Hello everyone,
My group has been conducting a large scale analysis using TCGA data. I'm using the expression results to identify DE genes following the DESeq2 vignette along with lfcShrink (apeglm). I apply the analysis between healthy and diseased samples for multiple organs.
However the healthy samples for almost every organ are about 1-15% of the diseased samples (eg. 44 healthy vs 525 diseased,130 vs 903 or even 3 vs 309!). I do get results for almost every organ studied, but I am skeptical on the actual statistical significance of said results and the amount of bias introduced by such a big difference in the sample numbers representing each condition.
Should I do something differently in the analysis because of such imbalance in the samples per condition or is such an analysis pointless because of this? Are the results with adjusted p-value < 0.1 still considered significant as indicated by DESeq2? Should I decrease the required adjusted p-value to less then 0.05 or find a formula for the significance cutoff?
I have searched for similar cases online, but I could not find any so extremely imbalanced as ours, which is why I am asking this here. I have read that DESeq2 does not need equal samples per condition to provide significant results, but I am not sure if that covers extreme cases like ours.
Thanks in advance
I see, thank you very much for your quick response!
I have the same problem, but I am using limma voom. I have 82 cancer samples and 390 control samples. Any suggestions?