Hi Michael and community,
As always, thank you for your devotion in DESeq2. I'd like to ask about: is it reasonable to to run DESeq2 analysis on only a "subset" of the original raw count matrix? (would the DESeq2 statistical model still apply?) For instance, if I am only interested in the coding genes of the transcriptome, then can I filter out non-coding genes (in the rows) from the original count matrix, but keep all my samples (in the columns), and then run DESeq2 to find DE coding genes between sample conditions?
Thank you in advance!
Alan
R version 3.4.3 (2017-11-30), DESeq2_1.18.1
Hi Michael,
Thank you for your quick reply. I understand it's essential to to give DESeq2 enough info (genes) to allow more accurate estimation of size factors and dispersion.
Can I sort of push my question one step further: for instance, in my experiment, I am most interested in comparing the DE of cytokine mRNAs between 2 conditions (trt vs. control), 10 samples each. So if I just want to look at the DE of cytokine genes (eg. a non-exhaustive list: https://www.rndsystems.com/products/human-cytokine-array-kit) between two experimental conditions, this would likely only include tens ~ maybe a couple of hundreds of genes (out of the 60,000 or so total genes, coding plus non-coding, I got from the RNA-Seq raw data)...
In this case, which would you recommend?
(1) I could run DESeq2 with the limited numbers of rows (using raw counts), or
(2) Is it possible for me to first use all of the ~60,000 genes in 10 samples as the input count matrix , then get the normalized counts, then filter for those cytokine genes, and then run DESeq()... Is it possible to do this? (I know this may sound totally unreasonable request, but...)
(3) Other advice on more appropriate analysis approach?
Thank you very much in advance, Michael. Thank you for your time and patience!
Alan
You should just run DESeq2 on all the genes. It’s not a good idea to subset to “interesting” ones because of the two problems I outlined above (priors and the scaling factor).
OK, I see. Thank you very much for explanation, Michael.
Alan
And another (dummy) question, please: a possible solution could be generate the DESeq object and, after size factor estimation, filter out all non-interesting genes? Or this approach could bias DE results? Thanks for your help (and patient!).
No, it's just generally not a good idea, because there are other estimates across all genes that would be disrupted by subsetting to only a few genes.
Ok, thanks for your quick reponse!