Question

Is it reasonable to run DESeq2 on only a subset of transcripts of the original raw count matrix?

1

Entering edit mode

Alan ▴ 20

@alan-15011

Last seen 5.7 years ago

Hi Michael and community,

As always, thank you for your devotion in DESeq2. I'd like to ask about: is it reasonable to to run DESeq2 analysis on only a "subset" of the original raw count matrix? (would the DESeq2 statistical model still apply?) For instance, if I am only interested in the coding genes of the transcriptome, then can I filter out non-coding genes (in the rows) from the original count matrix, but keep all my samples (in the columns), and then run DESeq2 to find DE coding genes between sample conditions?

Thank you in advance!

Alan

R version 3.4.3 (2017-11-30), DESeq2_1.18.1

deseq2 • 4.9k views

ADD COMMENT • link updated 6.6 years ago by Michael Love 42k • written 6.6 years ago by Alan ▴ 20

score 4 · Answer 1 · 2018-02-15

4

Entering edit mode

Michael Love 42k

@mikelove

Last seen 2 days ago

United States

You can subset to a smaller set of rows, here with protein coding genes I don't see a problem. You generally want to let DESeq() see as many genes as possible as this helps the dispersion and LFC estimation steps, which construct priors by looking at all genes. And for normalization, it is required that not all the genes be greatly differentially expressed, or else it's not possible to estimate the size factor (library size correction). So by looking at all expressed genes, DESeq() has a good shot at estimating the library size, because not all genes are greatly differentially expressed in a well-designed experiment (or else spike-in controls should have been used).

ADD COMMENT • link 6.6 years ago Michael Love 42k

0

Entering edit mode

Hi Michael,

Thank you for your quick reply. I understand it's essential to to give DESeq2 enough info (genes) to allow more accurate estimation of size factors and dispersion.

Can I sort of push my question one step further: for instance, in my experiment, I am most interested in comparing the DE of cytokine mRNAs between 2 conditions (trt vs. control), 10 samples each. So if I just want to look at the DE of cytokine genes (eg. a non-exhaustive list: https://www.rndsystems.com/products/human-cytokine-array-kit) between two experimental conditions, this would likely only include tens ~ maybe a couple of hundreds of genes (out of the 60,000 or so total genes, coding plus non-coding, I got from the RNA-Seq raw data)...

In this case, which would you recommend?

(1) I could run DESeq2 with the limited numbers of rows (using raw counts), or

(2) Is it possible for me to first use all of the ~60,000 genes in 10 samples as the input count matrix , then get the normalized counts, then filter for those cytokine genes, and then run DESeq()... Is it possible to do this? (I know this may sound totally unreasonable request, but...)

(3) Other advice on more appropriate analysis approach?

Thank you very much in advance, Michael. Thank you for your time and patience!

Alan

ADD REPLY • link 6.6 years ago Alan ▴ 20

1

Entering edit mode

You should just run DESeq2 on all the genes. It’s not a good idea to subset to “interesting” ones because of the two problems I outlined above (priors and the scaling factor).

ADD REPLY • link 6.6 years ago Michael Love 42k

0

Entering edit mode

OK, I see. Thank you very much for explanation, Michael.

Alan

ADD REPLY • link 6.6 years ago Alan ▴ 20

0

Entering edit mode

And another (dummy) question, please: a possible solution could be generate the DESeq object and, after size factor estimation, filter out all non-interesting genes? Or this approach could bias DE results? Thanks for your help (and patient!).

ADD REPLY • link 4.6 years ago jgarces • 0

0

Entering edit mode

No, it's just generally not a good idea, because there are other estimates across all genes that would be disrupted by subsetting to only a few genes.