Question

DESeq2 counts distribution across both genes and samples

0

Entering edit mode

igor ▴ 50

@igor

Last seen 11 days ago

United States

DESeq2 assumes negative binomial distribution for counts distribution. That refers to the distribution of counts for a single gene across all samples.

What about distribution of counts of all genes across one sample? Each sample is normalized based on the geometric mean of all counts (or size factor). That sounds like it does not take into account the distribution. For example, two samples have the same number of reads, but one sample has a lot of low and high counts and another sample has only medium counts. The means would be the same, but a lot of genes would be different between the two samples. Is that taken into account?

deseq2 rnaseq • 2.8k views

ADD COMMENT • link updated 3.1 years ago by BarryGutierrez • 0 • written 8.6 years ago by igor ▴ 50

score 0 · Answer 1 · 2016-08-30

"That refers to the distribution of counts for a single gene across all samples."

Not exactly. The counts K_ij are not iid across samples, because the mean value mu_ij differs across samples. Even within the same group, the mu_ij is not equal because the s_j are not equal. Take a look at the formula in the Materials and Methods section of the DESeq2 paper.

"Each sample is normalized based on the geometric mean of all counts (or size factor)."

Again, not exactly. The normalization is the median ratio of the counts for a sample compared to a pseudo-reference sample. The pseudo-reference is created by taking the geometric mean across rows. Take a look at Eq 5 in the original DESeq paper.

I'm not sure what the concern is here. A way to think about the size factor estimation is that, if you plot two samples in a scatter plot, such that each point is a gene, we are looking for a size factor vector such that the ratio of the size factors for these two samples gives the slope of the line going through the points.