Hi,
I have samples where one gene accounts for more than 40% of the total number of reads in normal conditions. In one phenotype that I consider, that gene is up-regulated.
How will that impact the differential expression of the other genes?
How does DESeq2 do the normalisation to avoid considering these other genes artificially down-regulated because of that?
Let's look at 'fake' numbers of RNA molecules:
Phenotype 1
G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 | G10 |
1000 | 50 | 60 | 12 | 150 | 180 | 140 | 10 | 190 | 45 |
Total number of molecules: ~ 2000
Phenotype 2
G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 | G10 |
1500 | 50 | 60 | 12 | 150 | 180 | 140 | 10 | 190 | 45 |
Total number of molecules: ~2500
The sequencing depth might be the same between the two, so normalising by sequencing depth is not going to help correct for that. Also, DESeq2 assumes a log normal distribution for the gene expression levels, but I was wondering if such a high read count for one single gene might make that assumption wrong?
I am unsure if this is simply equivalent to half of the genes being up-regulated in the sample, with no genes down-regulated, which DESeq2 is clearly equipped to tackle, or if it is different?
Could you explain how DESeq2 accounts for cases such as this one?
Thanks very much,
Delphine
EDIT: I attach an MAplot, and changed up to down and down to up as my plot was the other ay around compared to what I had written (the gene I am talking about is up-regulated in this plot, because of the condition considered as baseline)
Can you post an image (you can use imgur.com for hosting) of the MA plot if you use DESeq2? You can get a quick sense of how the normalization works. Or you can even plug in some simulated counts like you have above to see how it works.