How to determine the cause of no difference between treatments in rnaseq?
2
0
Entering edit mode
@jdmontenegroc-21026
Last seen 5.6 years ago

Hello, I recently had a fairly straightforward dge analysis comparing two treatments with three biological replicates each. All samples were clones of the same plant and the tests were performed under the same growth conditions except for the treatment. DGE using the the glmQLFit and glmQLFTest commands resulted in 0 genes differentially expressed between the groups for a FDR threshold of 0.01.

I understand that the exact test will produce no genes with DE under two circumstances: 1) there is too much variation in the expression levels within groups and therefore the test estimates a high p.value which translates into no DGE and 2) both groups have actually very similar expression patterns even though the dispersion of the data is not too big.

How can I test which one of the above is the main reason driving the result of no differential genes? Is there any other explanation for that result that I have not considered?

Thank you for your help.

Regards,

Juan D. Montenegro

RNAseq DGE edger • 3.4k views
ADD COMMENT
1
Entering edit mode

Note that an FDR of .01 is quite conservative, going with an FDR <= 0.10 is not uncommon so take a look at that.

As Aaron points out though, there are many reasons for not detecting any DE genes. Maybe the effect of the treatment isnt vere big at all, in which case more replicates will (always) help ... Or maybe there's some variability / technical artifact that you aren't accounting for ...

ADD REPLY
0
Entering edit mode

Thank you Steve,

Indeed going through the experimental design I found out the "clones" were not actually clones, but plants from the same "variety" that show fairly similar behaviour and phenotypes, but are not consistent one generation after the next, so yes, there was some variability between the repeats that I was not taking into account, but then again, I cannot add that factor to the model matrix, unless I can do this:

ind<-factor(c(1, 2, 3, 4, 5, 6))
treat<-factor(c(1, 1, 1, 2, 2, 2))
design<-mode.matrix(~ind+treat)

would that make sense at all?

Cheers,

Juan Daniel

ADD REPLY
0
Entering edit mode

No, unfortunately it wouldn't make sense to encode the design in this way as there's no replication within your ind factor.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

It's all relative. Large dispersions aren't a problem if your effect sizes are even larger. There's not really any absolute distinction between the two possibilities, other than some rules of thumb based on prior experience.

If you're talking about RNA-seq in (nominally) controlled lab conditions, I would consider dispersions above 0.1 to be quite high. I might then say, "Oh, that's pretty high. Better go have a closer look at the data to see if there's outliers/batch effects/etc. that might be inflating the estimated variance and masking any effect." If I have dispersions below 0.01, then I might say, "Well, that's pretty low, so if I'm not getting any DE genes, there's probably not a strong effect in the first place."

Of course, detection power is also influenced by other factors such as the number of samples, the magnitude of the counts and the number of genes (due to the multiple testing correction). If you're not getting any DE genes with 100 deeply sequenced samples that exhibit Poisson-level variation (i.e., dispersions of zero), then there's probably no differential expression of any biological relevance.

If you want to do some testing, just take your estimated dispersion and the average count for each gene, introduce a known log-fold change, sample counts from a negative binomial distribution and see how many genes you detect. I believe that there are a couple of Bioconductor packages to do this kind of empirical power calculation for you, though I have not used them personally.

ADD COMMENT
0
Entering edit mode

Dear Aaron,

Thank you for your answer. I agree there are a lot confounding factors I may be missing since I was not part of the original experimental design. I recently found out that the biological repeats that were supposed to be clones, were actually just from the same "variety" and the phenotype of that variety has been shown not to be stable from generation to generation, so the stabilization part of the breeding process is still ongoing. Which suggests there are some differences in the genomic background of the biological repeats. I am currently testing this by discovering sample specific SNPs. But from the MDSplot I can see there is quite a bit of variation in the expression levels of one group. Also, the common dispersion calculated of 0.23 also suggests more dispersion than that expected from clones.

I'll look around to see if I can find any package that can help me do the testing you suggest and update here.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

The secret is to plot the data, as illustrated in the edgeR QL workflow article:

  1. Use plotMDS to see if your samples separate between treatments (as in Figure 2 of the workflow). In my experience, when people fail to find any DE but believe they should have, the reason is most commonly because of quality control issues that are evident from the MDS plot.

  2. Use plotBCV to plot the dispersions (Figure 5 of the workflow). This shows you exactly how large or small the dispersions are.

  3. Use plotMD to plot the fold-changes (Figure 7 of the workflow). This shows you exactly how large the fold changes are.

  4. Look at the top 10-30 genes in the topTags table. Are the reported fold changes (logFC) large or small?

ADD COMMENT
0
Entering edit mode

Dear Gordon,

Thank you for your answer. I did perform quality checks on the data before doing the DGE test. After filtering genes with fewer than 10 alignments in fewer than 3 libraries I checked the BCV and MDS plots after normalization. The BCVplot shows quite a bit of variation (common dispersion = 0.23). The MDSplot shows two groups: the repeats in one are quite close to each other, while the other group is spread mostly across the X axis.

I am not sure if the high dispersion of the second group would be enough to skew the dispersion estimates of both groups. The logFC of the top 20 genes in the table is 3 (so, 8-fold change in read counts between groups), so not so much.

ADD REPLY

Login before adding your answer.

Traffic: 503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6