HI all
I am analyzing an RNAseq dataset to find DE genes using edgeR. Strangely I get no significantly differentially expressed genes, and the same FDR for every gene when summarizing the results.
I ran the same code on a public dataset and got fine results, so I guess it has something to do with the specs of the dataset but I cant figure out what.
The data is from 8 mice, genetically homogeneous, 4 of them treated and 4 untreated. What I know is that the biological material used for sequencing was very little, so the quality is maybe not the best.
Any help would be appreciated to find out where I am going wrong, many thanks!
My Code:
library(edgeR) d <- DGEList(counts=data_matrix[2:46079,3:10], group=group) d$samples isexpr <- rowSums(cpm(d)>1) >= 4 y <- d[isexpr, , keep.lib.sizes=FALSE] y$samples y <- calcNormFactors(y) plotMDS(y, method="bcv", col=as.numeric(y$samples$group)) legend("bottomleft", as.character(unique(d$samples$group)), col=1:3, pch=20) design<- model.matrix(~ 0 + y$samples$group) y <- estimateDisp(y, design, robust=TRUE) plotBCV(y) fit <- glmQLFit(y, design, robust=TRUE) plotQLDisp(fit) qlf <- glmQLFTest(fit) topTags(qlf,n=15)
Results:
logFC | logCPM | F | PValue | FDR | |
Gene1 | 0.929197 | 7.740159 | 21.96436 | 0.000363 | 0.998617 |
Gene2 | 1.056017 | 5.989199 | 19.84983 | 0.000562 | 0.998617 |
Gene3 | 1.172161 | 5.667213 | 17.50894 | 0.000944 | 0.998617 |
Gene4 | 0.897815 | 7.162813 | 16.90687 | 0.001086 | 0.998617 |
Gene5 | 1.523411 | 5.06676 | 16.61661 | 0.001164 | 0.998617 |
Gene6 | -1.72696 | 4.947108 | 16.4548 | 0.00121 | 0.998617 |
Gene7 | -0.903 | 6.165918 | 16.37441 | 0.001233 | 0.998617 |
Gene8 | -3.23804 | 4.054428 | 16.32435 | 0.001248 | 0.998617 |
Gene9 | -3.75089 | 5.074982 | 19.1382 | 0.001352 | 0.998617 |
Gene10 | 0.970776 | 6.206563 | 15.89572 | 0.001385 | 0.998617 |
Gene11 | 0.749168 | 7.167152 | 15.84059 | 0.001403 | 0.998617 |
Gene12 | 0.799071 | 6.561417 | 15.34004 | 0.001588 | 0.998617 |
The following specs are relevant:
After filtering (<1% was removed)
please just tell me if any further info would be helpful!
Well, the results you report are very surprising. It is very unusual that you could start with 46,000 genes but filter less than 1%. When we analyse mouse data we typically filter nearly half the genes using a similar filtering rule to you. I don't understand how you could be filtering less than 1% of genes.
Secondly, as we've already discussed, your code sequence is actually testing whether the counts in the last group are equal to 1 rather than testing DE. It seems incredible that none of the counts could be judged different from 1, so I conclude that the dispersion in your data must be enormous. I would view any NB common dispersion greater than about 0.1 for genetically identical mice to represent a poor quality experiment or an error in the analysis.
Hey Gordon. sorry for the late reply, was traveling.
i have to say that this comment of yours confuses me, will i test for DE if using
regarding filtering, library sizes changed for example from 7849976 to 7814960, which I would argue is not much.
looking at my dispersion plot I agree that its not optimally set up, but since i am a beginner i have to admit I am a bit struggling to find an adequate solution, can you advise a better strategy?
Thanks for your effort!
https://imgur.com/a/b2H2hQ4
and
https://imgur.com/a/zXhOJuW
What is important is the number of genes that are filtered, not the number of reads. I thought you were saying that only 1% of genes were filtered, which would be crazy. The percentage of genes filtered should be much greater than the percentage of reads filtered.
The dispersion shown in the BCV plot is huge. No RNA-seq experiment with genetically identical mice, good quality RNA and controlled lab conditions should be as variable as this. You should trouble-shoot your data using MDS plots and so on. Maybe you have a low RNA quality problem, which you might not be able to much about at this stage.
Also, if you have used Ensembl genes, you could also consider using our recommended Rsubread read mapping and summarization pipeline with builtin Rsubread Entrez Gene annotation. That does tend to reduce variability because it concentrates on better annotated exons.
My apologies for unclear terminology. You are absolutely correct, the number of genes goes down from around 46000 to 12000.
I will work on your suggestions, but the experimentalists have now told me that they barely had any RNA to begin with, so I assume its really a problem of low quality data.
Thanks!