Hi all experts,
After doing DE analysis by edgeR, we can find the number of up and down-expressed genes by something like, "summary(de <- decideTestsDGE(lrt5))". I would highly appreciate if you could please help me out to figure out the following issues, which I cannot completely understand with reading and browsing net.
1) Please let me know how fold change (FC) is calculated during GLM method, it's based on cpm? say we compare A and B conditions, fold change of DE genes are calculated as cpmA / cpmB or there is another story? if your response is positive, why the software doesn't use RPKM(FPKM) that take into account the gene length? is there any priority for using CPM in relative to FPKM?
2) Please kindly tell me how we can export DE genes with all related information, like logFC, logCPM, P-value and FDR from edgeR software?
3) As I found on net, we can convert cpm to rpkm via this formula: RPKM = 2^(logCPM-log2(geneLength)). Please let me know if the same formula can be ued for FPKM?
4) If I correctly understood from edgeR manual, drawing heatmap by logCPM is more desirable than logFPKM? if it's right, please kindly tell me why there is such preference?
Thank you in advance for all your clarifications.
To elaborate on Ryan's answer:
?predFC
. This function adds a prior count to avoid undefined or unstable log-fold changes due to zeroes and low counts respectively, and then fits a GLM to these modified counts. The coefficients corresponding to differences between conditions are then directly used as the log-fold changes. The reason that we don't correct for gene length is because it would just cancel out when you compute the (log-)fold change for any given gene.$table
member of thetopTags
output, you don't even need to coerce it to adata.frame
.rpkm
instead. This function can also be used for FPKM, which is basically the same as RPKM (see Ryan's link).Thank you very much for all responses. They were really helpful. However, I'm still confused about logFC. I did the similar analysis using edgeR within Trinity package, it uses just exact test for DE analysis; here, when two samples,A and B (A-B), compared, genes with higher FPKM in sample B than A get the positive logFC and vice versa. I was wondering if it is correct when DE analysis is done by edgeR in R (out of Trinity package)? my mean is can we judge the up and down-regulated genes among DE genes resulted from edgeR analysis by looking at their FPKM values?
For exporting DE genes from edgeR, I used the following commands
But, the "DEexport.txt" contained just 10 rows, while the number of DE is about 1500 based on the output of "summary(de <- decideTestsDGE(lrt5)) ". Could you please let me know how I can get all 1500 DE genes?
Regarding the log-fold change, if you don't post any code, it's impossible to tell what's happening. But why are you bothering with FPKMs? If you want to know up- or down-regulation, just look at the sign of the log-FC.
For exporting, have a look at
?topTags
and then
argument. Settingn=Inf
will get you all genes.Thanks for the comment. Regarding the logFC and FPKM, here is the code, this script https://github.com/trinityrnaseq/trinityrnaseq/blob/master/Analysis/DifferentialExpression/run_DE_analysis.pl, within Trinity package made a DE analysis by edgeR (or other software) easy for the beginners, like me. You're right with looking at the sign of the logFC for identification of up and down-expressed genes. However, during the transcript analysis, I face with several transcripts of a single gene that have different FPKM values, as I read, those transcripts with the higher FPKM are more reliable than others while those with low FPKM may be artifacts. It is useful for our biologists when we try to experimentally study expression analysis in the lab, say by qRT-PCR, ... That's why I'm looking for FPKM values in addition to the sign of logFC.
Regarding topTags, yes, I tried
n=Inf
, but as you mentioned it returned all genes not just DE genes. Please kindly tell me how to modify the above code to get just DE genes at the FDR cutoff of 0.05?Thanks
No response, yet. Please any suggestions!
This is the point at which you are usually admonished to read the help page. Actually, Aaron already pointed you to the help page when he said to look at
?topTags
.From
?topTags
:Learning how to figure out the answer yourself is an invaluable skill, if you expect to be doing anything with R/Bioconductor (or any Open Source software, for that matter). There is limited support, and you will find that in the vast majority of cases it is faster and way more useful for you to figure out the answer yourself.