Hello alll. I am a newbie in bioinformatics and still learning.
The thing is, I want to do differential expressed gene(DEG) analysis in the TCGA RNA-seq data.
DESeq2 and edgeR manuals said they require raw count data as input, so I downloaded STAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data
form [FireHose] (http://gdac.broadinstitute.org/) and exact count column from it.
My code for DEGs in tumor vs nomral goes like:
library(edgeR) library(dplyr) d <- DGEList(counts=count_data, group=group) d <- calcNormFactors(d) design <- model.matrix(~group) d <- estimateDisp(d,design) f <- glmFit(d,design) lrt <- glmLRT(f, coef = 2) topTags(lrt, n =20) DEG.edgeR <- topTags(lrt, n=Inf, sort.by = "logFC", adjust.method = "BH")$table DEG.edgeR$gene <- rownames(DEG.edgeR) DEG.edgeR.sig <- filter(DEG.edgeR, abs(logFC) >= 2 & FDR < 0.05)
and I get a data frame with DEGs sorted by logFC. Here comes my question:
1. Am I using the right data for DESeq2/edgeR and is my code ok?
2. Suppose I am right on 1, now I want to do a correalation analysis among the DEGs' expressinon, what data should I use?
Since I now only have raw count data, it varies so much and it seems I should perform some kind of normalization first and then do correlation analysis. Is it right?
Also, if I want to do a boxplot presantation of a gene's expression in tumor and normal sample, common practice is to use TCGA RPKM/FPKM data for this. Does this mean I have to download RPKM/FRKM data too just for correlation analysis and ploting?
It sounds weired to me to using one data to do DEG analysis and another to do correlation analysis and ploting.
Thanks for your reply, Aaron! I saw your answer days ago and realized I haven't understand how edgeR works so I went back reading the edgeR User Guild, took me a while to get a glimpse of the capability and usage of edgeR.
To answer your suggestion:
Yes, the user guide proposed a
lfc
option inglmTreat.
I was wrong to filter the DEG usingdplyr.
As for correlation analysis, you're right again. Since all genes are DE genes, they might be supposed to be positively or negatively correlated with each other. I was wrong to analysis like that. I know about WGCNA, but they don't recommend feeding DEG result to WGCNA. I may have to find another way.
I want RPKM mainly because I am to do some plot presentation of the expression data ( boxplot, barplot etc). Raw count data is not proper for that. I finally get log transformed RPKM value using the following code:
rpkm <- rpkm(d, log = TRUE)
after getting gene length using
getGeneLengthAndGCContent
fromEDASeq
package.