Question

CPM output from edgeR for a downstream application

0

Entering edit mode

ycding ▴ 10

@ycding-7496

Last seen 4.5 years ago

United States

Dear edgeR users,

I am not an experienced R user. I ran the edgeR for the TCGA RNAseq data using raw count from the Rsubread featureCounts and the TCGA miRNAseq data using raw count from TCGA level3 data to identify differential expressed genes, see the codes below; now I want to examine miRNA and mRNA expresion interaction, Should I use the Count per Million (CPM) from edgeR as input data for the interaction prediction?

thank you very much!!

Yuanchun Ding

---------------------------------------------------------------------------------------------------------------------------------

#loading edgeR, limma and splines;
library(edgeR)
library(limma)
library(splines)

#create treatment factor with old group as reference group and then create batchid factor
Treat <- factor(group$agegroup)
Treat <- relevel(Treat, ref="old")
batch <- factor(group$batchid)

#filter genes to include genes with at least two counts per million in at least three samples
keep=rowSums(cpm(counts) >2) >= 3
counts_final=counts[keep,]
table(keep)

#create a DGEList and apply TMM normalization
y=DGEList(counts=counts_final, group=Treat)
y=calcNormFactors(y)

#to create design matrix using batch as a block factor,so compare young to old within each batch
design=model.matrix(~batch + Treat)
rownames(design)=colnames(y)

# to estimate dispersion and plot BCV
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

plotBCV(y)

# to detrmine differential expresssed genes; by default, the test is for the last
# coefficient in the desgin matrix, which is the young to old treatment effect.
fit=glmFit(y, design)
lrt=glmLRT(fit)
DEGs_all=topTags(lrt,n=15223)
write.csv(DEGs_all, "DEGs_all_052715.csv")

#subset genes with FDR < 0.05 and abs(logFC)>1
DEGs_FDR05 = subset(as.data.frame(DEGs_all), FDR < 0.05)
DEGs_FDR05_logFC1=subset(DEGs_FDR05, abs(logFC)>1)
genes_FDR05=rownames(DEGs_FDR05)

write.csv(DEGs_FDR05_logFC1, "DEGs_FDR05_FC2.csv")

#summarize total nubmer of differential expressed genes at 5% FDR
dt=decideTestsDGE(lrt)
summary(dt)

#to pick out DEGs
isDE=as.logical(dt)
DEnames=rownames(y)[isDE]

# to plot all the logFCs against average count size, highlighting DEGs
plotSmear(lrt, de.tags=DEnames)
abline(h=c(-1,1), col="blue")

# to export individual counts-per-million for all 15223 genes
top15223=rownames(topTags(lrt, n=15223))
CPM_15223genes=cpm(y)[top15223,]
write.csv(CPM_15223genes_trans, "CPM_15223genes.csv")

--------------------------------------------------------------------------------------------------------------------------------------

CPM edgeR • 3.8k views

ADD COMMENT • link 9.9 years ago ycding ▴ 10

score 0 · Answer 1 · 2015-05-28

If you're trying to determine if differential expression of miRNAs is related to differential expression of mRNAs, it would make more sense to compare the relevant log-fold changes across your DE comparison of interest. For example, you'd expect that upregulation of a miRNA would have a repressive effect on its target mRNAs.

As to the specifics; that depends on how you intend to predict an interaction between miRNAs and RNAs. If you already have a table of miRNA-mRNA relationships, then you can treat all genes regulated by a given miRNA as a gene set. You could then use roast to examine the change in mRNA expression across the set for your given DE comparison. Alternatively, you can just intersect the DE miRNAs with the DE mRNAs to identify likely miRNA-target relationships.

If you don't have this table, then the problem becomes harder, because you're trying to predict targets from the sign of the log-fold change alone. I can't imagine that this would be particularly informative because you're not using the mRNA/miRNA sequence to obtain potential binding sites. I guess there would be other software to do this for you, but the sensibility of feeding in log-fold changes or CPMs would depend on the specific program.

score 0 · Answer 2 · 2015-05-28

Hi Aaron,

thank you so much for your quick answer!!

I am going to use the integrative analysis method developed by Dr. Iterson etal published in Nuclear Acid Research( integerative analysis of microarray and mRNA expression); yes, the integrative idea is similar to what you suggested, first use target prediction tools to generate a list of predicted mRNA targets for a miRNA and then use a gene set test to detect consistent effect of the miRNA on the joint expression of multiple targets. in the paper, microarray data were used; for microarray data, after background correction, log2 transformaiton, quantile normalization and batch effect removal by combat, then the data are ready for testing miRNA and mRNA interaction. however, for RNAseq data, I am not sure CPMs or logCPM from edgeR are good input data for running the interaction as edgeR manual indicates that for most part, edgeR does not produce any quantity that could be called a "normalized count". from edgeR, if 200 mRNA DEGs and 100 miRNA DEGs were generated, then I guess I can just examine the interaction between those DEGs; if CPMs are not suitable, what kind of output file I should generated from edgeR or other programs to perform the integrative analysis?