Hi,
I'm investigating the potential effect of several mutations on gene expression. I've RNA-seq from ~30 tumors and 4 control samples. I screened for these 30 tumors about 20 genes for mutations. I translated these mutation data into a binary matrix (rows = samples, columns = genes ; 0 = not mutated, 1 = mutated).
For the RNA-Seq I aligned with STAR and counted the number of reads per gene with featurecounts. Then I merged everything into a big read count matrix (column = sample ; rows=gene).
Now I want to detect genes which expression is affected by these mutations. So I used limma as follow:
The design matrix harbors a column for each gene (~20), and an additional column indicating if the samples is a tumor or not. I also add an offset column. For example :
mut dataframe :
offset geneA geneB geneC isTumor
sample1 1 0 1 1 1
sample2 1 1 1 0 1
sample3 1 0 0 1 0
So :
design <- mut
design <- design[,colSums(design)>2] # remove gene that is mutated in less than 3 samples
design <- cbind(offset=1,design) # add offset
# egdeR
dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)
cpm <- cpm(dge,log=T)
keep.exprs <- rowSums(cpm>1)>=3 # min 3 samples with CPM > 1
dge <- dge[keep.exprs,]
# limma voom
geneExpr <- voom(dge,design)
glm = lmFit(geneExpr$E, design = design)
glm = eBayes(glm)
testResults <- decideTests(glm, method="global",adjust.method="BH", p.value=0.05)[,-1]
Now in testResults I've a matrix containing for each gene and each covariate (the 20 screened genes for mutation) if it's up (=1) or down regulated (=-1) ; (no significant change = 0). Is that correct for you ?
Thanks
FYI I based my analysis on this paper : http://www.nature.com/ncomms/2015/150109/ncomms6901/full/ncomms6901.html . Supplementary data 2 contains the original R code I used to wrote the few lines above.
P.S. : I first post this question to biostars but some users adviced me to post it here. https://www.biostars.org/p/206505/
Aaron, what would be a good way to assess the quality of limma model fits?
One approach would be to plot the residuals against each covariate. If the model fitted well, then there shouldn't be any trend in this plot, because it should have been regressed out by the model fit. It's probably easier to do this for all genes at once - if there's a noticeable trend, you should be concerned as most genes are poorly fitted. If there isn't a trend, then poor fits for a handful of genes can probably be ignored.
Hi @Aaron. Could you tell me how to plot such plot (residuals vs covariates). I did the analysis and get a list of DE genes for each of the covariates. And I found a significant degree of interestion between these lists of DE genes.. This make me feel that something is wrong in me design. Any ideas ?
A big intersection is not necessarily symptomatic of a problem with your design. Many genes may genuinely be affected by or associated with the mutations, so how do you know that's not the case?
Anyway, to make the residual vs covariate plot, you'll probably want to do something like this:
The model is likely to be insufficient if the two boxplots aren't very similar, e.g., due to a difference in the medians (unlikely, as this should average out across many genes) or due to a difference in the inter-quartile ranges (more likely, due to improperly modelled interaction effects that inflate the residuals for a few observations in each gene).
That being said, I'm not really sure what you could do if that were the case, because there's too many interaction terms that need testing (20-choose-2, and that's just for two-way interactions). Also, because you only have about 30 samples, you don't have that many residual d.f. to play with, which further restricts the model parametrization. It might be better to stick to a simple additive model, accept some inaccuracy (inflation of the variances will probably make limma conservative, which is better than the alternative), and move on to selecting candidates for experimental validation.
I check the boxplot for all my covariates and they seems all very similar (no differences). So the model seems ok. As you advice, I'll select some interesting candidates for further testing. Thank you