Question

Mutation effect on expression (RNA-Seq) using limma

0

Entering edit mode

Nicolas Rosewick ▴ 20

@nicolas-rosewick-10121

Last seen 4.8 years ago

Belgium/Brussels/ULB

Hi,

I'm investigating the potential effect of several mutations on gene expression. I've RNA-seq from ~30 tumors and 4 control samples. I screened for these 30 tumors about 20 genes for mutations. I translated these mutation data into a binary matrix (rows = samples, columns = genes ; 0 = not mutated, 1 = mutated).

For the RNA-Seq I aligned with STAR and counted the number of reads per gene with featurecounts. Then I merged everything into a big read count matrix (column = sample ; rows=gene).

Now I want to detect genes which expression is affected by these mutations. So I used limma as follow:

The design matrix harbors a column for each gene (~20), and an additional column indicating if the samples is a tumor or not. I also add an offset column. For example :

mut dataframe :

         offset geneA   geneB   geneC   isTumor
sample1      1      0       1       1         1
sample2      1      1       1       0         1  
sample3      1      0       0       1         0

So :

design <- mut
design <- design[,colSums(design)>2] # remove gene that is mutated in less than 3 samples
design <- cbind(offset=1,design) # add offset
# egdeR
dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)
cpm <- cpm(dge,log=T)
keep.exprs <- rowSums(cpm>1)>=3 # min 3 samples with CPM > 1
dge <- dge[keep.exprs,]
# limma voom
geneExpr <- voom(dge,design)
glm = lmFit(geneExpr$E, design = design) 
glm = eBayes(glm)

testResults <- decideTests(glm, method="global",adjust.method="BH", p.value=0.05)[,-1]

Now in testResults I've a matrix containing for each gene and each covariate (the 20 screened genes for mutation) if it's up (=1) or down regulated (=-1) ; (no significant change = 0). Is that correct for you ?

Thanks

FYI I based my analysis on this paper : http://www.nature.com/ncomms/2015/150109/ncomms6901/full/ncomms6901.html . Supplementary data 2 contains the original R code I used to wrote the few lines above.

P.S. : I first post this question to biostars but some users adviced me to post it here. https://www.biostars.org/p/206505/

limma rnaseq mutations • 1.7k views

ADD COMMENT • link updated 8.3 years ago by Aaron Lun ★ 28k • written 8.3 years ago by Nicolas Rosewick ▴ 20

score 3 · Answer 1 · 2016-08-12

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 1 hour ago

The city by the bay

I don't know why you're running voom and then only using the E matrix, you should just do:

glm = lmFit(geneExpr, design = design)

... otherwise you won't use the weights in the linear model, which is the whole point of voom.

That aside, your interpretation of the results is basically correct; testing the covariate for each mutated gene in the design matrix will tell you how many of the genes in your data set are DE in a manner that is associated with that gene's mutation pattern. There are, however, several caveats:

Causality. The best you can probably say is that the DE is associated with a mutation in a gene, rather than the latter causing the former. I would guess that you didn't go and experimentally introduce the mutations into the samples yourself, so for all we know, they're just markers for some underlying cause that actually drives the DE.
Orthogonality. If the pattern of mutation across patients is very similar for two or more genes, then you may fail to detect DE associated with either gene. This is because, if you drop the covariate for one gene from the design matrix, its friend will still be present and will fit the data just as well. If the null model fits well, the null hypothesis won't be rejected.
Additivity. Are the effects of gene mutations really additive? Is the effect of a mutation in a tumour really the same as in a control sample? If not, then your model won't fit well and you'll get inflated variances and/or correlated residuals later on. This will lead to some inaccuracies in error control, which - in the absence of more knowledge about the factors - you'll just have to accept.

ADD COMMENT • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron, what would be a good way to assess the quality of limma model fits?

ADD REPLY • link 8.3 years ago Yury Bukhman ▴ 20

0

Entering edit mode

One approach would be to plot the residuals against each covariate. If the model fitted well, then there shouldn't be any trend in this plot, because it should have been regressed out by the model fit. It's probably easier to do this for all genes at once - if there's a noticeable trend, you should be concerned as most genes are poorly fitted. If there isn't a trend, then poor fits for a handful of genes can probably be ignored.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi @Aaron. Could you tell me how to plot such plot (residuals vs covariates). I did the analysis and get a list of DE genes for each of the covariates. And I found a significant degree of interestion between these lists of DE genes.. This make me feel that something is wrong in me design. Any ideas ?

ADD REPLY • link 8.3 years ago Nicolas Rosewick ▴ 20

0

Entering edit mode

A big intersection is not necessarily symptomatic of a problem with your design. Many genes may genuinely be affected by or associated with the mutations, so how do you know that's not the case?

Anyway, to make the residual vs covariate plot, you'll probably want to do something like this:

resids <- geneExpr$E - glm$coefficients %*% t(design)
covA <- matrix(design[,"geneA"], nrow=nrow(resids), ncol=ncol(resids), byrow=TRUE)
boxplot(split(resids, covA))

The model is likely to be insufficient if the two boxplots aren't very similar, e.g., due to a difference in the medians (unlikely, as this should average out across many genes) or due to a difference in the inter-quartile ranges (more likely, due to improperly modelled interaction effects that inflate the residuals for a few observations in each gene).

That being said, I'm not really sure what you could do if that were the case, because there's too many interaction terms that need testing (20-choose-2, and that's just for two-way interactions). Also, because you only have about 30 samples, you don't have that many residual d.f. to play with, which further restricts the model parametrization. It might be better to stick to a simple additive model, accept some inaccuracy (inflation of the variances will probably make limma conservative, which is better than the alternative), and move on to selecting candidates for experimental validation.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

I check the boxplot for all my covariates and they seems all very similar (no differences). So the model seems ok. As you advice, I'll select some interesting candidates for further testing. Thank you

ADD REPLY • link 8.3 years ago Nicolas Rosewick ▴ 20