Setting contrasts in limma where null hypothesis is that Test = 2 x Control
1
0
Entering edit mode
stuart ▴ 10
@stuart-9561
Last seen 3.5 years ago
Australia

Hello,

I am analysing RNA-seq data to find differentially expressed genes between a wild-type (Ctl) and mutant (Mut). The mutant has a large DNA duplication encompassing several hundred genes, so they always come up as ~2x upregulated but this is not biologically interesting. Hence the H0 (null) for these particular genes is that Mut = 2 x Ctl, while H0 for the rest of the genes is Mut = Ctl.

My workaround for this was to do two runs of contrast testing on the limma.voom output, one run where H0 is Mut = Ctl, the second where H0 is Mut = 2 x Ctl. Then I would get 2 topTable outputs, one from which I'd select rows for the non-duplicated genes and the other from which I'd select the duplicated genes. However, I am having difficulty specifying (or perhaps even understanding) the contrasts for these comparisons and the results are not making sense to me. In particular, I would have expected the log-FC values from both comparisons to be well-correlated, but they are actually very different, especially for genes with high expression. So is this even a valid approach?

Here is an example using the zebrafish dataset:

library(zebrafishRNASeq)
library(limma)
library(edgeR)
library(ggplot2)
data(zfGenes)
filter <- apply(zfGenes, 1, function(x) length(x[x>10]) > 1)
cmat <- zfGenes[filter,]
design <- matrix(c(rep(1,3), rep(0,6), rep(1,3)), ncol=2, dimnames = list(c(),c('ctl','mut')))
nf <- calcNormFactors(cmat)
y <- voom(cmat, design, lib.size=colSums(cmat)*nf)
fit <- lmFit(y,design)

## make a function to run eBayes etc. with different null hypotheses ##
run_cf_eb <- function(H0 = 1, fit){
  contrasts = paste0('mut-', H0, '*ctl')
  cont.matrix <- makeContrasts(contrasts = contrasts, levels=c('ctl','mut'))
  print(cont.matrix)
  fit_b <- contrasts.fit(fit, cont.matrix)
  fit_b <- eBayes(fit_b)
  return( topTable(fit_b, n=Inf, sort.by='none') )
}

tt_1 <- run_cf_eb(H0 = 1, fit) ## H0 is 'ctl = 1 x mut' ##
tt_2 <- run_cf_eb(H0 = 2, fit) ## H0 is 'ctl = 2 x mut' ##
tt_all <- merge(tt_1, tt_2, by = 'row.names', suffixes = c('.H0_1','.H0_2'))

## Plot logFC from H0('ctl=mut') vs logFC from H0('ctl=2mut'): ##
ggplot(tt_all) + geom_point(aes(x=logFC.H0_1, y=logFC.H0_2, colour = AveExpr.H0_1))

Output:

      Contrasts
Levels mut-1*ctl
   ctl        -1
   mut         1
      Contrasts
Levels mut-2*ctl
   ctl        -2
   mut         1

 

I realize that I will probably have to tweak the calcNormFactors bit to account for the duplication and also re-run the p-value adjustment after the analysis, however my question right now is just about how to approach the contrasts.

Thanks,

Stuart

 

differential gene expression limma rnaseq design and contrast matrix limma voom • 1.6k views
ADD COMMENT
4
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

You are forgetting that limma is working with expression values on the log2-scale, so a 2-fold increase in Mutant vs Control is represented by Mut - Ctl = log2(2) = 1. If you conduct a standard DE analysis by:

design <- cbind(Intercept=1, MutvsCtl=c(0,0,0,1,1,1))
y <- voom(cmat, design, lib.size=colSums(cmat)*nf)
fit <- lmFit(y, design)
fit <- eBayes(fit)
topTable(fit, coef=2)

then you would see a lot of logFC values about 1 in the top-table for genes in the duplicated region. Is this what you actually see?

If 'InDup' is TRUE for genes in the duplicated region, then look at:

plotMD(fit, col=2, status=InDup)

to see whether the logFC are doing what you expect.

To formally test whether the true logFC is equal to 1, you can do this:

t.stat <- (fit$coefficients[,2] - 1) / fit$stdev.unscaled[,2] / sqrt(fit$s2.post)
p.value <- 2*pt(-abs(t.stat), df=fit$df.total)
FDR <- p.adjust(p.value, method="BH")

To formally test whether genes in the duplicated region tend to be up-regulated overall you can conduct a gene set test:

roast(y, index=InDup, design=design)

or even better:

roast(y, index=InDup, design=design, gene.weights=pmax(fit$Amean,0))

to weight genes according to their expression levels.

Note: this answer was edited after Stuart's comment.

ADD COMMENT
0
Entering edit mode

Thank you very much, this was really helpful.

The duplicated genes show an average ~ +0.8 logFC. This is less than +1.0 probably due to homeostatic mechanisms at play, which indicates that I should treat duplicated genes that are downregulated from the logFC +1.0 baseline with caution. Nonetheless it is pretty clear that the duplicated genes are generally overexpressed.

The following slightly modified code worked on limma_3.22.7 in R3.1.3:

t.stat <- (fit$coefficients[,2] - 1) / fit$stdev.unscaled[,2] / sqrt(fit$s2.post)
p.value <- 2*pt(-abs(t.stat), df = fit$df.total)
FDR <- p.adjust(p.value, method="BH")

Thanks,

Stuart

edit: changed degrees of freedom from '4' to 'fit$df.total' according to comments below

ADD REPLY
0
Entering edit mode

For future reference, it's typically a good idea to be using the latest version of all packages. Currently, limma is at version 3.26.5, so you're two Bioconductor releases behind (for reference, we're currently at BioC version 3.2 on R 3.2.2+). There's no real reason to use an older version unless you're trying to reproduce old results.

ADD REPLY
0
Entering edit mode

Yes, well spotted, there were two typos in my code for the t.stat and p.value. Note that the df for pt() should be the total empirical Bayes df, not just the ordinary residual df.

ADD REPLY

Login before adding your answer.

Traffic: 850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6