limma differences with and without an intercept
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.2 years ago
When setting up a model in limma, if I include or exclude an intercept term, I get slightly different results. For example here is a simplified version of my model, which uses a paired design, each patient is represented twice with a measurement when sick and when recovered: design = model.matrix(~ patient + status, data=samples_norm) colnames(design) [1] "(Intercept)" "patientA010" "patientA011" [4] "patientA019" "patientA024" "patientA030" [7] "patientA031" "patientA034" "patientA036" [10] "patientA041" "patientA055" "patientA067" [13] "patientA073" "patientA080" "statusexacerbation" The first patient is patient A004. If I run an analysis and compare patient A010 to the baseline (patientA004): y = DGEList(counts=counts) y = calcNormFactors(y) v = voom(y, design) fit = lmFit(v, design) fit = eBayes(fit) a = rownames(topTable(fit, coef=2, n=Inf, p.value=0.05)) length(a) [1] 70 If I do the same analysis, but use an intercept term I get a slightly different result: design = model.matrix(~ 0 + patient + status, data=samples_norm) y = DGEList(counts=counts) y = calcNormFactors(y) v = voom(y, design) fit = lmFit(v, design) cm = makeContrasts( onepatient=patientA010-patientA004, levels=design) fit = contrasts.fit(fit, cm) fit = eBayes(fit) b = rownames(topTable(fit, n=Inf, p.value=0.05)) length(b) [1] 72 > length(intersect(a, b)) [1] 69 This does not occur using the pickrell1 dataset. At first I thought this might just be random due to rounding of the p-values, but one of them is not really close to the cutoff so that isn't it. Should I not be expecting these two ways of parameterizing the model to give the same results? -- output of sessionInfo(): > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin13.0.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines grid parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] Vennerable_3.0 gtools_3.2.1 [3] RBGL_1.38.0 graph_1.40.1 [5] CHBUtils_0.1 tweeDEseqCountData_1.0.10 [7] sva_3.8.0 statmod_1.4.18 [9] rpart_4.1-5 RColorBrewer_1.0-5 [11] mgcv_1.7-28 nlme_3.1-113 [13] DESeq_1.14.0 lattice_0.20-24 [15] locfit_1.5-9.1 corpcor_1.6.6 [17] BiocInstaller_1.12.0 affyPLM_1.38.0 [19] preprocessCore_1.24.0 gcrma_2.34.0 [21] affy_1.40.0 gplots_2.12.1 [23] biomaRt_2.18.0 knitr_1.5 [25] devtools_1.4.1 reshape_0.8.4 [27] plyr_1.8 DESeq2_1.2.10 [29] RcppArmadillo_0.4.000.2 Rcpp_0.11.0 [31] GenomicRanges_1.14.4 XVector_0.2.0 [33] IRanges_1.20.6 vsn_3.30.0 [35] gridExtra_0.9.1 ggplot2_0.9.3.1 [37] HTSFilter_1.2.1 Biobase_2.22.0 [39] BiocGenerics_0.8.0 edgeR_3.4.2 [41] limma_3.18.11 googleVis_0.4.7 [43] xtable_1.7-1 extrafont_0.16 [45] dplyr_0.1.1 loaded via a namespace (and not attached): [1] affyio_1.30.0 annotate_1.40.0 AnnotationDbi_1.24.0 [4] assertthat_0.1 Biostrings_2.30.1 bitops_1.0-6 [7] caTools_1.16 codetools_0.2-8 colorspace_1.2-4 [10] compiler_3.0.2 DBI_0.2-7 dichromat_2.0-0 [13] digest_0.6.4 evaluate_0.5.1 extrafontdb_1.0 [16] formatR_0.10 gdata_2.13.2 genefilter_1.44.0 [19] geneplotter_1.40.0 gtable_0.1.2 httr_0.2 [22] KernSmooth_2.23-10 labeling_0.2 MASS_7.3-29 [25] Matrix_1.1-2 memoise_0.1 munsell_0.4.2 [28] proto_0.3-10 RCurl_1.95-4.1 reshape2_1.2.2 [31] RJSONIO_1.0-3 RSQLite_0.11.4 Rttf2pt1_1.2 [34] scales_0.2.3 stats4_3.0.2 stringr_0.6.2 [37] survival_2.37-7 tools_3.0.2 whisker_0.3-2 [40] XML_3.98-1.1 zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.
limma limma • 6.8k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 17 minutes ago
WEHI, Melbourne, Australia

Dear Rory,

> When setting up a model in limma, if I include or exclude an intercept
> term, I get slightly different results.

There are a few situations in a which contrasts.fit() uses an approximation to avoid having to refit the linear model for each gene. The help page for contrasts.fit gives the following warning:

"Warning. For efficiency reasons, this function does not re-factorize the
design matrix for each probe. A consequence is that, if the design matrix
is non-orthogonal and the original fit included quality weights or missing
values, then the unscaled standard deviations produced by this function
are approximate rather than exact. The approximation is usually
acceptable. The results are always exact if the original fit was a
oneway model."

This approximation has generally been fine in the past with microarray data, but the voom weights are pushing the approximation to the point that it is having a noticeable effect on some analyses.  We are planning to rewrite the linear model code in the limma to eliminate the approximation by storing more information from each fit.

> For example here is a simplified version of my model,
> which uses a paired design, each patient is
> represented twice with a measurement when sick and when recovered:
>
> design = model.matrix(~ patient + status,  data=samples_norm)
> colnames(design)
> [1] "(Intercept)"        "patientA010"        "patientA011"
> [4] "patientA019"        "patientA024"        "patientA030"
> [7] "patientA031"        "patientA034"        "patientA036"
> [10] "patientA041"        "patientA055"        "patientA067"
> [13] "patientA073"        "patientA080"        "statusexacerbation"
>
> The first patient is patient A004. If I run an analysis and compare patient A010 to the baseline (patientA004):
>
> y = DGEList(counts=counts)
> y = calcNormFactors(y)
> v = voom(y, design)
> fit = lmFit(v, design)
> fit = eBayes(fit)
> a = rownames(topTable(fit, coef=2, n=Inf, p.value=0.05))
> length(a)
> [1] 70

This result is exact.

> If I do the same analysis, but use an intercept term I get a slightly different result:
> design = model.matrix(~ 0 + patient + status,
>    data=samples_norm)
> y = DGEList(counts=counts)
> y = calcNormFactors(y)
> v = voom(y, design)
> fit = lmFit(v, design)
> cm = makeContrasts(
>    onepatient=patientA010-patientA004,
>    levels=design)
> fit = contrasts.fit(fit, cm)
> fit = eBayes(fit)
> b = rownames(topTable(fit, n=Inf, p.value=0.05))
> length(b)
> [1] 72

This result is approximate.

>> length(intersect(a, b))
> [1] 69
>
> This does not occur using the pickrell1 dataset.

Because the design matrix is orthogonal in that case.  The approximation only arises when there is an additive term (+ status in your case).

Best wishes
Gordon

ADD COMMENT
0
Entering edit mode

I still see the warning message about approximation for contrasts.fit in limma version 3.30.8. In my case, the difference between the exact and approximate method is about 900 genes vs 1500 genes at adj.P.Val<0.05 with an additive model for batch correction with limma/voom. Unfortunately the package I am using (EGSEA) requires contrasts as input, and it seems like for each contrast the column has to add up to 0. So when I use the design for the exact method and designate a single treatment coefficient as a contrast, I get error messages with the package. And if I use the approximate method, there is a large difference, making me wonder about the accuracy of approximation with my voom weights. Not sure what to do. Perhaps I could correct for batch effect first using the exact method and then feed the results to the EGSEA pipeline midstream. Any thoughts would be greatly appreciated.

 

ADD REPLY
0
Entering edit mode

I can't answer questions about the EGSEA package, and I don't want to continue a 3-year-old post anyway. If you want to use the GSEA package, please post a new question with a GSEA tag so that the GSEA authors will see it.

You can of course simply do gene set testing in limma. You don't need GSEA unless you want to combine lots of methods together.

You could also use limma-trend instead of limma-voom so there is no issue with approximate contrasts.

ADD REPLY

Login before adding your answer.

Traffic: 457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6