Hi,
I analyzed a publicly available gene expression dataset GSE4917 with two treatments (dex and control), 4 treatment time points (0.5, 2, 4, 24hr) and 3 replicates under each condition (24 samples in total). I would like to compare expression changes of dex vs. control at 24hr (3 samples under each condition and 6 samples in total). I tried two ways: 1. put all 24 samples in the design matrix and make a contrast matrix for dex 24hr vs. control 24hr, and 2. extract expression values of 6 samples treated with 24hr and run limma in the subset dataset. I found the p-values obtained from option 1 is significantly smaller than from option 2.
I checked R codes for eBay function and found the code to compute p-values is in Line 51:
out$p.value <- 2*pt(-abs(out$t),df=df.total))
I found that the degree of freedom used for p-value computation is based on the total 24 samples in the model, instead of the samples actually used in the comparison (i.e. 3 samples with dex versus 3 samples with control at 24 hr), which mainly drive the significance of p-values.
My question here is that for computing p-values for subgroup comparion, would it be more reasonable using the degree of freedom based on actual sample size in the particular comparion rather than based on all the samples in the design model matrix? Thank you very much!
Below are my R codes:
library(GEOquery)
library(oligo)
library(limma)
library(dplyr)
# load data
getGEOSuppFiles("GSE4917")
untar("./GSE4917/GSE4917_RAW.tar", exdir = "./GSE4917/data")
celFiles <- list.celfiles("./GSE4917/data", full.names = TRUE, listGzipped = TRUE)
raw.data <- read.celfiles(celFiles)
# assigan phenotype information
pData(raw.data)$treatment_agent <- rep(c("control", "dex"), 12)
pData(raw.data)$treatment_time <- rep(rep(c(0.5, 2, 4, 24), each=2), 3)
pData(raw.data)$donor <- rep(c("rep1", "rep2", "rep3"), each=8)
pData(raw.data)$status <- paste0(pData(raw.data)$treatment_agent, pData(raw.data)$treatment_time)
# RMA normalization
rma.data <- rma(raw.data)
# 1. Perform limma: include all samples in linear model
design <- model.matrix(~ -1 + factor(rma.data$status))
colnames(design) <- levels(factor(rma.data$status))
fit <- lmFit(rma.data, design)
contrast <- makeContrasts(contrast = dex24 - control24, levels = design)
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)
results <- topTable(fit2, coef = "contrast", adjust = "BH", num=Inf)
head(results)
# logFC AveExpr t P.Value adj.P.Val B
#204602_at 3.769458 8.515285 14.035859 1.948652e-11 4.342182e-07 11.000748
#203962_s_at 2.379628 7.575251 9.819835 7.526558e-09 8.385715e-05 7.905465
#204388_s_at 1.774723 8.344796 8.985486 3.034137e-08 2.253656e-04 7.044685
#203961_at 1.406293 7.895797 8.225007 1.164625e-07 6.487834e-04 6.167950
#204560_at 3.131549 8.672537 7.612022 3.636916e-07 1.620828e-03 5.391914
#206100_at 2.217901 7.230835 6.964108 1.280040e-06 4.753853e-03 4.500461
# 2. Perform limma: only include samples treated with 24 hour
sample.sel <- row.names(pData(raw.data))[pData(raw.data)$treatment_time=="24"]
rma.data.sel <- rma.data[, sampleNames(rma.data)%in%sample.sel]
design.sub <- model.matrix(~ -1 + factor(rma.data.sel$status))
colnames(design.sub) <- levels(factor(rma.data.sel$status))
fit.sub <- lmFit(rma.data.sel, design.sub)
contrast.sub <- makeContrasts(contrast = dex24 - control24, levels = design.sub)
fit2.sub <- contrasts.fit(fit.sub, contrast.sub)
fit2.sub <- eBayes(fit2.sub)
results.sub <- topTable(fit2.sub, coef = "contrast", adjust = "BH", num=Inf)
head(results.sub)
# logFC AveExpr t P.Value adj.P.Val B
#204602_at 3.769458 9.040398 13.670974 2.631346e-06 0.05863428 -0.4221248
#202149_at 3.066575 8.826255 11.835967 6.954618e-06 0.07748488 -0.5320640
#204388_s_at 1.774723 8.770217 10.951224 1.168845e-05 0.08681791 -0.6027886
#200648_s_at 2.112701 10.351704 9.815470 2.413108e-05 0.13442822 -0.7178475
#202150_s_at 2.381962 9.219177 9.221501 3.633009e-05 0.16190867 -0.7921431
#203962_s_at 2.379628 7.978510 8.839395 4.785161e-05 0.17517888 -0.8463093
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] pd.hg.u133a_3.12.0 DBI_1.1.0 RSQLite_2.2.1 dplyr_1.0.2 limma_3.44.3 oligo_1.52.1
[7] Biostrings_2.56.0 XVector_0.28.0 IRanges_2.22.2 S4Vectors_0.26.1 oligoClasses_1.50.4 GEOquery_2.56.0
[13] Biobase_2.48.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.18.2 tidyselect_1.1.0 purrr_0.3.4 splines_4.0.3
[5] lattice_0.20-41 vctrs_0.3.4 generics_0.0.2 yaml_2.2.1
[9] blob_1.2.1 rlang_0.4.8 pillar_1.4.6 glue_1.4.2
[13] bit64_4.0.5 affyio_1.58.0 matrixStats_0.57.0 GenomeInfoDbData_1.2.3
[17] foreach_1.5.1 lifecycle_0.2.0 zlibbioc_1.34.0 codetools_0.2-16
[21] memoise_1.1.0 ff_4.0.4 GenomeInfoDb_1.24.2 preprocessCore_1.50.0
[25] Rcpp_1.0.5 readr_1.4.0 BiocManager_1.30.10 DelayedArray_0.14.1
[29] affxparser_1.60.0 bit_4.0.4 hms_0.5.3 digest_0.6.25
[33] GenomicRanges_1.40.0 grid_4.0.3 tools_4.0.3 bitops_1.0-6
[37] magrittr_1.5 RCurl_1.98-1.2 tibble_3.0.4 crayon_1.3.4
[41] tidyr_1.1.2 pkgconfig_2.0.3 ellipsis_0.3.1 Matrix_1.2-18
[45] xml2_1.3.2 rstudioapi_0.11 iterators_1.0.13 R6_2.4.1
[49] compiler_4.0.3
Thanks for your reply.
So it really depends on what we put in the linear model, right? I'm wondering if a dataset contains samples with many unrelated treatments/disease types that results in a very large sample size, we should only include samples in groups with biological relevance that we want to compare in the design model, rather than including all samples which might lead to false positives?
That's a reasonable concern. You can certainly argue (I would!) that having really disparate samples in one model is sub-optimal, particularly if the expected variance of the different groups is different. That's actually one of the assumptions for ANOVA, that the within-group variability is expected to be similar. In the context of a t-test, there's the Welch adjustment that is supposed to help with non-constant variance, but I don't know if there's anything similar for ANOVA models, so it's up to the analyst to make sure things are OK.
And I read from the codes eBayes function in Line 51 that p-values are computed based on the distribution of t-statistics not the F-statistics:
Please direct me to the correct place if I've got the wrong codes!
Thank you!
That's true. For a contrast limma uses the t distribution as the null. A t-statistic, under the null hypothesis is expected to be distributed t. The same is true for a contrast, which is what limma is computing.
The difference being that a t-statistic is conventionally computed using just two groups, and the denominator of the statistic is computed using the within-group variance of the two groups being compared. A contrast has the same form as a t-statistic, being a difference in means in the numerator, but the denominator is computed using the within-group variance from all the groups in the model. This tends to make a contrast more powerful than a t-statistic because the variance is not a very efficient statistic, and requires more data to converge to the underlying parameter than a more efficient statistic (like the mean). So more data means more accurate variance estimate. And a more accurate variance estimate, all things equal, results in more power to detect differences.
Thanks for the explanations!
I agree that the estimation of t-statistics using a denomintor based on within-group variance from all the groups makes sense. My question here is that since the contrast is estimated mainly based on means of samples from the two groups, would it be more reasonable to estimate p-values using the degree of freedom based on the two group sample size rather than the total sample size (i.e. df.total)? The significance of 3 vs 3 sample comparison looks driven by the total number of samples (24 samples) rather than the moderate t-statistics.
No, that's not how it works. The degrees of freedom have nothing to do with the numerator of the statistic. The numerator only tells you how far apart the means of the two groups are. The denominator is used as sort of a yardstick to say if that difference is large or not, based on the expectation from the null distribution, which itself is defined by the degrees of freedom.
So it makes no sense to compute a statistic based on a larger df and then pretend like it was computed using a smaller df.
Anyway, you are questioning methods that Ronald Fisher developed back in the 1920s, and that thousands of statisticians have used for the last 100 years. And if you used
lm
to fit the model, the t-statistic would be based on the total.df. At some point you either decide that everybody is wrong, and try to prove that mathematically, or you can be like me and say 'sounds legit'.Thanks for your reply!
I got your point that p-values are estimated based on the linear model we created at the first step. So I guess the take-home message here is that to be cautious to include samples from unrelated groups in the first design matrix, to avoid potential false positives in the comparison of the phenotypes of our interest.
No, you will not generally get false positives by including all the samples in your linear model.
In your case the extra samples are not unrelated but an intrinsic part of the same experimental design. Including all the samples will usually result in more power and consistency, but this is a genuine increase in power rather than a source of false positives.
The only situation in which there would be an issue would be if your 24 hour samples are much more variable than the earlier time points, and you can handle that contingency in limma by using sample weights.