I'm working on analyzing a proteomic dataset in which I am interested in evaluating the difference in protein expression between case and control and the effect and interaction of gender on expression differences by case status.
I also have a factor (site) and a covariate (age) that I need to adjust for, neither of which I am interested in knowing the effects of, and that is where I'm having trouble. In a typical regression, I would just add them to the model when estimating my coefficients of interest, but, following this extremely helpful guide, I think I can/should include site as a random effects variable (if it's possible to do that with age, a continuous covariate, I would not know how to do that, which is why I don't list it as well even though the effect of age is not of interest to me). One of the issues I'm having with that is that it results in site being included as a blocking variable, and from my experience with blocking in clinical trials, this would not be a blocking variable; it's not that strict, it's just the sites across which observations are distributed.
So far, I've set up how to look at: differential expression by case status; differential expression by gender; differential expression by case status within genders; differential expression by gender within case status; and interaction between case status and gender. I believe that's more or less been coded correctly, but I will still include it and please do let me know if there are any improvements that could be made. I've also set up those same comparisons, but with two different attempts adjusting for age and site. All of that is included below. I am asking if either of those methods for adjusting is appropriate in this case, and if not, how should I adjust for age and site?
Any assistance is greatly appreciated!
This is the rough structure of the data I'm working with, except rather than "expression" being a single column, it's more like 8000 columns.
# Create mock dataset
pheno_dat <- data.frame(matrix(NA, nrow = 10, ncol = 5))
pheno_dat[1,] <- c("case", "female", "A", 44, "case_female")
pheno_dat[2,] <- c("control", "female", "B", 36, "control_female")
pheno_dat[3,] <- c("case", "female", "A", 45, "case_female")
pheno_dat[4,] <- c("control", "female", "C", 38, "control_female")
pheno_dat[5,] <- c("case", "female", "B", 46, "case_female")
pheno_dat[6,] <- c("control", "female", "A", 38, "control_female")
pheno_dat[7,] <- c("case", "male", "A", 48, "case_male")
pheno_dat[8,] <- c("control", "male", "B", 39, "control_male")
pheno_dat[9,] <- c("case", "male", "C", 47, "case_male")
pheno_dat[10,] <- c("control", "male", "C", 41, "control_male")
colnames(pheno_dat) <- c("status", "gender", "site", "age", "stat_gender")
pheno_dat$status <- as.factor(pheno_dat$status)
pheno_dat$status <- relevel(pheno_dat$status, ref="control")
pheno_dat$gender <- as.factor(pheno_dat$gender)
pheno_dat$gender <- relevel(pheno_dat$gender, ref="male")
pheno_dat$age <- as.numeric(pheno_dat$age)
pheno_dat$site <- as.factor(pheno_dat$site)
pheno_dat$stat_gender <- as.factor(pheno_dat$stat_gender)
pheno_dat$stat_gender <- relevel(pheno_dat$stat_gender, ref="control_male")
> pheno_dat
> status gender site age stat_gender
> 1 case female A 44 case_female
> 2 control female B 36 control_female
> 3 case female A 45 case_female
> 4 control female C 38 control_female
> 5 case female B 46 case_female
> 6 control female A 38 control_female
> 7 case male A 48 case_male
> 8 control male B 39 control_male
> 9 case male C 47 case_male
> 10 control male C 41 control_male
exprs_dat <- as.data.frame(c(9.7424779,
8.541871,
7.6822924,
9.6101021,
9.2693605,
8.8050989,
8.8638765,
7.9937876,
7.3689429,
8.9164766))
colnames(exprs_dat) <- "expression"
# Convert to expression set
exp_set <- as.ExpressionSet(exprs_dat)
exp_set@phenoData@data <- pheno_dat
The modeling strategy I've used is similar between the following two methods of covariate adjustment. I believe this part is done correctly (but again, please correct me if I'm wrong!).
This is my attempt at covariate adjustment by adding age to the design matrix and treating site as a random effects variable.
# Design matrix adjusting for age
means_stat_gender_age <- model.matrix(~ 0 + exp_set$stat_gender + exp_set$age)
colnames(means_stat_gender_age) <- c("control_male", "case_female", "case_male", "control_female", "age")
# Setting up site as a blocking (random effects) variable
cor <- duplicateCorrelation(exp_set, means_stat_gender_age, block=exp_set$site)
> cor$consensus.correlation
> [1] -0.3233333
# Model blocking on site
fit_stat_gender_age_block <- lmFit(object=exp_set, design=means_stat_gender_age, block=exp_set$site, correlation=cor$consensus.correlation)
fit_stat_gender_age_block <- eBayes(fit_stat_gender_age_block)
> topTable(fit_stat_gender_age_block)
> control_male case_female case_male control_female age AveExpr F P.Value adj.P.Val
> 1 -3.97873 -4.984572 -6.532959 -2.566777 0.3094411 8.679429 3894.656 5.733608e-09 5.733608e-09
# Contrasts for comparisons of interest, adjusting for age and blocking by site
contrasts_comp_int_stat_gender_age_block <- makeContrasts(
caseVScontrol=(case_male+case_female)/2-(control_male+control_female)/2, # Differential expression by case status
femaleVSmale=(case_female+control_female)/2-(case_male+control_male)/2, # Differential expression by gender
caseVScontrol_female=case_female-control_female, caseVScontrol_male=case_male-control_male, # Differential expression by case status within gender
femaleVSmale_case=case_female-case_male, femaleVSmale_control=control_female-control_male, # Differential expression by gender within case status
statusXgender=(case_female-control_male)-(control_female-control_male)-(case_male-control_male), # Interaction between case status and gender
levels=colnames(means_stat_gender_age))
fit_contrasts_comp_int_stat_gender_age_block <- contrasts.fit(fit_stat_gender_age_block, contrasts_comp_int_stat_gender_age_block)
fit_contrasts_comp_int_stat_gender_age_block <- eBayes(fit_contrasts_comp_int_stat_gender_age_block)
> topTable(fit_contrasts_comp_int_stat_gender_age_block)
> caseVScontrol femaleVSmale caseVScontrol_female caseVScontrol_male femaleVSmale_case femaleVSmale_control statusXgender
> 1 -2.486012 1.48017 -2.417794 -2.554229 1.548387 1.411952 0.1364348
> AveExpr F P.Value adj.P.Val
> 1 8.679429 0.9186268 0.4954753 0.4954753
This is my attempt at covariate adjustment by adding age and site to the design matrix.
# Design matrix adjusting for age and site
means_stat_gender_cor <- model.matrix(~ 0 + exp_set$stat_gender + exp_set$age + exp_set$site)
colnames(means_stat_gender_cor) <- c("control_male", "case_female", "case_male", "control_female", "age", "B", "C")
fit_stat_gender_cor <- lmFit(exp_set, means_stat_gender_cor)
fit_stat_gender_cor <- eBayes(fit_stat_gender_cor)
> topTable(fit_stat_gender_cor)
> control_male case_female case_male control_female age B C AveExpr F P.Value adj.P.Val
> 1 -2.58555 -3.574291 -5.064257 -1.339882 0.2779798 -0.1102714 -0.04674809 8.679429 83.64332 0.001962154 0.001962154
# Contrasts for comparisons of interest, adjusting for age and blocking by site
contrasts_comp_int_stat_gender_cor <- makeContrasts(
caseVScontrol=(case_male+case_female)/2-(control_male+control_female)/2, # Differential expression by case status
femaleVSmale=(case_female+control_female)/2-(case_male+control_male)/2, # Differential expression by gender
caseVScontrol_female=case_female-control_female, caseVScontrol_male=case_male-control_male, # Differential expression by case status within gender
femaleVSmale_case=case_female-case_male, femaleVSmale_control=control_female-control_male, # Differential expression by gender within case status
statusXgender=(case_female-control_male)-(control_female-control_male)-(case_male-control_male), # Interaction between case status and gender
levels=colnames(means_stat_gender_cor))
fit_contrasts_comp_int_stat_gender_cor <- contrasts.fit(fit_stat_gender_cor, contrasts_comp_int_stat_gender_cor)
fit_contrasts_comp_int_stat_gender_cor <- eBayes(fit_contrasts_comp_int_stat_gender_cor)
> topTable(fit_contrasts_comp_int_stat_gender_cor)
> caseVScontrol femaleVSmale caseVScontrol_female caseVScontrol_male femaleVSmale_case femaleVSmale_control statusXgender
> 1 -2.356557 1.367817 -2.234408 -2.478707 1.489967 1.245668 0.2442984
> AveExpr F P.Value adj.P.Val
> 1 8.679429 0.3691084 0.7826167 0.7826167
If anything is missing or unclear, please let me know, and I'll address it as best as possible!
Cheers!
# please also include the results of running the following in an R session
sessionInfo( )
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] statmod_1.4.36 BiocManager_1.30.18 leukemiasEset_1.30.0 destiny_3.8.1 limma_3.50.3 ROCR_1.0-11
[7] glmnet_4.1-4 Matrix_1.4-1 gridExtra_2.3 janitor_2.1.0 UniprotR_2.2.0 magrittr_2.0.3
[13] glue_1.6.2 Gmisc_3.0.0 htmlTable_2.4.1 Rcpp_1.0.9 robust_0.7-1 fit.models_0.64
[19] sva_3.42.0 BiocParallel_1.28.3 genefilter_1.76.0 mgcv_1.8-40 nlme_3.1-158 report_0.5.1
[25] broom_1.0.0 flextable_0.7.2 rempsyc_0.0.4.5 devtools_2.4.4 usethis_2.1.6 forcats_0.5.1
[31] table1_1.4.2 psych_2.2.5 dplyr_1.0.9 tidyr_1.2.0 Hmisc_4.7-0 Formula_1.2-4
[37] survival_3.3-1 lattice_0.20-45 ggplot2_3.3.6 lmSupport_2.9.13 readxl_1.4.0 Biobase_2.54.0
[43] BiocGenerics_0.40.0
loaded via a namespace (and not attached):
#Removed because of character limit
I understand that your mock dataset has expression data for just one protein instead of 8000, but is the phenoData complete? Does your real dataset have 10 samples, same as the mock dataset?
No, the real data has 153 samples. My apologies, I thought I'd included that somewhere. The balance of status, gender, site, and age is approximately the same as I've put here though!