The present analysis is an exercise physiology study with a mixed factorial design. The design is as follows: Time (pre vs. post intervention; repeated measures) x Metformin status (on drug vs off drug; independent subjects) x dietary intervention (control diet vs MUFA enriched vs PUFA enriched; independent subjects). The data under investigation comes from a custom preoteomics aptamer array (104 individual subjects; two timepoints; 857 x 208 expression matrix). The primary contrast of interest is the first one (pre-post). I have set up the analysis using the attached code, but others on the research team find the initial results highly improbable.
I am unclear on how to interpret this and I want to give their concerns due diligence. Any advice would be appreciated. My primary concern is that some aspect of the following code is inappropriate for the stated application due to faulty implementation/violated assumptions.
library(dplyr)
library(impute)
library(vsn)
library(limma)
#filter combined proteomics data to exclude MNAR probes where abundances are NA for more than 33.3% of samples
combined<-inner_join(join,bind,by="Order_Key_USE")
na<-apply(combined,2,is.na)
na_index<-apply(na,2,sum)/dim(combined)[1]
combined<-combined[,na_index<.333]
#join additional metadata to samples
add_data<-read.csv("meta.csv",check.names=FALSE,na.strings=".")[,1:10]
add_data$Sex<-as.factor(add_data$Sex)
add_data$Smoker<-as.factor(add_data$Smoker)
final_join<-dplyr::inner_join(add_data,combined,by="StudyID")
write.csv(final_join,file="Intermediate_one.csv")
#set up objects for limma and normalize expression data with vsn
R<-read.csv("Intermediate_one.csv",check.names=FALSE)
abunds<-justvsn(t(R[,-c(1:25)]))
# impute remaining NA assuming they are not MNAR
edata<-impute.knn(abunds,k = 10, rowmax = 1.0, colmax = 1.0, maxp = 1500, rng.seed=362436069)$data
rownames(edata)<-colnames(R[,-c(1:25)])
#set up model matrix for limma and estimate correlation of individual subject samples across repeated measures
pheno_table<-R[,c(2,3,14,18)]
colnames(pheno_table)<-c("StudyID","Treatment","Study","Main")
Treat<-factor(paste(pheno_table$Study,pheno_table$Main,sep="."))
Treat<-factor(paste(Treat,pheno_table$Treatment,sep="."))
levels(Treat)<-list("PUFApre.Metform"=c("PUFA.1.Metform"),
"PUFApost.Metform"=c("PUFA.2.Metform"),
"PUFApre.NoMetform"=c("PUFA.1.NoMetform"),
"PUFApost.NoMetform"=c("PUFAt.2.NoMetform"),
"NoDietpre.Metform"=c("NoDiet.1.Metform"),
"NoDietpost.Metform"=c("NoDiet.2.Metform"),
"NoDietpre.NoMetform"=c("NoDiet.1.NoMetform"),
"NoDietpost.NoMetform"=c("NoDiet.2.NoMetform"),
"MUFApre.Metform"=c("MUFA.1.Metform"),
"MUFApost.Metform"=c("MUFA.2.Metform"),
"MUFApre.NoMetform"=c("MUFA.1.NoMetform"),
"MUFApost.NoMetform"=c("MUFA.2.NoMetform"))
design<-model.matrix(~0+Treat)
colnames(design)<-levels(Treat)
corfit<-duplicateCorrelation(edata,design,block=pheno_table$StudyID)
#Fit linear model for within subjects contrast
fit<-lmFit(edata,design,block=pheno_table$StudyID,correlation=corfit$consensus)
cm <- makeContrasts(
`PUFApost.Metform-PUFApre.Metform` = PUFApost.Metform-PUFApre.Metform,
`NoDietpost.Metform-NoDietpre.Metform` = NoDietpost.Metform-NoDietpre.Metform,
`MUFApost.Metform-MUFApre.Metform` = MUFApost.Metform-MUFApre.Metform,
`PUFApost.NoMetform-PUFApre.NoMetform` = PUFApost.NoMetform-PUFApre.NoMetform,
`NoDietpost.NoMetform-NoDietpre.NoMetform` = NoDietpost.NoMetform-NoDietpre.NoMetform,
`MUFApost.NoMetform-MUFApre.NoMetform` = MUFApost.NoMetform-MUFApre.NoMetform,
levels=design)
#clean up DE table
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2,robust=TRUE)
T<-topTable(fit2,number=100000)
T<-cbind(rownames(T),T)
colnames(T)[1]<-"FeatureID"
Concerning the part before limma.I don't know from which instrument your proteomics data came out. Usually, a log transform is needed. It's not clear if it's already applied. You should also check graphically the VSN and the impute steps by plotting the replicates (or the samples) before and after each processing.