Question

Sanity Checking Mixed Factorial Proteomics Analysis Workflow Using Limma

0

Entering edit mode

JMallory • 0

@jmallory-13488

Last seen 5.9 years ago

The present analysis is an exercise physiology study with a mixed factorial design. The design is as follows: Time (pre vs. post intervention; repeated measures) x Metformin status (on drug vs off drug; independent subjects) x dietary intervention (control diet vs MUFA enriched vs PUFA enriched; independent subjects). The data under investigation comes from a custom preoteomics aptamer array (104 individual subjects; two timepoints; 857 x 208 expression matrix). The primary contrast of interest is the first one (pre-post). I have set up the analysis using the attached code, but others on the research team find the initial results highly improbable.

I am unclear on how to interpret this and I want to give their concerns due diligence. Any advice would be appreciated. My primary concern is that some aspect of the following code is inappropriate for the stated application due to faulty implementation/violated assumptions.

library(dplyr)
library(impute)
library(vsn)
library(limma)

#filter combined proteomics data to exclude MNAR probes where abundances are NA for more than 33.3% of samples       
combined<-inner_join(join,bind,by="Order_Key_USE")
na<-apply(combined,2,is.na)
na_index<-apply(na,2,sum)/dim(combined)[1]
combined<-combined[,na_index<.333]

#join additional metadata to samples
add_data<-read.csv("meta.csv",check.names=FALSE,na.strings=".")[,1:10]
add_data$Sex<-as.factor(add_data$Sex)
add_data$Smoker<-as.factor(add_data$Smoker)
final_join<-dplyr::inner_join(add_data,combined,by="StudyID")
write.csv(final_join,file="Intermediate_one.csv") 

#set up objects for limma and normalize expression data with vsn 
R<-read.csv("Intermediate_one.csv",check.names=FALSE)
abunds<-justvsn(t(R[,-c(1:25)]))

# impute remaining NA assuming they are not MNAR 
edata<-impute.knn(abunds,k = 10, rowmax = 1.0, colmax = 1.0, maxp = 1500, rng.seed=362436069)$data
rownames(edata)<-colnames(R[,-c(1:25)])

#set up model matrix for limma and estimate correlation of individual  subject samples across repeated measures 
pheno_table<-R[,c(2,3,14,18)]
colnames(pheno_table)<-c("StudyID","Treatment","Study","Main")
Treat<-factor(paste(pheno_table$Study,pheno_table$Main,sep="."))
Treat<-factor(paste(Treat,pheno_table$Treatment,sep="."))
levels(Treat)<-list("PUFApre.Metform"=c("PUFA.1.Metform"),
                    "PUFApost.Metform"=c("PUFA.2.Metform"),
                    "PUFApre.NoMetform"=c("PUFA.1.NoMetform"),
                    "PUFApost.NoMetform"=c("PUFAt.2.NoMetform"),
                    "NoDietpre.Metform"=c("NoDiet.1.Metform"),
                    "NoDietpost.Metform"=c("NoDiet.2.Metform"),
                    "NoDietpre.NoMetform"=c("NoDiet.1.NoMetform"),
                    "NoDietpost.NoMetform"=c("NoDiet.2.NoMetform"),
                    "MUFApre.Metform"=c("MUFA.1.Metform"),
                    "MUFApost.Metform"=c("MUFA.2.Metform"),
                    "MUFApre.NoMetform"=c("MUFA.1.NoMetform"),
                    "MUFApost.NoMetform"=c("MUFA.2.NoMetform"))
design<-model.matrix(~0+Treat)
colnames(design)<-levels(Treat)
corfit<-duplicateCorrelation(edata,design,block=pheno_table$StudyID)

#Fit linear model for within subjects contrast
fit<-lmFit(edata,design,block=pheno_table$StudyID,correlation=corfit$consensus)
cm <- makeContrasts(
    `PUFApost.Metform-PUFApre.Metform` = PUFApost.Metform-PUFApre.Metform,
    `NoDietpost.Metform-NoDietpre.Metform` = NoDietpost.Metform-NoDietpre.Metform,
    `MUFApost.Metform-MUFApre.Metform` = MUFApost.Metform-MUFApre.Metform,
    `PUFApost.NoMetform-PUFApre.NoMetform` = PUFApost.NoMetform-PUFApre.NoMetform,
    `NoDietpost.NoMetform-NoDietpre.NoMetform` = NoDietpost.NoMetform-NoDietpre.NoMetform,
    `MUFApost.NoMetform-MUFApre.NoMetform` = MUFApost.NoMetform-MUFApre.NoMetform,
    levels=design)

#clean up DE table 
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2,robust=TRUE)
T<-topTable(fit2,number=100000)
T<-cbind(rownames(T),T)
colnames(T)[1]<-"FeatureID"

limma • 969 views

ADD COMMENT • link updated 5.9 years ago by Gordon Smyth 52k • written 5.9 years ago by JMallory • 0

score 0 · Answer 1 · 2019-06-01

The limma part of this only starts at # set up model matrix for limma and .... Before that, there seem to be some proteomics-specific steps that I would not be qualified to answer. The data munging step is also a mystery to me as well, so I can only assume that you've done that part correctly.

Anyway, onto the limma part. It mostly looks fine, though there is surely a better way of setting levels(Treat). Note that calling topTable without coef will do an ANOVA-like analysis, testing all contrasts at once. This may not be what you want if you need DE results for each individual contrast.

You don't mention exactly why your colleagues find the results to be improbable. I would suggest (i) finding a handful of DE proteins that they're not liking, and (ii) looking back through the expression values to determine the step at which it became DE. You can do this by just plotting raw or transformed or imputed expression values - and if there is a clear difference even with the raw values, then there's really nothing more to be said. Well, unless you mixed up the protein names during pre-processing.

score 0 · Answer 2 · 2019-06-02

Well, you've made a number of non-trivial and debatable pre-processing steps. We don't know anything about your custom proteomics array, so we can't tell you whether your pre-processing is good or bad. Actually it would be hard to give definitive advice even if we did know your platform: pre-processing proteomics data for DE analyses is in general an unsolved problem because of the complex pattern of NAs.

I will say though that inputing imputing data to limma is never correct because it overstates the amount of data that you actually have and will over-state the significance of any results you obtain. There is no need to impute data because limma already handles missing values automatically (assuming they are MAR).

Also, limma doesn't need the data to be variance-stabilized (although it should be on the log-scale). It would be simpler and more robust to model the mean-variance relationship using trend=TRUE in eBayes rather than using VSN. I can understand why would you might impute data or variance-stabilize if you were inputing data to a clustering algorithm, say, but limma doesn't need so much pre-processing.

As Aaron has already remarked, it seems strange that you've formed a topTable for all the contrasts at once. If the first contrast is the one of interest, why haven't you tested for that contrast by itself?