Hello,
Experiment background: 5 genotypes (Factor A) 32 subjects 3 sampling times (Factor B) 9 plant samples per subject We are interested in comparing the changes in gene expression between genotypes after accounting for expression before infestation. We ran a variance partition analysis and found that RIN has a median of 5%, so we are considering to include RIN scores as a covariate in our analysis. Previous posts mentioned that using a surrogate variable for RIN scores works better in RNAseq analysis instead of using the original RIN scores as a covariate. I have some questions on how to set up SVA to get the surrogate variable.
Do we set up the null model with our without RIN score?
mod = model.matrix(~0 + genotype + time.dpi + genotype*time.dpi + RIN_score, data= samples) mod0 = model.matrix(~1, data= samples) or mod0 = model.matrix(~1 + RIN_score, data= samples)
Once we get the sv, do we include it in the model as modsv = model.matrix( ~ 0 + genotype + time.dpi + genotype*time.dpi + SV1)?
Below is the code we have run so far. We would appreciate any feedback on it. We noticed that the cor$consensus value dropped from 0.1 to 0.07 when including SV1 instead of RIN scores. Is this expected? Will it be better to not include RIN score in the model?
y <- DGEList(counts=x.2, genes=annotation)
design <- model.matrix(~ 0 + genotype + time.dpi + genotype*time.dpi + RIN_score)
keep <- filterByExpr(y, design)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y, method="TMM")
y <- estimateDisp(y, design, robust=TRUE)
v <- voom(y, design)
cor<-duplicateCorrelation(v, design, block = subject)
cor$consensus
v <- voom(y, desig, block = subject, correlation = cor$consensus)
cor<-duplicateCorrelation(v, design, block = subject)
cor$consensus
# variance partition analysis step
form <- ~ (1| genotype) + (1| time.dpi) + (1| genotype:time.dpi) + (1 | subject) + RIN_score
vp <- fitExtractVarPartModel(v, form, samples)
plotVarPart(sortCols(vp))
# sva to get an sv for RIN score
cpm.tab <- cpm(y, normalized.lib.sizes=TRUE)
mod = model.matrix(~0 + genotype + time.dpi + genotype*time.dpi + RIN_score, data= samples)
mod0 = model.matrix(~1, data= samples)
nsv = num.sv(cpm.tab, mod, method="be")
sva1 = svaseq(cpm.tab, mod, mod0, n.sv= nsv)
colnames(sva1$sv) <- "SVA1"
samples.sva = cbind(samples, sva1$sv)
y <- DGEList(counts=x.2, genes = annotation)
modsv <- model.matrix(~0 + genotype + time.dpi + genotype*time.dpi + samples.sva$SVA1)
keep <- filterByExpr(y, modsv)
y <- y[keep,,keep.lib.sizes=FALSE]
table(keep)
y <- calcNormFactors(y, method="TMM")
y <- estimateDisp(y, modsv, robust=TRUE)
# possible way to analyze gene expression data
v <- voom(y, modsv)
cor<-duplicateCorrelation(v, modsv, block = subject)
cor$consensus
v <- voom(y, modsv, block = subject, correlation = cor$consensus)
cor<-duplicateCorrelation(v, modsv, block = subject)
cor$consensus
fit<-lmFit(v, modsv, block= subject, correlation=cor$consensus)
fit<-eBayes(fit)
summary(decideTests(fit))
# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session
sessionInfo( )
Agreed. I'll add that with small sample size, a covariate can explain some fraction of expression variation even under the null.