Dear all,
i am aiming to integrate limma and sva for RNAseq analysis, and i would appreciate please an advice on the following R code (it is below, is does it look correct ?), and your insights on 2 questions :
1 -- could SVASEQ work on logCPM values ? It works with raw counts, however I am receiving errors when using logCPM or CPM in svaseq function.
2 -- if SVA computes the surrogate variable based on RAW counts, and LIMMA (typically) uses logCPM, is it OK to do : fit <- lmFit(logCPM, modSv); where modSV has been estimated based on RAW counts ?
The R code that I am using is below :
library("edgeR")
library("sva")
pheno <- read.delim("pheno.data", row.names="Samples")
#> pheno
# sample subject experiment
#siCTL1 1 1 siCTL
#siCTL2 2 2 siCTL
#siCTL3 3 3 siCTL
#siX1 4 1 siX
#siX2 5 2 siX
#siX3 6 3 siX
eset <- as.matrix(read.delim("eset.data", row.names="Gene"))
#> head(eset)
# siCTL1 siCTL2 siCTL3 siX1 siX2 siX3
#Xkr4 51 0 0 50 0 0
#Rp1 6 0 0 1 0 0
#Sox17 373 0 0 206 0 0
#Mrpl15 973 2158 1251 1079 1749 1293
#Lypla1 6419 4172 882 5016 3001 1398
#Tcea1 2681 1942 717 2213 1404 929
### keeping only non-zero values in eset
nonzero_rows = apply(eset, 1, function(x) !all(x==0) )
eset_nonzero <- eset[nonzero_rows,]
### setting up the FACTOR GROUP
group <- factor(pheno$experiment)
group <- relevel(group, ref="siCTL")
### setting up the FACTOR SUBJECT
subject <- factor(pheno$subject)
#subject <- factor(c(1,2,3,1,2,3))
## setting up the DESIGN MATRIX for EDGER/LIMMA
design <- model.matrix(~group+subject, data=pheno)
## estimating size factors on raw counts
## running svaseq on normalized counts
### slightly more filtering based on CPM values.
y <- DGEList(counts=eset_nonzero, group=group)
keep <- rowSums(cpm(y) > 0.5) >= 6
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
logCPM <- cpm(y,log=TRUE,prior.count=3)
plotMDS(logCPM)
### the code related to SVA ....
mod = model.matrix(~group+subject, data=pheno)
mod0 = model.matrix(~1, data=pheno)
n.sv = num.sv(eset_nonzero, mod, method="leek")
## n.sv = num.sv(logCPM, mod, method="leek")
## n.sv = num.sv(eset_nonzero, mod, method="be)
svobj = svaseq(eset_nonzero, mod, mod0, n.sv=n.sv)
plot(svobj$sv,pch=19,col="blue")
modSv = cbind(mod, svobj$sv)
# mod0Sv = cbind(mod0, svobj$sv)
# pValuesSv = f.pvalue(eset_nonzero,modSv,mod0Sv)
# qValuesSv = p.adjust(pValuesSv,method="fdr")
## instead of :
## fit <- lmFit(logCPM, design)
fit <- lmFit(logCPM, modSv)
fit <- eBayes(fit, trend=TRUE, robust=TRUE)
results <- topTable(fit, coef=2, adjust="fdr", number=Inf)
write.table(results, file="zz.be.results.edgeR.txt",
sep="\t", eol="\n", row.names=TRUE, col.names=TRUE)
Dear Aaron, thank you again for your comments ! i will modify the R code, and will re-post the questions. Thanks a lot !