Question

(batch) corrections of RNA-seq data: integrating LIMMA and SVA

0

Entering edit mode

Bogdan ▴ 670

@bogdan-2367

Last seen 18 months ago

Palo Alto, CA, USA

Dear all,

i am aiming to integrate limma and sva for RNAseq analysis, and i would appreciate please an advice on the following R code (it is below, is does it look correct ?), and your insights on 2 questions :

1 -- could SVASEQ work on logCPM values ? It works with raw counts, however I am receiving errors when using logCPM or CPM in svaseq function.

2 -- if SVA computes the surrogate variable based on RAW counts, and LIMMA (typically) uses logCPM, is it OK to do : fit <- lmFit(logCPM, modSv); where modSV has been estimated based on RAW counts ?

The R code that I am using is below :

library("edgeR") library("sva")

pheno <- read.delim("pheno.data", row.names="Samples")

#> pheno # sample subject experiment #siCTL1 1 1 siCTL #siCTL2 2 2 siCTL #siCTL3 3 3 siCTL #siX1 4 1 siX #siX2 5 2 siX #siX3 6 3 siX

eset <- as.matrix(read.delim("eset.data", row.names="Gene"))

#> head(eset) # siCTL1 siCTL2 siCTL3 siX1 siX2 siX3 #Xkr4 51 0 0 50 0 0 #Rp1 6 0 0 1 0 0 #Sox17 373 0 0 206 0 0 #Mrpl15 973 2158 1251 1079 1749 1293 #Lypla1 6419 4172 882 5016 3001 1398 #Tcea1 2681 1942 717 2213 1404 929

### keeping only non-zero values in eset nonzero_rows = apply(eset, 1, function(x) !all(x==0) ) eset_nonzero <- eset[nonzero_rows,]

### setting up the FACTOR GROUP group <- factor(pheno$experiment) group <- relevel(group, ref="siCTL")

### setting up the FACTOR SUBJECT subject <- factor(pheno$subject) #subject <- factor(c(1,2,3,1,2,3))

## setting up the DESIGN MATRIX for EDGER/LIMMA design <- model.matrix(~group+subject, data=pheno)

## estimating size factors on raw counts ## running svaseq on normalized counts

### slightly more filtering based on CPM values.

y <- DGEList(counts=eset_nonzero, group=group)

keep <- rowSums(cpm(y) > 0.5) >= 6 y <- y[keep,,keep.lib.sizes=FALSE]

y <- calcNormFactors(y) logCPM <- cpm(y,log=TRUE,prior.count=3) plotMDS(logCPM)

### the code related to SVA ....

mod = model.matrix(~group+subject, data=pheno) mod0 = model.matrix(~1, data=pheno)

n.sv = num.sv(eset_nonzero, mod, method="leek") ## n.sv = num.sv(logCPM, mod, method="leek") ## n.sv = num.sv(eset_nonzero, mod, method="be)

svobj = svaseq(eset_nonzero, mod, mod0, n.sv=n.sv) plot(svobj$sv,pch=19,col="blue")

modSv = cbind(mod, svobj$sv)

# mod0Sv = cbind(mod0, svobj$sv) # pValuesSv = f.pvalue(eset_nonzero,modSv,mod0Sv) # qValuesSv = p.adjust(pValuesSv,method="fdr")

## instead of : ## fit <- lmFit(logCPM, design)

fit <- lmFit(logCPM, modSv) fit <- eBayes(fit, trend=TRUE, robust=TRUE)

results <- topTable(fit, coef=2, adjust="fdr", number=Inf) write.table(results, file="zz.be.results.edgeR.txt", sep="\t", eol="\n", row.names=TRUE, col.names=TRUE)

limma sva svaseq • 2.8k views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 28k • written 8.2 years ago by Bogdan ▴ 670

score 3 · Accepted Answer · 2017-02-25

1) You don't mention the type of error you're getting, but if you look at the documentation for svaseq, it says that the supplied data matrix dat will be log-transformed after adding constant. Clearly, giving the function log-values to be log-transformed again would be silly.

2) Well, it's probably okay because svaseq operates internally on log-counts. As long as the surrogate variables are correctly identified, it shouldn't matter how you get there. I haven't used sva but I suspect you may need to divide each count by the (effective) library size before running the function, otherwise the first surrogate variable would just represent the differences in library size.

There are, however, other issues with your code. The first is why you are using limma without voom to analyse RNA-seq data. There had better be good reasons for using the limma-trend approach as it does not perform as well when your library sizes are very different. The second point is that you should be ensuring that pheno$subject is a factor when constructing the design matrix. In your code, only subject is guaranteed to be a factor, not pheno$subject. Thirdly, your second filtering step requires all 6 libraries to have CPMs above 0.5. This will remove strongly DE genes that are silent in one condition.