Question

Remove batch effect in small RNASeq study (SVA or others?)

1

Entering edit mode

shirley zhang ★ 1.0k

@shirley-zhang-2038

Last seen 10.2 years ago

I have a RNASeq paired-end data from two batches (8 samples from batch1, and 7 samples from batch2). After alignment using TopHat2, then I got count using HTseq-count, and FPKM value via Cufflinks. A big batch effect can be viewed in PCA using both log10(raw count) and log10(FPKM) value.

I can NOT use the block factor in edgeR to remove batch effect since I need to first obtain residuals after adjusting for batch effect, then test the residuals for hundreds of thousands of SNPs (eQTL analysis).

My question is how to remove batch effect without using edgeR:

1. is SVA ok for such a small sample size (N=15)?
2. If SVA does not work, any other suggestions?

Many thanks,
Shirley

RNASeq Alignment edgeR sva • 8.1k views

ADD COMMENT • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by shirley zhang ★ 1.0k

Gordon Smyth · Answer 1 · 2014-04-28

4

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Please give the code sequence you used to remove the batch effect and to make the PCA plot.

You should normalize the data before using removeBatchEffect(), and quantile is one possibility.

Gordon

ADD COMMENT • link 10.6 years ago • updated 9.8 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Dr. Smyth,

Thanks for your reply. Here is the code sequence I used to remove the batch effect and to make the PCA plot.

First, getting raw counts for each feature per sample using htseq-count (~64K features by using Ensemble.gtf). Then, getting a count data.matrix by combing all samples together (64K
rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.

>count = cbind(count.s1, count.s2, ...., count.s14)

#remove the batch effect
library(edgeR)
batch = as.factor(c(rep("b1", 8), rep("b2", 6)))

logCPM <- cpm(count,log=TRUE,prior.count=5)
logCPM <- removeBatchEffect(logCPM, batch=batch)

#PCA
B.res = prcomp(logCPM, scale = TRUE)
pc.s = summary(.Bres)$importance[2,1:2]
pc1.var = round(pc.s[["PC1"]],2)
pc2.var = round(pc.s[["PC2"]],2)

pdf(file = "all.count.logCPM.rmBatch.pca")
plot(B.res$rotation[,1:2], main = maintitle,  xlab = paste("PC1(variance:",
pc1.var*100, "%)", sep =""), ylab = paste("PC2 (variance:",
pc2.var*100,"%)", sep =""))
dev.off()

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] edgeR_3.4.0  limma_3.18.7

Many thanks.
Shirley

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by shirley zhang ★ 1.0k

0

Entering edit mode

Dear Shirley,

I don't think that scaling in prcomp() is appropriate here.

Better would be:

  library(edgeR)
  dge <- DGEList(counts=count)
  dge <- calcNormFactors(dge)
  logCPM <- cpm(dge,log=TRUE,prior.count=5)
  plotMDS(logCPM)
  logCPM <- removeBatchEffect(logCPM,batch=batch)
  plotMDS(logCPM)

Best wishes
Gordon

ADD REPLY • link 10.6 years ago • updated 9.8 years ago Gordon Smyth 51k

0

Entering edit mode

To add colours to the MDS plot:

  plotMDS(logCPM, col=as.numeric(batch))

Gordon

ADD REPLY • link 10.6 years ago • updated 9.8 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Dr. Smyth,

Thank you very much for taking your time to look through my codes, and also provided an more approciate code sequence. Thank you!

By following your codes, I think the batch effect was removed efficiently as shown in the attached figures. Do you agree?

Furthermore, I found a very useful paper you published in Nature Protocol 2013.
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

In the paper, you wrote:

d = DGEList(counts = count, group = samples$condition)
d = calcNormFactors(d)
nc = cpm(d, normalized.lib.sizes = TRUE)

My question is:
In my case, should I add option "normalized.lib.sizes = TRUE" in cpm() after calling calcNormFactors() like the following:

dge <- DGEList(counts=count)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge,log=TRUE,prior.count=5, normalized.lib.sizes = TRUE)

Many thanks for your help,
Shirley

-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.plotMDS.pdf
Type: application/pdf
Size: 4598 bytes
Desc: not available
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.rmBatch.plotMDS.pdf
Type: application/pdf
Size: 4584 bytes
Desc: not available

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by shirley zhang ★ 1.0k

0

Entering edit mode

Hi Shirley,

I beleive that "normalized.lib.sizes = TRUE" has become the default when calling the cpm function on a DGEList, so you should not need to specify it.

-Ryan

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Dear Ryan,

I think you are right. As shown by ?cpm

## S3 method for class 'DGEList'
     cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25,
...)

BTW, do you have experience about how to set "prior.count", e.g. 0.25 vs. 5?

Many thanks,
Shirley

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by shirley zhang ★ 1.0k

0

Entering edit mode

Dear Shirley,

I have only used fold changes based on prior counts for descriptive purposes. My understanding is that edgeR uses the counts as is to do the statistics (i.e. p-values), but adds in the prior count when computing the logFC/logCPM values to avoid taking the log of zero and to avoid excessively large fold changes in low-abundance genes. However, I do not have a sense of what values are appropriate and I usually stick with the default.

-Ryan

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Shirley,

I specifically recommended to you a particular choice for prior.count that I thought would be appropriate for your purpose.

Gordon

ADD REPLY • link 10.6 years ago • updated 9.8 years ago Gordon Smyth 51k

0

Entering edit mode

Thank you, Dr. Smyth. I really appreciate your constructive and thoughtful reply.

Ryan, thank you too for your comments.

Best,
Shirley

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by shirley zhang ★ 1.0k

Gordon Smyth · Answer 2 · 2014-04-28

0

Entering edit mode

Gordon Smyth 51k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Dear Shirley,

I would probably do it like this:

   library(edgeR)
   logCPM <- cpm(y,log=TRUE,prior.count=5)
   logCPM <- removeBatchEffect(logCPM, batch=batch)

Best wishes
Gordon

ADD COMMENT • link 10.6 years ago • updated 9.8 years ago Gordon Smyth 51k

0

Entering edit mode

Dear Dr. Smyth,

Thank you very much for your quick reply. I did as you suggested by first getting log CPM value, then call removeBatchEffect(). I found the PCA figure looks better than before, but there is still a batch effect.

I attached two PCA figures. One is based on log10(raw count) which is before calling cpm() and removeBatchEffect(). Another one is after.

Could you look at them and give me more suggestions. Will a quantile normalization across samples be a good idea since CPM() is still a normalization only within each sample??

Thanks again for your help,
Shirley

-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.log10.pca.pdf
Type: application/pdf
Size: 5034 bytes
Desc: not available
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.rmBatch.pca.pdf
Type: application/pdf
Size: 5120 bytes
Desc: not available

ADD REPLY • link updated 9.8 years ago by Gordon Smyth 51k • written 10.6 years ago by shirley zhang ★ 1.0k