I have a RNASeq paired-end data from two batches (8 samples from batch1, and 7 samples from batch2). After alignment using TopHat2, then I got count using HTseq-count, and FPKM value via Cufflinks. A big batch effect can be viewed in PCA using both log10(raw count) and log10(FPKM) value.
I can NOT use the block factor in edgeR to remove batch effect since I need to first obtain residuals after adjusting for batch effect, then test the residuals for hundreds of thousands of SNPs (eQTL analysis).
My question is how to remove batch effect without using edgeR:
1. is SVA ok for such a small sample size (N=15)?
2. If SVA does not work, any other suggestions?
Many thanks,
Shirley
Dear Dr. Smyth,
Thanks for your reply. Here is the code sequence I used to remove the batch effect and to make the PCA plot.
First, getting raw counts for each feature per sample using htseq-count (~64K features by using Ensemble.gtf). Then, getting a count data.matrix by combing all samples together (64K
rows, and 14 columns). 8 samples from batch1, and 6 samples from batch2.
Many thanks.
Shirley
Dear Shirley,
I don't think that scaling in prcomp() is appropriate here.
Better would be:
Best wishes
Gordon
To add colours to the MDS plot:
Gordon
Dear Dr. Smyth,
Thank you very much for taking your time to look through my codes, and also provided an more approciate code sequence. Thank you!
By following your codes, I think the batch effect was removed efficiently as shown in the attached figures. Do you agree?
Furthermore, I found a very useful paper you published in Nature Protocol 2013.
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
In the paper, you wrote:
My question is:
In my case, should I add option "normalized.lib.sizes = TRUE" in cpm() after calling calcNormFactors() like the following:
Many thanks for your help,
Shirley
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.plotMDS.pdf
Type: application/pdf
Size: 4598 bytes
Desc: not available
-------------- next part --------------
A non-text attachment was scrubbed...
Name: all.count.logCPM.rmBatch.plotMDS.pdf
Type: application/pdf
Size: 4584 bytes
Desc: not available
Hi Shirley,
I beleive that "normalized.lib.sizes = TRUE" has become the default when calling the cpm function on a DGEList, so you should not need to specify it.
-Ryan
Dear Ryan,
I think you are right. As shown by ?cpm
BTW, do you have experience about how to set "prior.count", e.g. 0.25 vs. 5?
Many thanks,
Shirley
Dear Shirley,
I have only used fold changes based on prior counts for descriptive purposes. My understanding is that edgeR uses the counts as is to do the statistics (i.e. p-values), but adds in the prior count when computing the logFC/logCPM values to avoid taking the log of zero and to avoid excessively large fold changes in low-abundance genes. However, I do not have a sense of what values are appropriate and I usually stick with the default.
-Ryan
Shirley,
I specifically recommended to you a particular choice for prior.count that I thought would be appropriate for your purpose.
Gordon
Thank you, Dr. Smyth. I really appreciate your constructive and thoughtful reply.
Ryan, thank you too for your comments.
Best,
Shirley