Hello,
I need some advice. I'm lookin at PCA plots of RNAseq data, and am understand whether my data has batch effects or not. I performed alignment using STAR, and then obtained gene counts, and gene TPM values
I used ```prcomp``` to find the Principal components and plotted PC1 vs PC2 and PC2 vs PC3 for
- (a) Raw counts
- (b) Log2(counts + 1)
- (c) TPM values
- (d) Log2(TPM + 1)
I am showing the PCA plots below (These are links to images from google drive)
It seems that there could be a batch effect, but I'm not a 100% sure, since I'm doing this for the first time.
- Can anyone provide advice on if this is really a batch effect ?
- If there is a batch effect, could this be mitigated with either using ComBat, or SVA , or adjusting in linear model ?
Please advise.
Thank you !
K
This may be a bit late, but if you still tackling the problem you could try guided PCA (a link to the vignette: https://cran.r-project.org/web/packages/gPCA/vignettes/gPCA.pdf) for identifying batch effect. However, you must know which batch each sample is from to make it work.
Late response, but Thank you so much. The guided PCA package looks very interesting ! I will check it out.
I ended up doing a strong filtering of features with low gene counts before running edgeR , and this helped avoid the issue.
The batch effect is caused by differences when samples are sequenced separately. If this is your case you can correct the effect by using ComBat function from sva package. If the samples were sequenced together you may want to check for possible errors caused by different lanes. Also, verify your data in the previous steps. mapping, coverage, and multi-mapped-reads should be discarded. There is a series of details to take into account. Check Bioconductor vignettes for edgeR package.
Thank you for the feedback. I did infact use EdgeR for differential expression analysis of this data, and we got an unusually extremely large number of differentially expressed results.
That is why I'm going back to the data, and the PCA plot, to see if there is a batch effect. I need help with interpretation of the PCA plots, to understand if the separation I see is large enough that it could be a batch effect.