In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7.
Fig 1: MDS plot showing outliers and that they belong to extbatch 7.
Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples.
Fig 2: Mean and SD of RNA yield over extraction batches.
So now i am trying to correct this. I tried sva::ComBat()
to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using scran::mnnCorrect
. This also produced negative values.
Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.
https://i.imgur.com/wcIXtW3.jpg
Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (~extbatch+family+condition
). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (~family+condition
), I find many DEGs.
So, some of my questions are
1. Why do I get negative values and how do I fix that?
2. Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
3. Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
4. If anyone has any better suggestions to handle this?
Ahan. So batch effect correction works only if all the samples in that batch are affected and not partially? By fixing the negative values, I meant to say to make them all positive because counts cannot be negative. And negative counts cannot be used by DGE packages. Anywya, I will try
sva
. Thanks for the suggestions.You shouldn't be applying the batch correction methods on the raw counts. If you're getting negative "count" values, you're doing something wrong. There are some complicated ways to get "corrected count" values with quantile-quantile mapping, but the simplest approach is just to include batch in the design matrix.
My thinking was that I would batch correct the raw counts to get corrected raw counts which would be used in DESeq2 along with the model (which would now exclude the corrected batch variable). What kind of counts do you think the correction should be applied on? It cannot be log-cpm, vst, rlog, voom etc, because they cannot be fed into DESeq2. The negative values are worrying. I wonder if I could add a pseudocount to push them all above zero.
No, your proposed approach is not correct. As I have already said, batch correction methods are not designed to be used with the raw counts. If you try to do so... well, you'll get nonsensical negative counts, and no amount of pseudo-count addition can protect you against that (notwithstanding the other biases introduced by adding pseudo-counts in the first place). In any case, you should be including the batch factor directly in the model for DE analysis. Supplying "corrected" values will fail to account for the uncertainty of the correction parameters.