Question

RNA-Seq Batch correction negative values

1

Entering edit mode

firestar ▴ 20

@rmf-13755

Last seen 3 months ago

Sweden

In my bulk RNA-seq dataset, I noticed a few outliers on MDS plot and on further investigation found that these samples originated from one extraction batch (extbatch) 7.

Fig 1: MDS plot showing outliers and that they belong to extbatch 7.

Most of the samples from batch 7 and even batch 8 had lower RNA yield than other samples.

Fig 2: Mean and SD of RNA yield over extraction batches.

So now i am trying to correct this. I tried sva::ComBat() to correct for this extbatch and I ended up with negative values in my data. For comparison, I tried correction using scran::mnnCorrect. This also produced negative values.

Fig 3: Distribution of raw counts, counts after scran correction and combat correction. X-axis are individual samples. Y-axis are counts forced to limits -100K and 100K. Colours denote extraction batches. Red is the offending batch 7 with mostly bad samples. See appearance of negative values after batch correction.

https://i.imgur.com/wcIXtW3.jpg

Now, there is the argument that rather than correcting the batch, include it as a factor in my GLM model (~extbatch+family+condition). I did that and found 2 DEGs. If I remove samples from this batch 7 from my data altogether and run my model (~family+condition), I find many DEGs.

So, some of my questions are

1. Why do I get negative values and how do I fix that?
2. Is it better to discard these offending samples (I lose half my data) rather than fixing it using batch correction.
3. Why is the GLM with extbatch failing to produce any DEGs while removing the batch works?
4. If anyone has any better suggestions to handle this?

rna-seq batch-effect combat scran sva • 7.3k views

ADD COMMENT • link updated 7.5 years ago by Aaron Lun ★ 28k • written 7.5 years ago by firestar ▴ 20

score 1 · Answer 1 · 2017-09-27

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 35 minutes ago

The city by the bay

You get negative values because these methods, by and large, are designed to work on log-expression values that can go from -Inf to Inf. This is what they do, so there's nothing to "fix". If you want to account for the batch effect in a DE analysis, include batch as a factor in the design matrix.
According to your first MDS plot, samples in batch 7 are highly variable within the same batch. Batch correction won't help here, because your variability is occurring within the same batch. You need to figure out why this is the case before discarding those samples. For example, what are the DE genes between the batch 7 samples on the left and their counterparts on the right? Sex-related genes, or something?
See my answer to point 2. Including a batch effect won't solve your particular problem, because the variability is occurring within the batch. This will inflate the dispersions and reduce power. Removing the entire batch will avoid this effect - though obviously at the cost of losing samples from batch 7.
If you still can't figure out what's happening in batch 7, you could try using RUVSeq or sva to guess the factor of unwanted variation and put that in the model. However, there's no substitute for properly understanding what's happening with your data.

ADD COMMENT • link 7.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ahan. So batch effect correction works only if all the samples in that batch are affected and not partially? By fixing the negative values, I meant to say to make them all positive because counts cannot be negative. And negative counts cannot be used by DGE packages. Anywya, I will try sva. Thanks for the suggestions.

ADD REPLY • link 7.5 years ago firestar ▴ 20

0

Entering edit mode

You shouldn't be applying the batch correction methods on the raw counts. If you're getting negative "count" values, you're doing something wrong. There are some complicated ways to get "corrected count" values with quantile-quantile mapping, but the simplest approach is just to include batch in the design matrix.

ADD REPLY • link 7.5 years ago Aaron Lun ★ 28k

0

Entering edit mode

My thinking was that I would batch correct the raw counts to get corrected raw counts which would be used in DESeq2 along with the model (which would now exclude the corrected batch variable). What kind of counts do you think the correction should be applied on? It cannot be log-cpm, vst, rlog, voom etc, because they cannot be fed into DESeq2. The negative values are worrying. I wonder if I could add a pseudocount to push them all above zero.

ADD REPLY • link 7.5 years ago firestar ▴ 20

0

Entering edit mode

No, your proposed approach is not correct. As I have already said, batch correction methods are not designed to be used with the raw counts. If you try to do so... well, you'll get nonsensical negative counts, and no amount of pseudo-count addition can protect you against that (notwithstanding the other biases introduced by adding pseudo-counts in the first place). In any case, you should be including the batch factor directly in the model for DE analysis. Supplying "corrected" values will fail to account for the uncertainty of the correction parameters.

ADD REPLY • link 7.5 years ago Aaron Lun ★ 28k