Question

ComBat normalization for batch effect change 0 values

1

Entering edit mode

hackingfactory ▴ 10

@hackingfactory-20629

Last seen 5.6 years ago

Hi all, I have a matrix (dds) of DESeq2 normalized read counts of the expression of coding/non-coding RNA. I then tried to remove a known batch effect due to strandedness using ComBat After modeling the matrix

modcombat <- model.matrix(~1, data = colData)

where colData is a df with samples in rows and condition and batch in 2 different columns, I run combat:

batch_corrected<- ComBat(dat=log2(counts(dds, normalized=T)+1), batch = colData$batch, mod = modcombat, par.prior = TRUE, prior.plots = F)

However, looking at the expression of some element (the matrix contains both expression of coding and noncoding genes as well as of those of some Alu elements) that was 0 in dds normalized counts, after applying ComBat it has raised to 0.0722117750339454 How could be that a 0 value (no expression) has been changed to a higher value? How to avoid this behaviour? using 'mean.only = T' makes things even worse and a much higher number of sample that had 0 value for specific elements now have a higher one Thank you

ComBat batch effect sva normalization • 3.9k views

ADD COMMENT • link updated 5.6 years ago by James W. MacDonald 67k • written 5.6 years ago by hackingfactory ▴ 10

score 1 · Answer 1 · 2019-04-26

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 1 day ago

United States

In your code, you have

batch_corrected <- ComBat(dat = log2(counts(dds, normalized = TRUE) +1), ...

Where the critical part to consider is

counts(dds, normalized = TRUE) + 1

If you have a count value of zero, and you add 1 to it, what is the value now?

You might also consider using svaseq instead, which is intended for RNA-Seq data. In general, if you are using things like DESeq2, you don't adjust the counts directly, but instead use offsets in the model, so you are probably better off generating surrogate variables that you can add to the model rather than trying to adjust your counts.

ADD COMMENT • link 5.6 years ago James W. MacDonald 67k

0

Entering edit mode

James, if I have a value of 0 and I add 1 THEN i apply log2, I still have zero and since dat = log2(counts(dds, normalized=T)) my question still remains. Moreover I used DESeq2 just to normalize by sequencing depth and then ComBat, on log2 normalized read counts, to remove batch effect because I need normalized read counts (both for sequencing depth and batch effect) for downstream anaylses different from differential gene expression (such as WGCNA for example) Thanx

ADD REPLY • link 5.6 years ago hackingfactory ▴ 10

1

Entering edit mode

Oh yeah good point.

Anyway, you are making the assumption that ComBat won't have any effect on zero counts, but why should that be true? If one batch is ~0.07 counts smaller, on average, than another batch, to account for that difference you have to add 0.07 to all the counts in the smaller batch. Not just the non-zero counts.

ComBat doesn't just adjust the mean however, it adjusts the variance as well, which in the context of count data might not make any sense at all. I don't know if that's a recommended use case.

I also don't know what you mean by 'normalized counts', because in the context of RNA-Seq, that isn't really a thing. You can compute normalization factors that will be used as offsets, but that doesn't affect the counts. And if you are using something like vst or rlog from DESeq2, then you shouldn't be taking logs afterwards anyway. And if you are planning to use for something like WGCNA, then you probably want to use vst or rlog because I am not sure how robust to departures from normality (and heteroscedasticity) WGCNA is.

ADD REPLY • link 5.6 years ago James W. MacDonald 67k

0

Entering edit mode

I am not making any assumption, i just asked WHY ComBat change zero values (i.e. zero counts). If a gene/element in a sample has ZERO read count, WHY adjusting for batch effect should increase that number? That's what I'm asking. Is there an answer besides "why not" ? I also use ComBat with mean.only=T and have more sample where the 0 values are corected to higher ones. Moreover, as suggested by the author in this post (https://support.bioconductor.org/p/73266/) i set to 0 all the negative values. By 'normalized read counts' I mean counts normalized for the sequencing depth/library size and to me it is a thing otherwise how could you compare samples sequenced at different depths? The estimateSizeFactor performs this normalization. rlog and vst are alternative that I also used having the same result.

ADD REPLY • link 5.6 years ago hackingfactory ▴ 10

0

Entering edit mode

Yes you are making assumptions. You assume that a count of zero means something special, when it does not, and are expecting ComBat to adjust all the genes with counts one way and leave the genes with zero counts alone. That doesn't make any sense - ComBat is adjusting the distribution of the data, by shifting the mean and (probably) adjusting the variance. The zeros are part of the distribution, and get adjusted just like anything else. If the adjustment is to increase the counts, and some were zero to begin with, they end up being larger than zero, because that is what ComBat is intended to do.

You also seem to have missed the main point that Evan Johnson made, which was [sic]:

ComBat assumes that the data symmetric and bell-shaped.

Which does not describe count data, which are neither symmetric nor bell-shaped.

And to reiterate - when you 'normalize' RNA-Seq data this does not affect the counts themselves in any way it just computes size factors that are used as offsets in the linear model to account for differences in sequencing depth. So the counts aren't normalized, but instead normalization factord are computed from the counts that are then used when you make comparisons.

> dds <- makeExampleDESeqDataSet(1000,4)
> sizeFactors(dds)
NULL
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      27      12      34      19
gene2       4       4      20       8
gene3     146     270     241     154
gene4      13       4       4      15
gene5       5       7       3       9
gene6      32       8      18       5
> dds <- estimateSizeFactors(dds)
> sizeFactors(dds)
 sample1  sample2  sample3  sample4 
1.017329 1.040678 1.026172 1.036060 
> head(counts(dds))
      sample1 sample2 sample3 sample4
gene1      27      12      34      19
gene2       4       4      20       8
gene3     146     270     241     154
gene4      13       4       4      15
gene5       5       7       3       9
gene6      32       8      18       5

Do you see a difference in the 'normalized' counts? I don't.

ADD REPLY • link 5.6 years ago James W. MacDonald 67k

0

Entering edit mode

Thanks James, I think you described it well.

If anyone is interested, we are developing a ComBat-seq that is more appropriate for counts data. The software is still in beta version, and we are close to publishing. In the interest of open science, its available to anyone to try out—just no guarantees we won’t tweak it before we send to bioconductor.

The download site is: https://github.com/zhangyuqing/ComBat-seq. The process would be to use combat-seq on untransformed counts first, then apply DESeq2 normalization (or anything else you want to do with the counts).

On Apr 29, 2019, at 3:51 PM, James W. MacDonald [bioc] <noreply@bioconductor.org<a rel="nofollow" href="mailto:noreply@bioconductor.org">noreply@bioconductor.org> wrote:

Activity on a post you are following on support.bioconductor.orghttps://support.bioconductor.org/

User James W. MacDonaldhttps://support.bioconductor.org/u/5106/ wrote Comment: ComBat normalization for batch effect change 0 valueshttps://support.bioconductor.org/p/120408/#120496:

Yes you are making assumptions. You assume that a count of zero means something special, when it does not, and are expecting ComBat to adjust all the genes with counts one way and leave the genes with zero counts alone. That doesn't make any sense - ComBat is adjusting the distribution of the data, by shifting the mean and (probably) adjusting the variance. The zeros are part of the distribution, and get adjusted just like anything else. If the adjustment is to increase the counts, and some were zero to begin with, they end up being larger than zero, because that is what ComBat is intended to do.

You also seem to have missed the main point that Evan Johnson made, which was [sic]:

ComBat assumes that the data symmetric and bell-shaped.

Which does not describe count data, which are neither symmetric nor bell-shaped.

And to reiterate - when you 'normalize' RNA-Seq data this does not affect the counts themselves in any way it just computes size factors that are used as offsets in the linear model to account for differences in sequencing depth. So the counts aren't normalized, but instead normalization factord are computed from the counts that are then used when you make comparisons. NULL sample1 sample2 sample3 sample4 gene1 27 12 34 19 gene2 4 4 20 8 gene3 146 270 241 154 gene4 13 4 4 15 gene5 5 7 3 9 gene6 32 8 18 5 sample1 sample2 sample3 sample4 1.017329 1.040678 1.026172 1.036060 sample1 sample2 sample3 sample4 gene1 27 12 34 19 gene2 4 4 20 8 gene3 146 270 241 154 gene4 13 4 4 15 gene5 5 7 3 9 gene6 32 8 18 5

Do you see a difference in the 'normalized' counts? I don't.

ADD REPLY • link 5.6 years ago W. Evan Johnson ▴ 870

0

Entering edit mode

I'll give ComBat-seq a try Thank you

ADD REPLY • link 5.6 years ago hackingfactory ▴ 10

0

Entering edit mode

I'll give ComBa-seq a try Thank you

ADD REPLY • link 5.6 years ago hackingfactory ▴ 10

0

Entering edit mode

To me a value of 0 means not expressed and that's why I asked why it's been changed by ComBat. I understand that shifting the distribution mean would affect all the counts but I was wondering, from a biological perspective, if makes sense to change a gene "status" from unexpressed to "barely expressed" because it changes the biological meaning.

Regarding normalization you are missing a piece, just like with log2. After estimateSizeFactors give a try to:

head(counts(dds, normalized=T)) and see what happens...

My goal is to obtain a matrix of read counts not only normalized for sequencing depth but also adjusted for known batch effect and not having 0 values changed because I do ot want element not expressed to become expressed (I am also looking at Alu element not only genes). Is it possible? Thank you

ADD REPLY • link 5.6 years ago hackingfactory ▴ 10

0

Entering edit mode

This is why ComBat should never be used with RNA-seq. I am not sure why you chose to use it in the first place. I used it once, looked away in horror, and then never touched it again.

ADD REPLY • link 5.6 years ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Since you mentioned WGCNA, I'll add my 2 cents. I use varianceStabilizingTransformation in DESeq2 on RNA-seq data. This normalizes and approximately log-transforms the data. If needed, I then apply ComBat or whatever other adjustment is appropriate, and I don't worry about preserving any zeros since, for VST data, zero is not a special value (in fact, you will typically have no zeroes or anything close to it after VST).

Note the order: first normalization/VST (or normalization and log transform), then ComBat. I would not run ComBat or any other linear model-based adjustment on the natural scale data (count or microarray expression) because they are heavily heteroschedastic.

ADD REPLY • link 5.6 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Thank you Peter, Actually I am using ComBat because of what you suggested in the tutorial. I also used VST, but ComBat still change 0 read counts to different values. Moreover I would like to clear that I need normalized read counts (for sequencing depth) and adjusted for batch effect not only for WGCNA but also for other analyses, such a ONE-WAY ANOVA between several groups (conditions) each with hundreds of samples to test for any significant association of the expression profile to various clinical traits, and for this I would prefer not to use VST. Thank you

ADD REPLY • link 5.6 years ago hackingfactory ▴ 10