Question

DESeq normalization seems to reverse the directionality of log2FoldChange - is this normal?

0

Entering edit mode

stwestreich • 0

@stwestreich-10664

Last seen 7.3 years ago

Hello all,

I'm a graduate student working on analyzing RNA-Seq data from a microbiome - many different species all found in the same environment. I used some annotation software and Python to get a list of counts for hits to each organism, within each sample. Imported these counts into R, created a combined data table, and then used DESeq2 for comparing the 12 wild-type (WT) samples with my 12 samples from diseased individuals.

When comparing between DESeq's normalized counts, AKA counts(dds, normalized=TRUE), versus the raw counts per million transcripts (CPM) for these organisms, I noticed something interesting. For some of the higher-abundance organisms, the CPM is higher in my WT samples, if only slightly. However, in DESeq2's normalized data, counts are higher in my diseased samples.

In other words, for some of the organism counts, they are being shifted down in my WT samples and up in my experimental (diseased) samples, to the point where the final calculated log2FoldChange shows these organisms as more active in my experimental samples - even though the CPM is higher in my WT samples.

I know that DESeq fits to a negative binomial distribution, but is that strong enough to lead to a reversed direction of the log2FoldChange for these entries?

Code I'm running:

> completeCondition <- data.frame(condition=factor(c(rep("control", length(control_files)), rep("experimental", length(exp_files)))))
> colnames(complete_table) <- completeCondition
> dds <- DESeqDataSetFromMatrix(complete_table, completeCondition, ~ condition)
> dds <- DESeq(dds)

> normalized_data <- counts(dds, normalized=TRUE)

Any help is greatly appreciated!

deseq2 rnaseq rna-seq normalization cpm • 2.9k views

ADD COMMENT • link updated 8.8 years ago by Wolfgang Huber ★ 13k • written 8.8 years ago by stwestreich • 0

score 1 · Answer 1 · 2016-05-09

Of course it's normal for the normalization to change the relative levels between samples. This is the purpose of the normalization. In general, the normalization methods used for RNA-seq attempt to normalize such that the log2 fold change of the average or median gene (or in this case, the average organism) is zero, under the assumption that a large fraction of genes (organisms) are not changing. The assumptions of these normalization methods were chosen based on the typical features of RNA-seq gene counts, not metagenomics. If you're not sure whether the assumptions made by the normalization are appropriate, I suggest you consult the relevant paper to understand more about the normalization, so you can make an informed decision about whether it is the right choice. I will say, though, that un-normalized counts or counts per million are almost certainly inadequate.

Lastly, the use of the negative binomial distribution is irrelevant here, since computing the normalized counts does not, as far as I know, involve any kind of model fitting; it's just a direct computation from the counts.

score 0 · Answer 2 · 2016-05-10

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 19 days ago

EMBL European Molecular Biology Laborat…

Did you try scatterplots of the counts, for the different conditions, and highlight the species in question.

One more comment and a question:

- DESeq2 works off the actual counts, not CPM. The latter will lead to nonsensical results. Please see the vignette.

- You said these were RNA-seq data (in which the objects being counted are different genes / (m)RNAs), but then refer to the objects being counted as organisms. Which do you mean?

ADD COMMENT • link 8.8 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hello Wolfgang,

Thank you for the suggestion! I haven't yet tried looking at a scatterplot with the species in question selected, but I'll attempt that to see how much it shifts in position.

Yes, I am aware that DESeq2 works off of actual counts. I am comparing the FPM matrix results with the normalized data (counts(dds)). I am not using the CPM values as input.

This is RNA-seq data, yes, but I'm pooling all gene hits to the same organism to look at changes in organism abundance.

ADD REPLY • link 8.8 years ago stwestreich • 0