Counter-directional LFC shrinkage
2
0
Entering edit mode
ennui • 0
@5a77cdfc
Last seen 6 months ago
Switzerland

Hi,

First a caveat: I am new to enrichment analyses, following the DESeq2 guide. The data is not mine and I have very little to say in terms of experimental design, I am just trying to improve on a "naive" analysis of producing a ration counts-per-million which exaggerates low-count reads.

I noticed something counter-intuitive in my data, namely that lfcShrink does the opposite of what it should: low-count reads are exaggerated and high-count reads are shrunk.

Two example genes below.

# counts, metadata defined elsewhere
> dds <- DESeq2::DESeqDataSetFromMatrix(
    countData = counts,
    colData = metadata,
    design = ~condition
  )
> dds$condition <- relevel(dds$condition, ref = "reference")

# many zeros in this dataset hence poscount
> dds <- DESeq2::DESeq(dds, sfType = "poscount")

> result_unshrunk <- results(dds, name = "condition_condition1_vs_reference")
> result_unshrunk[c('gene1', 'gene2'),]

log2 fold change (MLE): condition1 vs reference 
Wald test p-value: condition1 vs reference 
DataFrame with 2 rows and 6 columns
       baseMean log2FoldChange     lfcSE      stat     pvalue      padj
      <numeric>      <numeric> <numeric> <numeric>  <numeric> <numeric>
gene1   11.0575       -5.49035   2.08908  -2.62811 0.00858602  0.116205
gene2 7537.5202        8.97127   5.23205   1.71468 0.08640462  0.289067

> result_apeglm <- lfcShrink(dds, coef = "condition_condition1_vs_reference", type = "apeglm")
> result_apeglm[c('gene1', 'gene2'),]

log2 fold change (MAP): condition1 vs reference 
Wald test p-value: condition1 vs reference 
DataFrame with 2 rows and 5 columns
       baseMean log2FoldChange     lfcSE     pvalue      padj
      <numeric>      <numeric> <numeric>  <numeric> <numeric>
gene1   11.0575      10.457654        NA 0.00858602  0.116205
gene2 7537.5202       0.123898        NA 0.08640462  0.289067

Looking at the normalized counts below (from plotCounts). It seems obvious to me that gene2 should be the one with higher LFC. Both genes have the same counts in reference, but one has a clearly higher average. In the un-shrunk calculation gene2 is higher, but the relationship reverses under the shrunk model.

Gene 1 - `plotCounts` Gene 2 - `plotCounts`

Now, this is just a small snapshot of the data, which I cannot share, so I appreciate that it's not possible to some up with a definitive answer via a forum. Still, I would appreciate some guidance as to how to diagnose what is really happening?

Below are the distributions from plotMA.

Basic LFC calculation apeglm-shrunk LFCs

Does this give a hint? Perhaps it's the non-symmetric LFC distribution that's the problem? Thanks a lot for any help.

DESeq2 • 756 views
ADD COMMENT
0
Entering edit mode

Is this RNA-seq? The unshrunken MA-plot looks very odd and indicates that normalization is completely offset.

ADD REPLY
0
Entering edit mode

I agree it looks offset (and weird). The data is from NGS, not RNAseq, but we still want to see relative abundance between conditions - I thought the same principles would apply.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

It's not just the average count but also the dispersion that affects apeglm.

You can try ashr for another type of shrinkage.

ADD COMMENT
0
Entering edit mode

Thank you for your response. The ashr shrinkage indeed preserves more of the original ranks in the data.

So if I understand correctly, LFC for Gene 2 in the example is strongly shrunk because it's more evenly dispersed between 0 and 10e6 in condition1. Gene 1 is clustered around 0 with a single outlier, hence less dispersion, hence increase rather than decrease in LFC?

ADD REPLY
0
Entering edit mode

Gene 1 is hard to predict given the paucity of data for the comparison of interest. What is happening is that the relatively low dispersion from condition 1 and 2 is bringing the dispersion down for the whole gene but then you only have a single count for the reference.

I would actually recommend you not process the data altogether for this dataset (so instead run DESeq separately on condition 1 vs ref and condition 2 vs ref), because I think the model is too confident from pooling dispersion across all samples. For a two group comparison I would guess it would be less confident, more accurately summarizing the data.

ADD REPLY
0
Entering edit mode
swbarnes2 ★ 1.4k
@swbarnes2-14086
Last seen 1 day ago
San Diego

The y axes in those plots are logged. The average for condition1 is going to be higher than condition2 because of the very high outliers.

ADD COMMENT

Login before adding your answer.

Traffic: 506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6