Question

LFC shrinkage executed when not expected (DESeq2)

0

Entering edit mode

Jonas B. ▴ 40

@jonas-b-14652

Last seen 5.3 years ago

Belgium, Antwerp, University of Antwerp

Dear all,

While trying to obtain specific LFCs, we noticed an inconsistency in the results.

We are performing a RNAseq analysis of an experiment with three fixed factors: treatment (control, CL), genotype (G1, G2, G3) and time (t0, t1, t2).

We are interested in specific comparisons, for instance comparing the treatment: control vs CL. Following the manual, there are two ways to do this.

The first method would be to perform an analysis where you contrast both treatments with the results function:

# First perform the analysis of the DESeqDataSet
dds_withTHREEWAYint <- DESeqDataSetFromMatrix(countData = mydata_matrix, colData = colData, design = ~ treatment + genotype + time + treatment:genotype + treatment:time + genotype:time + treatment:genotype:time)
dds_withTHREEWAYint_DESeq <- DESeq(dds_withTHREEWAYint)

# treatment drought vs control
results_for_contrast_CLvcontrol_BH <- results(dds_withTHREEWAYint_DESeq, contrast=c("treatment", "CL","control"), alpha=0.05, pAdjustMethod = "BH")

# Head of results
head(results_for_contrast_CLvcontrol_BH)

The last line generates the following output:

> head(results_for_contrast_CLvcontrol_BH)
log2 fold change (MLE): treatment CL vs control 
Wald test p-value: treatment CL vs control 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE       stat     pvalue       padj
          <numeric>      <numeric> <numeric>  <numeric>  <numeric>  <numeric>
AT1G01010  233.9382    -0.05907449 0.2442976 -0.2418136 0.80892457 0.90003409
AT1G01020  227.5727    -0.23369322 0.1027148 -2.2751670 0.02289591 0.08155662
AT1G01030  116.6140     0.68119565 0.2921068  2.3320089 0.01970022 0.07283691
AT1G01040 1480.6111    -0.14968164 0.1093810 -1.3684431 0.17117343 0.35035306
AT1G01046    0.0000             NA        NA         NA         NA         NA
AT1G01050 1328.1905     0.10908425 0.1433199  0.7611243 0.44658283 0.64690174

A second method, but maybe less straight forward, would be to use a group design. Here, we only use treatment as group and then compare both treatment groups, which looks like this:

# First perform the analysis of the DESeqDataSet
dds_withTHREEWAYint <- DESeqDataSetFromMatrix(countData = mydata_matrix, colData = colData, design = ~ treatment + genotype + time + treatment:genotype + treatment:time + genotype:time + treatment:genotype:time)
dds_withTHREEWAYint_forgroupanalysis <- dds_withTHREEWAYint

# Combine the factors of interest into a single factor with all combinations of the original factors.
dds_withTHREEWAYint_forgroupanalysis$group <- factor(paste0(dds_withTHREEWAYint_forgroupanalysis$treatment))
# Change the design to include just this factor, e.g. ~ group.
design(dds_withTHREEWAYint_forgroupanalysis) <- ~ group
# With the factors and the new design in place, perform DESeq (now the new group factor and the group design will be used).
dds_withTHREEWAYint_forgroupanalysis <- DESeq(dds_withTHREEWAYint_forgroupanalysis)
# Using the resultsNames-function, one can print all the new combinations of the fixed factors (here treatment, genotype and time).
# These names can be used to create contrasts of choice. 
resultsNames(dds_withTHREEWAYint_forgroupanalysis)

# Extract results
res_BH <- results(dds_withTHREEWAYint_forgroupanalysis, contrast=c("group", "CL", "control"), alpha=0.05, pAdjustMethod = "BH")
# Head of the results:
head(res_BH)

Which generates the following output:

> resultsNames(dds_withTHREEWAYint_forgroupanalysis)
[1] "Intercept"    "groupCL"      "groupcontrol"
> res_BH <- results(dds_withTHREEWAYint_forgroupanalysis, contrast=c("group", "CL", "control"), alpha=0.05, pAdjustMethod = "BH")
> head(res_BH)
log2 fold change (MAP): group CL vs control 
Wald test p-value: group CL vs control 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange      lfcSE      stat       pvalue         padj
          <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
AT1G01010  233.9382     0.70431932 0.15242364  4.620801 3.822612e-06 1.174188e-05
AT1G01020  227.5727    -0.08691073 0.03719792 -2.336441 1.946829e-02 3.213914e-02
AT1G01030  116.6140     0.22132906 0.24070875  0.919489 3.578398e-01 4.368425e-01
AT1G01040 1480.6111    -0.34129933 0.06420846 -5.315488 1.063717e-07 4.053913e-07
AT1G01046    0.0000             NA         NA        NA           NA           NA
AT1G01050 1328.1905    -0.28299095 0.12400197 -2.282149 2.248055e-02 3.666190e-02

What strikes us, is the difference in LFC. The LFCs of the group design were shrunk (indicated by MAP), where the LFCs of the other analysis where not shrunk (indicated by MLE).

Since a recent post by Michael Love (New function lfcShrink() in DESeq2) mentioned that the DESeq function no longer shrinks the LFCs, we didn't understand why MAP was still used.

We also noticed that the p-values were different. We think that this is because the design was and a different method was used. Is it correct to assume this? Based on your experience, which p-value would you advice to use?

Thank you for your advice and time,

Kind regards,

Jonas

Antwerp University

deseq2 • 1.7k views

ADD COMMENT • link 7.0 years ago Jonas B. ▴ 40

score 0 · Answer 1 · 2018-02-07

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 41 minutes ago

United States

What version of DESeq2 do you have?

The second run really can’t produce MAP estimates because betaPrior is set to FALSE since version 1.16 (April 207).

ADD COMMENT • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Dear Michael,

Thank you for your quick response.

It turns out that my version is outdated. I really didn't expect this, since I've only installed it in December 2017. Having looked into this, I've noticed that an older version was installed because I'm running an older R-version. I didn't know this, now I do!

Lesson learned: Always update every package and language! (Or at least be very aware of it!)

I've quickly ran it again on a computer running R-version 3.4.3 and package version 1.18.1.

Now, both analyses report that MLE was used, so no shrinkage should have occurred. However, our results still show different LFCs and p-values when comparing both analyses.

Output for group analysis:

> res_BH_3
log2 fold change (MLE): group CL vs control 
Wald test p-value: group CL vs control 
DataFrame with 33208 rows and 6 columns
            baseMean log2FoldChange      lfcSE       stat       pvalue         padj
           <numeric>      <numeric>  <numeric>  <numeric>    <numeric>    <numeric>
AT1G01010   233.9382     0.70816864 0.15325652  4.6208061 3.822519e-06 1.146368e-05
AT1G01020   227.5727    -0.08693879 0.03720991 -2.3364421 1.946821e-02 3.138168e-02
AT1G01030   116.6140     0.22439652 0.24404246  0.9194979 3.578352e-01 4.274354e-01
AT1G01040  1480.6111    -0.34162718 0.06427015 -5.3154872 1.063723e-07 3.956375e-07
AT1G01046     0.0000             NA         NA         NA           NA           NA
...              ...            ...        ...        ...          ...          ...
MIR854c   0.02422714      0.0906822   2.925159 0.03100077    0.9752689           NA
MIR854d   0.00000000             NA         NA         NA           NA           NA
MIR854e   0.00000000             NA         NA         NA           NA           NA
MIR855    0.00000000             NA         NA         NA           NA           NA
MIR858b   0.01255371      0.1708312   2.925159 0.05840066    0.9534295           NA

Output for the other analysis:

> head(results_for_contrast_CLvcontrol_BH)
log2 fold change (MLE): treatment CL vs control 
Wald test p-value: treatment CL vs control 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE       stat     pvalue       padj
          <numeric>      <numeric> <numeric>  <numeric>  <numeric>  <numeric>
AT1G01010  233.9382    -0.05907449 0.2442976 -0.2418136 0.80892457 0.90003409
AT1G01020  227.5727    -0.23369322 0.1027148 -2.2751670 0.02289591 0.08155662
AT1G01030  116.6140     0.68119565 0.2921068  2.3320089 0.01970022 0.07283691
AT1G01040 1480.6111    -0.14968164 0.1093810 -1.3684431 0.17117343 0.35035306
AT1G01046    0.0000             NA        NA         NA         NA         NA
AT1G01050 1328.1905     0.10908425 0.1433199  0.7611243 0.44658283 0.64690174

These LFCs are exactly the same as the ones reported in the first post (the one with the older version, where shrinkage occurred). Noteworthy: The adjusted p-values of the group analysis also changed slightly.

This means we are still left with the same question, now using R version 3.4.3 and DESeq2 version 1.18.1. Could you please explain why we get different LFCs and p-values?

Thank you in advance!

Kind regards,

Jonas

ADD REPLY • link 7.0 years ago Jonas B. ▴ 40

0

Entering edit mode

Here, if I follow the designs correctly, you are comparing the CL vs control effect in the interaction design, with the CL vs control effect when the only variable in the design is treatment (control or CL). The interpretation of CL vs control is not the same here. In the interaction design, CL vs control is the effect seen when all other variables are at the reference level. Take a look at the diagram here:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Dear Michael,

Thank you for your response.

We got confused with the different methods to analyse specific contrasts.

Following the manual, we've now decided to do the following:

For any three-way contrasts of interest, we believe that we have to follow the interaction instructions (manual), where our dds contains the three way model, followed by the creation of factors and the change in design to groups. The following code of the manual was used and adjusted:

dds$group <- factor(paste0(dds$genotype, dds$condition))
design(dds) <- ~ group
dds <- DESeq(dds)
resultsNames(dds)
results(dds, contrast=c("group", "IB", "IA"))

Next, to test our single factors (e.g. make a comparison just between two treatments, without taking into account genotypes or time points), we believe that we have to use the contrast functionality without transforming our design to a group design. Meaning, our dds still contains all interactions, up until the threeway interaction, and we then use the contrast function to compare specific groups of a single factor. For example, within treatments, we compare treatmentA versus treatmentB. For this, we based ourselves on the following code of the manual:

results(dds, contrast=c("condition","C","B"))

However, we do not have a clue how to contrast the two-way interactions. We expect that we need to use the group design, but we don't know how to use it in this case.

Sorry to make things so complicated.

Thank you for your time and advice (and patience).

Kind regards,

Jonas and colleagues

ADD REPLY • link 7.0 years ago Jonas B. ▴ 40

0

Entering edit mode

hi Jonas,

this from before is key:

"In the interaction design, CL vs control is the effect seen when all other variables are at the reference level. "

So with respect to your statement

"make a comparison just between two treatments, without taking into account genotypes or time points"

using contrast like you have above is just comparing the effect of condition for the reference level of the other variables, and when the other variables are not the reference level, those samples are not used in the contrast at all.

In an interaction design, there is no coefficient which represents something like "the condition effect, without taking into account the genotypes or time points". What you have is many different condition effects, stratified by genotype and time point and genotype x time point. I think this is a sufficiently complex design that you need to partner with someone who has experience with interactions and linear modeling.

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Dear Michael,

Thank you so much for your time and response. I do believe that I understand the the mistake we're making.

And indeed, our design is complex enough to solicit somebody with more experience in this topic. I will only return to the forum with more precise questions or when unexplainable errors occur. I understand that discussing an entire design is out of the scope of the forum.

I want to thank you for all the help offered and time invested. The response time on the forum is amazing and also the help in previous posts allowed me to make quite some progress already.

Kind regards,

Jonas

ADD REPLY • link 7.0 years ago Jonas B. ▴ 40

0

Entering edit mode

Dear Michael,

I've been experimenting with the group function a bit more, on a simpeler data set.

The experiment had the following design:

three treatments: control, mild, severe
three zones: mer, elo, mat

To test specific combinations of both factors, the manual advices to do the following:

dds$group <- factor(paste0(dds$genotype, dds$condition)) design(dds) <- ~ group dds <- DESeq(dds) resultsNames(dds) results(dds, contrast=c("group", "IB", "IA"))

For my data, it could look like this:

dds_withint_forgroupanalysis <- DESeqDataSetFromMatrix(countData = mydata_matrix, colData = colData, design = ~ zone + treatment + zone:treatment)
dds_withint_forgroupanalysis$group <- factor(paste0(dds_withint_forgroupanalysis$treatment, dds_withint_forgroupanalysis$zone))
design(dds_withint_forgroupanalysis) <- ~ group
dds_withint_forgroupanalysis <- DESeq(dds_withint_forgroupanalysis)
resultsNames(dds_withint_forgroupanalysis)
res_BH <- results(dds_withint_forgroupanalysis, contrast=c("group", "mildelo", "controlelo"), alpha=0.05, pAdjustMethod = "BH")
res_BH

Where the output looks like this:

> res_BH
log2 fold change (MLE): group mildelo vs controlelo 
Wald test p-value: group mildelo vs controlelo 
DataFrame with 45795 rows and 6 columns
                 baseMean log2FoldChange     lfcSE        stat     pvalue      padj
                <numeric>      <numeric> <numeric>   <numeric>  <numeric> <numeric>
Zm00001d036955  0.0544819      0.3367455 6.3427193  0.05309166 0.95765888        NA
Zm00001d024843 11.6731276      1.2986291 1.7364082  0.74788243 0.45453108        NA
...

I wanted to compare the fold changes, calculated by DESeq2 as shown above Log2(avg_mildelo/avg_controlelo), to the fold changes which I've calculated myself. For this, I've used the normalised expression values, extracted from the DESeq-analysis. Extracted using this code:

dds_withint <- DESeqDataSetFromMatrix(countData = mydata_matrix, colData = colData, design = ~ zone + treatment + zone:treatment)
dds_effectofinteraction <- DESeq(dds_withint, test="LRT", reduced=~ zone + treatment)
normcounts <- counts(dds_effectofinteraction, normalized = TRUE)
write.csv(normcounts, file="norm_counts.csv")

And with those normalised expression values, I've calculated the fold changes of the same contrast as the one I've calculated with DESeq2. However, I notice there is a small discrepancy in the fold changes. For instance: calculated LFC = -0.65, DESeq2 LFC = -0.59.

In the manual, the following is stated:

... The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.

And also a bit further:

... The coefficients βi give the log2 fold changes for gene i for each column of the model matrix X. Note that the model can be generalized to use sample- and gene-dependent normalization factors sij.

So my guess is that the fold change reported by DESeq2 is not a shrunken one (since MLE is stated in the output, but also not the simple Log2(avg_mildelo/avg_controlelo) for example. In fact, the LFC reported by DESeq is normalised itself, based on a factor for each sample/gene combination, which I cannot simply replicate myself with the normalised counts.

To conclude, my two questions:

1. Could you please verify the reasoning in the last paragraph of the above? Feel free to elaborate a bit more on this in Layman's terms.

2. Concerning the groups: I'm a bit confused by the names function when using groups. The output of the names functions looks like this:

> resultsNames(dds_withint_forgroupanalysis)
[1] "Intercept"                      "group_controlmat_vs_controlelo" "group_controlmer_vs_controlelo" "group_mildelo_vs_controlelo"    "group_mildmat_vs_controlelo"   
[6] "group_mildmer_vs_controlelo"    "group_severeelo_vs_controlelo"  "group_severemat_vs_controlelo"  "group_severemer_vs_controlelo"

Where in the contrast function you use different names, e.g.:

res_BH <- results(dds_withint_forgroupanalysis, contrast=c("group", "mildelo", "controlelo"), alpha=0.05, pAdjustMethod = "BH")

This is correct, right?

ADD REPLY • link 7.0 years ago Jonas B. ▴ 40

0

Entering edit mode

The MLE should be very close to the simple log2(average of normalized counts / average of normalized counts). But instead of computing based on averages, the DESeq2 method uses a GLM and solves for the MLE, and there are minor modifications to stabilize the output, e.g. bounding the expected count so as not to produce 1/0 during the algorithm, etc.

Because you have an intercept, the coefficients actually produced by the model are all compared to the reference level (see the vignette section on factor levels). We wrote the results() function with an argument 'contrast' which takes care of everything for you, so you can compare two arbitrary levels of a factor variable.

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael!

You wrote:

"...so you can compare two arbitrary levels of a factor variable."

This looks like you referred to a one factor design.

So in a two factor design, this means that I can compare any combination the two factors against each other using the group functionality, right?

ADD REPLY • link 7.0 years ago Jonas B. ▴ 40

0

Entering edit mode

Yes. Levels in the group design correspond to combinations of levels from both factors.