Question

[DEseq2] How to properly add a noise coefficient to the experiment design formula

0

Entering edit mode

drowsygoat ▴ 30

@lechkaczmarczyk-14172

Last seen 3.5 years ago

Poland

Hi!

I already posted the same question as a comment in this thread: Subtracting Background Signal Before DESeq Analysis but Michael kindly advised me to reformulate the problem here. Issue is related to Slam-Seq is a metabolic labeling technique to analyze transcription, resulting in newly made RNA are pulse labeled with a modified nucleotide, which is later detected in RNA-seq as a T->C mismatch. The readout is the count of T->C conversions that have similar distribution as read counts. Alternative type of readout can be counts of reads containing mismatches.

Unlike in typical RNA-seq, in SLAM-seq there are intrinsic levels of background noise arising from sequencing errors, unspecific labeling, and other factors (collectively: noise). I now tend to think that the best approach would be to incorporate such noise in a model. The noise-to-signal ratio may vary from sample to sample, and let's say noise constitutes about 20-40% of total signal magnitude (so 20-40% of genes have signals close or below noise levels).

I am thinking of this problem using the following analogy (please correct me if you think it's wrong): imagine a count matrix, and that an evil dwarf randomly adds a number to each count by sampling from a normal distribution with known mean and SD. Having no possibility to determine the exact counts, should I subtract the mean noise from each count, or should I just include the noise in the model. The latter seems the right way to go, but I struggle to formulating a proper design formulas to use with DEseq2 or limma.

I'd like to consider the following time series experiment with 2 time-points (t0, t1) and 2 conditions (control, treatment).

Case 1. I have a separate measurement to determine the background levels, and I assume the background levels are identical across all samples. The design I was thinking of is: ~ time + condition + time:condition However, I'm not sure how in this case incorporate the noise into the formula. In this case, the noise is the same, and is estimated indirectly (e.g., by using a control sample).

Case 2. I have a separate measurement to determine the background levels, and I estimate the background separately for each SAMPLE. The design I was thinking of is: ~noise + time + condition + time:condition

Case 3. I have a separate measurement to determine the background levels, and I estimate the background separately for each GENE. In this case, I don't really see a clear way how to avoid subtracting of the background counts for each gene. I thought of determining genes that are above the noise levels with ANOVA analysis, and then subtract the background for only these genes. The genes that do not pass ANOVA test would just have 0 counts. Intuitively I think there is a much more appropriate way to do it, but I fail to see it.

Any hints that would help me to get started are greatly appreciated.

Cheers, Lech

limma edgeR DEseq2 • 1.4k views

ADD COMMENT • link updated 3.6 years ago by Michael Love 43k • written 3.6 years ago by drowsygoat ▴ 30

score 0 · Answer 1 · 2021-06-08

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

DESeq2 doesn’t have the option to model different variability (for the NB distribution, different dispersion) for different samples based on a covariate. The recommendation to deal with unwanted variation is to add a term to the design. But this will help with shifts, not differential variation. Perhaps limma and sample quality weighting would be another approach to your problem.

ADD COMMENT • link 3.6 years ago Michael Love 43k

0

Entering edit mode

Thank Michael, I will look into limma quality weighting, there seem to be plenty of function in this package I was not aware of. But I'm still struggling with handling the background noise in the experiment design. Could you please give me feedback on the following ideas:

Case 1. Constant noise determined from a control sample. Would in such case it be correct to make the noise an intercept term, like:

  noise(int)    condition1    condition2    time1   time2   
1    1              1           0           1        0 
2    1              0           1           0        1
3    1              1           0           0        1 
4    1              0           1           1        0

In that case, my design would be something like: ~condition + time

Case 2. Noise estimated for each condition and timepoint separately. I could then introduce noise as an interaction term.

         condition1 condition2    time1   time2    noise
sample1       1         0           1       0       1
sample2       1         0           1       0       1
sample3       0         1           0       1       1
sample4       0         1           0       1       1
bg_control1   1         0           1       0       0
bg_control2   1         0           1       0       0
bg_control3   0         1           0       1       0
bg_control4   0         1           0       1       0

In such case, I would have a formula like:

~condition:noise + time:noise

or

~0 + condition:noise + time:noise

Am I going somewhere with this?

Cheers, Lech

ADD REPLY • link 3.6 years ago drowsygoat ▴ 30

0

Entering edit mode

Case 1 is not possible / meaningful as far as what I understood you wanting to do (control for certain samples having different expression artifacts).

Case 2 has perfect confounding as you've written it. But if you mean, add a variable that controls for something like "batch" then sure, this is used all the time with DESeq2.

ADD REPLY • link 3.6 years ago Michael Love 43k