Hi!
I already posted the same question as a comment in this thread: Subtracting Background Signal Before DESeq Analysis but Michael kindly advised me to reformulate the problem here. Issue is related to Slam-Seq is a metabolic labeling technique to analyze transcription, resulting in newly made RNA are pulse labeled with a modified nucleotide, which is later detected in RNA-seq as a T->C mismatch. The readout is the count of T->C conversions that have similar distribution as read counts. Alternative type of readout can be counts of reads containing mismatches.
Unlike in typical RNA-seq, in SLAM-seq there are intrinsic levels of background noise arising from sequencing errors, unspecific labeling, and other factors (collectively: noise). I now tend to think that the best approach would be to incorporate such noise in a model. The noise-to-signal ratio may vary from sample to sample, and let's say noise constitutes about 20-40% of total signal magnitude (so 20-40% of genes have signals close or below noise levels).
I am thinking of this problem using the following analogy (please correct me if you think it's wrong): imagine a count matrix, and that an evil dwarf randomly adds a number to each count by sampling from a normal distribution with known mean and SD. Having no possibility to determine the exact counts, should I subtract the mean noise from each count, or should I just include the noise in the model. The latter seems the right way to go, but I struggle to formulating a proper design formulas to use with DEseq2 or limma.
I'd like to consider the following time series experiment with 2 time-points (t0, t1) and 2 conditions (control, treatment).
Case 1. I have a separate measurement to determine the background levels, and I assume the background levels are identical across all samples. The design I was thinking of is:
~ time + condition + time:condition
However, I'm not sure how in this case incorporate the noise into the formula. In this case, the noise is the same, and is estimated indirectly (e.g., by using a control sample).
Case 2. I have a separate measurement to determine the background levels, and I estimate the background separately for each SAMPLE. The design I was thinking of is:
~noise + time + condition + time:condition
Case 3. I have a separate measurement to determine the background levels, and I estimate the background separately for each GENE. In this case, I don't really see a clear way how to avoid subtracting of the background counts for each gene. I thought of determining genes that are above the noise levels with ANOVA analysis, and then subtract the background for only these genes. The genes that do not pass ANOVA test would just have 0 counts. Intuitively I think there is a much more appropriate way to do it, but I fail to see it.
Any hints that would help me to get started are greatly appreciated.
Cheers, Lech
Thank Michael, I will look into limma quality weighting, there seem to be plenty of function in this package I was not aware of. But I'm still struggling with handling the background noise in the experiment design. Could you please give me feedback on the following ideas:
Case 1. Constant noise determined from a control sample. Would in such case it be correct to make the noise an intercept term, like:
In that case, my design would be something like:
~condition + time
Case 2. Noise estimated for each condition and timepoint separately. I could then introduce noise as an interaction term.
In such case, I would have a formula like:
~condition:noise + time:noise
or
~0 + condition:noise + time:noise
Am I going somewhere with this?
Cheers, Lech
Case 1 is not possible / meaningful as far as what I understood you wanting to do (control for certain samples having different expression artifacts).
Case 2 has perfect confounding as you've written it. But if you mean, add a variable that controls for something like "batch" then sure, this is used all the time with DESeq2.