Question

Treatment dose in treatment-control RNA-seq

0

Entering edit mode

JacobK • 0

@jacobk-9130

Last seen 9.5 years ago

Germany

Dear Bioconductor community,

I have a RNA-seq dataset consisting of 16 samples. Two different genotypes are expected to show differences with respect to a treatment (4x4 samples). Half of the mice have not been treated and half of them treated with an additive to the food. Because of the different amount the individual mice eat during the day the dose of the treatment per day and unit body weight varies widely. I know that it is possible to build a quantitative model so that the treatment is a real value and the treatment effect will be expressed relative to one unit of treatment dose. Then - if I am not mistaken - there cannot be groups treated-untreated anymore and there will be no log2FC between them. Most lab people like to have fold change values. My questions:

1. Is it possible to stick with the treated v. untreated groups here but get a clearer picture of the effect by using the dose as a contiuous covariate? Or phrased somewhat differently: Could the contiuous covariate be introduced only for the treated animals, or for all but equal to 0 for all untreated animals e.g. as a difference to the mean dose within the group (0 for all untreated and a real number for all the treated animals)?
2. Is there a standard way to decide whether to use a contiuous covariate or binning to generate a factor or no covariate at all? What measure could be used to justify this choice and the number of bins when bins are preferable?

Any comments on the approach in general and how to actually analyze this design in DESeq2 or edgeR are most welcome. Thanks for your support and keep up the fantastic work,

Jacob.

deseq2 edger dose rna-seq • 1.9k views

ADD COMMENT • link updated 9.5 years ago by Aaron Lun ★ 28k • written 9.5 years ago by JacobK • 0

score 1 · Answer 1 · 2015-11-08

For starters, does the dose actually have any effect? Have a look at a MDS/PCA plot and see if the treated mice are segregating according to the dosage. It may be that, past a certain point, the effect on expression is invariant to the actual dosage. If that's the case, then you can just use a simple one-way layout.

Otherwise, it starts to get a bit tricky. If you perform a linear regression on the dosage (setting untreated as dosages of zero, for example), the value of the coefficient will represent the log-fold change per unit of treatment. However, if the gene expression response is non-linear, this value will not make any sense. The linear model will also fit poorly to a non-linear response, resulting in inflated dispersion estimates from edgeR. Instead, I would suggest using splines, which would provide a more flexible fit and avoid dispersion overestimation. A spline-based model can then be used to test for DE across all dosages within each genotype (or for a differential dosage effect between genotypes, if you've set it up to use genotype-specific coefficients).

Now, it's true that the log-fold changes of spline coefficients don't have any obvious interpretation. However, from edgeR's perspective (and if you're not using treat), the log-fold change is just a descriptive value that doesn't play any part in the significance calculations. There's nothing stopping you from going back, fitting a one-way treatment-genotype model or a linear regression model, and computing the log-fold changes from them. You can then report the p-value from the spline model with the log-fold changes from the other models. Make sure you keep track of what your log-fold changes are, though - it's easy to get confused with multiple models flying around.

Finally, I would not bin covariates if I could avoid it. I guess that, if you look at the dosages and there's distinct groups, that could motivate a sensible choice of binning. Otherwise, it's a toss-up between trying to keep sufficient residual d.f. and getting a good model fit - and if you're going to do that, you might as well fit a spline, which accounts for the continuous nature of the covariates more elegantly.