Question

DESeq: residuals instead of read counts

0

Entering edit mode

Tuuli Lappalainen ▴ 10

@tuuli-lappalainen-4439

Last seen 10.4 years ago

I've been using DESeq for RNASeq data analysis of differential expression of genes and exons. However, I have a dilemma: Instead of using read count data, I would like to use residuals of linear regression (after rescaling), but I'm unsure whether this is a valid thing to do. The reason to use residuals is the correction for variation in insert size between lanes (as e.g. in Montgomery et al. Nature 2010) - or potentially some other confounders as well. I have been planning to rescale the residuals to have all the values >0, and the median across the samples per gene the same as in the original count data, and use these data then similarly to raw counts. Would this be a proper way to analyze the data? Many thanks in advance, Tuuli Lappalainen Department of Genetic Medicine and Development University of Geneva Medical School

RNASeq DESeq RNASeq DESeq • 2.1k views

ADD COMMENT • link updated 14.1 years ago by Simon Anders ★ 3.8k • written 14.1 years ago by Tuuli Lappalainen ▴ 10

score 0 · Answer 1 · 2011-01-17

Dear Tuuli On 01/16/2011 10:30 AM, Tuuli Lappalainen wrote: > I've been using DESeq for RNASeq data analysis of differential > expression of genes and exons. However, I have a dilemma: Instead of > using read count data, I would like to use residuals of linear > regression (after rescaling), but I'm unsure whether this is a valid > thing to do. > > The reason to use residuals is the correction for variation in insert > size between lanes (as e.g. in Montgomery et al. Nature 2010) - or > potentially some other confounders as well. I have been planning to > rescale the residuals to have all the values >0, and the median across > the samples per gene the same as in the original count data, and use > these data then similarly to raw counts. Would this be a proper way to > analyze the data? No, this would not work well because it invalidates the shot noise estimation. The proper place to introduce such corrections would be alongside the size factors. Unfortunately, DESeq does not offer, at the moment, the functionality to enter such extra factors, but it seems I really should add this, as it is actually trivial. The alternative would be to use GLMs. Maybe a few notes about both possibilities. (a) DESeq models the counts K_ij of gene i in sample j as following a negative-binomial distribution with mean s_j q_ij. Here, q_ij is the expected concentration of transcripts from gene i under the condition (treatment) of sample j, i.e., q_ij is the same for all samples j within a replicate group. But even within a replicate group, the exepctation values of the counts K_ij will be different -- primarily because the sequencing depth varies from sample to sample. Hence, we multiply q_ij with the "size factor" s_j, which models the relation between (biological) transcript concentration and (technical) read-out in terms of read counts. At the moment, this factor s_j only depends on the sample j but not on the gene i. If your confounder also only depends on j and not i, you can simply multiply it to the size factor (provided, you have estimated the confounder as a multiplicative rather than an additive correction). If it depends on i, too, the math says all the same -- only, there is no slot in the CountDataSet object to store this information. Maybe I should add this, as you are not the first one to ask. Other users wanted to incorportate a GC correction (as used in the Pickrell et al. paper), and this would be done the same way. (b) In the last release, DESeq got the functionality to fit GLMs -- only this is not yet mentioned in the vignette but it is documented in the help pages. With this, you can do a per-gene confounder estimation and the main effect estimation in a unified fashion. In brief, it works as follows: - When calling 'newCountDataSet', the second argument is no longer a factor of conditions but rather a data frame with the predictors, one row for each column in the count table, and one column for each factor. (in principle, some of the predictors could be numerical, rather than factors.) - When calling 'estimateVarianceFunctions', use the option 'method="pooled"'. This estimates one pooled variance for all samples. - Instead of 'nbinomTest', use 'nbinomFitGLM' and 'nbinomTestGLM', as follows (assuming that the model frame contains two columns, 'confounder' and 'treatment'): # Fit the reduced model (takes a few minutes): fit0 <- nbinomFitGLM( cds, count ~ confounder ) # Fit the full model (takes a few minutes): fit1 <- nbinomFitGLM( cds, count ~ confounder + treatment ) # Calculate the pvalues by a simple chi2 test: pvals <- nbinomGLMTest( fit1, fit0 ) This make sense only if you expect your confounder to change in value for each gene. If it doesn't it should go into the size factors (or, more generally, into the GLM's offset.) The catch is that the new "pooled" variance estimation only works if the design has "cell replicates", i.e., if the design matrix has replicated rows, and this is often not the case. The authors of the "edgeR" package found an ingenious way around the problem (their newest version uses what they call a Cox-Reid dispersion estimation), and I guess, I should take over this idea. Cheers Simon