Hi. I have recently worked with both microarray and RNA-Seq data. For differential expression analysis of microarrays, I used limma
with log2 transformed intensities as input. For RNA-Seq, I used DESeq2
with raw counts (derived from salmon
) as input.
Why does limma
want/require log2 transformed intensities but DESeq2
wants untransformed counts?
I'm aware that limma
uses a different regression model compared to DESeq2
(linear model vs negative binomial GLM) due to different types of data (intensities vs counts), so I'm more interested in the reason why input data should or shouldn't be transformed before regression analysis.
I imagine DESeq2
doesn't want log2 transformed values because the negative binomial GLM already handles heteroscedasticity and the log link function ensures the model coefficients are log2 fold changes. limma
's choice to use a regular linear regression model means to meet homooscedasticity it needs to reduce the heteroscedasticity of the intensity data with a log transform. If this is true, why doesn't limma
avoid requiring log2 transformed input data and simply use an appropriate GLM with a log link function? Or alternatively, why doesn't DESeq2
avoid the extra computation required to fit a GLM and simply log2 transform counts that are then used as input to a linear regression.
I ask because I'm pretty sure these approaches give different results. Using untransformed values as input to a linear regression with a log link should be different than using log2 transformed values as input to a linear regression with the identity link function. Does one of these approaches to data transformation have more justification than the other?
Thanks for pointing me toward the limma voom paper, that answered my question.