I have a question about DESEQ2 data normalization. I know that DESEQ2 requires raw reads counts, that the softwares normalizes by seq depth.
But what if I want to account for technical variation? Normally, I would quantile-normalize the data, but I understand that DESEQ2 does not support quantile normalized data, so how can I correct for this kind of variability?
Thanks in advance,
Marco
Thanks. So you would recommend performing variance-stabilization before looking for DE genes (or differentially accessible ATAC-seq regions) if I am concern about technical variation?
You don’t need to apply VST before DE (you cannot actually supply transformed data to DESeq2 which instead uses original counts and models the heteroskedastocity via the NB GLM).
Let me check in later for more links to technical variation related software that can be useful here integrating with DESeq2 (RUV, cqn, sva, etc).
So, there are two categories as I see it for modeling extra technical variation, one based on covariates, e.g. gene GC content, and gene length:
and the other based on factor analysis:
We have examples of incorporating these in the vignette and workflow.
The covariate-based methods are useful if you have biased counts related to per-sample fluctuations in PCR or RNA degradation. If you use Salmon with --gcBias (and --posBias for positional bias), and then tximport, then you don't need to use those to deal with that type of technical variation, as Salmon has already corrected for these during its estimation steps and its passed along to DESeq2 via tximport. You can assess GC bias and positional bias with MultiQC (and FASTQC modules, also soon to come Salmon modules).
The factor analysis methods are useful for removing additional technical variation regardless the source, but if the bias is partially confounded with the biological covariates, its possible to remove some signal. This doesn't happen with Salmon or the covariate based methods because they are working on a per-sample basis, and only removing variation that can be explained based on gene, transcript or cDNA fragment features.
No, I recommend checking the normalized (and perhaps VST'd) data and unless there is good reason to worry about quantile normalization, don't (worry about QN). As Michael says below, if you feel you have inter-sample technical variation, you can look into SVA, RUV-seq and possibly other approaches to creating covariates that can be used within DESeq to account for inter-sample variation.
Thank you both for the replies, I'll check what you suggested and let you know if I have more questions.