I imported a set of salmon quantifications into R with tximport default settings and exactly used the code on the manual page for tximport to prepare data for use with edgeR. The result is a DGElist with the offsets for the downstream DGE analysis.
Issue: The DGElist (y$samples) does not contain the lib.size factors (they are all 1) for obtaining TMM-normalized counts via cpm(y, log=F).
Therefore, the question is how to feed normalization factors into y$samples$norm.factors while still using the information from tximport.
One can of course run calcNormFactors(y) manually but then the length offsets from tximport are lost. Is there a recommended approach?
Maybe add a short comment to the tximport vignette referencing the suggestions from Aaron below. Using the corrected counts for things like clustering etc. is standard so I was actually surprised no one asked this before (by best knowledge, maybe I missed the respective threads).
Have a look at csaw::calculateCPM(), which does exactly as you request (see usage here). You'll need to convert it back into a SummarizedExperiment, though, the function doesn't take DGEList objects... or you can use csaw::normFactors() instead of calcNormFactors() to keep everything in a SummarizedExperiment form. (Note the difference in the weighted default, though, as this was built for ChIP-seq data.)
I'll add this to the tximport vignette if Aaron gives the ok. I'm just less knowledgeable about internals so want to make sure I don't promulgate something not accurate.
Looks fine to me. You needn't use.norm.factors if you have use.offsets=TRUE, the latter overrides the former. The only other comments are to avoid T and F, but I know that Mike would never put those in a vignette anyway.
Mike, if you open a PR on the vignette, I can put in some comments to explain what and why, especially around the offset calculation part. Otherwise I'll have to re-remember everything the next time this pops up.
If you have an offsets matrix in your DGEList then you won't use the norm.factors anyway, so it wouldn't matter if you did something with them or not. Put a different way, the offsets are supposed to be better than simple normalization factors, and are preferentially used by glmFit.
This I understand but how about obtaining normalized counts for non-GLM applications like clustering or checking normalization efficiency by MA plots. Can you use the offsets for those, too?
This is beyond my knowledge of edgeR. I checked that chunk of tximport vignette code with Aaron at some point, to make sure we were doing it properly.
Oops. Spotted something. Will open an issue.
Edit: actually, ignore that, it's fine - phew.
Maybe add a short comment to the
tximport
vignette referencing the suggestions from Aaron below. Using the corrected counts for things like clustering etc. is standard so I was actually surprised no one asked this before (by best knowledge, maybe I missed the respective threads).