Hi there
I have a question about the appropriate salmon-generated gene counts, which take into account gene length, for the purpose of running eqtl analysis. I would much appreciate your suggestions on this.
I used tximport to obtain gene-level counts/abundance estimates for our RNAseq data (estimated using default countsFromAbundance=no). The downstream steps involve running TMM normalisation (to obtain normalised counts), rank transformation and batch correction, before running eqtl analysis.
With this approach, I am concerned that the counts do not take into account the length of genes. Your document here :
https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html
section : Use with downstream Bioconductor differential expression packages
outlines two ways in which one can get the counts ready for DE analysis, by taking into account gene/transcript length. I have a couple of questions:
Do the two methods work in a similar way, and can either be applied to edgeR/DEseq2/limma-voom?
If yes, would counts generated with tximport's countsFromAbundance = "lengthScaledTPM" be the appropriate ones to use for my eqtl pipeline?
If no, would the block of code under edgeR be appropriate to apply for my eqtl processing pipeline too ? I use the output of calcNormFactors to modify my counts. I don't know if/where I should use the offset, or if running calcNormFactors(cts/normMat) would do what I need?
Thanks for your help!
Thanks Michael. I still have a few issues ...
1. If I re-import with DESeqDataSetFromTximport, can I supply TMM normalisation factors to your code snippet? (I know they are similar but I guess not identical (?), and I want to stay consistent with previous analyses I run). I don't know what (if any) between sample normalisation counts(.., normalized=TRUE) performs ..
2. If I want to estimate the offset, and apply it to modify my counts (with TMM normalisation factors and gene length), would this code estimate offset by taking into account TMM normalisation factors and gene length?
Sure, in general it's not that hard to use TMM here. To calculate something like a DESeq2 "size factor", you just multiply edgeR's output of calcNormFactors by the column sum, and then scale it to be centered around 1:
Now you want normalized counts with TMM normalization and gene length divided out. So you want to obtain the size factor after dividing the gene length matrix out, i.e. when x = cts/normMat.
Finally, rather than stuff these pieces into a DESeqDataSet, you can skip that and just divide the size factor out of the gene-length-normalized matrix:
Michael, that is super helpful. Thank you!!