The current tximport vignette recommends the following workflow for limma-voom input:
files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")
names(files) <- paste0("sample", 1:6)
txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv,
countsFromAbundance = "lengthScaledTPM")
library(limma)
y <- DGEList(txi$counts)
y <- calcNormFactors(y)
design <- model.matrix(~condition, data = sampleTable)
v <- voom(y, design)
I'm a bit confused with the logic here. I have read the tximport source code, and noticed the "lengthScaledTPM" mode pretty much does what it is--giving out the length-scaled TPM (or FPKM in some cases) as if it was read counts. I understand TMM normalization (or DESeq2's normalization for that matter) is designed for raw reads, rather than length-normalized metrics such as TPM. So what's the point in running calcNormFactors?
FYI, the precision weights from TMM are calculated from a binomial distribution, and won't make much sense for non-count data. It probably doesn't do much harm, either, but it can be turned off with
doWeighting=FALSE
.Well, by triggering the argument
countsFromAbundance
, thecounts
slot of the tximport output is derived using the abundance measure named in the command (by default, TPM for the fishes and FPKM for RSEM). I'm citing the code in question below:So, the TPM matrix the columns sum to 1e6. The matrices tximport returns in the "counts" slot have column sum equal to the number of fragments aligned to transcripts by the quantifiers. That's what Ryan means by "you have a count matrix, not a TPM matrix".
You can think of the countsFromAbundance counts matrix as "regressing out" differences in counts due to changes in gene length across samples (from DTU), instead of using an offset to account for this.