Question

Is there any point to TMM-normalize TPM prior to limma-voom?

0

Entering edit mode

johnmcma ▴ 10

@johnmcma-12132

Last seen 2.9 years ago

United States

The current tximport vignette recommends the following workflow for limma-voom input:

files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")

names(files) <- paste0("sample", 1:6) txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv, countsFromAbundance = "lengthScaledTPM") library(limma) y <- DGEList(txi$counts) y <- calcNormFactors(y) design <- model.matrix(~condition, data = sampleTable) v <- voom(y, design)

I'm a bit confused with the logic here. I have read the tximport source code, and noticed the "lengthScaledTPM" mode pretty much does what it is--giving out the length-scaled TPM (or FPKM in some cases) as if it was read counts. I understand TMM normalization (or DESeq2's normalization for that matter) is designed for raw reads, rather than length-normalized metrics such as TPM. So what's the point in running calcNormFactors?

voom limma-voom tximport edger • 5.5k views

ADD COMMENT • link updated 8.1 years ago by Ryan C. Thompson ★ 7.9k • written 8.1 years ago by johnmcma ▴ 10

score 5 · Answer 1 · 2017-03-28

5

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

tximport should always giving you a matrix of values on a raw count scale. The only difference is whether it takes the estimated counts reported directly by Kallisto or whether it starts from the TPM values and transforms them back to counts. So either way you have a count matrix, not a TPM matrix. Regardless, I see no reason that either TMM or DESeq's normalization shouldn't work on TPM values. It's still preferable to run it on the counts, but my guess is that the difference would be minimal in practice.

By the way, neither normalization methods is terrible complicated, so I encourage you to read the papers and understand the methods so you will be able to decide for yourself when they are applicable.

ADD COMMENT • link 8.1 years ago Ryan C. Thompson ★ 7.9k

1

Entering edit mode

FYI, the precision weights from TMM are calculated from a binomial distribution, and won't make much sense for non-count data. It probably doesn't do much harm, either, but it can be turned off with doWeighting=FALSE.

ADD REPLY • link 8.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Well, by triggering the argument countsFromAbundance, the counts slot of the tximport output is derived using the abundance measure named in the command (by default, TPM for the fishes and FPKM for RSEM). I'm citing the code in question below:

makeCountsFromAbundance <- function(countsMat, abundanceMat, lengthMat, countsFromAbundance) {
  countsSum <- colSums(countsMat)
  if (countsFromAbundance == "lengthScaledTPM") {
    newCounts <- abundanceMat * rowMeans(lengthMat)
  } else if (countsFromAbundance == "scaledTPM") {
    newCounts <- abundanceMat
  } else {
    stop("expecting 'lengthScaledTPM' or 'scaledTPM'")
  }
  newSum <- colSums(newCounts)
  countsMat <- t(t(newCounts) * (countsSum/newSum))
  countsMat
}

ADD REPLY • link 8.1 years ago johnmcma ▴ 10

0

Entering edit mode

So, the TPM matrix the columns sum to 1e6. The matrices tximport returns in the "counts" slot have column sum equal to the number of fragments aligned to transcripts by the quantifiers. That's what Ryan means by "you have a count matrix, not a TPM matrix".

You can think of the countsFromAbundance counts matrix as "regressing out" differences in counts due to changes in gene length across samples (from DTU), instead of using an offset to account for this.

ADD REPLY • link 8.1 years ago Michael Love 43k