Question

dtuScaledTPM vs lengthScaledTPM in DTU analysis

3

Entering edit mode

fiona.dick91 ▴ 60

@fionadick91-16521

Last seen 2.3 years ago

Norway

Hi,

I have a question concerning the "new" scaling method offered by tximport called dtuScaledTPM and how it affects DTU analysis. So far I have used scaledTPM which as I understand (correct me if Im wrong) scales the TPM values to library size by multiplying the TPM of a transcript of a sample with the column sum of the count matrix and thus brings them back onto count scale? The dtuScaledTPM additionally includes the transcript length into the library size info, (dividing count based library size by library size calculated from TPM*transcript length). And the transcript length is the median of transcript lengths of all transcripts in a gene ( where the transcript length itself is the average across all samples). Is this correct? And if so, why is this beneficial for DTU analysis. I understand that using lengthScaledTPM is not advantageous but I cant wrap my head around why this method is better "just" because the transcript length value is the median instead of the mean?

Sorry if this is a rather confusing question. Im seeking to understand why this method should be used for DTU . It would be great to get a worst case toy example where lengthScaledTPM would not work and also where lack of length scaling in scaledTPM would not work well.

Thanks in advance

Fiona

tximport dtu salmon TPM rnaseqDTU • 4.1k views

ADD COMMENT • link written 6.0 years ago by fiona.dick91 ▴ 60

score 5 · Answer 1 · 2019-04-03

5

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 days ago

United States

I understand that this is tricky. It took us a while to work this out, and then a few iterations to explain it in the paper as well. Ok, so if you understand the problem with lengthScaledTPM and why we can instead use scaledTPM then we're already quite far along. (For others reading this thread, see the rnaseqDTU workflow or paper, and the section that starts with "For DTU analysis, we suggest generating counts from abundance, using the scaledTPM method...".

The idea behind dtuScaledTPM is that, the scaledTPM do not reflect the fact that, in general, long genes will have higher counts, and so higher precision. We may want to get a bit closer to the original count scale for each transcript, while still avoiding the problem with original counts and lengthScaledTPM. So we can, within each gene, multiply all the TPM estimates by a single value: the median over genes of the mean effective transcript length over samples. Since all TPMs for all samples within a gene are scaled by the same value, we avoid the issue discussed in the workflow that occurs with original counts or lengthScaledTPM.

ADD COMMENT • link 6.0 years ago Michael Love 43k

2

Entering edit mode

thanks for taking the time to answer. so the main difference to lengthScaledTPM is, that you take the average overall samples:

"Since all TPMs for all samples within a gene are scaled by the same value"

and so you do not unpuprosley set down the whole gene expression, in one sample (groupA) and not the other (groupB). Hence you do not divide by a different total gene expression amount when calculating the proportions. (as opposed to what happens when using lengthScaledTPM, which is wanted for DGE)

And the main argument for including transcript length scaling (in DTU) basically cooks down to, preventing to confound the low dispersion (or high precision) for long transcripts "just" because they generally have higher counts (due to their length).

Is that right?

ADD REPLY • link 6.0 years ago fiona.dick91 ▴ 60

2

Entering edit mode

With dtuScaledTPM, and comparing to scaledTPM, we want that long transcripts tend to have counts in the range similar to their original counts. Actually the precision has more to do with the Poisson component than the dispersion component (if we are considering a Gamma Poisson for modeling the data).

ADD REPLY • link 6.0 years ago Michael Love 43k

2

Entering edit mode

What would you recommend to use for downstream linear modeling (e.g., network or transcript-QTL) type of analyses? It still seems like the dtuScaledTPM is the best choice. Also, would you recommend variance stabilizing transform on dtuScaledTPM? Thanks!

ADD REPLY • link 5.0 years ago mgandal ▴ 20

2

Entering edit mode

If you are doing something marginally per transcript, you could use any countsFromAbundance method. This thread is about DTU modeling where the outcome is a vector (over isoform counts). Yes, I’d recommend VST for eQTL discovery, I’ve benchmarked this in a collaboration and it works well.

ADD REPLY • link 5.0 years ago Michael Love 43k