Hi,
I have a question concerning the "new" scaling method offered by tximport
called dtuScaledTPM
and how it affects DTU analysis.
So far I have used scaledTPM
which as I understand (correct me if Im wrong) scales the TPM values to library size by multiplying the TPM of a transcript of a sample with the column sum of the count matrix and thus brings them back onto count scale?
The dtuScaledTPM
additionally includes the transcript length into the library size info, (dividing count based library size by library size calculated from TPM*transcript length). And the transcript length is the median of transcript lengths of all transcripts in a gene ( where the transcript length itself is the average across all samples). Is this correct? And if so, why is this beneficial for DTU analysis. I understand that using lengthScaledTPM
is not advantageous but I cant wrap my head around why this method is better "just" because the transcript length value is the median instead of the mean?
Sorry if this is a rather confusing question. Im seeking to understand why this method should be used for DTU . It would be great to get a worst case toy example where lengthScaledTPM
would not work and also where lack of length scaling in scaledTPM
would not work well.
Thanks in advance
Fiona
thanks for taking the time to answer. so the main difference to
lengthScaledTPM
is, that you take the average overall samples:and so you do not unpuprosley set down the whole gene expression, in one sample (groupA) and not the other (groupB). Hence you do not divide by a different total gene expression amount when calculating the proportions. (as opposed to what happens when using
lengthScaledTPM
, which is wanted for DGE)And the main argument for including transcript length scaling (in DTU) basically cooks down to, preventing to confound the low dispersion (or high precision) for long transcripts "just" because they generally have higher counts (due to their length).
Is that right?
With
dtuScaledTPM
, and comparing toscaledTPM
, we want that long transcripts tend to have counts in the range similar to their original counts. Actually the precision has more to do with the Poisson component than the dispersion component (if we are considering a Gamma Poisson for modeling the data).What would you recommend to use for downstream linear modeling (e.g., network or transcript-QTL) type of analyses? It still seems like the
dtuScaledTPM
is the best choice. Also, would you recommend variance stabilizing transform ondtuScaledTPM
? Thanks!If you are doing something marginally per transcript, you could use any countsFromAbundance method. This thread is about DTU modeling where the outcome is a vector (over isoform counts). Yes, I’d recommend VST for eQTL discovery, I’ve benchmarked this in a collaboration and it works well.