Hi everyone,
The DESeq2
documentation states that input needs to be un-normalized counts (=raw counts?),
while tximport
suggests for salmon data to apply countsFromAbundance="lengthScaledTPM"
and use the result as a regular count matrix.
- To my understanding
tximport
implies that it's length scaled counts can be treated like un-normalized counts in DESeq2. But why? Is it because these bias corrected counts fromtximport
are different from normalized or transformed counts? - Are there any reasons to prefer importing raw read counts (with offset) over
countsFromAbundance="lengthScaledTPM"
or can they be used likewise? - As the outcome of
countsFromAbundance="lengthScaledTPM"
is corrected on a similar scale as TPMs: Could I use the data likewise, e.g., to compare counts between different genes within the same sample? And after transformation via rlog/vst would the values be on a scale that could be compared between genes AND between samples? Basically TPM would not be required anymore?
I´m grateful for any clarification.
Thank you for the response (and to the other responders as well) and apologies for my late reply. Especially the reference to Soneson is very helpful. I think I was simply confused by isoform length scaled vs. gene length normalized. If I understand correctly the scaling to counts really refers to actual upscaling of the TPM values to the level of counts via library size (ScaledTPM) and to the level of counts + considering different isoform lengths (lengthScaledTPM).
However, do the 'ScaledTPM' at least inherit the normalisation by gene length from TPM and have this as advantage over the original raw counts? Then 'lengthScaledTPM' add another layer to the data already normalized by gene length by considering different isoform lengths between samples? Otherwise, what would be the advantage of ScaledTPM over original raw counts (without offset)?
I´m grateful for further clarification.
Kind regards, Christian
Advantage and disadvantage depends on the aim, what you plan to do with them (testing, plots, etc.).
lengthScaledTPM puts back in the gene length by the way. Just does so in a way that differential transcript usage won't bias when performing differential gene expression analysis.
scaledTPM have the advantage over raw counts that the latter are subject to this bias (DTU can produce spurious, "apparent" DGE).
Many thanks for clarification.
Then only scaledTPM inherits the normalisation by gene length from TPM, while in lengthScaledTPM the feature length is added back again and we are on a similar level as in raw counts. However, both methods account for DTU between samples.
Would this also mean that if someone would try to calculate TPM values according to the standard formula (as used for raw counts), this would be valid using lengthScaledTPM counts, but not for ScaledTPM?
I do not follow why calculate TPM from something derived from TPM.
I get your point as we already starting with TPM. This was more to clarify that I understand the principle correctly.
However, further I´m checking out a differential expression pipeline, which uses
DESeq2
, but currently only accepts gene count matrices as input (No support of the offset) and would like to clarify if either using 'lengthScaledTPM
' or 'scaledTPM
' or 'original raw counts
' count matrices would be a valid choice. I did not check if this is actually happening, but if this pipeline would recalculate TPMs based on 'lengthScaledTPM' (or 'scaledTPM' or 'original raw counts') would this still provide reasonable results?For the reasons outlined in the Soneson paper, I'd prefer this pipeline:
https://nf-co.re/rnaseq
You cannot use these scaled TPM approaches with gene count tables that don't involve transcript abundance estimation.
I´m sorry causing confusion here, but the gene matrices I refer to are actually obtained via https://nf-co.re/rnaseq.
The problem with the
'rnaseq
' pipeline is that it produces count matrices with the different methods described above, but does not perform further downstream processing (i.e., differential gene expression), expect of PCA maybe. For this downstream processing there is another pipeline (https://nf-co.re/differentialabundance/
) existing, which based on gene count matrices can perform treatment comparisons and other downstream tasks. So, while this downstream pipeline is not capable of making use of the 'offset' for raw counts via DESeq2 (as this is not implemented yet), it can apply standard DeSeq2 processing based on the lengthScaledTPM/scaledTPM counts matrices as input. At least this is what I understand from the documentation of tximport where is written:Does this sound reasonable, but maybe I misunderstood the concept here? I´m just searching for the best way to proceed and further I was just not sure if TPMs can be recalculated as the pipeline uses counts matrices as input without the TPMs from salmon.
Oh, I see.
If you can't use counts + offset then I prefer lengthScaledTPM to DESeq2, i.e.
txi$counts
toDESeqDataSetFromMatrix()
.So you can use the first pipeline to produce those counts, and analyze it with R or this other pipeline it sounds like.