Question

difference among tximport scaledTPM, lengthScaledTPM and the original TPM output by salmon/kallisto

7

Entering edit mode

tangming2005 ▴ 200

@tangming2005-6754

Last seen 9 weeks ago

United States

Hi,

I am testing salmon and kallisto for RNA-seq. both tools outputs ESTIMATED counts and TPM. I have read around and put my notes here https://github.com/crazyhottommy/RNA-seq-analysis/blob/master/salmon_kalliso_STAR_compare.md#counts-versus-tpmrpkmfpkm

My questions are:

1. from the help of tximport function:

countsFromAbundance:

character, either "no" (default), "scaledTPM", or "lengthScaledTPM", for whether to generate estimated counts using abundance estimates scaled up to library size (scaledTPM) or additionally scaled using the average transcript length over samples and the library size (lengthScaledTPM). if using scaledTPM or lengthScaledTPM, then the counts are no longer correlated with average transcript length, and so the length offset matrix should not be used.

To my understanding, TPM is a unit that scaled by (effective) feature length first and then sequencing depth. So, what are scaledTPM and lengthScaled TPM? does tximport use the estimate counts to get the TPM?

2. what's the difference among the TPM output by salmon/kallisto and the TPM returned by tximport function?

3. How does tximport mathematically convert counts to TPM if use the estimated counts to get the TPM?

Thanks very much!

Ming Tang

tximport deseq2 TPM RPKM rnaseq • 19k views

ADD COMMENT • link updated 6.6 years ago by luxeredias ▴ 20 • written 8.4 years ago by tangming2005 ▴ 200

1

Entering edit mode

luxeredias ▴ 20

@luxeredias-15360

Last seen 3.2 years ago

Brazil - Belo Horizonte - UFMG

Dear all,

Following up with the topic, I also struggled a bit to figure out salmon+tximport outputs, but after reading forums+papers and looking into my own data I came up with this scheme of how I think RNAseq normalization of salmon+tximport data works.

https://drive.google.com/file/d/1FQJ6Ao2L9Z2CLVA5clE8DzvwHR0ulBZD/view?usp=sharing

Box patterns (full lines, interrupted lines, yellow highlighting, red line) refer to output file categoty (tx-lvl/gene-lvl, lib-size-length-corrected/not corrected, abundance/count)

Best,

Thomaz Luscher Dias

UFMG-Brazil

ADD COMMENT • link 6.6 years ago luxeredias ▴ 20

score 16 · Accepted Answer · 2016-07-11

16

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

hi Ming Tang,

First, in case it's easier to just read the code which produces these counts, you can look it over the few lines of code here:

https://github.com/Bioconductor-mirror/tximport/blob/master/R/tximport.R#L371-L378

1) scaledTPM is TPM's scaled up to library size, while lengthScaledTPM first multiplies TPM by feature length and then scales up to library size. These are then quantities that are on the same scale as original counts, except no longer correlated with feature length across samples.

2) No difference. tximport is simply importing the TPMs and providing them back to the user as a matrix (txOut=TRUE), or summarizing these values among isoforms of a gene (txOut=FALSE).

3) Counts are never converted to TPMs. The default is to import the estimated counts and estimated TPMs from the quantification files, and then summarize these to the gene level.

ADD COMMENT • link 8.4 years ago Michael Love 43k

2

Entering edit mode

Thanks Michael. I understand much better. correct me if I am wrong:

tximport function just import the estimated counts/TPM and summarize to gene-level.

tx.salmon <- tximport(salmon.files, type = "salmon", tx2gene = tx2gene, 
                      reader = read_tsv, countsFromAbundance = "no")

tx.salmon$counts will be the count table from the original salmon quantification, but gene-level summarized.

tx.salmon$abundance will be TPM table from the original salmon quantification, but gene-level summarized

Alternatively, one can generate the count table from TPM (not from the original estimated counts):

tx.salmon.scale <- tximport(salmon.files, type = "salmon", tx2gene = tx2gene, 
                      reader = read_tsv, countsFromAbundance = "lengthScaledTPM")

tx.salmon.scale$abundance will be the same as tx.salmon$abundance ( I checked)

but tx.salmon.scale$count will be generated by using the TPM value * featureLength * library size.

values of tx.salmon.scale$count are very close to tx.salmon$count, but accounted for transcript length changes across samples.

ADD REPLY • link 8.4 years ago tangming2005 ▴ 200

3

Entering edit mode

Yes correct.

ADD REPLY • link 8.4 years ago Michael Love 43k

0

Entering edit mode

@Michael Love , thank you. That was succinct; much appreciate it. Its too bad the toil project in Xena (TCGA) does not provide the library size for this conversion or just the tpm counts.

A

ADD REPLY • link 7.3 years ago Ahdee ▴ 50

0

Entering edit mode

Dear Michael, Sorry for disturbing you now.

It seems a little latter for this post, I'm confused of tximport, I want to convert my data to a DESeq2 object, the data is downloaded from site https://xenabrowser.net/datapages/?dataset=TcgaTargetGtexgeneexpected_count&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443.

the data I use is something like what vigentte(tximport) says RSEM sample.genes.results, and I find this argument tx2gene is required for gene-level summarization for methods that provides transcript-level estimates only (kallisto, Salmon, Sailfish) in R help document. so I leave my code this tximport::tximport("prostate_rna", type = "rsem", txIn = F, txOut = T), the object prostate_rna is tranformed from the data I download through a algorithm of 2^data - 1., but R gives me a error:1 Error in computeRsemGeneLevel(files, importer, geneIdCol, abundanceCol, : all(c(geneIdCol, abundanceCol, lengthCol) %in% names(raw)) is not TRUE 此外: Warning message: Unnamed col_types should have the same length as col_names. Using smaller of the two.

I don't know what to do to transform my data into a DESeq2 subject.

ADD REPLY • link 5.3 years ago YunYun ▴ 70

0

Entering edit mode

tximport is only for importing files that are directly output by one of the software listed in the vignette. It can not import this type of data which has been preprocessed by another program other than the ones in the vignette.

ADD REPLY • link 5.3 years ago Michael Love 43k

0

Entering edit mode

Thanks for your reply, It's terrible to the ucsc data (UCSC Toil RNAseq Recompute ) which only provide a RSEM expectedcount with the unit of log2(expectedcount+1), so I have to transform it to expected_count because I see DESEq2 want to receive a raw count without any polish.

Is there any way to solve the data in ucsc and bring it to DESeq2 object, I have find something in forum that recommend a function round, However, I also find the more recommended way is trimport. as your opinion, can I take function round into consideration?

Thanks Michael!

ADD REPLY • link 5.3 years ago YunYun ▴ 70

0

Entering edit mode

I don’t have a recommendation for what to do here. I wouldn’t use DESeq2 if you don’t have access to the right input data.

ADD REPLY • link 5.3 years ago Michael Love 43k

0

Entering edit mode

Thank you a lot, I'll try another way to deal with the ucsc data

ADD REPLY • link 5.3 years ago YunYun ▴ 70

0

Entering edit mode

Dear Michael - are scaledTPM and lengthScaledTPM values comparable across samples? (which TPM is definitely not) I could be completely mistaken, but while both adjust for library size and gene length neither adjust for RNA/library composition? (like DESeq2 median of ratios and edgeR TMM normalization methods)

ADD REPLY • link 5.1 years ago hermidalc ▴ 20

1

Entering edit mode

These two have library size differences baked in. The column sum is equal to the number of mapped reads. So not comparable across samples.

ADD REPLY • link 5.1 years ago Michael Love 43k

1

Entering edit mode

These two have library size differences baked in. The column sum is equal to the number of mapped reads. So not comparable across samples.

ADD REPLY • link 5.1 years ago Michael Love 43k

0

Entering edit mode

Thank you for the explanation. Oh ok, so scaledTPM and lengthScaledTPM values are equivalent to gene-level quantification read counts? Similar to what would be produced by HTSeq or featureCounts?