Question

tximport discrepancy between TPMs and lengthScaledTPMs

0

Entering edit mode

igor ▴ 50

@igor

Last seen 20 days ago

United States

I am importing Salmon output with tximport to get gene-level expression levels using the suggested approach:

txi = tximport(quant_files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")

Generally, this works as expected. However, sometimes I notice a large discrepancy between TPMs (txi$abundance) and counts/lengthScaledTPMs (txi$counts).

For example, I am looking at one sample where the TPMs for the top two genes are 676,935 and 54,165 (this is clearly a problematic library), but the top two counts/lengthScaledTPMs are 661,979 and 3,917. Top 10 genes are 83% of total for TPMs, but 99.7% for counts. 97% of the genes end up with counts between 0 and 0.01. For comparison, in the original Salmon estimates, top 10 are 33% of the total. I am confident in the tximport results, but what would cause such behavior? Can I trust any of the values?

tximport • 1.3k views

ADD COMMENT • link updated 5.9 years ago by Michael Love 43k • written 5.9 years ago by igor ▴ 50

score 0 · Answer 1 · 2019-06-18

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

I don’t necessarily see any problem with what you’ve presented above. Counts are not proportional to expression across genes, and they shouldn’t be expected to be.

There may be some issues with this library though. With this library where 67% of the total gene expression is from a single gene, how many genes have original estimated counts of 10 or more?

ADD COMMENT • link 5.9 years ago Michael Love 43k

0

Entering edit mode

My concern was that the top gene is 67% of total by TPMs, but 98% by counts. I understand the relative increase or decrease. The part that worries me is that a single gene is soaking up nearly all of the reads. Regardless of any scaling or normalization, it's hard to reconcile all other genes essentially disappearing.

By original estimates, I assume you mean the ones direct from Salmon (without tximport). There are 993 genes with over 10 counts, so the library is substantially more balanced. The top gene is only 42k (less than 10% of total).

ADD REPLY • link 5.9 years ago igor ▴ 50

0

Entering edit mode

Long genes do take all the counts, this is a property of RNA-seq data.

Ok so 993 genes with original counts greater than 10, how many for lengthScaledTPM?

ADD REPLY • link 5.9 years ago Michael Love 43k

0

Entering edit mode

For lengthScaledTPM, only 33 with counts greater than 10.

ADD REPLY • link 5.9 years ago igor ▴ 50

0

Entering edit mode

I think if this sample is peculiar among your other samples as having 67% of the TPM from one gene, then the counts from abundance approach may not work. But you may also need to discard that samples for QC reasons?

ADD REPLY • link 5.9 years ago Michael Love 43k

0

Entering edit mode

The problem is that I have a lot of samples that are not much better, so it's hard to define a cutoff.

I also tried the more traditional STAR/featureCounts approach and had only 64k total counts. The top gene from Salmon only had 20. Clearly, there is something very odd about this sample.

I would think I am doing something wrong, but I don't see such huge discrepancies when dealing with good quality samples.

There is probably not a clear answer, but I was hoping to figure out what might be the cause.

ADD REPLY • link 5.9 years ago igor ▴ 50