I am importing Salmon output with tximport
to get gene-level expression levels using the suggested approach:
txi = tximport(quant_files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM")
Generally, this works as expected. However, sometimes I notice a large discrepancy between TPMs (txi$abundance
) and counts/lengthScaledTPMs (txi$counts
).
For example, I am looking at one sample where the TPMs for the top two genes are 676,935 and 54,165 (this is clearly a problematic library), but the top two counts/lengthScaledTPMs are 661,979 and 3,917. Top 10 genes are 83% of total for TPMs, but 99.7% for counts. 97% of the genes end up with counts between 0 and 0.01. For comparison, in the original Salmon estimates, top 10 are 33% of the total. I am confident in the tximport
results, but what would cause such behavior? Can I trust any of the values?
My concern was that the top gene is 67% of total by TPMs, but 98% by counts. I understand the relative increase or decrease. The part that worries me is that a single gene is soaking up nearly all of the reads. Regardless of any scaling or normalization, it's hard to reconcile all other genes essentially disappearing.
By original estimates, I assume you mean the ones direct from Salmon (without
tximport
). There are 993 genes with over 10 counts, so the library is substantially more balanced. The top gene is only 42k (less than 10% of total).Long genes do take all the counts, this is a property of RNA-seq data.
Ok so 993 genes with original counts greater than 10, how many for lengthScaledTPM?
For lengthScaledTPM, only 33 with counts greater than 10.
I think if this sample is peculiar among your other samples as having 67% of the TPM from one gene, then the counts from abundance approach may not work. But you may also need to discard that samples for QC reasons?
The problem is that I have a lot of samples that are not much better, so it's hard to define a cutoff.
I also tried the more traditional STAR/featureCounts approach and had only 64k total counts. The top gene from Salmon only had 20. Clearly, there is something very odd about this sample.
I would think I am doing something wrong, but I don't see such huge discrepancies when dealing with good quality samples.
There is probably not a clear answer, but I was hoping to figure out what might be the cause.