I have used DESeq2 for differential gene expression analysis and found discrepancies of gene ranking from two pipelines upstream of DESeq2: STAR-rsem-tximport; STAR-salmon-tximport.. Some of the discrepancies are quite large and I picked one gene here that ranked No98 from rsem-tximport but No.3 from salmon-tximport. It turned out that both rsem and salmon gave the same count numbers but they are significantly different in lengths. Any advice would be appreciated.
Here is the counts from rsem-tximport:
> txi$counts["ENSMUSG00000040170.13",] 28 7 21 34 6 26 45 12 15 50 38 27 48 35 33 2 23 49 14 4 139 108 119 561 100 76 242 206 208 116 405 201 54 78 98 84 285 52 91 88 5 9 3 98 137 111
Here is the counts from salmon-tximport
> txi$counts["ENSMUSG00000040170.13",] 28 7 21 34 6 26 45 12 139.00000 108.00000 119.00000 561.00004 100.00000 76.00000 242.00001 206.00000 15 50 38 27 48 35 33 2 208.00000 116.00000 405.00000 201.00007 54.00000 78.00002 98.00000 84.00002 23 49 14 4 5 9 3 285.00001 52.00000 91.00000 88.00000 98.00000 137.00000 111.00001
gene length from rsem-tximport
> txi$length["ENSMUSG00000040170.13",] 28 7 21 34 6 26 45 12 15 50 3549.18 3529.81 3202.68 2884.12 3425.77 3419.01 3424.29 3581.70 3648.28 3573.82 38 27 48 35 33 2 23 49 14 4 3636.29 2926.10 2974.95 3017.98 3649.82 2255.17 3631.78 3567.58 3545.08 3017.01 5 9 3 3746.35 3772.59 3405.43
gene length from salmon-tximport
> txi$length["ENSMUSG00000040170.13", ] 28 7 21 34 6 26 45 12 319.5377 2249.3429 1809.2934 325.9756 947.2694 2014.2411 296.0048 1795.9675 15 50 38 27 48 35 33 2 402.6961 566.5026 462.6234 366.7090 2158.0163 1717.6544 950.9149 595.1135 23 49 14 4 5 9 3 879.1492 1908.8248 862.0149 291.5122 3475.7957 2344.3731 552.1021
Thanks for the comments. It seems that tximport takes effective length directly from rsem gene.results file. Look at the effective length from rsem output for sample 9, it has the exact number from tximport length above.
But for salmon, tximport has to calculates effective length from salmon transcripts quantification file. So it is tximport vs rsem, not sure which gene length is more appropriate.
I did not do --gcBias with salmon. I will run again with the flag turned on and see there is any differences.
tximport uses the estimated effective length of the transcripts, and computes a weighted average of these to produce a gene length.
So either the effective length of the transcripts could be different between Salmon and RSEM, or the isoform usage, or both, to produce a difference in the estimated gene length.
If GC bias was not used, then I would guess it's more differences in the isoform usage that are changing gene length.
Hi Michael, I just want to update to your comment above.
What I found is that both tximport-rsem and tximport-salmon generate highly correlated gene counts but very different gene length estimations. As shown below I have 53456 genes, and 40617 have the same counts across all samples.
Looking at gene length, not only the numbers are different, but the orders of sample (sorted by gene length) are different as well. As shown below there are only 1295 genes that have the same order.
I also compared the transcript length and they are very similar. Do you have comments on which gene length estimates is better?
PS. I did --gcBias with salmon, both gene counts and lengths are changed but the gene ranking are similar to the original salmon, relatively to rsem.
I don't really have any more informative comments on this. I wouldn't use code like the above to compare rank, but instead simply plotting rank(x) vs rank(y).
I wasn't trying to compare the ranks but was trying to show that it is the gene length estimation that drives differences of DESeq2 results. I wish someone could work it out that the top two rated quantification software have discrepancies that is large enough to puzzle biologist. I couldn't find any paper that compare these two software head to head. Thanks again for your comments.
I just noticed that in Salmon version 0.11.0 a fix has been applied to correct a bug that for multi-isoform genes resulted in the wrong calculation of lengths and effective lengths of genes. Although this was, according to the release notes, (apparently) *only* an issue when these numbers were calculated within Salmon, and not when using
tximport
, it may be worth to check this in more detail.See Salmon v0.11.0 release notes here.
Thanks. We only use the effective transcript lengths and the transcript abundances in tximport to calculate the average transcript length (of a gene), so any miscalculation in gene lengths upstream won’t affect tximport results.