Hi,
I have an issue using tximport
with kallisto output. I never had this issue before, and I can't figure out why it is happening. When applying it on a large dataset (over 3000 transcriptome mapped cells from single-cell RNAseq data), some genes from a subset of the cells (1/4 of the cells) have NA
values. When I checked the counts of these genes in the corresponding abundance.tsv
files, some of them seem to be expressed at high levels (around 1000 counts). Many of these genes also have only 1 transcript.
The code I'm running is the following:
txi <- tximport(files = abundance.files, type = "kallisto", tx2gene = tx2gene, importer = read_tsv)
I also tried importing with rhdf5
, and it didn't change in anything.
My sessionInfo()
is the following:
sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.1.1 tximport_1.4.0 magrittr_1.5 loaded via a namespace (and not attached): [1] zlibbioc_1.22.0 compiler_3.4.1 plyr_1.8.4 R6_2.2.2 hms_0.3 tools_3.4.1 [7] rhdf5_2.20.0 reshape2_1.4.2 tibble_1.3.4 Rcpp_0.12.13 stringi_1.1.5 stringr_1.2.0 [13] rlang_0.1.4
Any help to fix this issue is highly appreciated!
Thanks for your help!
Best,
Leon
“I also tried importing with
rhdf5
, and it didn't change in anything.”When you tried this, did you specify ‘importer’? If so, can you try not specifying importer, but using our built-in methods for importing kallisto h5 files?
I first didn't specify
importer
. I had previously used an earlier release oftximport
(which had thereader
argument). However, this time with the latest release, thereader
argument was giving me an error. Therefore, I tried to executetximport
with the default arguments, and a message popped telling me that I have to importrhdf5
so thattximport
functions in its default settings. So I did that and this is when I realised that I hadNA
values. Suspecting that probably something wrong happened with theh5
files, I tried usingread_tsv
as an alternative approach and got the exact same output.Ok so it's probably not to do with importer, but if you want to import much faster you can install rhdf5:
And then try without specifying 'importer'. You will still get NA's with this approach, and I'll comment again with more to diagnose this.
Indeed, and as I mentioned earlier, when I tried using
rhdf5
without specifyingimporter
, I always gotNA
values. However, regarding the speed, I didn't feel that it was loading faster. But this was just my feeling because I didn't test for speed explicitly.I wonder if this could have to do with the automatic determination of the column types by read_tsv. By default I think it uses the first 1,000 values to determine the column type, and if these values are all 0 it may try to read the whole column as integers. This caused problems for me once (albeit not exactly the same problem as yours). Could you try to specify another importer function, e.g.
```
read_tsv2 <- function(...) readr::read_tsv(..., progress = FALSE,
col_types = list(
target_id = col_character(),
length = col_integer(),
eff_length = col_double(),
est_counts = col_double(),
tpm = col_double()
))
```
Thanks Charlotte, if this is the problem, using rhdf5 should avoid the NA. This is the importer that is used if the rhdf5 package is detected.
Another curious thing is that Leon reports no NA when txOut=TRUE, which wouldn’t happen i think if the importer was the issue.
True, but that was only for 50 cells, for which txOut=FALSE also worked, so I guess it is not yet clear whether there is a set of cells for which txOut=TRUE works but txOut=FALSE doesn't.
I'm wondering, if it's the importer's fault, why would more or less files make a difference?
Dear Michael and Charlotte,
Thank you for your much appreciated help. The solution Charlotte provided solved my issue. Also, regarding importing
h5
files, I did a mistake when trying to load these files, since I have specifiedabundance.tsv
files totximport
instead ofh5
files. Directly loadingh5
files using the default settings oftximport
does not giveNA
values. My complete script for this is the following:Please let me know if I need to provide further information so that others that might face a similar issue be able to troubleshoot it quite straightforwardly.
Best,
Leon
Thank Leon for the useful feedback here.
We can either have a specific read_tsv importer for each method 'type' that explicitly sets the col_types, or I could set guess_max=Inf so that read_tsv uses the full file for setting col_types. The first approach should be faster, so I'll probably go with this.
At this point I can debug on my own now that we have the source of the NAs, and will put a fix into the devel branch.
The transcripts FASTA file was directly downloaded from Ensembl. The file can be found here. The file corresponds to Homo_sapiens.GRCh38.cdna.all.fa.gz. No
R
code was used to generate thetx2gene
file.What we did is to retrieve from the FASTA file (which is exactly the same file that was used to generate the kallisto index) all transcript IDs. These IDs were then used to retrieve gene IDs and gene names from Biomart.What we did is to retrieve from the FASTA file (which is exactly the same file that was used to generate the kallisto index) all transcript and gene IDs, as well as gene names. A custom madepython
script was used to retrieve all these information. Thetx2gene
file that was provided to the code includes only transcript IDs and gene names. The issue is not at all related to thetx2gene
file, because all transcripts have a corresponding gene name. Moreover, the same genes that showNA
values in some cells have actual count values in other cells.Best,
Leon
Any NA if you set txOut=TRUE?
When I set
txOut = TRUE
, I get noNA
values. However, I didn't run this code on all my cells, since this is time consuming. I first identified all cells havingNA
values, then selected 50 of them and ran the code. Now if I settxOut = FALSE
by using only this same subset of cells (i.e. the 50 selected cells), I still don't getNA
values. Any hints on what could have happened when taking all 3000 cells together?Best,
Leon
My guess is that the
tx2gene
doesn't line up with the transcripts in the FASTA file, and hence the quantifications. Biomart is convenient but using it makes it sometimes difficult to know which version you got. Can you try this instead:wget ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz
Then construct a tx2gene table from 'txs' above.
Dear Michael,
I still don't understand why you think this is related to
tx2gene
. Many cells from the same dataset (3/4 of the cells) and also across multiple other datasets having fewer cells don't show this problem at all (using the sametx2gene
file). Also, and as I mentioned it C: tximport returns NA counts, when usingtximport
with only a subset of the cells showingNA
values, this problem doe not emerge at all (i.e. the same genes that previously hadNA
values now show actual counts). Also, when I checked many of these genes individually by tracing back their transcript ID (which was unique in some cases), and then searched for this transcript in the correspondingabundance.tsv
files, I actually found count values and notNA
. The transcript ID matched perfectly the ID in theabundance.tsv
files! In my view, everything I tried points to the fact that it is not a problem of concordance between the files, but an issue with the internal workings oftximport
, that might not support very large datasets. Doestximport
try to correct the counts of cells (or samples more generally) having a high variance in gene expression? Also, do you think that this issue might be fixed any time soon?Thank you for your help!
Best,
Leon
I’d be happy to fix the issue but we need to identify the source of the NA. tximport is simply summing the counts from transcripts associated with a gene in tx2gene. Can you please email me three objects saved in an RData file: an example of some columns that contain NA values, an example of the output for the same samples when txOut=TRUE, and tx2gene? I’ll report back here if I find anything.
In version 1.7.3, I hard code the
col_types
to avoid this error:https://github.com/mikelove/tximport/commit/f80fcaac7411ae590688c237088072a313772668
Thanks for your bug report