I am trying to use tximport to make data matrices from kallisto output files. I have tried using both the h5 and tsv output files, and both are producing errors. When I try to use the tsv files, I am running the following:
`dir <- "/Users/My_Name/Downloads"
sampleruns <- c("SRR3402457_1.tsv", "SRR3402460_1.tsv", "SRR3402456_1.tsv", "SRR3402459_1.tsv")
files <- file.path(dir, sampleruns)
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
txi.kallisto.tsv <- tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE,
ignoreTxVersion = TRUE)`
This gives me the following error:
Error in .local(object, ...) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
Example IDs (file): [TR100009|c0_g1_i1|m, TR100024|c0_g1_i1|m, TR100032|c0_g1_i1|m, ...]
Example IDs (tx2gene): [ORF TR100009|c0_g1_i1|g.500685 TR100009|c0_g1_i1|m.500685 type:internal len:127 (-), ORF TR100024|c0_g1_i1|g.500687 TR100024|c0_g1_i1|m.500687 type:complete len:111 (+), ORF TR100032|c0_g1_i1|g.500688 TR100032|c0_g1_i1|m.500688 type:complete len:120 (-), ...]
This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
For reference, my tx2gene file looks like this:
TXNAME
1 TR100009|c0_g1_i1|g.500685 TR100009|c0_g1_i1|m.500685 type:internal len:127 (-)
2 TR100024|c0_g1_i1|g.500687 TR100024|c0_g1_i1|m.500687 type:complete len:111 (+)
3 TR100032|c0_g1_i1|g.500688 TR100032|c0_g1_i1|m.500688 type:complete len:120 (-)
4 TR100037|c0_g1_i1|g.500691 TR100037|c0_g1_i1|m.500691 type:3prime_partial len:101 (-)
5 TR10004|c1_g1_i1|g.85724 TR10004|c1_g1_i1|m.85724 type:internal len:189 (+)
6 TR100051|c0_g1_i1|g.500696 TR100051|c0_g1_i1|m.500696 type:internal len:147 (-)
GENEID
1 TR100009|c0_g1_i1|g.500685 TR100009|c0_g1_i1|m.500685 type:internal len:127 (-)
2 TR100024|c0_g1_i1|g.500687 TR100024|c0_g1_i1|m.500687 type:complete len:111 (+)
3 TR100032|c0_g1_i1|g.500688 TR100032|c0_g1_i1|m.500688 type:complete len:120 (-)
4 TR100037|c0_g1_i1|g.500691 TR100037|c0_g1_i1|m.500691 type:3prime_partial len:101 (-)
5 TR10004|c1_g1_i1|g.85724 TR10004|c1_g1_i1|m.85724 type:internal len:189 (+)
6 TR100051|c0_g1_i1|g.500696 TR100051|c0_g1_i1|m.500696 type:internal len:147 (-)
When I instead try to use h5 files, I am running the following:
dir <- "/Users/My_Name/Downloads"
sampleruns <- c("SRR3402457_1.h5", "SRR3402460_1.h5", "SRR3402456_1.h5", "SRR3402459_1.h5")
files <- file.path(dir, sampleruns)
names(files) <- paste0("sample", 1:4)
txi.kallisto <- tximport(files, type = "kallisto", txOut = TRUE, tx2gene = tx2gene)
I have tried the above with and without the tx2gene argument. Either way, I am getting the following error:
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read.delim (install 'readr' package for speed up)
1 Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<89>HDF'
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 3 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 4 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote, :
line 5 appears to contain embedded nulls
Lastly, the here is the output of sessionInfo():
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] tximportData_1.12.0 GenomicFeatures_1.36.4 AnnotationDbi_1.46.1 Biobase_2.44.0 GenomicRanges_1.36.1
[6] GenomeInfoDb_1.20.0 IRanges_2.18.3 S4Vectors_0.22.1 BiocGenerics_0.30.0 rhdf5_2.28.1
[11] tximport_1.12.3 edgeR_3.26.8 limma_3.40.6
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 compiler_3.6.1 pillar_1.4.2 BiocManager_1.30.10
[5] XVector_0.24.0 prettyunits_1.0.2 progress_1.2.2 bitops_1.0-6
[9] tools_3.6.1 zlibbioc_1.30.0 biomaRt_2.40.5 zeallot_0.1.0
[13] digest_0.6.23 bit_1.1-14 RSQLite_2.1.3 memoise_1.1.0
[17] tibble_2.1.3 lattice_0.20-38 pkgconfig_2.0.3 rlang_0.4.2
[21] Matrix_1.2-18 DelayedArray_0.10.0 DBI_1.0.0 yaml_2.2.0
[25] GenomeInfoDbData_1.2.1 rtracklayer_1.44.4 httr_1.4.1 stringr_1.4.0
[29] hms_0.5.2 Biostrings_2.52.0 vctrs_0.2.0 locfit_1.5-9.1
[33] bit64_0.9-7 grid_3.6.1 R6_2.4.1 BiocParallel_1.18.1
[37] XML_3.98-1.20 magrittr_1.5 Rhdf5lib_1.6.3 blob_1.2.0
[41] matrixStats_0.55.0 GenomicAlignments_1.20.1 Rsamtools_2.0.3 backports_1.1.5
[45] SummarizedExperiment_1.14.1 assertthat_0.2.1 stringi_1.4.3 RCurl_1.95-4.12
[49] crayon_1.3.4
Any help or insight into how I might solve EITHER of these errors (I just need one filetype to work) would be greatly appreciated. Thank you!