Question

tximport error: TXNAME from Txdb.mm10 and kallisto target_id mismatch

0

Entering edit mode

Kaustav • 0

@kaustav-13212

Last seen 5.7 years ago

NY

Hello

I'm relatively new to Bioconductor (and R). My overall goal is to find out differentially expressed genes form RNA-Seq data using DESeq2 (I switched from Cuffdiff). I have 6 samples - 2 conditions and 3 bio-replicates each. I used kallisto to get counts from each fastq file with the mm10 version of the mouse genome. I have used DESeq before, for which I made my own count matrix table with counts from 'htseq-counts'. However, this time I am trying to use tximport to import my RNA-Seq count files generated using kallisto . I get the following error:

> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> k <- keys(txdb, keytype = "GENEID")
> df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
'select()' returned 1:many mapping between keys and columns
> tx2gene <- df[, 2:1]
> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
 
  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

I checked to see that the files are intact:

> all(file.exists(fileEB))

[1] TRUE

After browsing through the forums, I also found a solution to try and use 'ignoreTxVersion', but that didn't work either:

> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, ignoreTxVersion = TRUE) reading in files 1 2 3 4 5 6 Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :

None of the transcripts in the quantification files are present in the first column of tx2gene. Check to see that you are using the same annotation for both.
It seems that there is a mismatch between the transcript names. The tx2gene dataframe looks like this:
> head(tx2gene)
TXNAME GENEID
1 uc009veu.1 100009600
2 uc033jjg.1 100009600
3 uc012fog.1 100009609
4 uc011xhj.2 100009614
5 uc007inp.2 100009664
6 uc008vqx.2 100012

The kallisto output tsv looks like this:

target_id length eff_length est_counts tpm AF240164 597 398 0.0494548 0.00760895 AF240165 285 86.0009 0 0 AF240166 463 264 0 0 AF240167 540 341 0 0 AF240168 671 472 0 0 AF240169 461 262 0 0 AF240170 535 336 0 0 AF240171 624 425 0 0 AF240172 683 484 0 0

When I search for the terms in the first column of the kallisto output I get:

LOCUS       AF240169                 461 bp    mRNA    linear   HTC 30-APR-2001
DEFINITION  Mus musculus MRP6 mRNA.
ACCESSION   AF240169
VERSION     AF240169.1
KEYWORDS    HTC.
SOURCE      Mus musculus (house mouse)

I can't figure out the cause of the mismatch. I definitely used the mm10 build of the mouse genome downloaded from the UCSC server. I know it may partly be an issue with kallisto, and I will post this to other forums as well, but I wanted to ask if anyone has faced this before and know of a solution. I checked some of the terms for matches manually (after changing the case), but didn't get any.

I will be happy to provide other details if you ask.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)

Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so

locale:
[1] C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] rhdf5_2.16.0
[2] readr_0.2.2
[3] tximport_1.0.2
[4] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[5] GenomicFeatures_1.26.0
[6] AnnotationDbi_1.36.0
[7] Biobase_2.34.0
[8] GenomicRanges_1.26.1
[9] GenomeInfoDb_1.10.0
[10] IRanges_2.8.0
[11] S4Vectors_0.12.0
[12] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9.4              XVector_0.12.1
[3] GenomicAlignments_1.8.4    splines_3.4.0
[5] zlibbioc_1.20.0            BiocParallel_1.11.2
[7] xtable_1.8-2               lattice_0.20-35
[9] DESeq_1.24.0               tools_3.4.0
[11] SummarizedExperiment_1.2.3 grid_3.4.0
[13] DBI_0.5-1                  genefilter_1.54.2
[15] survival_2.40-1            Matrix_1.2-9
[17] rtracklayer_1.34.2         geneplotter_1.50.0
[19] RColorBrewer_1.1-2         bitops_1.0-6
[21] biomaRt_2.30.0             RCurl_1.95-4.8
[23] RSQLite_1.0.0              compiler_3.4.0
[25] Rsamtools_1.26.1           Biostrings_2.40.2
[27] XML_3.98-1.4               annotate_1.50.0

tximport kallisto txdb.mmusculus.ucsc.mm10.knowngene • 2.3k views

ADD COMMENT • link updated 7.9 years ago by James W. MacDonald 68k • written 7.9 years ago by Kaustav • 0

score 2 · Accepted Answer · 2017-06-08

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

The transcript IDs you are using are the UCSC transcript IDs, but the IDs in your kallisto file are GenBank IDs. You will get better luck doing something like

library(org.Hs.eg.db)
tx2gene <- select(org.Hs.eg.db, keys(org.Hs.eg.db), "ACCNUM")
tx2gene <- tx2gene[,2:1]

You could also check to make sure that the first column of your tx2gene has the same values as your kallisto files:

all(<first column of kallisto file> %in% tx2gene)

ADD COMMENT • link 7.9 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you James, it worked like a charm. Spent the last 48h on this, and finally victory!

ADD REPLY • link 7.9 years ago Kaustav • 0