Hello
I'm relatively new to Bioconductor (and R). My overall goal is to find out differentially expressed genes form RNA-Seq data using DESeq2 (I switched from Cuffdiff). I have 6 samples - 2 conditions and 3 bio-replicates each. I used kallisto to get counts from each fastq file with the mm10 version of the mouse genome. I have used DESeq before, for which I made my own count matrix table with counts from 'htseq-counts'. However, this time I am trying to use tximport to import my RNA-Seq count files generated using kallisto . I get the following error:
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> k <- keys(txdb, keytype = "GENEID")
> df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")
'select()' returned 1:many mapping between keys and columns
> tx2gene <- df[, 2:1]
> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
I checked to see that the files are intact:
> all(file.exists(fileEB))
[1] TRUE
After browsing through the forums, I also found a solution to try and use 'ignoreTxVersion', but that didn't work either:
> txi.kallisto.tsv <- tximport(fileEB, type = "kallisto", tx2gene = tx2gene, ignoreTxVersion = TRUE)
reading in files
1 2 3 4 5 6
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
It seems that there is a mismatch between the transcript names. The tx2gene dataframe looks like this:
> head(tx2gene)
TXNAME GENEID
1 uc009veu.1 100009600
2 uc033jjg.1 100009600
3 uc012fog.1 100009609
4 uc011xhj.2 100009614
5 uc007inp.2 100009664
6 uc008vqx.2 100012
The kallisto output tsv looks like this:
target_id length eff_length est_counts tpm
0
AF240164 597 398 0.0494548 0.00760895
AF240165 285 86.0009 0 0
AF240166 463 264 0 0
AF240167 540 341 0 0
AF240168 671 472 0 0
AF240169 461 262 0 0
AF240170 535 336 0 0
AF240171 624 425 0 0
AF240172 683 484 0
When I search for the terms in the first column of the kallisto output I get:
LOCUS AF240169 461 bp mRNA linear HTC 30-APR-2001 DEFINITION Mus musculus MRP6 mRNA. ACCESSION AF240169 VERSION AF240169.1 KEYWORDS HTC. SOURCE Mus musculus (house mouse)
I can't figure out the cause of the mismatch. I definitely used the mm10 build of the mouse genome downloaded from the UCSC server. I know it may partly be an issue with kallisto, and I will post this to other forums as well, but I wanted to ask if anyone has faced this before and know of a solution. I checked some of the terms for matches manually (after changing the case), but didn't get any.
I will be happy to provide other details if you ask.
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.2 (Final)
Matrix products: default
BLAS/LAPACK: /hpc/packages/minerva-common/intel/parallel_studio_xe_2015/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_gf_lp64.so
locale:
[1] C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rhdf5_2.16.0
[2] readr_0.2.2
[3] tximport_1.0.2
[4] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
[5] GenomicFeatures_1.26.0
[6] AnnotationDbi_1.36.0
[7] Biobase_2.34.0
[8] GenomicRanges_1.26.1
[9] GenomeInfoDb_1.10.0
[10] IRanges_2.8.0
[11] S4Vectors_0.12.0
[12] BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9.4 XVector_0.12.1
[3] GenomicAlignments_1.8.4 splines_3.4.0
[5] zlibbioc_1.20.0 BiocParallel_1.11.2
[7] xtable_1.8-2 lattice_0.20-35
[9] DESeq_1.24.0 tools_3.4.0
[11] SummarizedExperiment_1.2.3 grid_3.4.0
[13] DBI_0.5-1 genefilter_1.54.2
[15] survival_2.40-1 Matrix_1.2-9
[17] rtracklayer_1.34.2 geneplotter_1.50.0
[19] RColorBrewer_1.1-2 bitops_1.0-6
[21] biomaRt_2.30.0 RCurl_1.95-4.8
[23] RSQLite_1.0.0 compiler_3.4.0
[25] Rsamtools_1.26.1 Biostrings_2.40.2
[27] XML_3.98-1.4 annotate_1.50.0
Thank you James, it worked like a charm. Spent the last 48h on this, and finally victory!