Question

tximport::summarizeToGene fails to ignoreTxVersion

0

Entering edit mode

Benjamin • 0

@benjamin-21234

Last seen 5.8 years ago

TU Dortmund

Using airway2 (https://github.com/mikelove/airway2/tree/master/inst) salmon transcript level count files, this is supposed to remove the versioning of Ensemble gene IDs:

library(tximeta)

srrs = list.files(path = "~/Downloads/airway2-master/inst/extdata/quants/", full.names = TRUE)

txm_raw = tximeta(file.path(srrs, "quant.sf.gz"), type = "salmon") 
txm_con = summarizeToGene(txm_raw, ignoreTxVersion = TRUE)

loading existing TxDb created: 2019-07-02 20:01:47
obtaining transcript-to-gene mapping from TxDb
Error in .local(object, ...) : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [ENST00000456328, ENST00000450305, ENST00000488147, ...]

Example IDs (tx2gene): [ENST00000456328.2, ENST00000450305.2, ENST00000473358.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

Similarly, the operation fails using mapIds of AnnotationDbi (for which it might be nice to have as well an option like ignoreTxVersion ).

Current work-around:

names(GRanges_object) <- sapply(names(GRanges_object), function(x) unlist(strsplit(x, "\\."))[[1]])

Session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomicFeatures_1.36.3 AnnotationDbi_1.46.0   Biobase_2.44.0        
[4] GenomicRanges_1.36.0   GenomeInfoDb_1.20.0    IRanges_2.18.1        
[7] S4Vectors_0.22.0       BiocGenerics_0.30.0    tximeta_1.2.1         

loaded via a namespace (and not attached):
 [1] httr_1.4.0                  tidyr_0.8.3                 vsn_3.52.0                 
 [4] bit64_0.9-7                 jsonlite_1.6                assertthat_0.2.1           
 [7] BiocManager_1.30.4          affy_1.62.0                 BiocFileCache_1.8.0        
[10] blob_1.1.1                  GenomeInfoDbData_1.2.1      Rsamtools_2.0.0            
[13] progress_1.2.2              pillar_1.4.2                RSQLite_2.1.1              
[16] lattice_0.20-38             limma_3.40.2                glue_1.3.1                 
[19] digest_0.6.19               XVector_0.24.0              colorspace_1.4-1           
[22] preprocessCore_1.46.0       Matrix_1.2-17               XML_3.98-1.20              
[25] pkgconfig_2.0.2             biomaRt_2.40.1              zlibbioc_1.30.0            
[28] purrr_0.3.2                 scales_1.0.0                affyio_1.54.0              
[31] BiocParallel_1.18.0         tibble_2.1.3                AnnotationFilter_1.8.0     
[34] ggplot2_3.2.0               SummarizedExperiment_1.14.0 lazyeval_0.2.2             
[37] magrittr_1.5                crayon_1.3.4                memoise_1.1.0              
[40] tools_3.6.0                 prettyunits_1.0.2           hms_0.4.2                  
[43] matrixStats_0.54.0          stringr_1.4.0               munsell_0.5.0              
[46] DelayedArray_0.10.0         ensembldb_2.8.0             Biostrings_2.52.0          
[49] compiler_3.6.0              rlang_0.4.0                 grid_3.6.0                 
[52] RCurl_1.95-4.12             tximport_1.12.3             rstudioapi_0.10            
[55] rappdirs_0.3.1              bitops_1.0-6                gtable_0.3.0               
[58] DBI_1.0.0                   curl_3.3                    R6_2.4.0                   
[61] GenomicAlignments_1.20.1    knitr_1.23                  dplyr_0.8.2                
[64] rtracklayer_1.43.3          bit_1.1-14                  ProtGenerics_1.16.0        
[67] readr_1.3.1                 stringi_1.4.3               Rcpp_1.0.1                 
[70] dbplyr_1.4.2                tidyselect_0.2.5            xfun_0.8

tximport tximeta AnnotationDbi • 2.6k views

ADD COMMENT • link 5.8 years ago Benjamin • 0

score 1 · Answer 1 · 2019-07-04

I think the files do have the version ending, and so it works just with defaults (here with airway2 files):

> coldata <- data.frame(files=file.path(list.files(),"quant.sf.gz"), names=letters[1:8])
> se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8
found matching transcriptome:
[ Gencode - Homo sapiens - release 27 ]
loading existing TxDb created: 2018-10-25 12:50:54
loading existing transcript ranges created: 2019-05-20 15:35:22
fetching genome info

> gse <- summarizeToGene(se)
loading existing TxDb created: 2018-10-25 12:50:54
obtaining transcript-to-gene mapping from TxDb
loading existing gene ranges created: 2019-07-04 13:15:09
summarizing abundance
summarizing counts
summarizing length

> gse <- addIds(gse, "ENTREZID")
mapping to new IDs using 'org.Hs.eg.db' data package
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
it appears the rows are gene IDs, setting 'gene' to TRUE
'select()' returned 1:many mapping between keys and columns

> mcols(gse)
DataFrame with 58288 rows and 2 columns
                              gene_id    ENTREZID
                          <character> <character>
ENSG00000000003.14 ENSG00000000003.14        7105
ENSG00000000005.5   ENSG00000000005.5       64102
ENSG00000000419.12 ENSG00000000419.12        8813
ENSG00000000457.13 ENSG00000000457.13       57147
ENSG00000000460.16 ENSG00000000460.16       55732