Different annotation packages (org.Mm.eg.db, EnsDb.Mmusculus.v79, BiomaRt) returning drastically different numbers of transcripts
1
0
Entering edit mode
Nicholas • 0
@3611f731
Last seen 1 hour ago
United States

I am currently in the process importing Kallisto data for a mouse RNAseq experiment using the tximport package and am creating tx2gene objects to use for mapping. I decided to try three different annotation packages: (1) org.Mm.eg.db, (2) EnsDb.Mmusculus.v79, and (3) biomaRt. However, when I ran the code (shown below) to extract the transcript to gene mappings, I only returned ~27k transcripts when using org.Mm.eg.db, but returned ~104k when using EnsDb.Mmusculus.v79 and ~280k when using biomaRt. Why are there such drastic differences?

I understand that biomaRt will always be the most up to date since it is pulled directly from Ensembl, but I was curious why org.Mm.eg.db had so few transcripts returned. Although the code below pertains to creating tx2gene objects, my question is more broadly related to annotation packages in general.

Code for getting transcripts with org.Mm.eg.db:

# Using org.Mm.eg.db
keys <- keys(org.Mm.eg.db, keytype = 'ENSEMBLTRANS')

tx2gene.org <- select(org.Mm.eg.db, keys, columns = c('ENSEMBLTRANS', 'SYMBOL'), keytype = 'ENSEMBLTRANS') %>%
  dplyr::rename(target_id = ENSEMBLTRANS, gene_name = SYMBOL)

length(keys)
dim(tx2gene.org)

> length(keys)
[1] 26481
> dim(tx2gene.org)
[1] 26909     2

Code for getting transcripts with EnsDb.Mmusculus.v79:

# Using EnsDb.Mmusculus.v79

tx2gene.ensdb <- transcripts(EnsDb.Mmusculus.v79, columns=c("tx_id", "gene_name")) %>%
  dplyr::as_tibble() %>%
  dplyr::rename(target_id = tx_id) %>%
  dplyr::select(target_id, gene_name)

dim(tx2gene.ensdb)

> dim(tx2gene.ensdb)
[1] 104129      2
`

Code for getting transcripts with biomaRt:

# Using biomaRt 
myMart <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', mirror = 'useast')

tx2gene.bm <- getBM(mart = myMart, attributes = c('ensembl_transcript_id_version', 'external_gene_name')) %>%
  as_tibble() %>%
  dplyr::rename(target_id = ensembl_transcript_id_version, 
                gene_name = external_gene_name)

dim(tx2gene.bm)

> dim(tx2gene.bm)
[1] 278375      2
> sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.62.0             EnsDb.Mmusculus.v79_2.99.0 org.Mm.eg.db_3.20.0        enrichplot_1.26.5         
 [5] msigdbr_7.5.1              clusterProfiler_4.14.4     gprofiler2_0.2.3           GSVA_2.0.4                
 [9] GSEABase_1.68.0            graph_1.84.0               annotate_1.84.0            XML_3.99-0.18             
[13] gplots_3.2.0               RColorBrewer_1.1-3         d3heatmap_0.9.0            DT_0.33                   
[17] gt_0.11.1                  plotly_4.10.4              cowplot_1.1.3              edgeR_4.4.1               
[21] limma_3.62.1               datapasta_3.1.0            beepr_2.0                  EnsDb.Hsapiens.v86_2.99.0 
[25] ensembldb_2.30.0           AnnotationFilter_1.30.0    GenomicFeatures_1.58.0     AnnotationDbi_1.68.0      
[29] Biobase_2.66.0             GenomicRanges_1.58.0       GenomeInfoDb_1.42.1        IRanges_2.40.1            
[33] S4Vectors_0.44.0           BiocGenerics_0.52.0        tximport_1.34.0            lubridate_1.9.4           
[37] forcats_1.0.0              stringr_1.5.1              dplyr_1.1.4                purrr_1.0.2               
[41] readr_2.1.5                tidyr_1.3.1                tibble_3.2.1               ggplot2_3.5.1             
[45] tidyverse_2.0.0            rhdf5_2.50.1              

loaded via a namespace (and not attached):
  [1] splines_4.4.2               later_1.4.1                 BiocIO_1.16.0               filelock_1.0.3             
  [5] ggplotify_0.1.2             bitops_1.0-9                R.oo_1.27.0                 httr2_1.0.7                
  [9] lifecycle_1.0.4             lattice_0.22-6              magrittr_2.0.3              yaml_2.3.10                
 [13] ggtangle_0.0.6              httpuv_1.6.15               DBI_1.2.3                   abind_1.4-8                
 [17] pkgload_1.4.0               zlibbioc_1.52.0             audio_0.1-11                R.utils_2.12.3             
 [21] RCurl_1.98-1.16             yulab.utils_0.1.9           rappdirs_0.3.3              GenomeInfoDbData_1.2.13    
 [25] ggrepel_0.9.6               irlba_2.3.5.1               tidytree_0.4.6              codetools_0.2-20           
 [29] DelayedArray_0.32.0         DOSE_4.0.0                  xml2_1.3.6                  tidyselect_1.2.1           
 [33] aplot_0.2.4                 farver_2.1.2                UCSC.utils_1.2.0            ScaledMatrix_1.14.0        
 [37] BiocFileCache_2.14.0        matrixStats_1.5.0           base64enc_0.1-3             GenomicAlignments_1.42.0   
 [41] jsonlite_1.8.9              progress_1.2.3              tools_4.4.2                 treeio_1.30.0              
 [45] Rcpp_1.0.13-1               glue_1.8.0                  SparseArray_1.6.0           qvalue_2.38.0              
 [49] MatrixGenerics_1.18.0       HDF5Array_1.34.0            withr_3.0.2                 BiocManager_1.30.25        
 [53] fastmap_1.2.0               rhdf5filters_1.18.0         caTools_1.18.3              digest_0.6.37              
 [57] rsvd_1.0.5                  gridGraphics_0.5-1          timechange_0.3.0            R6_2.5.1                   
 [61] mime_0.12                   colorspace_2.1-2            GO.db_3.20.0                gtools_3.9.5               
 [65] RSQLite_2.3.9               R.methodsS3_1.8.2           utf8_1.2.4                  generics_0.1.3             
 [69] data.table_1.16.4           rtracklayer_1.66.0          prettyunits_1.2.0           httr_1.4.7                 
 [73] htmlwidgets_1.6.4           S4Arrays_1.6.0              pkgconfig_2.0.3             gtable_0.3.6               
 [77] blob_1.2.4                  SingleCellExperiment_1.28.1 XVector_0.46.0              htmltools_0.5.8.1          
 [81] fgsea_1.32.0                ProtGenerics_1.38.0         scales_1.3.0                png_0.1-8                  
 [85] SpatialExperiment_1.16.0    ggfun_0.1.8                 rstudioapi_0.17.1           tzdb_0.4.0                 
 [89] reshape2_1.4.4              rjson_0.2.23                nlme_3.1-166                curl_6.1.0                 
 [93] cachem_1.1.0                KernSmooth_2.23-26          parallel_4.4.2              restfulr_0.0.15            
 [97] pillar_1.10.1               grid_4.4.2                  vctrs_0.6.5                 promises_1.3.2             
[101] BiocSingular_1.22.0         dbplyr_2.5.0                beachmat_2.22.0             xtable_1.8-6               
[105] magick_2.8.5                cli_3.6.3                   locfit_1.5-9.10             compiler_4.4.2             
[109] Rsamtools_2.22.0            rlang_1.1.4                 crayon_1.5.3                plyr_1.8.9                 
[113] fs_1.6.5                    stringi_1.8.4               viridisLite_0.4.2           BiocParallel_1.40.0        
[117] babelgene_22.9              munsell_0.5.1               Biostrings_2.74.1           lazyeval_0.2.2             
[121] GOSemSim_2.32.0             Matrix_1.7-1                patchwork_1.3.0             hms_1.1.3                  
[125] sparseMatrixStats_1.18.0    bit64_4.5.2                 Rhdf5lib_1.28.0             KEGGREST_1.46.0            
[129] statmod_1.5.0               shiny_1.10.0                SummarizedExperiment_1.36.0 igraph_2.1.2               
[133] memoise_2.0.1               ggtree_3.14.0               fastmatch_1.1-6             bit_4.5.0.1                
[137] gson_0.1.0                  ape_5.8-1
AnnotationDbi EnsDb.Mmusculus.v79 biomaRt org.Mm.eg.db • 145 views
ADD COMMENT
0
Entering edit mode

Just an update on this: I tried tximport with tx2gene objects from the three annotation packages and from the GTF file itself. The best performance was using the GTF file itself, thank you ATpoint for the suggestion. This begs the question: why use annotation packages at all, when the GTF file gives the best performance?

It makes sense that the GTF file would yield the best results as you create the kallisto index from the fasta file that is linked to that GTF. Should this be the standard for annotating kallisto inputs with tximport?

Code creating tx2gene object from GTF file:

# Getting annotations directly from gtf file ----
gtf.mm.ensembl <- rtracklayer::import('Mus_musculus.GRCm39.113.gtf') 

tx2gene.gtf <- mcols(gtf.mm.ensembl)[,c(7,10)] %>%
  as_tibble() %>%
  dplyr::rename(gene_name = gene_name, target_id = transcript_id) %>%
  dplyr::select(target_id, gene_name) %>%
  na.omit() %>%
  distinct(target_id, .keep_all = T)
ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

The reason you get fewer mappings using the org.Mm.eg.db package is because that package is based on annotations from NCBI, but you are using Ensembl IDs. The two annotation services use different methods to determine where the genes are, how many transcripts each have, etc. Because of that, there isn't very good consistency between NCBI and Ensembl IDs

There is an ongoing project called MANE where NCBI and Ensembl/Genecode are collaborating to find a single transcript for each human gene that they can agree is the same thing, which indicates how difficult it is to map between annotation services.

It has been my longstanding contention around these parts that you should stick with whatever IDs you started with, because cross-mapping is a fool's errand. If you have Ensembl transcripts, stick with an EnsDb or biomaRt.

1
Entering edit mode

that you should stick with whatever IDs you started with,

Meaning, you got your fasta file for the kallisto index from somewhere, e.g. Ensembl or GENCODE, so I recommend to take the GTF file from there and make the tx2gene from this GTF, ensuring that the annotation version is the same. Eg transcript fasta from GENCODE vM23, then take the vM23 GTF, and none of these annotation packages. This only creates trouble, unless you obtained the transcript sequences exactly from the same package.

ADD REPLY
0
Entering edit mode

I agree that it's difficult to match an EnsDb object to the build, but I think that is orthogonal to OP's issue, which is mapping transcript IDs to gene symbols, and for which an EnsDb or biomaRt will be required.

ADD REPLY
0
Entering edit mode

Got it, so using the GTF directly rather than any of the annotation packages will yield more accurate results than any of the packages? I got my fasta file from Ensembl.

ADD REPLY

Login before adding your answer.

Traffic: 685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6