Question

Use of org.Rn.eg.db with GRCr8 reference genome

0

Entering edit mode

Erika • 0

@bd3dd3f9

Last seen 4 weeks ago

United States

Can a count matrix made with GRCr8 reference genome be annotated with org.Rn.eg.db? My attempt to add gene name annotations failed, as described below.

UPDATE: the gene ids are not in ensembl format, so perhaps I need to use a different annotation file? Or a different key?

GRCr8 genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_036323735.1/

Files used to make STAR reference genome: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/323/735/GCF_036323735.1_GRCr8/GCF_036323735.1_GRCr8_genomic.gtf.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/323/735/GCF_036323735.1_GRCr8/GCF_036323735.1_GRCr8_genomic.fna.gz

# Count matrix
> slice_sample(adult_matrix, n=10)
              07  09  58  08  59  03  06  12  05  04  01  10  02  11  60  57
LOC120093222   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Dcaf17        97  91  74  54  94  76 111 105  70  82 103  68 105  97  69  85
LOC134483108   0   1   0   0   0   1   0   0   0   0   0   0   1   0   0   2
Itga2b        17  23  21  26  29  25  27  39  16  30  33  21  24  25  23  12
Naa20        244 335 191 243 279 221 242 380 210 217 286 220 291 242 202 177
LOC134481053   1   1   0   1   1   1   0   0   0   0   0   0   0   0   1   2
Helb          67  72  39  38  56  54  46  69  41  32  58  31  50  50  33  46
LOC100911384   1   0   0   0   1   1   1   1   0   1   0   0   0   0   1   0
Slc5a7        16   9   4   7   9   6  13   8   9   5  12  10   4   7  41   9
LOC120099428   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

# Meta data for count matrix
> slice_sample(adult_meta, n = 10)
   subject diet   age timepoint
1       09  CON Adult       T28
2       10  CON Adult       T28
3       58 EtOH Adult       T28
4       07  CON Adult       T28
5       59  CON Adult       T28
6       01 EtOH Adult       T28
7       05 EtOH Adult       T28
8       08  CON Adult       T28
9       60  CON Adult       T28
10      57 EtOH Adult       T28

# constructing DeSeq object 
adult_dds <- DESeqDataSetFromMatrix(countData = adult_matrix,
                       colData = adult_meta, 
                       design = ~diet)

# prefiltering, only keeping genes with more than XX reads
adult_keep <- rowSums(counts(adult_dds)) >= 10
adult_dds <- adult_dds[adult_keep,]

# running DESeq
adult_dds <- DESeq(adult_dds)
adult_res <- results(adult_dds, contrast = c("diet", "CON", "EtOH"), alpha = 0.05)
adult_res

summary(adult_res)

# genes ranked by p-value
adult_res_ordered <- adult_res[order(adult_res$pvalue),] #padj?
head(adult_res_ordered)

# DF and gene name annotations
adult_res_df <- as.data.frame(adult_res_ordered)
adult_res_df$symbol <- mapIds(org.Rn.eg.db, keys = rownames(adult_res_df), keytype = "ENSEMBL", column = "SYMBOL")
# Error in .testForValidKeys(x, keys, keytype, fks) : 
  # None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.

# compare rownames (genes) to ensembl_ids
> rownames(adult_res_df)
   [1] "Sncb"         "Ptp4a3"       "Zmiz2"        "Klc4"         "Nicol1"       "Stard10"      "Rpsa"         "Clip3"       
   [9] "Pld3"         "Rnasek"

> available_ensembl_ids <- keys(org.Rn.eg.db, keytype = "ENSEMBL")

> available_ensembl_ids[1:10]
 [1] "ENSRNOG00000017701" "ENSRNOG00000028896" "ENSRNOG00000032908" "ENSRNOG00000009845" "ENSRNOG00000016924" "ENSRNOG00000005260"
 [7] "ENSRNOG00000013594" "ENSRNOG00000029762" "ENSRNOG00000010265" "ENSRNOG00000049882"

# incompatible formats

> sessionInfo( )
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.7.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.3.0                ggprism_1.0.5               EnhancedVolcano_1.24.0      ggrepel_0.9.6              
 [5] org.Rn.eg.db_3.20.0         AnnotationDbi_1.68.0        DESeq2_1.46.0               SummarizedExperiment_1.36.0
 [9] Biobase_2.66.0              MatrixGenerics_1.18.1       matrixStats_1.5.0           GenomicRanges_1.58.0       
[13] GenomeInfoDb_1.42.3         IRanges_2.40.1              S4Vectors_0.44.0            BiocGenerics_0.52.0        
[17] BiocManager_1.30.25         afex_1.4-1                  lme4_1.1-36                 Matrix_1.7-3               
[21] rstatix_0.7.2               emmeans_1.10.7              lubridate_1.9.4             forcats_1.0.0              
[25] stringr_1.5.1               dplyr_1.1.4                 purrr_1.0.4                 readr_2.1.5                
[29] tidyr_1.3.1                 tibble_3.2.1                ggplot2_3.5.1               tidyverse_2.0.0            

loaded via a namespace (and not attached):
 [1] DBI_1.2.3               rlang_1.1.5             magrittr_2.0.3          compiler_4.4.1          RSQLite_2.3.9          
 [6] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4          pkgconfig_2.0.3         crayon_1.5.3           
[11] fastmap_1.2.0           backports_1.5.0         XVector_0.46.0          labeling_0.4.3          utf8_1.2.4             
[16] rmarkdown_2.29          tzdb_0.4.0              UCSC.utils_1.2.0        nloptr_2.2.0            bit_4.6.0              
[21] xfun_0.51               cachem_1.1.0            zlibbioc_1.52.0         jsonlite_1.9.1          blob_1.2.4             
[26] DelayedArray_0.32.0     BiocParallel_1.40.0     broom_1.0.7             parallel_4.4.1          R6_2.6.1               
[31] stringi_1.8.4           RColorBrewer_1.1-3      car_3.1-3               boot_1.3-31             numDeriv_2016.8-1.1    
[36] estimability_1.5.1      Rcpp_1.0.14             knitr_1.49              splines_4.4.1           timechange_0.3.0       
[41] tidyselect_1.2.1        rstudioapi_0.17.1       abind_1.4-8             yaml_2.3.10             codetools_0.2-20       
[46] lattice_0.22-6          lmerTest_3.1-3          plyr_1.8.9              KEGGREST_1.46.0         withr_3.0.2            
[51] evaluate_1.0.3          Biostrings_2.74.1       pillar_1.10.1           carData_3.0-5           generics_0.1.3         
[56] vroom_1.6.5             hms_1.1.3               munsell_0.5.1           minqa_1.2.8             glue_1.8.0             
[61] tools_4.4.1             locfit_1.5-9.12         mvtnorm_1.3-3           grid_4.4.1              colorspace_2.1-1       
[66] nlme_3.1-167            GenomeInfoDbData_1.2.13 Formula_1.2-5           cli_3.6.4               fansi_1.0.6            
[71] S4Arrays_1.6.0          gtable_0.3.6            digest_0.6.37           SparseArray_1.6.2       farver_2.1.2           
[76] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7              bit64_4.6.0-1          
[81] MASS_7.3-65

DESeq2 org.Rn.eg.db • 800 views

ADD COMMENT • link updated 5 weeks ago by James W. MacDonald 68k • written 5 weeks ago by Erika • 0

score 1 · Answer 1 · 2025-03-14

You've got it backwards. It should be

adult_res_df$symbol <- mapIds(org.Rn.eg.db, keys = rownames(adult_res_df), column = "ENSEMBL", keytype = "SYMBOL")

The keys in this instance are the symbols, so the keytype is SYMBOL, not ENSEMBL

You might get good mapping, but maybe not. What's happening under the hood is that the symbols are mapped to NCBI Gene IDs, and then mapped to Ensembl IDs. But there are any number of NCBI Gene IDs that don't map to Ensembl IDs, whereas the symbol to Ensembl mapping might. You can get the direct mappings in two different ways. You can use an EnsDb from the AnnotationHub, or use biomaRt.

> library(AnnotationHub)
> hub <- AnnotationHub()
  |===========================| 100%

snapshotDate(): 2024-10-28

## find an ensdb
> query(hub, c("rattus norvegicus","ensdb"))
AnnotationHub with 39 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl
# $species: Rattus norvegicus
# $rdataclass: EnsDb
# additional mcols():
#   taxonomyid, genome,
#   description,
#   coordinate_1_based,
#   maintainer, rdatadateadded,
#   preparerclass, tags,
#   rdatapath, sourceurl,
#   sourcetype 
# retrieve records with, e.g.,
#   'object[["AH53239"]]' 


  AH53239  |
  AH53743  |
  AH56709  |
  AH57792  |
  AH60814  |
  ...       
  AH116989 |
  AH119434 |
  AH119435 |
  AH119436 |
  AH119437 |
           title                   
  AH53239  Ensembl 87 EnsDb for ...
  AH53743  Ensembl 88 EnsDb for ...
  AH56709  Ensembl 89 EnsDb for ...
  AH57792  Ensembl 90 EnsDb for ...
  AH60814  Ensembl 91 EnsDb for ...
  ...      ...                     
  AH116989 Ensembl 112 EnsDb for...
  AH119434 Ensembl 113 EnsDb for...
  AH119435 Ensembl 113 EnsDb for...
  AH119436 Ensembl 113 EnsDb for...
  AH119437 Ensembl 113 EnsDb for...

## That's a lot. Let's use the latest
> ensdb <- hub[["AH119437"]]
downloading 1 resources
retrieving 1 resource
  |===========================| 100%

loading from cache
require("ensembldb")
Warning message:
package 'GenomeInfoDb' was built under R version 4.4.2 

## now query 
> select(ensdb, ids, "GENEID","GENENAME")
  GENENAME             GENEID
1     Sncb ENSRNOG00000018039
2   Ptp4a3 ENSRNOG00000007628
3    Zmiz2 ENSRNOG00000025923
4     Klc4 ENSRNOG00000018168
5   Nicol1 ENSRNOG00000065022
6  Stard10 ENSRNOG00000019491
7     Rpsa ENSRNOG00000018645
8    Clip3 ENSRNOG00000050104

The alternative is biomaRt, but right now you cannot even connect to ensembl.org, so that's off the table.