Problems with enrichGO
0
0
Entering edit mode
@dfb033f2
Last seen 11 months ago
Finland

Hello everyone! I'm trying to perform GO and KEGG enrichment analysis. I also followed the advice given in this link: "No genes can be mapped...." using enrichGO in clusterProfiler using enrichGO in clusterProfiler But it does not work, can you help me?

Here the script and the error:

  > #import data
> all_genesxlsx <- read_xlsx("D:/R2/Prova/all_genes.xlsx")
> diff_genesxlsx <- read_xlsx("D:/R2/Prova/diff_genes.xlsx")
> 
> 
> #conversion dataframe in vector of character
> all_genes <- as.character(all_genesxlsx)
> diff_genes <- as.character(diff_genesxlsx)
> 
> 
> #control
> is.character(all_genes)
[1] TRUE
> is.character(diff_genes)
[1] TRUE
> 
> 
> # Remove "" an \n
> diff_genes_def <- gsub("[\"\n]", "", diff_genes)
> all_genes_def <- gsub("[\"\n]", "", all_genes)
> print(all_genes_def)
[1] "c(WP_000002828.1, WP_000002852.1, WP_000003115.1, WP_000003406.1, WP_000003532.1, WP_000003544.1, WP_000003555.1, ...)
> print(diff_genes_def)
[1] "c(WP_000002828.1, WP_000002852.1, WP_000003115.1, WP_000003406.1, WP_000003532.1, WP_000003544.1, WP_000003555.1, ...)
> #GO_analysys
> GO_analysis <- enrichGO(gene = diff_genes_def, 
+                         universe = all_genes_def, 
+                         OrgDb = org.Abaumannii.eg.db,
+                         keyType = "REFSEQ",
+                         ont = "CC",              # either "BP", "CC" or "MF",
+                         pAdjustMethod = "none",
+                         pvalueCutoff = 1, 
+                         qvalueCutoff = 1,
+                         readable = FALSE)
--> No gene can be mapped....
--> Expected input gene ID: WP_001075596.1,WP_000959085.1,WP_000220477.1,WP_000776215.1,WP_000091932.1,WP_049581137.1
--> return NULL...

Here the session info:

    > sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=Italian_Italy.utf8  LC_CTYPE=Italian_Italy.utf8   
[3] LC_MONETARY=Italian_Italy.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.utf8    

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.4.3               biomaRt_2.58.2             clusterProfiler_4.10.0    
 [4] org.Abaumannii.eg.db_0.0.1 AnnotationHub_3.10.0       BiocFileCache_2.10.1      
 [7] dbplyr_2.4.0               BiocManager_1.30.22        AnnotationForge_1.44.0    
[10] AnnotationDbi_1.64.1       IRanges_2.36.0             S4Vectors_0.40.2          
[13] Biobase_2.62.0             BiocGenerics_0.48.1       

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3            rstudioapi_0.15.0             jsonlite_1.8.8               
  [4] magrittr_2.0.3                farver_2.1.1                  fs_1.6.3                     
  [7] zlibbioc_1.48.0               vctrs_0.6.5                   memoise_2.0.1                
 [10] RCurl_1.98-1.14               ggtree_3.10.0                 htmltools_0.5.7              
 [13] progress_1.2.3                curl_5.2.0                    cellranger_1.1.0             
 [16] gridGraphics_0.5-1            plyr_1.8.9                    cachem_1.0.8                 
 [19] igraph_2.0.1.1                mime_0.12                     lifecycle_1.0.4              
 [22] pkgconfig_2.0.3               Matrix_1.6-5                  R6_2.5.1                     
 [25] fastmap_1.1.1                 gson_0.1.0                    GenomeInfoDbData_1.2.11      
 [28] shiny_1.8.0                   digest_0.6.34                 aplot_0.2.2                  
 [31] enrichplot_1.22.0             colorspace_2.1-0              patchwork_1.2.0              
 [34] RSQLite_2.3.5                 filelock_1.0.3                fansi_1.0.6                  
 [37] httr_1.4.7                    polyclip_1.10-6               compiler_4.3.2               
 [40] bit64_4.0.5                   withr_3.0.0                   BiocParallel_1.36.0          
 [43] viridis_0.6.5                 DBI_1.2.1                     ggforce_0.4.1                
 [46] MASS_7.3-60.0.1               rappdirs_0.3.3                HDO.db_0.99.1                
 [49] tools_4.3.2                   ape_5.7-1                     scatterpie_0.2.1             
 [52] interactiveDisplayBase_1.40.0 httpuv_1.6.14                 glue_1.7.0                   
 [55] nlme_3.1-164                  GOSemSim_2.28.1               promises_1.2.1               
 [58] grid_4.3.2                    shadowtext_0.1.3              reshape2_1.4.4               
 [61] fgsea_1.28.0                  generics_0.1.3                gtable_0.3.4                 
 [64] tidyr_1.3.1                   data.table_1.15.0             hms_1.1.3                    
 [67] xml2_1.3.6                    tidygraph_1.3.1               utf8_1.2.4                   
 [70] XVector_0.42.0                ggrepel_0.9.5                 BiocVersion_3.18.1           
 [73] pillar_1.9.0                  stringr_1.5.1                 yulab.utils_0.1.4            
 [76] later_1.3.2                   splines_4.3.2                 dplyr_1.1.4                  
 [79] tweenr_2.0.2                  treeio_1.26.0                 lattice_0.22-5               
 [82] bit_4.0.5                     tidyselect_1.2.0              GO.db_3.18.0                 
 [85] Biostrings_2.70.2             gridExtra_2.3                 graphlayouts_1.1.0           
 [88] stringi_1.8.3                 lazyeval_0.2.2                ggfun_0.1.4                  
 [91] yaml_2.3.8                    codetools_0.2-19              ggraph_2.1.0                 
 [94] tibble_3.2.1                  qvalue_2.34.0                 ggplotify_0.1.2              
 [97] cli_3.6.1                     xtable_1.8-4                  munsell_0.5.0                
[100] Rcpp_1.0.12                   GenomeInfoDb_1.38.5           png_0.1-8                    
[103] XML_3.99-0.16.1               parallel_4.3.2                ellipsis_0.3.2               
[106] ggplot2_3.4.4                 blob_1.2.4                    prettyunits_1.2.0            
[109] DOSE_3.28.2                   bitops_1.0-7                  viridisLite_0.4.2            
[112] tidytree_0.4.6                scales_1.3.0                  purrr_1.0.2                  
[115] crayon_1.5.2                  rlang_1.1.3                   cowplot_1.1.3                
[118] fastmatch_1.1-4               KEGGREST_1.42.0   
clusterProfiler enrichGO • 1.5k views
ADD COMMENT
0
Entering edit mode

What do you get from

sum(all_genes %in% keys(org.Abaumanii.eg.db))
ADD REPLY
0
Entering edit mode

I get this:

    > sum(all_genes %in% keys(org.Abaumanii.eg.db))
Error in h(simpleError(msg, call)) : 
  errore durante la valutazione dell'argomento 'table' nella selezione di un metodo per la funzione '%in%': errore durante la valutazione dell'argomento 'x' nella selezione di un metodo per la funzione 'keys': object 'org.Abaumanii.eg.db' not found`

So I do not understand if the file input is wrong, or i the org.package does not contain all the informations. To prepare the input files, I did the following: for all_genes, I took my genome and selected the specific RefSeq for each gene. For diff_genes, on the other hand, I took the list of differentially expressed genes and associated each one with the corresponding RefSeq (based on my genome). I did all of this in Excel.

ADD REPLY
0
Entering edit mode

Oh, I missed the second n in baumannii. Try with the correct spelling (and note that if you get an error that ends with "object 'org.Abaumanii.eg.db' not found", that's a hint that you probably either have a typo or haven't loaded the package yet).

ADD REPLY
0
Entering edit mode

Don't worry, I hadn't noticed it either.

Here is the result:

    > sum(all_genes_def %in% keys(org.Abaumannii.eg.db))
[1] 0
ADD REPLY
0
Entering edit mode

Well, that's the problem then. From a cursory check, it appears that things like WP_000002852.1 aren't actually Gene IDs, but instead are NCBI protein IDs.

ADD REPLY
0
Entering edit mode

Yes, I selected them because when I send the command using REFSEQ as the keytype, it asks for this information. For these reasons, I associated this code WP_000002852.1 to each gene

> #GO_analysys
> GO_analysis <- enrichGO(gene = diff_genes_def, 
+                         universe = all_genes_def, 
+                         OrgDb = org.Abaumannii.eg.db,
+                         keyType = "REFSEQ",
+                         ont = "CC",              # either "BP", "CC" or "MF",
+                         pAdjustMethod = "none",
+                         pvalueCutoff = 1, 
+                         qvalueCutoff = 1,
+                         readable = FALSE)
--> No gene can be mapped....
--> Expected input gene ID: WP_001003277.1,WP_000578583.1,WP_001177530.1,WP_001205590.1,WP_001281765.1,WP_045543779.1
--> return NULL...
ADD REPLY
0
Entering edit mode

OK, then what do you get for

sum(all_genes_def %in%  keys(org.Abaumannii.eg.db, "REFSEQ"))
##or maybe
sum(all_genes_def %in%  keys(org.Abaumannii.eg.db, "ACCNUM"))
ADD REPLY
0
Entering edit mode

I get this

> sum(all_genes_def %in%  keys(org.Abaumannii.eg.db, "REFSEQ"))
[1] 0
> ##or maybe
> sum(all_genes_def %in%  keys(org.Abaumannii.eg.db, "ACCNUM"))
[1] 0
ADD REPLY
0
Entering edit mode

If none of the IDs in your all_genes_def object are in your OrgDb, then no test can be made.

ADD REPLY
0
Entering edit mode

Oh no! Do you think there are problems related to the creation of the orgpackage or the input file?

What should the input file look like? In my case it is just one column with the list of codes.

ADD REPLY
0
Entering edit mode

Probably can I use bitr command? error in getting entrez ids for gene names Or this is not usefull in my case?

ADD REPLY
1
Entering edit mode

The OrgDb packages are just SQLite databases that can be used to (for example, in this case) map certain IDs to GO IDs, which can then be used to do a Fisher's exact test.

The issue is that you have a set of IDs that you want to map, and when you check to see if any of those IDs are contained in the OrgDb, the answer is no. In which case it's impossible to map to the GO IDs, and you therefore cannot do the Fisher's exact test. You need an OrgDb that actually contains the IDs that you want to map for this to work correctly.

Using bitr won't help because that function is just meant to simplify querying an OrgDb. It doesn't 'fix' the problem.

ADD REPLY
0
Entering edit mode

Ok thank you, because I did the org.db of A. baumannii 470, as you suggested here Error MakeOrgPackagefromNCBI. I am not sure about the fact that I don't get the input genes wrong, could you point me to an example?

ADD REPLY
1
Entering edit mode

I also made that package. Here are the seven IDs you present in the first post

> ids <- scan("clipboard", "c")
Read 7 items
> ids
[1] "WP_000002828.1" "WP_000002852.1"
[3] "WP_000003115.1" "WP_000003406.1"
[5] "WP_000003532.1" "WP_000003544.1"
[7] "WP_000003555.1"
## check the OrgDb
> library(org.Abaumannii.eg.db)
> select(org.Abaumannii.eg.db, ids, "ENTREZID","REFSEQ")
'select()' returned 1:1 mapping
between keys and columns
          REFSEQ ENTREZID
1 WP_000002828.1     <NA>
2 WP_000002852.1     <NA>
3 WP_000003115.1 66396598
4 WP_000003406.1 66398643
5 WP_000003532.1     <NA>
6 WP_000003544.1     <NA>
7 WP_000003555.1 66397740
## Only three map Let's try the NCBI.sqlite file
> library(RSQLite)
> ocon <- dbConnect(SQLite(), "NCBI.sqlite")
## check the rna_accession column
> dbGetQuery(ocon, paste0("select * from gene2accession where rna_accession in ('", paste(ids, collapse = "','"), "');"))
 [1] tax_id               
 [2] gene_id              
 [3] status               
 [4] rna_accession        
 [5] rna_gi               
 [6] protein_accession    
 [7] protein_gi           
 [8] genomic_dna_accession
 [9] genomic_dna_gi       
[10] genomic_start        
[11] genomic_end          
[12] orientation          
[13] assembly             
[14] peptide_accession    
[15] peptide_gi           
[16] symbol               
<0 rows> (or 0-length row.names)

## Nothing there. Try protein_accession
> dbGetQuery(ocon, paste0("select * from gene2accession where protein_accession in ('", paste(ids, collapse = "','"), "');"))
  tax_id  gene_id status
1    470 66396598   <NA>
2    470 66397740   <NA>
3    470 66398643   <NA>
  rna_accession rna_gi
1             -      -
2             -      -
3             -      -
  protein_accession protein_gi
1    WP_000003115.1  445925260
2    WP_000003555.1  445925700
3    WP_000003406.1  445925551
  genomic_dna_accession
1         NZ_CP043953.1
2         NZ_CP043953.1
3         NZ_CP043953.1
  genomic_dna_gi genomic_start
1     1751996929       1364290
2     1751996929       2596640
3     1751996929       3566756
  genomic_end orientation assembly
1     1364922           -        -
2     2597830           -        -
3     3567013           -        -
  peptide_accession peptide_gi
1                 -          -
2                 -          -
3                 -          -
         symbol
1 F3P16_RS06395
2 F3P16_RS12215
3          tusA

So from the limited number of IDs I have, only three map to the REFSEQ IDs in the OrgDb, and that matches up with the omnibus SQLite Db we already made. This probably has more to do with what data are available at NCBI for this organism than anything else.

ADD REPLY
0
Entering edit mode

Hi James, thank you so much for your information and for the test you conducted. Since the issue is related to the information available on NCBI, what do you suggest I do? Because I work with this microorganism. Do you think it's possible to contact NCBI support for further information? These might be basic questions, but I'm completely in the dark, and perhaps you've already encountered similar issues.

ADD REPLY

Login before adding your answer.

Traffic: 536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6