Question

Error for AnnotationForge makeOrgPackageFromNCBI function

0

Entering edit mode

Gayatri • 0

@1e961e20

Last seen 5 months ago

India

Hello everyone,

I am PhD student working on performing GSEA analysis for Candida albicans data. I queried AnnotationHub for existing records and found none. Hence I am trying to make an Organism DB for C. albicans. After going through the threads of Problem making orgdb package for bacteria (Pseudomonas) using annotation hub and annotation forge; error with downloading NCBI data for makeorgpackagefromNCBI? ; AnnotationForge not working for building custom org packages; I am still encountering the following errors. I have tried downloading the files directly from https://ftp.ncbi.nlm.nih.gov/gene/DATA/ after deleting the NCBI.sqlite file, but to no avail. I even tried changing the timeout settings to 10000. Any help in this regard is highly appreciated.


> hub <- AnnotationHub()
  |=====================================================================================| 100%

snapshotDate(): 2023-10-23
> query(hub, c("OrgDb","Candida albicans"))
AnnotationHub with 0 records
# snapshotDate(): 2023-10-23

> getOption('timeout')
[1] 60
> options(timeout = 10000)
> getOption('timeout')
[1] 10000

> makeOrgPackageFromNCBI("0.1", 
+                        "Gayatri <gayatri@catg.edu.in>", 
+                        "Gayatri", 
+                        ".", 
+                        "237561", 
+                        "Candida", 
+                        "albicans", 
+                        rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
Error: no such table: main.gene2pubmed

> list.files()
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"        "gene2pubmed.gz"   
[5] "gene2refseq.gz" 

> sessionInfo()
R version 4.3.3 (2024-02-29 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationForge_1.44.0 biomaRt_2.58.2         AnnotationHub_3.10.1  
 [4] BiocFileCache_2.10.2   dbplyr_2.5.0           GenomeInfoDb_1.38.8   
 [7] AnnotationDbi_1.64.1   IRanges_2.36.0         S4Vectors_0.40.2      
[10] Biobase_2.62.0         BiocGenerics_0.48.1   

loaded via a namespace (and not attached):
 [1] KEGGREST_1.42.0               vctrs_0.6.5                   tools_4.3.3                  
 [4] bitops_1.0-7                  generics_0.1.3                curl_5.2.1                   
 [7] tibble_3.2.1                  fansi_1.0.6                   RSQLite_2.3.6                
[10] blob_1.2.4                    pkgconfig_2.0.3               lifecycle_1.0.4              
[13] GenomeInfoDbData_1.2.11       compiler_4.3.3                stringr_1.5.1                
[16] Biostrings_2.70.3             progress_1.2.3                httpuv_1.6.15                
[19] htmltools_0.5.8.1             RCurl_1.98-1.14               yaml_2.3.8                   
[22] interactiveDisplayBase_1.40.0 pillar_1.9.0                  later_1.3.2                  
[25] crayon_1.5.2                  cachem_1.0.8                  mime_0.12                    
[28] tidyselect_1.2.1              digest_0.6.35                 stringi_1.8.3                
[31] dplyr_1.1.4                   BiocVersion_3.18.1            fastmap_1.1.1                
[34] cli_3.6.2                     magrittr_2.0.3                XML_3.99-0.16.1              
[37] utf8_1.2.4                    prettyunits_1.2.0             filelock_1.0.3               
[40] promises_1.3.0                rappdirs_0.3.3                bit64_4.0.5                  
[43] XVector_0.42.0                httr_1.4.7                    bit_4.0.5                    
[46] png_0.1-8                     hms_1.1.3                     memoise_2.0.1                
[49] shiny_1.8.1.1                 rlang_1.1.3                   Rcpp_1.0.12                  
[52] xtable_1.8-4                  glue_1.7.0                    DBI_1.2.2                    
[55] xml2_1.3.6                    BiocManager_1.30.22           R6_2.5.1                     
[58] zlibbioc_1.48.2              
Warning message:
call dbDisconnect() when finished working with a connection

OrgDb AnnotationForge • 836 views

ADD COMMENT • link updated 6 weeks ago by James W. MacDonald 67k • written 7 months ago by Gayatri • 0

score 0 · Answer 1 · 2024-04-22

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 17 minutes ago

United States

That error incidates that your NBCI.sqlite database is missing the gene2pubmed table, so you should regenerate that db. Just delete it and then rerun the script as you already have.

ADD COMMENT • link 7 months ago James W. MacDonald 67k

0

Entering edit mode

Hello James,

How to generate the db? By running the makeOrgPackageFromNCBI() command? I have tried doing that, first by just deleting the NCBI.sqlite file and running the command; and then deleting both the NCBI.sqlite as well as gene2pubmed file as the size of the file is around 180 Mb. But then I am still getting the error that gene2accession file is partially transferred as that too is re-downloaded, even though it is already downloaded prior to running the command.

ADD REPLY • link 7 months ago Gayatri • 0

0

Entering edit mode

I don't really follow what you are saying. All you have to do is delete the NCBI.sqlite db and re-run the script exactly as you did above. If you say rebuildCache = FALSE you shouldn't download anything. And the error you got before didn't say anything about downloading files. It said that you were missing the gene2pubmed table.

ADD REPLY • link 7 months ago James W. MacDonald 67k

0

Entering edit mode

What I meant is even after deleting the NCBI.sqlite file, and re-running the script, an empty NCBI.sqlite file (0 kb) is created which is causing the error I mentioned:

preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz Error: no such table: main.gene2pubmed

I ensured any partially created NCBI.sqlite files are deleted, then downloaded the data directly from the NCBI, and then only re-ran the script. But the creation of this empty NCBI. sqlite file is causing the script to terminate. What do you suggest I do regarding this?

ADD REPLY • link 7 months ago Gayatri • 0

1

Entering edit mode

That's weird. I don't have any problem at all generating the OrgDb on my box. How big is the gene2pubmed.gz file? I get this:

gzip -dc gene2pubmed.gz | wc -l
57054066

So just over 57M rows. I get fewer from the NCBI.sqlite file, but it's definitely there.

> library(RSQLite)
Warning message:
package 'RSQLite' was built under R version 4.3.2 
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbGetQuery(con, "select count(*) from gene2pubmed;") 
  count(*)
1  4843195

ADD REPLY • link 7 months ago James W. MacDonald 67k

0

Entering edit mode

Hello James,

Thank you so much for suggesting ways to solve my query. After many trials, and waiting for good internet speed, I got the command to work and now successfully have my organism package built.

Regards,

Gayatri Brahmandam.

ADD REPLY • link 5 months ago Gayatri • 0

0

Entering edit mode

Hi, I have the same issue as you! Does this happen because of internet speed? I already downloaded 2 times these files!

> makeOrgPackageFromNCBI(
+   version = '1.0.0',
+   author = 'Gabriela Librais',
+   maintainer = 'gnunesma@uwo.ca',
+   outputDir = ".",
+   tax_id = "5476",
+   genus = "Candida",
+   species = "albicans",
+   NCBIFilesDir = ".",
+   rebuildCache = FALSE  # This will stop the re-downloading
+ )
preparing data from NCBI ...
starting download for
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
Error: no such table: main.gene2refseq
>

ADD REPLY • link 6 weeks ago Gabriela • 0

0

Entering edit mode

That error indicates that you have a file called NCBI.sqlite in your working directory that is missing some tables. You should delete that file and then run makeOrgPackageFromNCBI again, using the same arguments. This will re-generate the correct NCBI.sqlite file and then create the package.

ADD REPLY • link 6 weeks ago James W. MacDonald 67k