Error MakeOrgPackagefromNCBI
Last seen 7 months ago

I need to create an orgDb for my microorganism, but it gives me an error that I'll report below:

>  > makeOrgPackageFromNCBI(version = "0.1",
> +                        author = "Cinzia Spagnoli",
> +                        maintainer = "Cinzia Spagnoli",
> +                        outputDir = ".",
> +                        tax_id = "575584",
> +                        genus = "Acinetobacter",
> +                        species = "baumannii")
>  If files are not cached locally this may take awhile to assemble a 33 
> GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads preparing data from NCBI ...
> starting download for
> [1] gene2pubmed.gz
> [2] gene2accession.gz
> [3] gene2refseq.gz
> [4] gene_info.gz
> [5] gene2go.gz
> getting data for gene2pubmed.gz
> extracting data for our organism from : gene2pubmed getting data for 
> gene2accession.gz extracting data for our organism from : 
> gene2accession getting data for gene2refseq.gz extracting data for our 
> organism from : gene2refseq getting data for gene_info.gz extracting 
> data for our organism from : gene_info getting data for gene2go.gz 
> extracting data for our organism from : gene2go processing gene2pubmed 
> processing gene_info: chromosomes processing gene_info: description 
> Error in prepareDataFromNCBI(tax_id, NCBIFilesDir, outputDir, rebuildCache,  :
>   no information found for species with tax id 575584
Bioconductor
Last seen 3 hours ago
United States

You will rarely find a particular strain in any annotation data, and instead you should use the 'main' taxon ID, which for A. baumannii happens to be 470.

## how many genes for 470?
$ awk '$1 == 470' gene_info | wc -l
## now how about 575584
$ awk '$1 == 575584' gene_info | wc -l

No idea how many genes one might expect for this bacterium, but you will get better results using 470.

I tried, but it does not seem to work.

> library(AnnotationForge)
> makeOrgPackageFromNCBI(version = "0.1",
+                          author = "Cinzia Spagnoli",
+                          maintainer = "Cinzia Spagnoli",
+                          outputDir = ".",
+                          tax_id = "470",
+                          genus = "Acinetobacter",
+                          species = "baumannii")
If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
rebuilding the cache
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
rebuilding the cache
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  error reading from the connection
In addition: Warning messages:
1: In .Internal(shortRowNames(x, type)) :
  closing unused connection 3 (D:/OneDrive - Universita degli Studi Roma Tre/Documenti/gene2pubmed.gz)
2: call dbDisconnect() when finished working with a connection 
3: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  invalid or incomplete compressed data
It might be due to either the spaces in your path, or the fact that it's a OneDrive directory. It's normally better to just use the Desktop and delete after installing.

> makeOrgPackageFromNCBI("0.0.1","me <>","me", tax_id = "470", genus = "Acinetobacter", species = "baumannii", rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
making the OrgDb package ...
Populating genes table:
genes table filled
Populating pubmed table:
pubmed table filled
Populating gene_info table:
gene_info table filled
Populating entrez_genes table:
entrez_genes table filled
Populating alias table:
alias table filled
Populating refseq table:
refseq table filled
Populating accessions table:
accessions table filled
Populating go table:
go table filled
table metadata filled

'select()' returned many:1
mapping between keys and columns
Dropping GO IDs that are too new for the current GO.db
Populating go table:
go table filled
Populating go_bp table:
go_bp table filled
Populating go_cc table:
go_cc table filled
Populating go_mf table:
go_mf table filled
'select()' returned many:1
mapping between keys and columns
Populating go_bp_all table:
go_bp_all table filled
Populating go_cc_all table:
go_cc_all table filled
Populating go_mf_all table:
go_mf_all table filled
Populating go_all table:
go_all table filled
Creating package in c:/Users/jmacdon/Desktop/ 
Now deleting temporary database file
[1] ""

> install.packages("", type = "source", repos = NULL)
Installing package into 'C:/Users/jmacdon/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
* installing *source* package '' ...
* DONE (
> library(

> select(, head(keys(, "SYMBOL")
'select()' returned 1:1 mapping
between keys and columns
       GID        SYMBOL
1 66395337          dnaA
2 66395338          dnaN
3 66395339          recF
4 66395340          gyrB
5 66395341          cybC
6 66395342 F3P16_RS00030

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
I apologize for the delay in responding. However, the command still doesn't work for me. I would need to create the package from this genome:

I don't know what to tell you. I already told you that you can't build it for that strain, and you have to use 470 instead. I can get it to build (see above), and told you not to use a OneDrive path. Saying 'the command still doesn't work for me' without code or output isn't helpful at all (doesn't work how?).

Hello, while working on different projects, I recently got back to working on this code. I tried running your code again, but it's not working, even though I manually downloaded these files into the working directory: [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz

Here the code:

> makeOrgPackageFromNCBI("0.0.1","me <>","me",
  • tax_id = "470",
  • genus = "Acinetobacter",
  • species = "baumannii") If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz extracting data for our organism from : gene2pubmed getting data for gene2accession.gz Error: no such table: gene2accession_date
That error indicates that you already have a file called NCBI.sqlite in your working directory, and it's incomplete (missing the gene2accession_date table). Here's mine:

> library(RSQLite)
Warning message:
package 'RSQLite' was built under R version 4.3.2 
> con <- dbConnect(SQLite(), "NCBI.sqlite")
> dbListTables(con)
 [1] "altGO"              
 [2] "altGO_date"         
 [3] "gene2accession"     
 [4] "gene2accession_date"
 [5] "gene2go"            
 [6] "gene2go_date"       
 [7] "gene2pubmed"        
 [8] "gene2pubmed_date"   
 [9] "gene2refseq"        
[10] "gene2refseq_date"   
[11] "gene_info"          
[12] "gene_info_date"     

## it's just a dumb little table that says when the db was built
> dbGetQuery(con, "select * from gene2accession_date;")
1 2023-06-28

The easiest thing to do is to delete your NCBI.sqlite DB and then run makeOrgDbFromNCBI again. But do note that you have to add rebuildCache = FALSE to your call, or you will download all those files again!

Well, thanks for the advice! i resubmitted the program and this is what I get. Fortunately, it downloaded most of the .gz files for me, but it crashed at GO.

> makeOrgPackageFromNCBI(version = "0.0.1",
  • author = "Cinzia Spagnoli",
  • maintainer = "Cinzia Spagnoli",
  • outputDir = '.',
  • tax_id = "470",
  • genus = "Acinetobacter",
  • species = "baumannii",
  • ) If files are not cached locally this may take awhile to assemble a 33 GB cache databse in the NCBIFilesDir directory. Subsequent calls to this function should be faster (seconds). The cache will try to rebuild once per day.Please also see AnnotationHub for some pre-builtOrgDb downloads preparing data from NCBI ... starting download for [1] gene2pubmed.gz [2] gene2accession.gz [3] gene2refseq.gz [4] gene_info.gz [5] gene2go.gz getting data for gene2pubmed.gz rebuilding the cache extracting data for our organism from : gene2pubmed getting data for gene2accession.gz rebuilding the cache extracting data for our organism from : gene2accession getting data for gene2refseq.gz rebuilding the cache extracting data for our organism from : gene2refseq getting data for gene_info.gz rebuilding the cache extracting data for our organism from : gene_info getting data for gene2go.gz rebuilding the cache extracting data for our organism from : gene2go processing gene2pubmed processing gene_info: chromosomes processing gene_info: description processing alias data processing refseq data processing accession data processing GO data Error in download.file(url, dest, quiet = TRUE) : download from '' failed In addition: Warning messages: 1: In download.file(url, dest, quiet = TRUE) : downloaded length 0 != reported length 0 2: In download.file(url, dest, quiet = TRUE) : URL '': Timeout of 1000 seconds was reached

con <- dbConnect(SQLite(), "NCBI.sqlite") dbListTables(con) [1] "gene2accession" "gene2accession_date" "gene2go" "gene2go_date"
[5] "gene2pubmed" "gene2pubmed_date" "gene2refseq" "gene2refseq_date"
[9] "gene_info" "gene_info_date"

Two remarks:

You can increase the time out further to (for example) 4000 seconds through options(timeout = 4000).

Although not related to the time out, note that you did not correctly follow the naming convention for specifying the author and maintainer; the mail should be between < and >. See ?makeOrgPackageFromNCBI, that tells you to do that like this:

author = "Some One <>",
maintainer = "Some One <>",
I did it! After several attempts and your invaluable advice! Now I'm trying to perform GO and KEGG enrichment analysis. I also followed the advice given in this link: "No genes can be mapped...." using enrichGO in clusterProfiler But it does not work, can you help me?

GO classification

all_genes <- read.csv("all_genes.csv") diff_genes <- read.csv("diff_genes.csv")

GO_analysis <- enrichGO(gene = diff_genes,

  • universe = all_genes,
  • OrgDb =,
  • ont = "CC", # either "BP", "CC" or "MF",
  • pAdjustMethod = "none",
  • pvalueCutoff = 1,
  • qvalueCutoff = 1,
  • readable = TRUE,
  • pool = TRUE) --> No gene can be mapped.... --> Expected input gene ID: 66397190,66398467,66396010,66395397,66398543,66398024 --> return NULL...
Happy to hear you got it working!

Please open a new thread for your new question. Yet, before you do so, double-check that the object/input GO_analysis is a character vector, and that these are indeed entrez ids (and thus match with the reported Expected input gene IDs). In other words, first do all the checks that I suggested in my post you linked to...!


