Question

Query regarding to create custom organism database with AnnotationForge package (AnnotationForge::makeOrgPackageFromNCBI)

0

Entering edit mode

abhisek001 • 0

@6d5973d2

Last seen 5 months ago

India

I run the command AnnotationForge from AnnotationForge package in r studio with the following inputs but I couldn't create my own custom database for Acinetobacter baumannii. I want to see gene set enrichment analysis from RNA-seq data with my mutated strain of Acinetobacter baumannii. This species is not available in r packages for GO analysis.

first I ran the command with rebuildCache=TRUE it started to download those repositories and then it got stuck into gene2accession file accessing. I mentioned the code below.

Then I have downloaded all the repositories from NCBI FTP site (https://ftp.ncbi.nih.gov/gene/DATA/ ) and supplied it to the working directory and run the following command but I am getting the following error. I even changed manually gene2accession.gz file's content's name from gene2accession to main.gene2accession but it also not worked. I mentioned this code also.

Please guide me. Thanks in advance.

makeOrgPackageFromNCBI(version = "0.1",
                       author = "Some one <somone2001@gmail.com>",
                       maintainer = "Some one <somone2001@gmail.com>",
                       outputDir = "/home/omic/analysis/R_studio",
                       NCBIFilesDir = "/home/omic/analysis/R_studio",
                       tax_id = "470",
                       genus = "Acinetobacter",
                       species = "baumannii",
                       rebuildCache=TRUE)

ERROR- 

getting data for gene2pubmed.gz
rebuilding the cache
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
rebuilding the cache
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  error reading from the connection
In addition: Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  invalid or incomplete compressed data


makeOrgPackageFromNCBI(version = "0.1",
                       author = "Some one <somone2001@gmail.com>",
                       maintainer = "Some one <somone2001@gmail.com>",
                       outputDir = "/home/omic/analysis/R_studio",
                       NCBIFilesDir = "/home/omic/analysis/R_studio",
                       tax_id = "470",
                       genus = "Acinetobacter",
                       species = "baumannii",
                       rebuildCache=FALSE)



error - 

preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
Error: no such table: main.gene2accession
sessionInfo( )

AnnotationForge OrganismData OrganismDb • 1.7k views

ADD COMMENT • link 22 months ago abhisek001 • 0

score 2 · Answer 1 · 2023-06-28

I'm not sure about the first error - that might be due to a timeout, although you should have got a timeout error instead of a scan error. The second error has to do with the first though. When you ran the first attempt, a SQLite database was created, and in the second attempt (by using rebuildCache = FALSE) you are saying 'just use the existing SQLite database instead of re-downloading', in which case there are missing tables that result in the error you see.

Ideally there would be a facility to hand download the files from NCBI and then just tell AnnotationForge to use the files you downloaded. I recently tried to add that functionality but it wasn't quite right and I have backed it out until I come up with a better solution. In the interim you can just create the SQLite file using the following function, and then try your second method again.

writeFilesToDb <- function(file, file.dir = ".") {
    require("AnnotationForge", character.only = TRUE,  quietly = TRUE)
    require("RSQLite", character.only = TRUE, quietly = TRUE)
    tmp <- file.path(file.dir, file)
    pfiles <- AnnotationForge:::.primaryFiles()
    file <- pfiles[file]
    NCBIcon <- dbConnect(SQLite(), file.path(file.dir, "NCBI.sqlite"))
    tableName <- sub(".gz","",names(file))
    AnnotationForge:::.writeToNCBIDB(NCBIcon, tableName, filepath=tmp, file)
    AnnotationForge:::.setNCBIDateStamp(NCBIcon, tableName)
    dbDisconnect(NCBIcon)
}

## try it out
> fls <- dir(".", "^gene.+gz")
> fls
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"       
[4] "gene2pubmed.gz"    "gene2refseq.gz"
> for(i in fls) writeFilesToDb(i)
> makeOrgPackageFromNCBI("0.1","me <me@mine.org>", "me", ".", "470", "Acinetobacter","baumannii",rebuildCache = FALSE)
preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
<snip>

score 0 · Answer 2 · 2023-06-28

0

Entering edit mode

abhisek001 • 0

@6d5973d2

Last seen 5 months ago

India

Thank you James W. MacDonald sir for your effort to counter the problem. I have followed your suggestion of a two-step process but still I not get the desired output. The development is following.

preparing data from NCBI ...
starting download for 
[1] gene2pubmed.gz
[2] gene2accession.gz
[3] gene2refseq.gz
[4] gene_info.gz
[5] gene2go.gz
getting data for gene2pubmed.gz
extracting data for our organism from : gene2pubmed
getting data for gene2accession.gz
extracting data for our organism from : gene2accession
getting data for gene2refseq.gz
extracting data for our organism from : gene2refseq
getting data for gene_info.gz
extracting data for our organism from : gene_info
getting data for gene2go.gz
extracting data for our organism from : gene2go
processing gene2pubmed
processing gene_info: chromosomes
processing gene_info: description
processing alias data
processing refseq data
processing accession data
processing GO data
Error: no such table: altGO_date
>

ADD COMMENT • link 22 months ago abhisek001 • 0

1

Entering edit mode

Oh, I forgot. You need the idmapping file as well, from

https://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz

ADD REPLY • link 22 months ago James W. MacDonald 68k

0

Entering edit mode

I thank MacDonald sir for helping me out but now I'm getting the following error when I'm running the following commands with idmapping_selected.tab.gz and without it . Please tell me where is the problem.

In the first scenario - 

> fls <- dir(".", "^gene.+gz")
> fls
[1] "gene_info.gz"      "gene2accession.gz" "gene2go.gz"       
[4] "gene2pubmed.gz"    "gene2refseq.gz"   
> for(i in fls) writeFilesToDb(i)
Warning messages:
1: In for (i in seq_len(n)) { :
  closing unused connection 3 (./idmapping_selected.tab.gz)
2: call dbDisconnect() when finished working with a connection

In the second case : 

> fls <- dir(".", "*.+gz")
> fls

[1] "gene_info.gz"              "gene2accession.gz"        
[3] "gene2go.gz"                "gene2pubmed.gz"           
[5] "gene2refseq.gz"            "idmapping_selected.tab.gz"

> for(i in fls) writeFilesToDb(i)

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'dbUnquoteIdentifier': Cannot pass NA to dbQuoteIdentifier()
Called from: h(simpleError(msg, call))

ADD REPLY • link 22 months ago abhisek001 • 0

1

Entering edit mode

Sorry, I don't think I was clear. You need to use writeFilesToDb for just the files you get from NCBI (not the idmapping_selected.tab.gz). So the regexp for finding those files is as I originally showed you, "^gene.+gz". Once you have generated the NCBI.sqlite file, you can then run makeOrgDbFromNCBI with rebuildCache = FALSE.

To reiterate, you don't use the idmapping_selected.tab.gz directly at all. It's used internally by makeOrgDbFromNCBI to make the GO tables.

ADD REPLY • link 22 months ago James W. MacDonald 68k

0

Entering edit mode

Thanks MackDonald sir the process is now completed.

Creating package in ./org.Abaumannii.eg.db 
Now deleting temporary database file
complete!
[1] "org.Abaumannii.eg.sqlite"

ADD REPLY • link 22 months ago abhisek001 • 0