how to download multiple gff files from NCBI
1
1
Entering edit mode
KB ▴ 50
@k-8495
Last seen 22 months ago
United States

Hello,

I have about 100 bacterial species. I used "reutils" package to download their fasta files. I am now trying to download their gff files. I have the name and NC_ ids for these.

I looked online and see that that there is package "genomes" that can read the file into R, but you need to provide the complete link to it . Eg: file<-"ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Yersinia_pestis_CO92_uid57621/NC_003132.gff"

If the link had been something like "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/NC_017105.gff", it would have been easy to programmatically create the link  in R and download the file. But that is not the case, since there is a species name and another number in it.

1) Is there an "retuils" type package where I can give the NC_ id and download the gff file ?

2) Is there a way to create this link programmatically for each bacteria ?

Any insights ? Thanks

 

gff • 2.7k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States

You can get the directory listing using curl and ftp

library(RCurl)
curl <- getCurlHandle()

url <- "ftp://ftp.ncbi.nih.gov/genomes/Bacteria/"
xx <- getURL(url=url, ftplistonly=TRUE, curl=curl)
entries <- strsplit(xx, "\n")[[1]]

parse these for the species names

ncbi_taxa <- sub("^([[:alpha:]]+_[[:alpha:]]+)_.*", "\\1", entries)

and match them up with the information you have

keep <- match(your_taxa, ncbi_taxa)
gff_urls <- paste0(url, entries[keep], "/", your_NC[!is.na(keep)])

you could then use download.file() or RCurl and this suggestion to download each gff. Import the downloaded files, or the url's to the files directly, with rtracklayer::import().

 

ADD COMMENT
0
Entering edit mode

Fabulous - thanks so much !! works great !!

ADD REPLY

Login before adding your answer.

Traffic: 954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6