Retrieve GenBank file using RefSeq ID
1
0
Entering edit mode
heiko_kin ▴ 60
@heiko_kin-23266
Last seen 3.9 years ago

Dear all,

I am looking for a way to load the .gb GenBank file for a specific RefSeq ID directly into R. I had no success using the genbankr, rentrez and biomaRt package. Does anyone have a suggestion on how this can be done?

Thanks, Heiko

genbankr • 2.0k views
ADD COMMENT
0
Entering edit mode

I typically accomplish these kinds of operations by either directly accessing EDirect in the terminal, or having R build a bunch of EDirect queries directly, passing those to the terminal, and capturing the output.

In the terminal this simply looks like:

$ esearch -db nuccore -query NM_130786 | efetch -format gp

This is pretty easily recreated in R with as long as R has access to edirect - check Sys.getenv("PATH"):

> query <- paste0("esearch -db ",
                  "nuccore ", # your specified DB
                  "-query ", 
                  "NM_130786", # your query ID
                  " | ",
                  "efetch -format",
                  " gp")
> z <- system(command = query,
              intern = TRUE,
              timeout = 300L)
> head(z)
[1] "LOCUS       NM_130786               3382 bp    mRNA    linear   PRI 12-DEC-2020"
[2] "DEFINITION  Homo sapiens alpha-1-B glycoprotein (A1BG), mRNA."                  
[3] "ACCESSION   NM_130786"                                                          
[4] "VERSION     NM_130786.4"                                                        
[5] "KEYWORDS    RefSeq; MANE Select."                                               
[6] "SOURCE      Homo sapiens (human)"                                               
> tail(z)
[1] "     3181 gccccagctt agttttttaa aaagtttatt ttagccattc taataggtat gtagacatat"
[2] "     3241 ctaatagtgg ttttccttgc gtttccttaa tgtctgatga tgttaagcat tttttcccca"
[3] "     3301 agtgcttagt tgccatctat atagcatctt tgatgaaatg tctgtttata tattttgcac"
[4] "     3361 actttaaaat attgggttgt tt"                                         
[5] "//"                                                                         
[6] ""      

There are a few quirks with this process: If you run a loop across a whole bunch of system() calls R can eventually run out of connections, which is weird. Also depending on your version of RStudio, calling system() with the default timeout argument can cause R to hang indefinitely. This does not occur if you load up R from the terminal or just use the GUI, and appears to have been fixed in the most recent version of RStudio - but don't quote me on that.

I like running queries this way, but it does require a bit of background knowledge about edirect. If you're new to it, there's an old Cookbook for edirect that can be helpful.

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 4 minutes ago
United States

You will have to be more explicit than that. What exactly do you want to do? What have you tried?

It seems pretty straightforward to download and read in a .gb file using rentrez (less so with genbankr, but that has more to do with the arbitrariness of .gb records than any fault with genbankr. I mean how do you programmatically prepare for random-ish inputs?).

> z <- entrez_search("nuccore","NM_130786[ACCNUM]")
> zz <- entrez_fetch("nuccore", z, rettype ="gbwithparts",retmode = "text")
> strsplit(zz, "\\n")[1:3]
[[1]]
  [1] "LOCUS       NM_130786               3382 bp    mRNA    linear   PRI 12-DEC-2020"  
  [2] "DEFINITION  Homo sapiens alpha-1-B glycoprotein (A1BG), mRNA."                    
  [3] "ACCESSION   NM_130786"                                                            
  [4] "VERSION     NM_130786.4"                                                          
  [5] "KEYWORDS    RefSeq; MANE Select."                                                 
  [6] "SOURCE      Homo sapiens (human)"                                                 
  [7] "  ORGANISM  Homo sapiens"                                                         
  [8] "            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;"    
  [9] "            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;"         
 [10] "            Catarrhini; Hominidae; Homo."                                         
 [11] "REFERENCE   1  (bases 1 to 3382)"                                                 
 [12] "  AUTHORS   Barallobre-Barreiro J, Gupta SK, Zoccarato A, Kitazume-Taneike R,"    
 [13] "            Fava M, Yin X, Werner T, Hirt MN, Zampetaki A, Viviano A, Chong M,"   
 [14] "            Bern M, Kourliouros A, Domenech N, Willeit P, Shah AM, Jahangiri M,"  
 [15] "            Schaefer L, Fischer JW, Iozzo RV, Viner R, Thum T, Heineke J,"        
 [16] "            Kichler A, Otsu K and Mayr M."                                        
 [17] "  TITLE     Glycoproteomics Reveals Decorin Peptides With Anti-Myostatin"     
<snip>
ADD COMMENT
0
Entering edit mode

Thank you James!

This is a wonderful solution... I need to extract information from that file and will do that by using regex. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 991 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6