Dear all,
I am looking for a way to load the .gb GenBank file for a specific RefSeq ID directly into R. I had no success using the genbankr, rentrez and biomaRt package. Does anyone have a suggestion on how this can be done?
Thanks, Heiko
Dear all,
I am looking for a way to load the .gb GenBank file for a specific RefSeq ID directly into R. I had no success using the genbankr, rentrez and biomaRt package. Does anyone have a suggestion on how this can be done?
Thanks, Heiko
You will have to be more explicit than that. What exactly do you want to do? What have you tried?
It seems pretty straightforward to download and read in a .gb file using rentrez
(less so with genbankr
, but that has more to do with the arbitrariness of .gb records than any fault with genbankr
. I mean how do you programmatically prepare for random-ish inputs?).
> z <- entrez_search("nuccore","NM_130786[ACCNUM]")
> zz <- entrez_fetch("nuccore", z, rettype ="gbwithparts",retmode = "text")
> strsplit(zz, "\\n")[1:3]
[[1]]
[1] "LOCUS NM_130786 3382 bp mRNA linear PRI 12-DEC-2020"
[2] "DEFINITION Homo sapiens alpha-1-B glycoprotein (A1BG), mRNA."
[3] "ACCESSION NM_130786"
[4] "VERSION NM_130786.4"
[5] "KEYWORDS RefSeq; MANE Select."
[6] "SOURCE Homo sapiens (human)"
[7] " ORGANISM Homo sapiens"
[8] " Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;"
[9] " Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;"
[10] " Catarrhini; Hominidae; Homo."
[11] "REFERENCE 1 (bases 1 to 3382)"
[12] " AUTHORS Barallobre-Barreiro J, Gupta SK, Zoccarato A, Kitazume-Taneike R,"
[13] " Fava M, Yin X, Werner T, Hirt MN, Zampetaki A, Viviano A, Chong M,"
[14] " Bern M, Kourliouros A, Domenech N, Willeit P, Shah AM, Jahangiri M,"
[15] " Schaefer L, Fischer JW, Iozzo RV, Viner R, Thum T, Heineke J,"
[16] " Kichler A, Otsu K and Mayr M."
[17] " TITLE Glycoproteomics Reveals Decorin Peptides With Anti-Myostatin"
<snip>
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I typically accomplish these kinds of operations by either directly accessing EDirect in the terminal, or having R build a bunch of EDirect queries directly, passing those to the terminal, and capturing the output.
In the terminal this simply looks like:
This is pretty easily recreated in R with as long as R has access to edirect - check
Sys.getenv("PATH")
:There are a few quirks with this process: If you run a loop across a whole bunch of
system()
calls R can eventually run out of connections, which is weird. Also depending on your version of RStudio, callingsystem()
with the defaulttimeout
argument can cause R to hang indefinitely. This does not occur if you load up R from the terminal or just use the GUI, and appears to have been fixed in the most recent version of RStudio - but don't quote me on that.I like running queries this way, but it does require a bit of background knowledge about edirect. If you're new to it, there's an old Cookbook for edirect that can be helpful.