help with PubMed Central OAI
1
0
Entering edit mode
stubben ▴ 80
@stubben-4185
Last seen 10.3 years ago
I've been using Efetch to get some full text articles from Pubmed Central, which works fine... url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PM C2784878" x<-readLines(url) doc <- xmlParse(x ) # requires XML package xpathSApply(doc, "//abstract", xmlValue) [1] "The majority of all genes have so far been identified and annotated systematically through in silico gene finding. Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus subtilis by the use of tiling arrays. I recently noticed the PMC copyright says to use the FTP or OAI service for any "automated" retrievals, so I thought I would try OAI, but I can't get the same xpath queries to work. url <- "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataP refix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878" x2<-readLines(url) # will warn about incomplete final line doc2 <- xmlParse(x2 ) xpathSApply(doc2, "//abstract", xmlValue) list() This query does work so I know there's an abstract tag. table(xpathSApply(doc2, "//*", xmlName)) abstract ack addr-line aff article article-categories 1 1 1 1 1 1 article-id article-meta article-title author-notes back body 3 1 79 1 1 1 caption contrib contrib-group copyright-statement corresp date 7 3 1 1 1 1 Thanks for any help. Chris Stubben
Bacillus subtilis Bacillus subtilis • 1.2k views
ADD COMMENT
0
Entering edit mode
@duncan-temple-lang-1540
Last seen 10.3 years ago
Hi Chris The problem is that the <abstract> node has a namespace. So the following will do what you want (and also avoids using readLines() by retrieving the URL directly in xmlParse().) url <- "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataP refix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878" doc2 = xmlParse(url) getNodeSet(doc2, "//x:abstract", c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle")) or xpathSApply(doc2, "//x:abstract", xmlValue, namespaces = c("x" = "http://dtd.nlm.nih.gov/2.0/xsd/archivearticle")) The namespaces is defined on the
node. D. On 4/20/12 10:33 AM, Chris Stubben wrote: > I've been using Efetch to get some full text articles from Pubmed Central, which works fine... > > url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db= pmc&id=PMC2784878" > x<-readLines(url) > doc <- xmlParse(x ) # requires XML package > xpathSApply(doc, "//abstract", xmlValue) > [1] "The majority of all genes have so far been identified and annotated systematically through in silico gene finding. > Here we report the finding of 3662 strand-specific transcriptionally active regions (TARs) in the genome of Bacillus > subtilis by the use of tiling arrays. > > > I recently noticed the PMC copyright says to use the FTP or OAI service for any "automated" retrievals, so I thought I > would try OAI, but I can't get the same xpath queries to work. > > url <- > "http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadat aPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:2784878" > > x2<-readLines(url) # will warn about incomplete final line > doc2 <- xmlParse(x2 ) > xpathSApply(doc2, "//abstract", xmlValue) > list() > > This query does work so I know there's an abstract tag. table(xpathSApply(doc2, "//*", xmlName)) > > abstract ack addr-line aff article > article-categories > 1 1 1 1 > 1 1 > article-id article-meta article-title author-notes > back body > 3 1 79 1 > 1 1 > caption contrib contrib-group copyright-statement > corresp date > 7 3 1 1 > 1 1 > > Thanks for any help. > Chris Stubben > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6