Hello,
We make extensive use of NCBI's blast application [1] in our
workflows. One of the optional output formats of the application is
XML formatted data. That output works very well for most of our
purposes as this form of output is complete. However, we have
encountered an issue where, for very large inputs, output to XML
becomes very resource heavy for our server - bring our workflow to a
crawl. We are trying end-runs around the issue including using other
output format options (flat ascii tables, html, etc.), and also by
saving the output to NCBI's ASN.1 archive format [2] and then
converting using the blast_formatter application [3] but none fit the
bill.
NCBI makes its AsnLib tool kit available, but don't have the resources
at this time to dive into C and C++. We are wondering if there are
any resources available in R for reading NCBI's ASN.1 archive format.
Do such beasts exist?
Thanks,
Ben
[1] http://www.ncbi.nlm.nih.gov/books/NBK1763/
[2] http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/ASNLIB.HTML
[3] http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/NCBI/blast_f
ormatter.txt
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine 04575-0475
http://www.bigelow.org
Hello again,
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
bounces at r-project.org] On Behalf Of Ben Tupper
> Sent: Wednesday, January 09, 2013 10:43 AM
> To: Bioconductor mailing list
> Subject: [BioC] ASN.1
>
> Hello,
>
> We make extensive use of NCBI's blast application [1] in our
workflows. One of the optional output formats of the application is
XML formatted data. That output works very well for most of our
purposes as this form of output is complete. However, we have
encountered an issue where, for very large inputs, output to XML
becomes very resource heavy for our server - bring our workflow to a
crawl. We are trying end-runs around the issue including using other
output format options (flat ascii tables, html, etc.), and also by
saving the output to NCBI's ASN.1 archive format [2] and then
converting using the blast_formatter application [3] but none fit the
bill.
>
> NCBI makes its AsnLib tool kit available, but don't have the
resources at this time to dive into C and C++. We are wondering if
there are any resources available in R for reading NCBI's ASN.1
archive format. Do such beasts exist?
>
Thanks to a number of off-list communications, I realized that this
approach was not going to be fruitful. My boss suggested that we try
splitting input FASTA files into smaller pieces and then outputting
smaller XML files. That works well (and that's why he's the boss!)
These XML files are easier to deal with. One little surprise is that
*sometimes* blast outputs multi-document XML files (more than one
xmlRoot in a single file). The XML package doesn't appear to be able
to work with those. Instead, we then mine the as if they were text,
find the root node endpoints, and feed those the the xml tree parser.
The XML file(s) have all the things we want for our home-made flat
tables and html files. Below is the function we use to good end so
far. It also works with the multi-document XML files.
CHeers,
Ben
library(XML)
# read one or more blast XML files including a multiple document
container-style
# XML files
#
# file - character vector of one or more xml filenames, possibly
compressed
# useInternalNodes (default is TRUE) see xmlTreeParse
# asList (default = FALSE) if TRUE return a list of xml root nodes
even if
# there is only one. Ignored if the the length of the 'file'
# input is greater that 1 or one or more files are multi-document.
read.blastXML <- function(file, useInternalNodes = TRUE, asList =
FALSE){
# we have one or more - a recursion is required
if (length(file) > 1){
x <- lapply(file, read.blastXML,
useInternalNodes = useInternalNodes)
# we must unlist because any one of the input files may be
# multi-document xml files
return(unlist(x, recursive = TRUE))
}
# is it a simple one-document file?
x <- try( xmlRoot(xmlTreeParse(file, useInternalNodes =
useInternalNodes)),
silent = TRUE)
# if not, then it could be a multi-document xml file. In that case
we scan the
# file into a character vector, find the lines and then
parse the xml
# in slabs of text
if (inherits(x, "try-error") ) {
cat("read.blastXML: unable to read xml file... trying as multi-
document xml\n")
# scan in as text
ff <- gzfile(file)
s <- scan(ff, what = character(), sep = "\n", quiet = TRUE)
close(ff)
# the start stop points
ix <- c(grep("^<.xml", s), length(s) + 1)
n <- length(ix) - 1
# results list
x <- vector(mode = "list", length = n )
for (i in seq(from = 1, to = n)) {
# note that we may end up with 'try-errors' in each element
# if so the end user will have to figure out what to do next
x[[i]] <- try(xmlRoot(xmlTreeParse(s[ ix[i]:(ix[i+1] - 1) ],
asText = TRUE,
useInternalNodes = useInternalNodes)))
}
} else {
if (asList) x <- list(x)
}
return(x)
}
> Thanks,
> Ben
>
> [1] http://www.ncbi.nlm.nih.gov/books/NBK1763/
> [2] http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/ASNLIB.HTML
> [3] http://home.cc.umanitoba.ca/~psgendb/birchhomedir/doc/NCBI/blast
_formatter.txt
>
Ben Tupper
Bigelow Laboratory for Ocean Sciences
180 McKown Point Rd. P.O. Box 475
West Boothbay Harbor, Maine 04575-0475
http://www.bigelow.org