Hello,
I would like to know if there is a bioconductor (or more general R) package that facilitates efficient importing/reading of very large genomic annotation files. I have a custom annotation set of ~10,000 columns for 1.1M BP positions (rows) .
The read.table and read.csv options fail because of the table size (but works if I only import 100,000 rows at a time). I know that Bioconductor has many packages for annotating sequences, etc..., but I couldn't find anything that helps to streamline the process of bringing your own annotations into the workspace for further processing/manipulation.
Thanks for any suggestions,
Kathleen
ANNOT <- read.table('D:/DATA/ANNOT.txt', stringsAsFactors=FALSE, colClasses=classes, skip=1, na.strings="", sep="\t", nrows=1112880) #Error: cannot allocate vector of size 8.5 Mb #In addition: Warning messages: #1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : #Reached total allocation of 32703Mb: see help(memory.size) #2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : #Reached total allocation of 32703Mb: see help(memory.size) #3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : #Reached total allocation of 32703Mb: see help(memory.size) #4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : #Reached total allocation of 32703Mb: see help(memory.size) system.time(ANNOT100000 <- read.table('D:/DATA/ANNOT.txt', stringsAsFactors=FALSE, colClasses=classes, skip=1, na.strings="", sep="\t", nrows=100000)) #user system elapsed #377.74 15.71 393.54 sessionInfo() #I'm using Revolution R, which has R version 3.0.3, but I have the same problem in R 3.1.1 R version 3.0.3 (2014-03-06) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C [5] LC_TIME=English_Canada.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Revobase_7.2.0 RevoMods_7.2.0 RevoScaleR_7.2.0 lattice_0.20-27 [5] rpart_4.1-5 loaded via a namespace (and not attached): [1] codetools_0.2-9 foreach_1.4.2 grid_3.0.3 iterators_1.0.7 [5] pkgXMLBuilder_1.0 revoIpe_1.0 tools_3.0.3 XML_3.98-1.1
Hi Martin, Thanks for your response.
The problem I am having is that I am not using the annotation file to annotate a set of SNPs or genes, as is usually the case. I actually want to bring the entire file in because I want to do some processing on the matrix itself.
I tried to take a look at rhdf5, but I'm a bit confused as to whether .txt files can be converted to this format. Moreover, it seems that, even if so, it would require that I import the file into R in order to convert to a new file type, but I can't get the file imported. The other option I have, obviously, is to break up the file into smaller chunks and import them separately. But, I was hoping to avoid that.
The idea would be to do a one-time transformation of the data from its unwieldy plain text representation to a more reasonable representation; in sql or hdf5 likely you'd do this with 'standard' tools outside R, or iterate through the file in (not break it up into) chunks in R, and with each chunk append it to an existing sql / hdf5 file. Are these actually SNPs? Because then you might want to look at the snpStats package. Again it'll help to be more explicit about what the 'annotation' data you have looks like.
Thanks again, Martin. Sorry. It's 1.1M SNPs (rs IDs) with 10,000 gene ontology annotations - the gene in which the SNP resides is/is not annotated to that gene ontology term.
Hi Kathleen -- I added to my answer with some suggestions; not sure if these are helpful...
Hi Martin,
I apologize for the delay. I was away from the office.
I will see if I can implement the function you provided.
Thanks again,
Kathleen