Question

bioconductor packages for importing custom annotation tables into R?

0

Entering edit mode

k.askland ▴ 20

@kaskland-6850

Last seen 9.5 years ago

United States

Hello,

I would like to know if there is a bioconductor (or more general R) package that facilitates efficient importing/reading of very large genomic annotation files. I have a custom annotation set of ~10,000 columns for 1.1M BP positions (rows) .

The read.table and read.csv options fail because of the table size (but works if I only import 100,000 rows at a time). I know that Bioconductor has many packages for annotating sequences, etc..., but I couldn't find anything that helps to streamline the process of bringing your own annotations into the workspace for further processing/manipulation.

Thanks for any suggestions,

Kathleen

ANNOT <- read.table('D:/DATA/ANNOT.txt', stringsAsFactors=FALSE, 
    colClasses=classes, skip=1, na.strings="", sep="\t", nrows=1112880)
#Error: cannot allocate vector of size 8.5 Mb
#In addition: Warning messages:
#1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)
#2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)
#3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)
#4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  #Reached total allocation of 32703Mb: see help(memory.size)

system.time(ANNOT100000 <- read.table('D:/DATA/ANNOT.txt', stringsAsFactors=FALSE, 
    colClasses=classes, skip=1, na.strings="", sep="\t", nrows=100000))

   #user  system elapsed 
 #377.74   15.71  393.54

sessionInfo() #I'm using Revolution R, which has R version 3.0.3, but I have the same problem in R 3.1.1

R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Revobase_7.2.0   RevoMods_7.2.0   RevoScaleR_7.2.0 lattice_0.20-27 
[5] rpart_4.1-5     

loaded via a namespace (and not attached):
[1] codetools_0.2-9   foreach_1.4.2     grid_3.0.3        iterators_1.0.7  
[5] pkgXMLBuilder_1.0 revoIpe_1.0       tools_3.0.3       XML_3.98-1.1

annotation importing • 2.3k views

ADD COMMENT • link updated 10.4 years ago by Martin Morgan 25k • written 10.4 years ago by k.askland ▴ 20

score 2 · Accepted Answer · 2014-10-31

2

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 11 weeks ago

United States

Hi Kathleen -- one solution is to use a data base (SQLite via the RSQLite package being a good bet for cross-platform portability; it's used in all of our annotation packages). You would want to query the data base for relevant information, rather than reading the entire file in at a time. I really don't know whether this is suitable for the size (especially number of columns) you're talking about. The data.table (maybe requires in-memory data, so likely to be as problematic as other approaches). and dplyr packages are also very effective at dealing with large data.frame style data.

It's hard to imagine what 10,000 annotation columns represent? Maybe many of these are somehow more like a matrix (e.g., of SNPs across samples)? One could then use something like rhdf5.

It might help if you provided a little bit more information on what your data looks like.

From your answer, it sounds like you have an 'adjacency matrix'. Likely it is very sparse (few '1' values). Here's some toy data

    m <- matrix(as.integer(runif(100 * 26) < .05), 100,
                dimnames=list(paste0("rs", 1:100), LETTERS))
    fl <- tempfile()
    write.csv(m, fl, quote=FALSE)

A strategy is to represent it instead as a matrix of 'from' and 'to' edges, and to populate the matrix by iterating through (below, in chunks of size N = 10, but in actuality maybe N=100000) the large file. I implemented this as

    fromto <- matrix(character(), 0, 2,
                     dimnames=list(NULL, c("from", "to")))
    N <- 10

    con <- file(fl)
    open(con)
    colnames <- strsplit(readLines(con, 1), ",")[[1]][-1]
    repeat {
        val <- tryCatch({
            read.csv(con, header=FALSE, row.names=1, nrows=N)
        }, error=function(err) {
            message("error? ", conditionMessage(err))
            matrix(integer(), 0, 0)
        })
        if (nrow(val) == 0)
            break
        idx <- val == 1
        m <- matrix(c(rownames(val)[row(val)[idx]],
                      colnames[col(val)[idx]]), ncol=2)
        fromto <- rbind(fromto, m)
    }
    close(con)

To work with these, one might use a list of some sort, or perhaps a 'sparse' matrix from the Matrix package or a graph from the graph package.

    split(fromto[,2], fromto[, 1])
    IRanges::splitAsList(fromto[,2], fromto[,1])

Maybe there are some ideas in there that are helpful...

ADD COMMENT • link 10.4 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, Thanks for your response.

The problem I am having is that I am not using the annotation file to annotate a set of SNPs or genes, as is usually the case. I actually want to bring the entire file in because I want to do some processing on the matrix itself.

I tried to take a look at rhdf5, but I'm a bit confused as to whether .txt files can be converted to this format. Moreover, it seems that, even if so, it would require that I import the file into R in order to convert to a new file type, but I can't get the file imported. The other option I have, obviously, is to break up the file into smaller chunks and import them separately. But, I was hoping to avoid that.

ADD REPLY • link 10.4 years ago k.askland ▴ 20

0

Entering edit mode

The idea would be to do a one-time transformation of the data from its unwieldy plain text representation to a more reasonable representation; in sql or hdf5 likely you'd do this with 'standard' tools outside R, or iterate through the file in (not break it up into) chunks in R, and with each chunk append it to an existing sql / hdf5 file. Are these actually SNPs? Because then you might want to look at the snpStats package. Again it'll help to be more explicit about what the 'annotation' data you have looks like.

ADD REPLY • link 10.4 years ago Martin Morgan 25k

0

Entering edit mode

Thanks again, Martin. Sorry. It's 1.1M SNPs (rs IDs) with 10,000 gene ontology annotations - the gene in which the SNP resides is/is not annotated to that gene ontology term.

ADD REPLY • link 10.4 years ago k.askland ▴ 20

0

Entering edit mode

Hi Kathleen -- I added to my answer with some suggestions; not sure if these are helpful...

ADD REPLY • link 10.4 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin,

I apologize for the delay. I was away from the office.

I will see if I can implement the function you provided.

Thanks again,

Kathleen

ADD REPLY • link 10.4 years ago k.askland ▴ 20