[R] tapply for enormous (>2^31 row) matrices

0

Entering edit mode

Matthew Keller ▴ 100

@matthew-keller-2483

Last seen 10.6 years ago

Thank you all very much for your help (on both the r-help and the bioconductor listserves). Benilton - I couldn't get sqldf to install on the server I'm using (error is: Error : package 'gsubfn' does not have a name space). I think this was a problem for R 2.13, and I'm trying to get the admin's to install a more up-to-date version. I know that I need to probably learn a modicum of SQL given the sizes of datasets I'm using now. I ended up using a modified version of Hervé Pagès' excellent code (thank you!). I got a huge (40-fold) speed bump by using the data.table package for indexing/aggregate steps, making an hours long job a minutes long job. SO - read.table is hugely useful if you're dealing with indexing/apply-family functions on huge datasets. By the way, I'm not sure why, but read.table was a bit faster than scan for this problem... Here is the code for others: require(data.table) computeAllPairSums <- function(filename, nbindiv,nrows.to.read) { con <- file(filename, open="r") on.exit(close(con)) ans <- matrix(numeric(nbindiv * nbindiv), nrow=nbindiv) chunk <- 0L while (TRUE) { #read.table faster than scan df0 <- read.table(con,col.names=c("ID1", "ID2", "ignored", "sharing"), colClasses=c("integer", "integer", "NULL", "numeric"),nrows=nrows.to.read,comment.char="") DT <- data.table(df0) setkey(DT,ID1,ID2) ss <- DT[,sum(sharing),by="ID1,ID2"] if (nrow(df0) == 0L) break chunk <- chunk + 1L cat("Processing chunk", chunk, "... ") idd <- as.matrix(subset(ss,select=1:2)) newvec <- as.vector(as.matrix(subset(ss,select=3))) ans[idd] <- ans[idd] + newvec cat("OK\n") } ans } On Wed, Feb 22, 2012 at 3:20 PM, ilai <keren at="" math.montana.edu=""> wrote: > On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at="" gmail.com=""> wrote: > >> X <- read.big.matrix("file.loc.X",sep=" ",type="double") >> hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on >> these matrices >> #I was then going to use foreach loop to sum across the splits >> identified by bigsplit > > How about just using foreach earlier in the process ? e.g. split > file.loc.X to (80) sub files and then run > read.big.matrix/bigsplit/sum inside %dopar% > > If splitting X beforehand is a problem, you could also use ?scan to > read in different chunks of the file, something like (untested > obviously): > # for X a matrix 800x4 > lineind<- seq(1,800,100) ?# create an index vec for the lines to read > ReducedX<- foreach(i = 1:8) %dopar%{ > ?x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0) ),skip=lineind[i],nlines=100) > ... do your thing on x (aggregate/tapply etc.) > ?} > > Hope this helped > Elai. > > > >> >> SO - does anyone have ideas on how to deal with this problem - i.e., >> how to use a tapply() like function on an enormous matrix? This isn't >> necessarily a bigtabulate question (although if I screwed up using >> bigsplit, let me know). If another package (e.g., an SQL package) can >> do something like this efficiently, I'd like to hear about it and your >> experiences using it. >> >> Thank you in advance, >> >> Matt >> >> >> >> -- >> Matthew C Keller >> Asst. Professor of Psychology >> University of Colorado at Boulder >> www.matthewckeller.com >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- guide.html >> and provide commented, minimal, self-contained, reproducible code. -- Matthew C Keller Asst. Professor of Psychology University of Colorado at Boulder www.matthewckeller.com

PROcess PROcess • 1.1k views

ADD COMMENT • link updated 13.2 years ago by Gabor Grothendieck ▴ 30 • written 13.2 years ago by Matthew Keller ▴ 100

0

Entering edit mode

Gabor Grothendieck ▴ 30

@gabor-grothendieck-2332

Last seen 10.6 years ago

On Thu, Feb 23, 2012 at 11:39 AM, Matthew Keller <mckellercran at="" gmail.com=""> wrote: > Thank you all very much for your help (on both the r-help and the > bioconductor listserves). > > Benilton - I couldn't get sqldf to install on the server I'm using > (error is: Error : package 'gsubfn' does not have a name space). I > think this was a problem for R 2.13, and I'm trying to get the admin's > to install a more up-to-date version. I know that I need to probably > learn a modicum of SQL given the sizes of datasets I'm using now. Right. See the troubleshooting section of the sqldf home page: http://code.google.com/p/sqldf/#Troubleshooting -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com

ADD COMMENT • link 13.2 years ago Gabor Grothendieck ▴ 30

Login before adding your answer.