matrix transformation
3
0
Entering edit mode
Bucher Elmar ▴ 80
@bucher-elmar-3683
Last seen 10.2 years ago
Dear Mailing List, I wrote the following function "matrix2tuple.sf" to translate a "cartesian xy matrix" as a "tuple matrix", to store it in relational database. The code works fine for a test set. My problem is, my real matrix is 7000 x 10000 big, which ends up in 70'000'0000' tuples. Transformation takes days X(... Has anyone an idea, how I can optimize the described functions for speed? Best Wishes, Elmar Bucher ##### BEGIN CODE LISTING #### matrix2tuple.sf <- function(xy.matrix.m = NULL) { tuple.m <- NULL #x.length.v <- dim(xy.matrix.m)[2] #y.length.v <- dim(xy.matrix.m)[1] #for (x.v in (1:x.length.v)) { #for (y.v in (1:y.length.v)) { x.axis.v <- colnames(xy.matrix.m) y.axis.v <- rownames(xy.matrix.m) for (x.v in x.axis.v) { for (y.v in y.axis.v) { #cat(x.v, y.v, xy.matrix.m[y.v,x.v],"\n") tuple.v <- c(x.v, y.v, xy.matrix.m[y.v,x.v]) if (is.null(tuple.v)) { tuple.m <- tuple.v } else { tuple.m <- rbind(tuple.m, tuple.v) } } } return(tuple.m) } put.db.sf <- function(conn.s4=NULL, x.v=NULL, y.v=NULL, xy.v=NULL) { query.v <- paste("INSERT INTO matrixdbtb ('xaxis','yaxis','xy') VALUES('",x.v,"','",y.v,"','",xy.v,"');", sep ="") #catch.df <- dbGetQuery(conn.s4, query.v) cat(query.v, "\n") } ## main ## matrix.m <- c("1","2","3","4","5","6","7","8","9","10","11","12","13", "14","15","16") dim(matrix.m) <- c(4,4) colnames(matrix.m) <- c("A","B","C","D") rownames(matrix.m) <- c("a","b","c","d") tuple.m <- matrix2tuple.sf (matrix.m) for (i in 1: dim(tuple.m)[1]) { put.db.sf(x.v=tuple.m[i,1],y.v=tuple.m[i,2],xy.v=tuple.m[i,3]) } #### END CODE LISTING ########
• 1.4k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Mon, Oct 18, 2010 at 5:41 AM, Bucher Elmar <ext- elmar.bucher@vtt.fi="">wrote: > Dear Mailing List, > > I wrote the following function "matrix2tuple.sf" to translate a "cartesian > xy matrix" as a "tuple matrix", to store it in relational database. > The code works fine for a test set. My problem is, my real matrix is 7000 x > 10000 big, which ends up in 70'000'0000' tuples. > Transformation takes days X(... > Has anyone an idea, how I can optimize the described functions for speed? > > Best Wishes, Elmar Bucher > > > > ##### BEGIN CODE LISTING #### > > matrix2tuple.sf <- function(xy.matrix.m = NULL) { > tuple.m <- NULL > #x.length.v <- dim(xy.matrix.m)[2] > #y.length.v <- dim(xy.matrix.m)[1] > #for (x.v in (1:x.length.v)) { > #for (y.v in (1:y.length.v)) { > x.axis.v <- colnames(xy.matrix.m) > y.axis.v <- rownames(xy.matrix.m) > for (x.v in x.axis.v) { > for (y.v in y.axis.v) { > #cat(x.v, y.v, xy.matrix.m[y.v,x.v],"\n") > tuple.v <- c(x.v, y.v, xy.matrix.m[y.v,x.v]) > if (is.null(tuple.v)) { > tuple.m <- tuple.v > } else { > tuple.m <- rbind(tuple.m, tuple.v) > } > } > } > return(tuple.m) > } > > Hi, Elmer. The following should be pretty quick. tuple.m = data.frame(xaxis=rep(colnames(matrix.m),each=nrow(matrix.m)),yaxis=rep (rownames(matrix.m)),xy=as.vector(matrix.m)) > > put.db.sf <- function(conn.s4=NULL, x.v=NULL, y.v=NULL, xy.v=NULL) { > query.v <- paste("INSERT INTO matrixdbtb ('xaxis','yaxis','xy') > VALUES('",x.v,"','",y.v,"','",xy.v,"');", sep ="") > #catch.df <- dbGetQuery(conn.s4, query.v) > cat(query.v, "\n") > } > > ## main ## > matrix.m <- > c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15", "16") > dim(matrix.m) <- c(4,4) > colnames(matrix.m) <- c("A","B","C","D") > rownames(matrix.m) <- c("a","b","c","d") > tuple.m <- matrix2tuple.sf (matrix.m) > for (i in 1: dim(tuple.m)[1]) { > put.db.sf(x.v=tuple.m[i,1],y.v=tuple.m[i,2],xy.v=tuple.m[i,3]) > } > > For loading this many rows, you will want to be using all the normal tricks of your database. This will often include removing indexes from the table, temporarily disabling foreign key constraints on the table, etc. You should minimally be using a transaction around, say, every 100000 inserts. Finally, several databases have a specific "loader" functionality that can take large tables like your 70M row table and insert them very quickly. dbWriteTable might include this functionality or might not, depending on the database. Assuming this is a one-off affair, though, even writing the table to disk and then using a dedicated loader will probably work just fine. If you have questions on the database side of things, perhaps the best place to write is the R-sig-db list. Sean > #### END CODE LISTING ######## > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Dear Sean and Vincent, Thank you very much for the quick response. Sean, I will try out your line of code and Vincent, I will follow your links. Have nice day! So long, Elmar [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Thank you Sean, You are a Wizard! Your oneliner ”tuple.m = data.frame(xaxis=rep(colnames(matrix.m),each= nrow(matrix.m)),yaxis=rep(rownames(matrix.m)),xy=as.vector(matrix.m))” and dbWriteTable does magic. Now it takes seconds what took before minutes. (In PostgreSQL and MySQL.) Yeah X). Have a great week too, Elmar From: seandavi@gmail.com [mailto:seandavi@gmail.com] On Behalf Of Sean Davis Sent: 18 October 2010 13:02 To: Bucher Elmar Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] matrix transformation On Mon, Oct 18, 2010 at 5:41 AM, Bucher Elmar <ext- elmar.bucher@vtt.fi<mailto:ext-elmar.bucher@vtt.fi="">> wrote: Dear Mailing List, I wrote the following function "matrix2tuple.sf" to translate a "cartesian xy matrix" as a "tuple matrix", to store it in relational database. The code works fine for a test set. My problem is, my real matrix is 7000 x 10000 big, which ends up in 70'000'0000' tuples. Transformation takes days X(... Has anyone an idea, how I can optimize the described functions for speed? Best Wishes, Elmar Bucher ##### BEGIN CODE LISTING #### matrix2tuple.sf <- function(xy.matrix.m = NULL) { tuple.m <- NULL #x.length.v <- dim(xy.matrix.m)[2] #y.length.v <- dim(xy.matrix.m)[1] #for (x.v in (1:x.length.v)) { #for (y.v in (1:y.length.v)) { x.axis.v <- colnames(xy.matrix.m) y.axis.v <- rownames(xy.matrix.m) for (x.v in x.axis.v) { for (y.v in y.axis.v) { #cat(x.v, y.v, xy.matrix.m[y.v,x.v],"\n") tuple.v <- c(x.v, y.v, xy.matrix.m[y.v,x.v]) if (is.null(tuple.v)) { tuple.m <- tuple.v } else { tuple.m <- rbind(tuple.m, tuple.v) } } } return(tuple.m) } Hi, Elmer. The following should be pretty quick. tuple.m = data.frame(xaxis=rep(colnames(matrix.m),each=nrow(matrix.m) ),yaxis=rep(rownames(matrix.m)),xy=as.vector(matrix.m)) put.db.sf <- function(conn.s4=NULL, x.v=NULL, y.v=NULL, xy.v=NULL) { query.v <- paste("INSERT INTO matrixdbtb ('xaxis','yaxis','xy') VALUES('",x.v,"','",y.v,"','",xy.v,"');", sep ="") #catch.df <- dbGetQuery(conn.s4, query.v) cat(query.v, "\n") } ## main ## matrix.m <- c("1","2","3","4","5","6","7","8","9","10","11","12","13", "14","15","16") dim(matrix.m) <- c(4,4) colnames(matrix.m) <- c("A","B","C","D") rownames(matrix.m) <- c("a","b","c","d") tuple.m <- matrix2tuple.sf (matrix.m) for (i in 1: dim(tuple.m)[1]) { put.db.sf(x.v=tuple.m[i,1],y.v=tuple.m[i,2],xy.v=tuple.m[i,3]) } For loading this many rows, you will want to be using all the normal tricks of your database. This will often include removing indexes from the table, temporarily disabling foreign key constraints on the table, etc. You should minimally be using a transaction around, say, every 100000 inserts. Finally, several databases have a specific "loader" functionality that can take large tables like your 70M row table and insert them very quickly. dbWriteTable might include this functionality or might not, depending on the database. Assuming this is a one-off affair, though, even writing the table to disk and then using a dedicated loader will probably work just fine. If you have questions on the database side of things, perhaps the best place to write is the R-sig-db list. Sean #### END CODE LISTING ######## _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 10 weeks ago
United States
This question has no direct relationship to Bioconductor. You should look at the archives of the SIG-DB mailing list for R https://stat.ethz.ch/mailman/listinfo/r-sig-db and perhaps pose the question there if it has not been previously handled. DBI's dbWriteTable is surely a relevant function to understand. Various Bioconductor facilities use DBI to create and manage large tables in SQLite -- AnnotationDbi and Genominator are two packages that come to mind. Other schemes for handling large data resources alongside R are described in the CRAN task view for high performance computing. On Mon, Oct 18, 2010 at 5:41 AM, Bucher Elmar <ext-elmar.bucher at="" vtt.fi=""> wrote: > Dear Mailing List, > > I wrote the following function "matrix2tuple.sf" to translate a "cartesian xy matrix" as a "tuple matrix", to store it in relational database. > The code works fine for a test set. My problem is, my real matrix is 7000 x 10000 big, which ends up in 70'000'0000' tuples. > Transformation takes days X(... > Has anyone an idea, how I can optimize the described functions for speed? > > Best Wishes, Elmar Bucher > > > > ##### BEGIN CODE LISTING #### > > matrix2tuple.sf <- function(xy.matrix.m = NULL) { > ? ?tuple.m <- NULL > ? ?#x.length.v <- dim(xy.matrix.m)[2] > ? ?#y.length.v <- dim(xy.matrix.m)[1] > ? ?#for (x.v in (1:x.length.v)) { > ? ?#for (y.v in (1:y.length.v)) { > ? ?x.axis.v <- colnames(xy.matrix.m) > ? ?y.axis.v <- rownames(xy.matrix.m) > ? ?for (x.v in x.axis.v) { > ? ? ? ?for (y.v in y.axis.v) { > ? ? ? ? ? ?#cat(x.v, y.v, xy.matrix.m[y.v,x.v],"\n") > ? ? ? ? ? ?tuple.v <- c(x.v, y.v, xy.matrix.m[y.v,x.v]) > ? ? ? ? ? ?if (is.null(tuple.v)) { > ? ? ? ? ? ? ? ?tuple.m <- tuple.v > ? ? ? ? ? ?} else { > ? ? ? ? ? ? ? ?tuple.m <- rbind(tuple.m, tuple.v) > ? ? ? ? ? ?} > ? ? ? ?} > ? ?} > ? ?return(tuple.m) > } > > > put.db.sf <- function(conn.s4=NULL, x.v=NULL, y.v=NULL, ?xy.v=NULL) { > ? ?query.v <- paste("INSERT INTO matrixdbtb ('xaxis','yaxis','xy') VALUES('",x.v,"','",y.v,"','",xy.v,"');", sep ="") > ? ?#catch.df <- dbGetQuery(conn.s4, query.v) > ? ?cat(query.v, "\n") > } > > ## main ## > matrix.m <- c("1","2","3","4","5","6","7","8","9","10","11","12","13 ","14","15","16") > dim(matrix.m) <- c(4,4) > colnames(matrix.m) <- c("A","B","C","D") > rownames(matrix.m) <- c("a","b","c","d") > tuple.m <- matrix2tuple.sf (matrix.m) > for (i in 1: dim(tuple.m)[1]) { > ? ?put.db.sf(x.v=tuple.m[i,1],y.v=tuple.m[i,2],xy.v=tuple.m[i,3]) > } > > #### END CODE LISTING ######## > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Ido M. Tamir ▴ 150
@ido-m-tamir-2778
Last seen 10.2 years ago
> Dear Mailing List, > > I wrote the following function "matrix2tuple.sf" to translate a "cartesian > xy matrix" as a "tuple matrix", to store it in relational database. The > code works fine for a test set. My problem is, my real matrix is 7000 x > 10000 big, which ends up in 70'000'0000' tuples. Transformation takes days > X(... > Has anyone an idea, how I can optimize the described functions for speed? > > Best Wishes, Elmar Bucher I didnt look at the R code, but the inserts could be sped up if you would do it in batches and not individual inserts. Something like: AUTOCOMMIT OFF 100 x insert.. COMMIT Maybe do a cheap profiling by just commenting out the db code to see if the db is slowing it down. best, ido
ADD COMMENT

Login before adding your answer.

Traffic: 712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6