BHC appears to be broken

0

Entering edit mode

Joseph Viviano ▴ 20

@joseph-viviano-5901

Last seen 10.6 years ago

<mailto:bioconductor@r-project.org>Hello all, I am having a great deal of trouble getting BHC to run on non-trivial datasets. I am using the following commands: data <- read.csv("data.csv") itemLabels <- names(data) timePoints <- 1:24 # for the time-course case nDataItems <- nrow(data) # this equals 152000, approximately nFeatures <- ncol(data) # this equals 24 BHC_OUT <- bhc(data,itemLabels,timePoints"time-course",verbose=TRUE) --- This causes R to immediately lock up on windows 7, linux mint 13, and OSX 10.6.8. The input data are variance normalized time-series exported from MATLAB. Here is a sample timeseries from the .csv: -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.62659,- 0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.22414 ,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179 Any help would be greatly appreciated. Cheers, Joseph [[alternative HTML version deleted]]

BHC BHC • 1.4k views

ADD COMMENT • link 12.0 years ago Joseph Viviano ▴ 20

0

Entering edit mode

Dan Tenenbaum ★ 8.2k

@dan-tenenbaum-4256

Last seen 10 months ago

United States

On Wed, Apr 24, 2013 at 3:56 PM, Joseph Viviano <vivianoj at="" yorku.ca=""> wrote: > <mailto:bioconductor at="" r-project.org="">Hello all, > > I am having a great deal of trouble getting BHC to run on non- trivial > datasets. I am using the following commands: > > data <- read.csv("data.csv") Can you share this dataset, or at least enough of it to reproduce the problem? > itemLabels <- names(data) > timePoints <- 1:24 # for the time-course case > > nDataItems <- nrow(data) # this equals 152000, approximately > nFeatures <- ncol(data) # this equals 24 > > BHC_OUT <- bhc(data,itemLabels,timePoints"time-course",verbose=TRUE) This line produces a syntax error. In order to help you we need a fully reproducible example. Also, please send the output of the sessionInfo() command. Dan > > --- > > This causes R to immediately lock up on windows 7, linux mint 13, and > OSX 10.6.8. The input data are variance normalized time-series exported > from MATLAB. > > Here is a sample timeseries from the .csv: > > -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.62659 ,-0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.224 14,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179 > > Any help would be greatly appreciated. > > Cheers, Joseph > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Dan Tenenbaum ★ 8.2k

0

Entering edit mode

Joseph Viviano ▴ 20

@joseph-viviano-5901

Last seen 10.6 years ago

Hello, my apologies for the sloppy post. You can find a sample dataset here: https://www.dropbox.com/sh/p1od9e4vx8ky66a/igt2OkNDbQ And the code I ran was essentially: data <- read.csv("subsample.csv",header=FALSE) itemLabels <- t(read.csv("labels.csv", header=FALSE)) #read in and transpose timePoints <- 1:24 #number of timepoints BHC_OUT <- bhc(data,itemLabels,timePoints,"time- course",verbose=TRUE,numThreads=8) This is where it completely locks up. Also, note that I get the same result with multiple permutations of the bhc command, and that this occurs on multiple versions of R for me (including the latest releases). I should note that I have demeaned and variance normalized all time series before entering them into bhc, if that makes a difference. Cheers, Joseph On Wed, Apr 24, 2013 at 3:56 PM, Joseph Viviano<vivianoj@yorku.ca> wrote: > <mailto:bioconductor@r-project.org>Hello all, > > I am having a great deal of trouble getting BHC to run on non- trivial > datasets. I am using the following commands: > > data <- read.csv("data.csv") Can you share this dataset, or at least enough of it to reproduce the problem? > itemLabels <- names(data) > timePoints <- 1:24 # for the time-course case > > nDataItems <- nrow(data) # this equals 152000, approximately > nFeatures <- ncol(data) # this equals 24 > > BHC_OUT <- bhc(data,itemLabels,timePoints"time-course",verbose=TRUE) This line produces a syntax error. In order to help you we need a fully reproducible example. Also, please send the output of the sessionInfo() command. Dan > --- > > This causes R to immediately lock up on windows 7, linux mint 13, and > OSX 10.6.8. The input data are variance normalized time-series exported > from MATLAB. > > Here is a sample timeseries from the .csv: > > -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.62659 ,-0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.224 14,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179 > > Any help would be greatly appreciated. > > Cheers, Joseph > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives:http://news.gmane.org/gmane.science.biology.info rmatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives:http://news.gmane.org/gmane.science.biology.inform atics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.0 years ago Joseph Viviano ▴ 20

0

Entering edit mode

On Fri, Apr 26, 2013 at 3:00 PM, Joseph Viviano <vivianoj at="" yorku.ca=""> wrote: > Hello, my apologies for the sloppy post. > > You can find a sample dataset here: https://www.dropbox.com/sh/p1od9e4vx8ky66a/igt2OkNDbQ > > And the code I ran was essentially: > > data <- read.csv("subsample.csv",header=FALSE) > itemLabels <- t(read.csv("labels.csv", header=FALSE)) #read in and transpose > timePoints <- 1:24 #number of timepoints > BHC_OUT <- bhc(data,itemLabels,timePoints,"time- course",verbose=TRUE,numThreads=8) > > This is where it completely locks up. Also, note that I get the same result with multiple permutations of the bhc command, and that this occurs on multiple versions of R for me (including the latest releases). > Thanks. It does appear to use increasing amounts of CPU and memory. I'm cc'ing the BHC maintainer. Dan > I should note that I have demeaned and variance normalized all time series before entering them into bhc, if that makes a difference. > > Cheers, Joseph > > On Wed, Apr 24, 2013 at 3:56 PM, Joseph Viviano<vivianoj at="" yorku.ca=""> wrote: > >> <mailto:bioconductor at="" r-project.org="">Hello all, >> >> I am having a great deal of trouble getting BHC to run on non- trivial >> datasets. I am using the following commands: >> >> data <- read.csv("data.csv") > > Can you share this dataset, or at least enough of it to reproduce the problem? > >> itemLabels <- names(data) >> timePoints <- 1:24 # for the time-course case >> >> nDataItems <- nrow(data) # this equals 152000, approximately >> nFeatures <- ncol(data) # this equals 24 >> >> BHC_OUT <- bhc(data,itemLabels,timePoints"time- course",verbose=TRUE) > > This line produces a syntax error. > > In order to help you we need a fully reproducible example. Also, > please send the output of the sessionInfo() command. > > Dan > > >> --- >> >> This causes R to immediately lock up on windows 7, linux mint 13, and >> OSX 10.6.8. The input data are variance normalized time-series exported >> from MATLAB. >> >> Here is a sample timeseries from the .csv: >> >> -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.6265 9,-0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.22 414,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179 >> >> Any help would be greatly appreciated. >> >> Cheers, Joseph >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives:http://news.gmane.org/gmane.science.biology.inf ormatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives:http://news.gmane.org/gmane.science.biology.info rmatics.conductor > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Dan Tenenbaum ★ 8.2k

0

Entering edit mode

Dear Joseph, I believe the problem is that you're calling 'bhc' incorrectly. There is a legacy value that needs to be entered as the 3rd input, which you're missing (see below). Apologies for the presence of this - it's no longer required by the algorithm, but is there to prevent people's scripts crashing that are set up to use an early release of the package. The document examples for 'bhc' have it set correctly, but it's not highlighted particularly. (if we do a version 2.0, I'll sure we'll bite the bullet and remove this) Here is a modified version of your code that should demonstrate both how to make it work, and also how to make it crash. (tested on R 3.0.0 Snow Leopard build for MAC OSX). library(BHC) data <- read.csv("subsample.csv",header=FALSE) itemLabels <- t(read.csv("~labels.csv", header=FALSE)) #read in and transpose timePoints <- 1:24 #number of timepoints ##THIS ONE WORKS startTime <- Sys.time() BHC_OUT <- bhc(data, itemLabels, 0, timePoints,"time- course",verbose=TRUE) plot(BHC_OUT, axes=FALSE) print(Sys.time() - startTime) ##THIS ONE CRASHES 'R' BHC_OUT <- bhc(data, itemLabels, timePoints,"time- course",verbose=TRUE) BTW, I noticed the following code comment in your original email to the list: # this equals 152000, approximately Does this imply you might be trying the randomised algorithm for timeseries BHC with over 10^5 items? If so, I would be interested to hear how you get on - we've never tried it for such a large data set. (the feedback would also be very useful, as a have in mind some ideas for a next-generation clustering tool, so any insights you gain from running a large data set would be very informative. Thanks!) Please let me know if you have any further questions. Best regards, Rich -- ------------------------------------------------------------------ Dr. Richard Savage Tel: +44 (0)24 765 72507 Systems Biology Centre University of Warwick Coventry CV4 7AL United Kingdom http://sites.google.com/site/drrichsavage/ http://21stcenturyscientist.blogspot.com/ ------------------------------------------------------------------ On 27/04/2013 18:41, Dan Tenenbaum wrote: > On Fri, Apr 26, 2013 at 3:00 PM, Joseph Viviano <vivianoj at="" yorku.ca=""> wrote: >> Hello, my apologies for the sloppy post. >> >> You can find a sample dataset here: https://www.dropbox.com/sh/p1od9e4vx8ky66a/igt2OkNDbQ >> >> And the code I ran was essentially: >> >> data <- read.csv("subsample.csv",header=FALSE) >> itemLabels <- t(read.csv("labels.csv", header=FALSE)) #read in and transpose >> timePoints <- 1:24 #number of timepoints >> BHC_OUT <- bhc(data,itemLabels,timePoints,"time- course",verbose=TRUE,numThreads=8) >> >> This is where it completely locks up. Also, note that I get the same result with multiple permutations of the bhc command, and that this occurs on multiple versions of R for me (including the latest releases). >> > > Thanks. It does appear to use increasing amounts of CPU and memory. > I'm cc'ing the BHC maintainer. > Dan > > >> I should note that I have demeaned and variance normalized all time series before entering them into bhc, if that makes a difference. >> >> Cheers, Joseph >> >> On Wed, Apr 24, 2013 at 3:56 PM, Joseph Viviano<vivianoj at="" yorku.ca=""> wrote: >> >>> <mailto:bioconductor at="" r-project.org="">Hello all, >>> >>> I am having a great deal of trouble getting BHC to run on non- trivial >>> datasets. I am using the following commands: >>> >>> data <- read.csv("data.csv") >> >> Can you share this dataset, or at least enough of it to reproduce the problem? >> >>> itemLabels <- names(data) >>> timePoints <- 1:24 # for the time-course case >>> >>> nDataItems <- nrow(data) # this equals 152000, approximately >>> nFeatures <- ncol(data) # this equals 24 >>> >>> BHC_OUT <- bhc(data,itemLabels,timePoints"time- course",verbose=TRUE) >> >> This line produces a syntax error. >> >> In order to help you we need a fully reproducible example. Also, >> please send the output of the sessionInfo() command. >> >> Dan >> >> >>> --- >>> >>> This causes R to immediately lock up on windows 7, linux mint 13, and >>> OSX 10.6.8. The input data are variance normalized time-series exported >>> from MATLAB. >>> >>> Here is a sample timeseries from the .csv: >>> >>> -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.626 59,-0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.2 2414,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179 >>> >>> Any help would be greatly appreciated. >>> >>> Cheers, Joseph >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives:http://news.gmane.org/gmane.science.biology.in formatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives:http://news.gmane.org/gmane.science.biology.inf ormatics.conductor >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 12.0 years ago Rich Savage ▴ 60

Login before adding your answer.