Fastest way to read CSV files

0

Entering edit mode

Gaston Fiore ▴ 40

@gaston-fiore-4224

Last seen 10.6 years ago

Hello everyone, Is there a faster method to read CSV files than the read.csv function? I've CSV files containing a rectangular array with about 17 rows and 1.5 million columns with integer entries, and read.csv is being too slow for my needs. Thanks for your help, -Gaston

• 2.8k views

ADD COMMENT • link updated 14.6 years ago by Paul Leo ▴ 970 • written 14.6 years ago by Gaston Fiore ▴ 40

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 7 weeks ago

United States

Try using scan and then rearrange the resulting vector. Sean On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore@gmail.com> wrote: Hello everyone, Is there a faster method to read CSV files than the read.csv function? I've CSV files containing a rectangular array with about 17 rows and 1.5 million columns with integer entries, and read.csv is being too slow for my needs. Thanks for your help, -Gaston _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 14.6 years ago Sean Davis 21k

0

Entering edit mode

This piqued my interest, as for really large datasets it can in general speed up things greatly to use binary formats (1.5 million does not sound *that* big to me). I have no experience with this in R, but a little search brought up e.g. readBin(). So it might be possible, especially if your data is quite simple (all integers), to first convert your data externally to a binary format (using perl or python or ..) and then read it with readBin(). Disclaimer: Quite likely a random thought from an ill-informed bystander. best, Stijn On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: > Try using scan and then rearrange the resulting vector. > > Sean > > On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at="" gmail.com=""> wrote: > > Hello everyone, > > Is there a faster method to read CSV files than the read.csv function? I've > CSV files containing a rectangular array with about 17 rows and 1.5 million > columns with integer entries, and read.csv is being too slow for my needs. > > Thanks for your help, > > -Gaston > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Stijn van Dongen >8< -o) O< forename pronunciation: [Stan] EMBL-EBI /\\ Tel: +44-(0)1223-492675 Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn

ADD REPLY • link 14.6 years ago Stijn van Dongen ▴ 80

0

Entering edit mode

On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn@ebi.ac.uk> wrote: > > This piqued my interest, as for really large datasets it can in general > speed > up things greatly to use binary formats (1.5 million does not sound *that* > big > to me). I have no experience with this in R, but a little search brought up > e.g. readBin(). So it might be possible, especially if your data is quite > simple (all integers), to first convert your data externally to a binary > format (using perl or python or ..) and then read it with readBin(). > > Disclaimer: Quite likely a random thought from an ill-informed bystander. > > Binary is always a good thought, but reading into another language to write binary to load into R is probably not going to be a big time saver over using R's capabilities. > x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20) di> dim(x) [1] 20 1700000 > write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE) > system.time((y = matrix(scan('abc.txt',what='integer'),nr=20))) Read 34000000 items user system elapsed 17.555 0.685 18.258 > dim(y) [1] 20 1700000 So, a 1.7 million column by 20 row table of integers can be read in about 18 seconds using scan, just to give a rough sketch of profiling results. You might be able to get close using read.table and setting column classes appropriately, also. Sean > best, > Stijn > > > > > On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: > > Try using scan and then rearrange the resulting vector. > > > > Sean > > > > On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore@gmail.com> wrote: > > > > Hello everyone, > > > > Is there a faster method to read CSV files than the read.csv function? > I've > > CSV files containing a rectangular array with about 17 rows and 1.5 > million > > columns with integer entries, and read.csv is being too slow for my > needs. > > > > Thanks for your help, > > > > -Gaston > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Stijn van Dongen >8< -o) O< forename pronunciation: > [Stan] > EMBL-EBI /\\ Tel: +44-(0)1223-492675 > Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Sean Davis 21k

0

Entering edit mode

Hi, If you did do this in binary, we'd see the following: > x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20) > z <- writeBin(as.vector(x),file("test.bin","wb")) > system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); dim(zz) <- c(20,1700000)}) user system elapsed 0.171 0.574 0.751 So, less than a second to read this in. If you were working in, say, Perl, you could write data like this as follows: open M, ">test2.bin"; for($i=0; $i<20*1700000; $i++) { print M pack('i',$i); } close M; and read that file into R as: > system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); dim(e) <- c(20,1700000)}) user system elapsed 0.093 0.273 0.370 Even faster, specifying explicitly the int size. --Misha On Thu, 19 Aug 2010, Sean Davis wrote: > On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at="" ebi.ac.uk=""> wrote: > >> >> This piqued my interest, as for really large datasets it can in general >> speed >> up things greatly to use binary formats (1.5 million does not sound *that* >> big >> to me). I have no experience with this in R, but a little search brought up >> e.g. readBin(). So it might be possible, especially if your data is quite >> simple (all integers), to first convert your data externally to a binary >> format (using perl or python or ..) and then read it with readBin(). >> >> Disclaimer: Quite likely a random thought from an ill-informed bystander. >> >> > Binary is always a good thought, but reading into another language to write > binary to load into R is probably not going to be a big time saver over > using R's capabilities. > >> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20) > di> dim(x) > [1] 20 1700000 >> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE) >> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20))) > Read 34000000 items > user system elapsed > 17.555 0.685 18.258 >> dim(y) > [1] 20 1700000 > > So, a 1.7 million column by 20 row table of integers can be read in about 18 > seconds using scan, just to give a rough sketch of profiling results. You > might be able to get close using read.table and setting column classes > appropriately, also. > > Sean > > >> best, >> Stijn >> >> >> >> >> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: >>> Try using scan and then rearrange the resulting vector. >>> >>> Sean >>> >>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at="" gmail.com=""> wrote: >>> >>> Hello everyone, >>> >>> Is there a faster method to read CSV files than the read.csv function? >> I've >>> CSV files containing a rectangular array with about 17 rows and 1.5 >> million >>> columns with integer entries, and read.csv is being too slow for my >> needs. >>> >>> Thanks for your help, >>> >>> -Gaston >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> Stijn van Dongen >8< -o) O< forename pronunciation: >> [Stan] >> EMBL-EBI /\\ Tel: +44-(0)1223-492675 >> Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.6 years ago Misha Kapushesky ▴ 130

0

Entering edit mode

On Fri, Aug 20, 2010 at 4:45 AM, Misha Kapushesky <ostolop@ebi.ac.uk> wrote: > Hi, > > If you did do this in binary, we'd see the following: > > x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20) >> z <- writeBin(as.vector(x),file("test.bin","wb")) >> > > system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); >> dim(zz) <- c(20,1700000)}) >> > user system elapsed > 0.171 0.574 0.751 > > So, less than a second to read this in. > > If you were working in, say, Perl, you could write data like this as > follows: > > open M, ">test2.bin"; > for($i=0; $i<20*1700000; $i++) { > print M pack('i',$i); > } > close M; > > and read that file into R as: > > system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); >> > dim(e) <- c(20,1700000)}) > user system elapsed > 0.093 0.273 0.370 > > Even faster, specifying explicitly the int size. > Very nice. I'm eating my words.... Sean > Thu, 19 Aug 2010, Sean Davis wrote: > > On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn@ebi.ac.uk> >> wrote: >> >> >>> This piqued my interest, as for really large datasets it can in general >>> speed >>> up things greatly to use binary formats (1.5 million does not sound >>> *that* >>> big >>> to me). I have no experience with this in R, but a little search brought >>> up >>> e.g. readBin(). So it might be possible, especially if your data is quite >>> simple (all integers), to first convert your data externally to a binary >>> format (using perl or python or ..) and then read it with readBin(). >>> >>> Disclaimer: Quite likely a random thought from an ill-informed bystander. >>> >>> >>> Binary is always a good thought, but reading into another language to >> write >> binary to load into R is probably not going to be a big time saver over >> using R's capabilities. >> >> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20) >>> >> di> dim(x) >> [1] 20 1700000 >> >>> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE) >>> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20))) >>> >> Read 34000000 items >> user system elapsed >> 17.555 0.685 18.258 >> >>> dim(y) >>> >> [1] 20 1700000 >> >> So, a 1.7 million column by 20 row table of integers can be read in about >> 18 >> seconds using scan, just to give a rough sketch of profiling results. You >> might be able to get close using read.table and setting column classes >> appropriately, also. >> >> Sean >> >> >> best, >>> Stijn >>> >>> >>> >>> >>> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: >>> >>>> Try using scan and then rearrange the resulting vector. >>>> >>>> Sean >>>> >>>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore@gmail.com> wrote: >>>> >>>> Hello everyone, >>>> >>>> Is there a faster method to read CSV files than the read.csv function? >>>> >>> I've >>> >>>> CSV files containing a rectangular array with about 17 rows and 1.5 >>>> >>> million >>> >>>> columns with integer entries, and read.csv is being too slow for my >>>> >>> needs. >>> >>>> >>>> Thanks for your help, >>>> >>>> -Gaston >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> Stijn van Dongen >8< -o) O< forename pronunciation: >>> [Stan] >>> EMBL-EBI /\\ Tel: +44-(0)1223-492675 >>> Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.6 years ago Sean Davis 21k

0

Entering edit mode

Thanks Misha, that's very instructive. I'd like to add that this can be made quite parametrizable, in that it is possible to write and read the dimensions of the object as well. In fact, by writing some kind of 'cookie' number it would be possible to have code that can recognize what *type* of data it needs to read. In the example below however, just the dimensions are first written to and then read from file. When reading, the dimensions are no longer hardcoded, but read from the same connection. x <- matrix(floor(runif(1.7e4 * 20)*1000),nr=20) cn <- file("test.bin","wb") writeBin(dim(x), cn) writeBin(as.vector(x), cn) close(cn) cn <- file("test.bin", "rb") dims <- readBin(cn, integer(), 2) x2 <- matrix(readBin(cn,numeric(), dims[1] * dims[2]), nrow=dims[1], ncol=dims[2]) close(cn) sum(x != x2) a hex dump of the file test.bin gives this for the first line: <----integer 1 ---> <--- integer 2 ---> 0000000 0014 0000 4268 0000 0000 0000 c000 4070 indeed, hexadecimal 0x14 == 20 and hexadecimal 4268 == 17000, this on a little endian machine. best, Stijn On Fri, Aug 20, 2010 at 09:45:14AM +0100, Misha Kapushesky wrote: > Hi, > > If you did do this in binary, we'd see the following: > > >x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20) > >z <- writeBin(as.vector(x),file("test.bin","wb")) > > >system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); > >dim(zz) <- c(20,1700000)}) > user system elapsed > 0.171 0.574 0.751 > > So, less than a second to read this in. > > If you were working in, say, Perl, you could write data like this as > follows: > > open M, ">test2.bin"; > for($i=0; $i<20*1700000; $i++) { > print M pack('i',$i); > } > close M; > > and read that file into R as: > > >system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); > dim(e) <- c(20,1700000)}) > user system elapsed > 0.093 0.273 0.370 > > Even faster, specifying explicitly the int size. > > --Misha > > On Thu, 19 Aug 2010, Sean Davis wrote: > > >On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at="" ebi.ac.uk=""> wrote: > > > >> > >>This piqued my interest, as for really large datasets it can in general > >>speed > >>up things greatly to use binary formats (1.5 million does not sound *that* > >>big > >>to me). I have no experience with this in R, but a little search brought > >>up > >>e.g. readBin(). So it might be possible, especially if your data is quite > >>simple (all integers), to first convert your data externally to a binary > >>format (using perl or python or ..) and then read it with readBin(). > >> > >>Disclaimer: Quite likely a random thought from an ill-informed bystander. > >> > >> > >Binary is always a good thought, but reading into another language to write > >binary to load into R is probably not going to be a big time saver over > >using R's capabilities. > > > >>x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20) > >di> dim(x) > >[1] 20 1700000 > >>write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FA LSE) > >>system.time((y = matrix(scan('abc.txt',what='integer'),nr=20))) > >Read 34000000 items > > user system elapsed > >17.555 0.685 18.258 > >>dim(y) > >[1] 20 1700000 > > > >So, a 1.7 million column by 20 row table of integers can be read in about > >18 > >seconds using scan, just to give a rough sketch of profiling results. You > >might be able to get close using read.table and setting column classes > >appropriately, also. > > > >Sean > > > > > >>best, > >>Stijn > >> > >> > >> > >> > >>On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: > >>>Try using scan and then rearrange the resulting vector. > >>> > >>>Sean > >>> > >>>On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at="" gmail.com=""> wrote: > >>> > >>>Hello everyone, > >>> > >>>Is there a faster method to read CSV files than the read.csv function? > >>I've > >>>CSV files containing a rectangular array with about 17 rows and 1.5 > >>million > >>>columns with integer entries, and read.csv is being too slow for my > >>needs. > >>> > >>>Thanks for your help, > >>> > >>>-Gaston > >>> > >>>_______________________________________________ > >>>Bioconductor mailing list > >>>Bioconductor at stat.math.ethz.ch > >>>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>Search the archives: > >>>http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> [[alternative HTML version deleted]] > >>> > >>>_______________________________________________ > >>>Bioconductor mailing list > >>>Bioconductor at stat.math.ethz.ch > >>>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >>-- > >>Stijn van Dongen >8< -o) O< forename pronunciation: > >>[Stan] > >>EMBL-EBI /\\ Tel: +44-(0)1223-492675 > >>Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor at stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > [[alternative HTML version deleted]] > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Stijn van Dongen >8< -o) O< forename pronunciation: [Stan] EMBL-EBI /\\ Tel: +44-(0)1223-492675 Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn

ADD REPLY • link 14.6 years ago Stijn van Dongen ▴ 80

0

Entering edit mode

On 08/20/2010 06:26 AM, Stijn van Dongen wrote: > > Thanks Misha, that's very instructive. > I'd like to add that this can be made quite parametrizable, in that it is > possible to write and read the dimensions of the object as well. In fact, by > writing some kind of 'cookie' number it would be possible to have code that can > recognize what *type* of data it needs to read. In the example below however, > just the dimensions are first written to and then read from file. When reading, > the dimensions are no longer hardcoded, but read from the same connection. > > x <- matrix(floor(runif(1.7e4 * 20)*1000),nr=20) > cn <- file("test.bin","wb") > writeBin(dim(x), cn) > writeBin(as.vector(x), cn) > close(cn) > > cn <- file("test.bin", "rb") > dims <- readBin(cn, integer(), 2) > x2 <- matrix(readBin(cn,numeric(), dims[1] * dims[2]), nrow=dims[1], ncol=dims[2]) > close(cn) > > sum(x != x2) > > a hex dump of the file test.bin gives this for the first line: > > <----integer 1 ---> <--- integer 2 ---> > 0000000 0014 0000 4268 0000 0000 0000 c000 4070 > > indeed, hexadecimal 0x14 == 20 and hexadecimal 4268 == 17000, > this on a little endian machine. Maybe worth mentioning save(..., compress=FALSE) / load(), which will be fast (though not as fast as readBin, and difficult to load parts of the data) and robust. Also SQL, NetCDF and friends which will be portable / interoperable. Depending on use case, it can be tricky to get good timings on these operations -- your OS has probably cached those values when written, so input seems very fast, whereas when they've been removed from cache the first access could be considerably slower (order of magnitude is my casual impression). Martin > > > best, > Stijn > > > On Fri, Aug 20, 2010 at 09:45:14AM +0100, Misha Kapushesky wrote: >> Hi, >> >> If you did do this in binary, we'd see the following: >> >>> x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20) >>> z <- writeBin(as.vector(x),file("test.bin","wb")) >> >>> system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); >>> dim(zz) <- c(20,1700000)}) >> user system elapsed >> 0.171 0.574 0.751 >> >> So, less than a second to read this in. >> >> If you were working in, say, Perl, you could write data like this as >> follows: >> >> open M, ">test2.bin"; >> for($i=0; $i<20*1700000; $i++) { >> print M pack('i',$i); >> } >> close M; >> >> and read that file into R as: >> >>> system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); >> dim(e) <- c(20,1700000)}) >> user system elapsed >> 0.093 0.273 0.370 >> >> Even faster, specifying explicitly the int size. >> >> --Misha >> >> On Thu, 19 Aug 2010, Sean Davis wrote: >> >>> On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at="" ebi.ac.uk=""> wrote: >>> >>>> >>>> This piqued my interest, as for really large datasets it can in general >>>> speed >>>> up things greatly to use binary formats (1.5 million does not sound *that* >>>> big >>>> to me). I have no experience with this in R, but a little search brought >>>> up >>>> e.g. readBin(). So it might be possible, especially if your data is quite >>>> simple (all integers), to first convert your data externally to a binary >>>> format (using perl or python or ..) and then read it with readBin(). >>>> >>>> Disclaimer: Quite likely a random thought from an ill-informed bystander. >>>> >>>> >>> Binary is always a good thought, but reading into another language to write >>> binary to load into R is probably not going to be a big time saver over >>> using R's capabilities. >>> >>>> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20) >>> di> dim(x) >>> [1] 20 1700000 >>>> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE) >>>> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20))) >>> Read 34000000 items >>> user system elapsed >>> 17.555 0.685 18.258 >>>> dim(y) >>> [1] 20 1700000 >>> >>> So, a 1.7 million column by 20 row table of integers can be read in about >>> 18 >>> seconds using scan, just to give a rough sketch of profiling results. You >>> might be able to get close using read.table and setting column classes >>> appropriately, also. >>> >>> Sean >>> >>> >>>> best, >>>> Stijn >>>> >>>> >>>> >>>> >>>> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: >>>>> Try using scan and then rearrange the resulting vector. >>>>> >>>>> Sean >>>>> >>>>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at="" gmail.com=""> wrote: >>>>> >>>>> Hello everyone, >>>>> >>>>> Is there a faster method to read CSV files than the read.csv function? >>>> I've >>>>> CSV files containing a rectangular array with about 17 rows and 1.5 >>>> million >>>>> columns with integer entries, and read.csv is being too slow for my >>>> needs. >>>>> >>>>> Thanks for your help, >>>>> >>>>> -Gaston >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> Stijn van Dongen >8< -o) O< forename pronunciation: >>>> [Stan] >>>> EMBL-EBI /\\ Tel: +44-(0)1223-492675 >>>> Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD REPLY • link 14.6 years ago Martin Morgan 25k

0

Entering edit mode

Hi, Martin is absolutely right. For our data analysis needs here we use NetCDF extensively. It's about as fast as direct binary reads, is portable, etc., without the headache of worrying about many nitty gritty details. --Misha > data) and robust. Also SQL, NetCDF and friends which will be portable / > interoperable. > > Depending on use case, it can be tricky to get good timings on these > operations -- your OS has probably cached those values when written, so > input seems very fast, whereas when they've been removed from cache the > first access could be considerably slower (order of magnitude is my > casual impression). > > Martin >> >> >> best, >> Stijn >> >> >> On Fri, Aug 20, 2010 at 09:45:14AM +0100, Misha Kapushesky wrote: >>> Hi, >>> >>> If you did do this in binary, we'd see the following: >>> >>>> x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20) >>>> z <- writeBin(as.vector(x),file("test.bin","wb")) >>> >>>> system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); >>>> dim(zz) <- c(20,1700000)}) >>> user system elapsed >>> 0.171 0.574 0.751 >>> >>> So, less than a second to read this in. >>> >>> If you were working in, say, Perl, you could write data like this as >>> follows: >>> >>> open M, ">test2.bin"; >>> for($i=0; $i<20*1700000; $i++) { >>> print M pack('i',$i); >>> } >>> close M; >>> >>> and read that file into R as: >>> >>>> system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); >>> dim(e) <- c(20,1700000)}) >>> user system elapsed >>> 0.093 0.273 0.370 >>> >>> Even faster, specifying explicitly the int size. >>> >>> --Misha >>> >>> On Thu, 19 Aug 2010, Sean Davis wrote: >>> >>>> On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at="" ebi.ac.uk=""> wrote: >>>> >>>>> >>>>> This piqued my interest, as for really large datasets it can in general >>>>> speed >>>>> up things greatly to use binary formats (1.5 million does not sound *that* >>>>> big >>>>> to me). I have no experience with this in R, but a little search brought >>>>> up >>>>> e.g. readBin(). So it might be possible, especially if your data is quite >>>>> simple (all integers), to first convert your data externally to a binary >>>>> format (using perl or python or ..) and then read it with readBin(). >>>>> >>>>> Disclaimer: Quite likely a random thought from an ill-informed bystander. >>>>> >>>>> >>>> Binary is always a good thought, but reading into another language to write >>>> binary to load into R is probably not going to be a big time saver over >>>> using R's capabilities. >>>> >>>>> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20) >>>> di> dim(x) >>>> [1] 20 1700000 >>>>> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE) >>>>> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20))) >>>> Read 34000000 items >>>> user system elapsed >>>> 17.555 0.685 18.258 >>>>> dim(y) >>>> [1] 20 1700000 >>>> >>>> So, a 1.7 million column by 20 row table of integers can be read in about >>>> 18 >>>> seconds using scan, just to give a rough sketch of profiling results. You >>>> might be able to get close using read.table and setting column classes >>>> appropriately, also. >>>> >>>> Sean >>>> >>>> >>>>> best, >>>>> Stijn >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote: >>>>>> Try using scan and then rearrange the resulting vector. >>>>>> >>>>>> Sean >>>>>> >>>>>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at="" gmail.com=""> wrote: >>>>>> >>>>>> Hello everyone, >>>>>> >>>>>> Is there a faster method to read CSV files than the read.csv function? >>>>> I've >>>>>> CSV files containing a rectangular array with about 17 rows and 1.5 >>>>> million >>>>>> columns with integer entries, and read.csv is being too slow for my >>>>> needs. >>>>>> >>>>>> Thanks for your help, >>>>>> >>>>>> -Gaston >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> Stijn van Dongen >8< -o) O< forename pronunciation: >>>>> [Stan] >>>>> EMBL-EBI /\\ Tel: +44-(0)1223-492675 >>>>> Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 >

ADD REPLY • link 14.6 years ago Misha Kapushesky ▴ 130

0

Entering edit mode

sorry, this: > <----integer 1 ---> <--- integer 2 ---> > 0000000 0014 0000 4268 0000 0000 0000 c000 4070 should have been: <-int 1-> <-int 2-> 0000000 0014 0000 4268 0000 0000 0000 c000 4070 > Thanks Misha, that's very instructive. > I'd like to add that this can be made quite parametrizable, in that it is > possible to write and read the dimensions of the object as well. In fact, by > writing some kind of 'cookie' number it would be possible to have code that can > recognize what *type* of data it needs to read. In the example below however, > just the dimensions are first written to and then read from file. When reading, > the dimensions are no longer hardcoded, but read from the same connection. > > x <- matrix(floor(runif(1.7e4 * 20)*1000),nr=20) > cn <- file("test.bin","wb") > writeBin(dim(x), cn) > writeBin(as.vector(x), cn) > close(cn) > > cn <- file("test.bin", "rb") > dims <- readBin(cn, integer(), 2) > x2 <- matrix(readBin(cn,numeric(), dims[1] * dims[2]), nrow=dims[1], ncol=dims[2]) > close(cn) > > sum(x != x2) > > a hex dump of the file test.bin gives this for the first line: > > <----integer 1 ---> <--- integer 2 ---> > 0000000 0014 0000 4268 0000 0000 0000 c000 4070 > > indeed, hexadecimal 0x14 == 20 and hexadecimal 4268 == 17000, > this on a little endian machine. -- Stijn van Dongen >8< -o) O< forename pronunciation: [Stan] EMBL-EBI /\\ Tel: +44-(0)1223-492675 Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn

ADD REPLY • link 14.6 years ago Stijn van Dongen ▴ 80

0

Entering edit mode

Paul Leo ▴ 970

@paul-leo-2092

Last seen 10.6 years ago

Yep Sean is correct, scan is the way, here is a code snip that works out the dimension of the matrix you are reading in by reading the first few lines and then rearranges. You may need to change the "sep" arguments in the read lines. file<-"input" options(show.error.messages = TRUE) chromo<-try(read.delim(paste(file,".TXT",sep=""),header=T,nrows=1,sep= "\t",fill=TRUE)) ### reads file input.TXT num.vars<-dim(chromo)[2] vars.names<-colnames(chromo)[1:dim(chromo)[2]] ########################## header.lines<-1 num.lines<-1 ################################### chromo<-try(scan(paste(file,".TXT",sep=""),what=character(num.vars),sk ip=header.lines,sep="\t",fill=TRUE)) num.lines<-length(chromo)/(num.vars) dim(chromo)<-c(num.vars,num.lines) chromo<-t(chromo) colnames(chromo)<-vars.names A few million lines talks < 20secs, typically -----Original Message----- From: Gaston Fiore <gaston.fiore@gmail.com> To: bioconductor@stat.math.ethz.ch Subject: [BioC] Fastest way to read CSV files Date: Thu, 19 Aug 2010 17:29:53 -0400 Hello everyone, Is there a faster method to read CSV files than the read.csv function? I've CSV files containing a rectangular array with about 17 rows and 1.5 million columns with integer entries, and read.csv is being too slow for my needs. Thanks for your help, -Gaston _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 14.6 years ago Paul Leo ▴ 970

Login before adding your answer.