Question

flowCore: using read.FCS with which.lines is not time efficient?

1

Entering edit mode

skiaphrene ▴ 10

@skiaphrene-6914

Last seen 10 days ago

Switzerland

Dear flowCore team,

I have recently started using flowCore (and other packages) to analyse flow cytometry data.

I have a collection of 8 FCS files of 80-180 Mb each and I can easily load these into R using read.FCS.

However, to initially practice using the packages, I wanted to limit the number of events read. Reading the read.FCS help, I wanted to use the which.lines parameter to limit what was being read. I was expecting this to make reading the files in faster, however the opposite was true.

Using

ff <- read.FCS( my.fcs.file, transformation=FALSE)

takes 4 to 8 seconds per file.

However, both

ff <- read.FCS( my.fcs.file, transformation=FALSE, which.lines=1:100000)
ff <- read.FCS( my.fcs.file, transformation=FALSE, which.lines=100000)

were much slower (neither had finished after 2 minutes for the first file).

So effectively, reading in the full files and then sub-selecting rows with either

ff <- ff[1:100000,]
ff <- ff[sample.int(nrow(ff),10000),]

is much faster (though, obviously, has higher memory requirements)!

Is this normal? Am I missing something? Should I just stick to my workaround?

Thanks in advance for your help!

Best regards,

-- Alex

My R session info is:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] flowType_2.4.0  BH_1.54.0-4     Rcpp_0.11.3     flowCore_1.32.1

loaded via a namespace (and not attached):
 [1] Biobase_2.26.0      BiocGenerics_0.12.0 clue_0.3-48         cluster_1.15.3      coda_0.16-1         corpcor_1.6.7      
 [7] DEoptimR_1.0-2      feature_1.2.10      flowClust_3.4.0     flowMeans_1.18.0    flowMerge_2.14.0    flowViz_1.30.0     
[13] graph_1.44.0        grid_3.1.1          hexbin_1.27.0       IDPmisc_1.1.17      KernSmooth_2.23-13  ks_1.9.2           
[19] lattice_0.20-29     latticeExtra_0.6-26 MASS_7.3-35         MCMCpack_1.3-3      misc3d_0.8-4        mvtnorm_1.0-0      
[25] parallel_3.1.1      pcaPP_1.9-60        RColorBrewer_1.0-5  rgl_0.93.1098       Rgraphviz_2.10.0    robustbase_0.91-1  
[31] rrcov_1.3-4         sfsmisc_1.0-26      stats4_3.1.1        tools_3.1.1

flowcore read.FCS flowcytometry • 1.8k views

ADD COMMENT • link updated 10.3 years ago by Jiang, Mike ★ 1.3k • written 10.5 years ago by skiaphrene ▴ 10

score 0 · Answer 1 · 2015-01-07

FCS file's data section is stored as a stream of raw bytes, thus reading entire data chunk is more efficient.

'which.lines' is provided mainly for the circumstances when there is not enough memory to read one single FCS (which almost never happens nowadays). As you have experienced, it takes more time because multiple disk IO is involved. And there is also extra overhead in R for calculating the location of each data slab and concatenating them together afterwards.

Therefore, it is not recommended to use `which.lines` unless you have to. ( I may add this note to `help')