Support of parquet files for on-disk Bioconductor objects
1
0
Entering edit mode
sandmann.t ▴ 70
@sandmannt-11014
Last seen 15 months ago
United States

I am curious about the use of parquet files as an on-disk storage back-end for Bioconductor objects. So far, I have found some references to parquet files in Bioconductor packages, e.g.

but there doesn't seem to be broader support e.g. via the (awesome!) DelayedArray package, yet.

I am aware of the support of true matrix-storage formats, e.g.

  • HDF5 files in HDF5Array (thank you, Herve!), but am looking for even better support of cloud storage systems, or
  • tiledb, supported via TileDBArray (thank you, Aaron!), but - unlike parquet files - tiledb has not been adopted in my work environment, yet.

Before I continue experimenting with marrying parquet and Bioconductor further, I was wondering if "parquet-backed Bioconductor objects" were a bad idea to begin with (and if so - why!). Or if there are ongoing efforts already that I might benefit from (or contribute to).

Many thanks for any thoughts and pointers,

Thomas

DelayedArray TileDBArray • 1.7k views
ADD COMMENT
0
Entering edit mode

Am I correct in thinking that parquet files are very column orientated? So they're a great analogy to a data.frame, but not so good a match to matrices/arrays, where you might want to extract features along any dimension? I guess I'm worried that things might appear array-like, but performance will be very different in different dimensions.

ADD REPLY
0
Entering edit mode

In principle, a Parquet file would be no different from 10X's HDF5 format for sparse matrices. Each matrix column would constitute a Parquet row group, containing the usual i/j/x sparse triplet (maybe the j column can be omitted as it is redundant with the row group ID for the matrix column). If i is used as the sort column within the row group, then you've got a CSC layout inside the Parquet file. At that point, the performance can be expected to be similar to the HDF5 format, i.e., great for column access, pretty bad for row access. Given that we already have a TENxMatrix, I don't see why we couldn't have a ParquetMatrix in the same manner.

ADD REPLY
0
Entering edit mode

Aaron Lun Just to make sure I understand: you mentioned: Each matrix column would constitute a Parquet row group.

For a typical RNA-seq experiment, the number of rows (= genes) is in the tens of thousands. Isn't that a little small for a parquet row group for efficient access? I think arrow:: write_parquet() defaults to the total number of rows if the data has fewer than 250 million cells (rows x cols). (The _total number of rows_ refers to the number of i/j/x sparse triplets, I think.)

For the tenx_pbmc4k dataset with 19773 detected genes in 4340 cells I end up with 6 row groups in the parquet file (see example below).

In the i/j/x representation, there will also be different numbers of rows (i) for each column (j), e.g. in a single cell experiment different numbers of genes will be detectable in each cell. I not sure how to choose a single row group size in that case.

Perhaps you can help me understand what you meant, and whether I should try to optimize this choice?

library(arrow)
library(Matrix)
library(TENxPBMCData)

tenx_pbmc4k <- suppressMessages(TENxPBMCData(dataset = "pbmc4k"))

df <- as.data.frame(
  Matrix::summary(
    as(counts(tenx_pbmc4k), "dgCMatrix")
  )
)

df <- (
  data.frame(
    i = factor(row.names(tenx_pbmc4k)[df$i], levels = row.names(tenx_pbmc4k)),
    j = factor(tenx_pbmc4k$Barcode[df$j], levels = tenx_pbmc4k$Barcode),
    x = df$x
  )
)
# range of the number of detected genes per cell
range(table(df$j)) #  498 5251

parquet_file <- tempfile(fileext = ".parquet")

# chunk_size argument: scalar integer, how many rows will be in each row group
arrow::write_parquet(x = df, sink = parquet_file, use_dictionary = TRUE,
                     chunk_size = NULL, version = "2.6")
pq <- arrow::ParquetFileReader$create(parquet_file)
pq$num_row_groups  # 6
ADD REPLY
0
Entering edit mode

Oops. I was thinking that the row group sizes were variable and we could have fine-grained control over their sizes. Apparently not.

Well, no matter; it can still be made to work. Just store the usual triplets as you did in df, sorted by j and then i. Then you can easily strip out a column's worth of matrix data by querying the file on j.

Row access will probably suck, though no more than HDF5, given that it would be a full scan of the dataset in both cases.

Total number of rows will be the number of non-zero elements, which should be moderately sized (>200 million) for a medium-sized single-cell dataset, e.g., ~100k cells.

ADD REPLY
0
Entering edit mode

Thanks a lot for pointing that out! I agree and am curious about performance as well. There are a few advantages of parquet format that might make it attractive even if access speed cannot match that of true array-based back-ends, e.g. the ability to add new data to an existing dataset simply by saving parquet files with the same schema in the same directory.

ADD REPLY
0
Entering edit mode

Great point, I haven't dived deeply enough into row groups to understand how to use them most effectively. (Added to my to-do list now.) Thanks a lot for sharing your thoughts, I will explore further and share my progress. Any and all feedback will be much appreciated as I learn more.

ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 8 hours ago
The city by the bay

While it doesn't really solve your matrix problem, I was intrigued enough by the premise to start work on https://github.com/LTLA/ParquetDataFrame.

# Mocking up a file:
tf <- tempfile()
on.exit(unlink(tf))
arrow::write_parquet(mtcars, tf)

# Creating a vector on-disk:
ParquetColumnVector(tf, column="gear")
## <32> DelayedArray object of type "double":
##  [1]  [2]  [3]    . [31] [32] 
##    4    4    4    .    5    4 

# This happily lives inside DataFrames:
collected <- list()
for (x in colnames(mtcars)) {
    collected[[x]] <- ParquetColumnVector(tf, column=x)
}
DataFrame(collected)

So we can now construct DataFrames with a mix of normal, Parquet-derived and other columns. The show method for the DataFrame could possibly be more efficient if I could extract data from multiple columns at once; I'm not sure whether that would be worth creating a separate ParquetDataFrame class.

Anyway, contributions welcome.

ADD COMMENT
0
Entering edit mode

Aaron Lun : Wow, that's awesome. Thanks a lot for sharing your ParquetDataFrame code. As always, plenty for me to learn from.

ADD REPLY

Login before adding your answer.

Traffic: 740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6