Printing DataFrame with nested DataFrames causes error
2
1
Entering edit mode
@wdesouza
Last seen 4.0 years ago
Brazil

I would like to use DataFrame class to represent data.frame with nested data frames. For example, a data frame that have a list of data frame as column (one data frame for each row).

library(S4Vectors)
df <- DataFrame(a=c(1,2,3), b=c("a","b","c"))
df

Outputs:

DataFrame with 3 rows and 2 columns
          a           b
  <numeric> <character>
1         1           a
2         2           b
3         3           c

Now add a list of data frames as new column of DataFrame. These data frames may have different columns and number of rows.

df$c <- list(DataFrame(x=c(1,2)), DataFrame(x=1,y=2), DataFrame())
df

Outputs an error:

DataFrame with 3 rows and 3 columns
Error in as.vector(x, mode = "character") : 
  no method for coercing this S4 class to a vector

But it works:

df[2, 3]

[[1]]
DataFrame with 1 row and 2 columns
          x         y
  <numeric> <numeric>
1         1         2

df[1, 3]

[[1]]
DataFrame with 2 rows and 1 column
          x
  <numeric>
1         1
2         2

However it returns a list of 1 element..

Is there a better way to work with nested data frames using Bioconductor base classes?

s4vectors dataframe • 2.5k views
ADD COMMENT
0
Entering edit mode

I wonder whether these nested-data-frame structures are really consistent with R's vectorization and end-user (including the person who creates these objects!) comprehension?

For me a more natural way to represent this (when all nested DataFrame have the same columns) would be a single data frame with column(s) describing the 'partitioning' df$group of rows into groups. Operations on columns (e.g., 'take the log of column x') are easily vectorized (df$logx <- log(df$x)) and many group-wise operations can be efficiently implemented using the *List infrastructure (e.g., the mean of column x by group, mean(splitAsList(df$x, df$group))).

Even if the data frames have different structure, I do think that a 'tidy' data structure will in the end be more useful.

ADD REPLY
0
Entering edit mode

Thank you Martin for the comment. Actually, the nested data frames may have different shapes (number of columns and rows). This data I am working on came from web APIs (using httr and jsonlite packages). I will update my example.

ADD REPLY
2
Entering edit mode
@michael-lawrence-3846
Last seen 3.0 years ago
United States

A fix will soon propagate for the display issue. For the extraction issue, what were you expecting if not a single element list?

ADD COMMENT
0
Entering edit mode

Thank you Michael. I updated my R installation and I got the latest version of the S4Vectors package. The error does not occur anymore. I expected the DataFrame object itself instead of a list. I tested the Hervé's example with base data frames and the behavior was the same.

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 4 hours ago
Seattle, WA, United States

Hi,

Note that this kind of nesting also "works" with ordinary data frames:

df <- data.frame(a=1:4, b=LETTERS[1:4])
df$c <- list(data.frame(x=1:2),
             NULL,
             data.frame(x=1:3,y=LETTERS[7:9],
                        stringsAsFactors=FALSE),
             data.frame())

Trying to display the object doesn't raise an error like in the DataFrame case but doesn't really do a good job:

df
#   a b                c
# 1 1 A             1, 2
# 2 2 B             NULL
# 3 3 C 1, 2, 3, G, H, I
# 4 4 D             NULL

2D-style subsetting works and also returns a data frame wrapped in a list of length 1:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

This behaves as expected if we think of 2D-style subsetting df[4, 3] as equivalent to df[[3]][4]. One could argue that this semantic is a little bit arbitrary and that we should rather think of it as equivalent to df[[3]][[4]] . However the df[[j]][[i]] semantic would not be desirable in certain situations e.g. when the j-th column of a DataFrame is an IRanges object. It would also cause some surprises e.g. when i is an integer vector that is the result of a computation and is expected to be of arbitrary length but ends up being of length 1 in some situations.

One can always work around the small inconvenience of the current semantic (df[[j]][i]) by doing df[[3]][[4]].

So it looks like all what needs to be fixed is the display of a DataFrame with columns that are lists of data-frame-like objects.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

The display has been fixed in devel. The dropping behavior is already complex enough, so the goal is just consistency with data.frame.

 

ADD REPLY
0
Entering edit mode

Thanks Michael.

I forgot about df[[i, j]] (I never use it) but it works on ordinary data frames and does df[[j]][[i]]:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

df[[4, 3]]
# data frame with 0 columns and 0 rows

Maybe DataFrame objects could support it too.

H.

ADD REPLY
0
Entering edit mode

I also forgot about that, thanks for the reminder. Support added.

ADD REPLY
0
Entering edit mode

Great, thanks! I should probably do the same for DelayedArray objects.

H.

ADD REPLY
0
Entering edit mode

Thank you Hervé for the explanation. It was very clarifying. I think the fix in development version worked for me. I agree with you and Michael about behavior of DataFrame being the consistent with base data.frame. 

ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6