Question

Proper way to read in a DataFrame with CharacterList columns that was saved to a text file

2

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 5 weeks ago

United States

Hi,

What is the proper way to read in a DataFrame from a text file that has CharacterList columns? With the code below, I can see that write.table() writes the text file in such a way that the CharacterList column has c() calls. I'm guessing that there's a simple argument change or a function that then allows you to read this information, but I'm not finding it.

Thank you,

Leonardo

> library('S4Vectors')
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply,
    parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    colMeans, colSums, expand.grid, rowMeans, rowSums

> library('GenomicRanges')
Loading required package: IRanges
Loading required package: GenomeInfoDb
There were 12 warnings (use warnings() to see them)
> df <- DataFrame(x = 1:5, y = CharacterList(lapply(1:5, function(i) {
+     letters[seq_len(i)]}
+ )))
> 
> write.table(df, file = 'test.tsv', sep = '\t', row.names = FALSE, quote = FALSE)
> system('head test.tsv')
x    y
1    a
2    c("a", "b")
3    c("a", "b", "c")
4    c("a", "b", "c", "d")
5    c("a", "b", "c", "d", "e")
> 
> df2 <- read.table('test.tsv', header = TRUE, sep = '\t', stringsAsFactors = FALSE)
> df2
  x                y
1 1                a
2 2          c(a, b)
3 3       c(a, b, c)
4 4    c(a, b, c, d)
5 5 c(a, b, c, d, e)
> 
> options(width = 120)
> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------
 setting  value                                 
 version  R version 3.3.0 RC (2016-05-01 r70572)
 system   x86_64, darwin13.4.0                  
 ui       AQUA                                  
 language (EN)                                  
 collate  en_US.UTF-8                           
 tz       America/New_York                      
 date     2016-06-16                            

Packages ---------------------------------------------------------------------------------------------------------------
 package       * version date       source        
 BiocGenerics  * 0.19.1  2016-06-11 Bioconductor  
 devtools        1.11.1  2016-04-21 CRAN (R 3.3.0)
 digest          0.6.9   2016-01-08 CRAN (R 3.3.0)
 GenomeInfoDb  * 1.9.1   2016-05-13 Bioconductor  
 GenomicRanges * 1.25.4  2016-06-10 Bioconductor  
 IRanges       * 2.7.6   2016-06-10 Bioconductor  
 memoise         1.0.0   2016-01-29 CRAN (R 3.3.0)
 S4Vectors     * 0.11.4  2016-06-11 Bioconductor  
 withr           1.0.1   2016-02-04 CRAN (R 3.3.0)
 XVector         0.13.0  2016-05-05 Bioconductor  
 zlibbioc        1.19.0  2016-05-05 Bioconductor  


## Doesn't work to simply use DataFrame

> DataFrame(df2)
DataFrame with 5 rows and 2 columns
          x                y
  <integer>      <character>
1         1                a
2         2          c(a, b)
3         3       c(a, b, c)
4         4    c(a, b, c, d)
5         5 c(a, b, c, d, e)

s4vectors genomicranges • 1.8k views

ADD COMMENT • link updated 8.3 years ago by Michael Lawrence ★ 11k • written 8.3 years ago by Leonardo Collado Torres ★ 1.1k

score 1 · Accepted Answer · 2016-06-16

Calling write.table() implies as.data.frame(), which coerces the CharacterList to a list. write.table() does not actually handle list columns (what should it do?) but as it turns out, the coercion from DataFrame to data.frame classes the list columns as "AsIs" which coincidentally ends up coercing the list to a character vector at write time. There's no obvious way to coerce a list to a character vector, and the current implementation just uses dput().

I would generally avoid writing list columns (is expand() an option?), but if you have to, list columns are typically encoded as comma-separated cells in tabular text. You could of course use strsplit() and unstrsplit() to move back and forth. It might be a good idea for read.table() to support compound cells. I think data.table::fread() already does. But it's definitely pushing the limits of tabular text.