Unexpected behaviour of set operations on GRanges objects when dplyr is loaded
1
0
Entering edit mode
@lizingsimmons-6935
Last seen 4.1 years ago
Germany

Hi there,

I'm getting unexpected behaviour when using set operation functions in an R session with both dplyr and GenomicRanges loaded. Set operation functions in dplyr used to not work on GRanges objects, which is fine, as if I accidentally used union rather than GenomicRanges::union I'd get an error and know how to fix it. Now I no longer get an error using dplyr::union on GRanges objects, but I also don't get the expected results (see reproducible example below).

I wasn't sure if this was an issue with dplyr or GenomicRanges, so I first reported it on the dplyr github issues. They think it's an issue with the dispatch, which ends up calling S4Vectors::union.Vector instead of base::union or GenomicRanges::union. The discussion is still ongoing there, but I thought it was worth also bringing it up here in case the Bioconductor team can provide some insight.

Thanks in advance for any help!

library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colMeans, colnames, colSums, dirname, do.call, duplicated,
#>     eval, evalq, Filter, Find, get, grep, grepl, intersect,
#>     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
#>     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
#>     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which, which.max,
#>     which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:GenomicRanges':
#> 
#>     intersect, setdiff, union
#> The following object is masked from 'package:GenomeInfoDb':
#> 
#>     intersect
#> The following objects are masked from 'package:IRanges':
#> 
#>     collapse, desc, intersect, setdiff, slice, union
#> The following objects are masked from 'package:S4Vectors':
#> 
#>     first, intersect, rename, setdiff, setequal, union
#> The following objects are masked from 'package:BiocGenerics':
#> 
#>     combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

x <- GRanges("chr1", IRanges(c(2, 9) , c(7, 19)), strand=c("+", "-"))
y <- GRanges("chr1", IRanges(5, 10), strand="-") 

GenomicRanges::union(x,y)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       2-7      +
#>   [2]     chr1      5-19      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
dplyr::union(x, y)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1       2-7      +
#>   [2]     chr1      9-19      -
#>   [3]     chr1      5-10      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

GenomicRanges::intersect(x,y)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      9-10      -
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
dplyr::intersect(x, y)
#> GRanges object with 0 ranges and 0 metadata columns:
#>    seqnames    ranges strand
#>       <Rle> <IRanges>  <Rle>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Sierra 10.12.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] dplyr_0.7.8          GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 
#> [4] IRanges_2.16.0       S4Vectors_0.20.1     BiocGenerics_0.28.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0             bindr_0.1.1            knitr_1.21            
#>  [4] XVector_0.22.0         magrittr_1.5           zlibbioc_1.28.0       
#>  [7] tidyselect_0.2.5       R6_2.3.0               rlang_0.3.1           
#> [10] stringr_1.3.1          highr_0.7              tools_3.5.1           
#> [13] xfun_0.4               htmltools_0.3.6        assertthat_0.2.0      
#> [16] yaml_2.2.0             digest_0.6.18          tibble_2.0.0          
#> [19] crayon_1.3.4           bindrcpp_0.2.2         GenomeInfoDbData_1.2.0
#> [22] purrr_0.2.5            bitops_1.0-6           RCurl_1.95-4.11       
#> [25] glue_1.3.0             evaluate_0.12          rmarkdown_1.11        
#> [28] stringi_1.2.4          pillar_1.3.1           compiler_3.5.1        
#> [31] pkgconfig_2.0.2
Created on 2019-01-10 by the reprex package (v0.2.1)
GenomicRanges S4Vectors dplyr • 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

There's not really a problem here. If you load two packages that contain the same function, the second package will mask that function from the first package. For whatever reason, dplyr is now dispatching something on GRanges objects and it is doing something with those GRanges rather than erroring out. But you know this, and already know to use the fully qualified function name (GenomicRanges::union) in order to get the right output, so I am not sure what the problem is?

ADD COMMENT
0
Entering edit mode

I guess the point is that dispatch could be smarter. I went ahead and inverted the dispatch cascade in S4Vectors 0.21.10, so that it flows from S3 to S4. Should resolve this issue.

ADD REPLY
0
Entering edit mode

Thank you very much for making this change!

ADD REPLY
0
Entering edit mode

IMO this is a problem because the behaviour (S4Vectors::union being called) is unexpected, since at least in my experience this kind of behaviour doesn't happen in any other situations (either the correct method is called for the object, or an error occurs), because it produces significantly different results to the correct method for the object, and because it still returns a GRanges it's difficult to spot. I only noticed because of a numerical difference in results after loading dplyr. Since both dplyr and GenomicRanges are pretty popular packages I'm sure I'm not the only person who could experience this issue.

You could argue that everyone should always be explicit about which version of union they are using, but realistically that isn't common practice. Over on Github I was pointed to the conflicted package, though, which is a great way to spot where this is necessary!

ADD REPLY

Login before adding your answer.

Traffic: 457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6