Question

Subsetting a GenomicRanges::GRanges with "%in%" fails

1

Entering edit mode

Aditya ▴ 160

@aditya-7667

Last seen 2.5 years ago

Germany

Subsetting a GRanges with == works:

granges <- GenomicRanges::GRanges(
               c('chr1', 'chr2'), 
               '100-200', 
               seqinfo =  GenomeInfoDb::seqinfo(BSgenome.Mmusculus.UCSC.mm10::Mmusculus))
subset(granges, seqnames == 'chr1')

But subsetting with %in% fails:

subset(granges, seqnames %in% c('chr1', 'chr3'))
Error in match(x, table, nomatch = 0L) : 'match' requires vector arguments

genomicranges • 2.9k views

ADD COMMENT • link 5.3 years ago Aditya ▴ 160

score 0 · Answer 1 · 2019-09-04

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 3.0 years ago

United States

Running that code fails for me:

> granges <- GenomicRanges::GRanges(
+                               c('chr1', 'chr2'), 
+                               '100-200')
> granges %>% subset(seqnames == 'chr1')
Error in granges %>% subset(seqnames == "chr1") : 
  could not find function "%>%"

But the problem is likely that the %in%() generic is not visible, so the version in base is being called, which does not support the run-length encoding of the seqnames (an Rle object). The easiest solution is to just call library(GenomicRanges).

ADD COMMENT • link 5.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael :-). Sorry for forgetting require(magrittr). Question now updated without magrittr to keep it simple. What would be a clean-namespace alternative to require(GenomicRanges) for GRanges subsetting?

ADD REPLY • link 5.3 years ago Aditya ▴ 160

1

Entering edit mode

You could always do something like IRanges::`%in%`(x, y). Keeping the workspace clean can be a pain though, because you'll constantly have to remember the origin of each function.

ADD REPLY • link 5.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I see, IRanges is the place where it lives - thanks!

(Michael, merging my answer - based on your comments - into your's and then deleting mine could keep things clean for future reference)

ADD REPLY • link 5.3 years ago Aditya ▴ 160

1

Entering edit mode

Nope, %in% doesn't live in IRanges. IRanges used to define S4 methods for %in% but not anymore. The fact that IRanges still exports %in% is a leftover from a long time ago. That needs to be removed. The S4Vectors package defines the %in% method for Vector derivatives. Note that %in% is an implicit generic. This means that we don't have a setGeneric() statement somewhere that promotes it to an S4 generic function. It just gets automatically promoted on the first setMethod statement.

A cleaner situation would be to define the %in% generic with an explicit setGeneric statement in BiocGenerics. We will do this soon. Then all you need to do is import the generic from BiocGenerics in your NAMESPACE. Then just use %in% normally. This will call the generic, which takes care of dispatching to the appropriate method. Always call the generic. Trying to call a particular method is not robust.

So no need to do things like S4Vectors::`%in%`(x, y) or BiocGenerics::`%in%`(x, y). If you fully import BiocGenerics and S4Vectors (with import(BiocGenerics) and import(S4Vectors)), which I strongly recommend, then you can start using %in% now without having to worry where the generic is defined.

H.

ADD REPLY • link 5.3 years ago Hervé Pagès 16k

0

Entering edit mode

Oh, that's illuminating - thx!

Is the following understanding correct?

BioCgenerics contains all generics required by BioC classes. When a need for a new generic arises, it is explicitated in this package and exported. The idiom is the @import BiocGenerics, and then one is free to play.
S4Vectors::Vector is the fundamental BioC datatype. Other S4 classes, e.g. GenomicRanges, SummarizedExperiment, etc. inherit from Vector and build further on that. Therefore, the idiom is to also @import S4Vectors and then the BioC playfield lays wide open.

ADD REPLY • link 5.3 years ago Aditya ▴ 160

1

Entering edit mode

Not exactly as there are generics specific to certain data types defined by the packages you mention and many others. Unless you are using a package for a very specific reason (like a single utility function), it's simplest to just import the package in bulk.

ADD REPLY • link 5.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

I see - it looks like I have to switch idioms then.

My default idiom has been to maintain a fully clean namespace, knowing exactly which function from which package is being used, being 100 % sure that no namespace clashes can occur.

But it looks like I need to drop that paradigm in the S4 world...

ADD REPLY • link 5.3 years ago Aditya ▴ 160

0

Entering edit mode

Would the following statement be right then?

On the one hand BiocGenerics exports generics common to multiple BioC packages.

On the other hand, GenomicRanges exports GRanges-specific generics, SummarizedExperiment exports SummarizedExperiment-specific generics, etc.

ADD REPLY • link 5.3 years ago Aditya ▴ 160

1

Entering edit mode

Yes, that's the idea. See ?BiocGenerics for the 2 kinds of generics we define in the BiocGenerics package. Also note that the proteomics folks define their own set of S4 generics in the ProtGenerics package.

ADD REPLY • link 5.3 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks Herve! So the apparant namespace clashes upon require(BiocGenerics) are actually intentional promotions of base primitives and S3 generics to S4 generics?

ADD REPLY • link 5.3 years ago Aditya ▴ 160

0

Entering edit mode

Exactly.

ADD REPLY • link 5.3 years ago Hervé Pagès 16k

0

Entering edit mode

FYI %in% is now an explicit S4 generic defined in the BiocGenerics package: https://github.com/Bioconductor/BiocGenerics/commit/30d0813751536f0dd03b36d1f90c116e889d1954

H.

ADD REPLY • link 5.3 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks Herve, also for providing the update :-)

ADD REPLY • link 5.3 years ago Aditya ▴ 160