Entering edit mode
Martin,
I may have run out of Hoorays, but your commit today implementing
subset for SummarizedExperiments is surely deserving of a few
"WooHoos" and a "Yowsah!".
So....
WooHoo WooHoo and a big ol' Yowsah!
Thanks!
~ malcolm_cook@stowers.org
________________________________
From: Cook, Malcolm
Sent: Thursday, February 13, 2014 12:20 PM
To: 'Michael Lawrence'; 'Martin Morgan'
Cc: 'bioconductor@r-project.org'
Subject: RE: [BioC] subsetting SummarizedExperiments - a proposal (or
a hack?)
Michael,
Just to say, �Hooray� on your response to this thread, and especially
more recent activity, where I see:
1) your discussion on R-devel re �getting environment from "top"
promise: http://r.789695.n4.nabble.com/getting-environment-from-quot-
top-quot-promise-td4685138.html#none
2) your commit last night in service of allowing interfaces to
GRanges (change log copied below)
What nice syntactic sugar you are serving us. And what lurking
dragons and self-inflicted gunshot wounds to the foot your are saving
me.
I look forward to the next release!
Did I say �Hooray!�
Hooray!
~Malcolm
Revision: 86351
Author: m.lawrence
Date: a day ago
Paths:
Modified
/trunk/madman/Rpacks/IRanges/R/Vector-class.R<https: hedgehog.fhcrc.="" org="" bioconductor="" trunk="" madman="" rpacks="" iranges="" r="" vector-class.r="">
Modified
/trunk/madman/Rpacks/IRanges/NAMESPACE<https: hedgehog.fhcrc.org="" bio="" conductor="" trunk="" madman="" rpacks="" iranges="" namespace="">
Modified
/trunk/madman/Rpacks/IRanges/R/List-class.R<https: hedgehog.fhcrc.or="" g="" bioconductor="" trunk="" madman="" rpacks="" iranges="" r="" list-class.r="">
Modified
/trunk/madman/Rpacks/IRanges/DESCRIPTION<https: hedgehog.fhcrc.org="" b="" ioconductor="" trunk="" madman="" rpacks="" iranges="" description="">
generalize eval() to Vectors, add as.env,Vector that forms an
environment by querying the Vector for fixedColumnNames() and the
mcols(). fixedColumnNames() returns an empty vector of names by
default, but e.g. GenomicRanges can use it to provide start, end,
etc. Change subset() and with() to call eval directly on the
Vector. Now we can do subset(gr, seqnames=="chr2" & score > 15).
From: Michael Lawrence [mailto:lawrence.michael@gene.com]
Sent: Saturday, February 08, 2014 7:55 AM
To: Martin Morgan
Cc: Cook, Malcolm; bioconductor@r-project.org
Subject: Re: [BioC] subsetting SummarizedExperiments - a proposal (or
a hack?)
On Wed, Feb 5, 2014 at 5:54 PM, Martin Morgan
<mtmorgan@fhcrc.org<mailto:mtmorgan@fhcrc.org>> wrote:
Hi Malcom --
On 02/05/2014 02:11 PM, Cook, Malcolm wrote:
> Martin,
>
> Google says: 'Your search - "subset.SummarizedExperiment" - did not
match any documents.'
If it had existed, it would have been an S4 method revealed by
library(GenomicAlignments) # library(GenomicRanges) in release
showMethods("subset")
method?"subset,SummarizedExperiment"
?"subset,SummarizedExperiment-method"
with tab completion available on the last two after getting through
"subset". Google would have found "subset,SummarizedExperiment-method"
(if it existed) but Google and tab completion with ? would have failed
(though the latter in perhaps a suggestive way by showing alternate
tab completions) when a relevant method is defined on a contained
class. But yes, a method does not exist.
>
> So I offer the below for your consideration, along with a few
examples and a quick benchmark comparison with using the more explicit
`[` extraction operator.
>
>
>
> subset.SummarizedExperiment<-function(x
> ,rowSubset=TRUE
> ,colSubset=TRUE
> ,assaySubset=TRUE
> ,drop=FALSE
> ,provenanceTrack=FALSE
> ,...) {
> ## PURPOSE: implement subsetting of SummarizedExperiments,
> ## allowing expressions in terms of:
> ##
> ## + the GRanges (and its mcols meta-data) held in rowData
> ## + the experimental meta-data held in the colData DataFrame
> ## + the names of the assays
> x<-x[
>
##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv)
> eval(as.expression(substitute(rowSubset)),as.data.frame(ro
wData(x)),.GlobalEnv)
>
,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv)
> ,drop=drop,...]
> if (! identical(TRUE,assaySubset))
assays(x)<-assays(x)[assaySubset]
> if(provenanceTrack) {
> exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character
(substitute(rowSubset)))
> exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character
(substitute(colSubset)))
> }
> x
> }
> attr(subset,'ex')<-function() {
> example(SummarizedExperiment)
> assays(se1)$a2<-assays(se1)$counts*2
> assays(se1)$a3<-assays(se1)$counts*3
> benchmark(replications=100
>
,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP']
> ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP')
> )
> stopifnot(identical(assays(se1.ss1),assays(se1.ss2)))
> se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3'))
> stopifnot(identical(c('a2','a3'),names(assays(se1.ss3))))
> }
>
>
> The rbenchmarks shows the cost of overhead in this contrived minimal
example. YMMV:
>
>
test replications elapsed relative user.self sys.self user.child
sys.child
> 3 c("a2",
"a3") 100 0.001 1 0.001 0.000 0
0
> 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment ==
"ChIP"] 100 3.113 3113 3.111 0.001 0
0
> 2 se1.ss2 <- subset(se1, start == 344757, Treatment ==
"ChIP") 100 3.486 3486 3.486 0.000 0
0
>
> Is all this in the spirit of things as you see it? It sure makes
my use of SE "scan" (erhm, like poetry;)
My own problem with subset() is that the context of the call is hard
to get correct. So your subset with this
f <- function(x) {
id <- 697568
subset(x, start==id)
}
fails
> f(se1)
Error in x[eval(as.expression(substitute(rowSubset)),
as.data.frame(rowData(x)), :
error in evaluating the argument 'i' in selecting a method for
function '[': Error in eval(expr, envir, enclos) : object 'id' not
found
and this fails (does not select the value of id set in f()) silently
> id <- 387456
> start(f(se1))
[1] 387456
I think I might have solved this problem with a simple solution. The
promise knows the expression and the environment in which to evaluate
it. Since we have no problem getting the expression with substitute(),
why not get the environment, too?
Just added this to IRanges and it seems to work:
SEXP top_prenv(SEXP nm, SEXP env)
{
SEXP promise = findVar(nm, env);
while(TYPEOF(promise) == PROMSXP) {
env = PRENV(promise);
promise = PREXPR(promise);
}
return env;
}
Let's see if we can break it.
Other R idioms are similarly dangerous even for data.frame using
base::subset.data.frame as illustrated here
http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-
subset
in the high-voted answer.
But maybe I shouldn't stand between users, guns, and feet?
A couple of small SummarizedExperiment things in your code ...
start(se1) == start(rowData(se1))
and likewise for other functions in the GRanges 'API'. Also it turns
out that it is expensive in the (current) implementation of
SummarizedExperiment to associated dimnames with returned assay() or
assays(), so it is much faster to
assays(x) <- assays(x, withDimnames=FALSE)[assaySubset]
(dimnames get stored on the rowData and colData rather than
redundantly on (each of the) assays, although this is probably a false
(space) economy). The combination of GRanges API and expensive assay()
/ assays() can help save one from unintended inefficiencies, e.g.,
dimnames(se1) rather than dimnames(assay(se1)).
Martin
>
> if you like, I also have castLong and castWide functions to reshape
a (subsetted) SE as a data.table for use in ggplot and friends.
>
> You like?
>
> Cheers,
>
> ~Malcolm
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793<tel:%28206%29%20667-2793>
_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
[[alternative HTML version deleted]]