It's funny, Paul asked me this in a private email. I gave him a
complicated solution and suggested he ask on the mailing list so all
would benefit. And of course Sean offers an excellent simple solution,
to which I add a bit below...
"Sean Davis" <sdavis2 at="" mail.nih.gov=""> writes:
> On Wed, Feb 27, 2008 at 3:21 AM, Paul Christoph Schr?der
> <pschrode at="" alumni.unav.es=""> wrote:
>> Hello all,
>>
>> I'm using the GSEABase package For gene set enrichment I used the
java software provided by the Broad Institute, but now since I want to
use R I'm trying to implement it to my workflow.
>>
>> If I'm interested only in a gene set collection like 'c2' from
MSIGDB, what would be the appropiate command for getting the gene set
collection?
>>
>> I tried:
>> getBroadSets('msigdb_v2.1.xml',
base="
http://www.broad.mit.edu/gsea/donwloads.jsp")
>
> Hi, Paul.
>
> You need to specify the actual uri of the file that you want to get:
>
>
These are all gene sets in the Broad collection. If you wanted just
those with 'c2' designation then you might do something like
> isC2 <- sapply(broadsets,
+ function(x) bcCategory(collectionType(x))) == "c2"
> c2collection <- broadsets[isC2]
> c2collection
GeneSetCollection
names: CHESLER_D6MIT150_TRANSCRIPTION_FACTORS_GLOCUS, DEATHPATHWAY,
..., ET743_RESIST_DN (1687 total)
unique identifiers: MSX3, BACH1, ..., RUSC2 (16966 total)
types in collection:
geneIdType: SymbolIdentifier (1 total)
collectionType: BroadCollection (1 total)
Reassuringly, the number of gene sets identified (1687) is the same as
from the Broad.
For the hard-core, my off-list response to Paul was along the lines of
1. download msigdb_v2.1.xml by hand (because I didn't pay attention to
the URL my copy came from, because the Broad people ask you to
register before using it so it might be nice to let them know,
implicitly, that their data is uesful and to whom, and because I want
to reference it on more than one occaision so might as well avoid the
bandwidth download; mostly though because I forgot where to get it
from directly).
2. define the following function
getLocalBroadSets <- function(file, category) {
xpathq <- paste('//GENESET[@CATEGORY_CODE="',
category,
'"]', sep="")
handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL)
GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler))
}
3. Evaluate the commands
> fl <- '~/tmp/msigdb_v2.xml'
> getLocalBroadSets(fl, "c1")
to which Paul rightly asks for an explanation of the cryptic function:
getLocalBroadSets <- function(file, category) {
## 'file' is an XML file that we'll read in (parse) using the XML
package
xpathq <- paste('//GENESET[@CATEGORY_CODE="',
category,
'"]', sep="")
handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL)
GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler))
}
Here's a more fully commented version. The key thing is that you can
reformulate the query to very efficiently extract complicated subsets
of the msigdb_v2.1 xml file.
getLocalBroadSets <- function(file, category) {
## 'file' is a text file formatted as xml. We'll read it in to R
## using the XML package. You can look at the file and you'll see
## something like
##
##
## <msigdb name="chr16q24" version="1" build_date="1">
## <geneset standard_name="chr5q23" ...="" ##="" ##="" things="" like="" 'msigdb',="" 'geneset'="" are="" called="" each="" a="" ##="" 'node'.="" things="" like="" 'standard_name'="" is="" an="" 'attribute'.="" ##="" ##="" it's="" possible="" using="" the="" amazing="" xml="" package="" and="" the="" underlying="" ##="" libxml="" library,="" to="" treat="" the="" xml="" as="" a="" kind="" of="" data="" base="" that="" ##="" you="" can="" query="" against.="" the="" syntax="" of="" queries="" is="" defined="" by="" the="" ##="" xpath="" query="" language,="" and="" is="" really="" quite="" simple="" (see="" the="" ##="" abbreviated="" syntax="" at="" ##="" ##="" http:="" www.w3.org="" tr="" xpath#path-abbrev="" ##="" xpathq="" <-="" paste('="" geneset[@category_code="',
category,
'" ]',="" sep="" )="" ##="" we="" construct="" a="" query="" that="" says="" 'select="" any="" geneset="" node="" with="" ##="" attribute="" category_code="" equal="" to="" the="" value="" of="" the="" variable="" ##="" 'category'="" ##="" ##="" this="" next="" line="" is="" best="" treated="" as="" black="" magic,="" but="" basically="" ##="" when="" we="" find="" a="" node="" from="" our="" query,="" we="" want="" to="" process="" it="" into="" ##="" a="" geneset.="" in="" the="" xml="" package="" we="" do="" this="" by="" providing="" a="" ##="" function="" that's="" invoked="" whenever="" the="" object="" is="" ##="" encountered.="" often="" we="" want="" our="" handler="" to="" remember="" things="" ##="" (e.g.,="" the="" location="" of="" the="" file="" that="" is="" being="" parsed)="" for="" each="" ##="" time="" it="" gets="" invoked.="" one="" way="" of="" doing="" this="" is="" by="" using="" r's="" ##="" 'lexical="" scope'="" to="" define="" a="" function="" that="" returns="" a="" ##="" function.="" the="" returned="" function="" (type="" 'handler'="" after="" the="" ##="" definition="" below)="" parses="" the="" node="" into="" a="" gene="" set,="" but="" has="" ##="" access="" to="" the="" variables="" defined="" in="" the="" function="" that="" defined="" it="" ##="" (.broadxmlnodetogeneset_factory).="" as="" i="" said,="" best="" to="" treat="" as="" ##="" black="" magic="" handler="" <-="" gseabase:::.broadxmlnodetogeneset_factory(null)="" ##="" we'll="" now="" parse="" the="" file="" form="" xml,="" using="" our="" xpathq="" to="" identify="" ##="" the="" nodes="" to="" extract,="" and="" our="" handler="" to="" convert="" each="" node="" from="" ##="" xml="" to="" a="" geneset.="" we'll="" wrap="" the="" entire="" object="" into="" a="" ##="" genesetcollection,="" instead="" of="" just="" returning="" a="" list="" genesetcollection(gseabase:::.fromxml(file,="" xpathq,="" handler))="" }=""> That should do it for you.
>
> Sean
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793