GeneSetCollection from GSEA
2
0
Entering edit mode
@paul-christophschroder-1940
Last seen 10.2 years ago
Hello all, I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow. If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection? I tried: getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp") But it fails: Error: 'getBroadSets' failed to create gene sets: msigdb_v2.1.xml does not seem to be XML, nor to identify a file name Thanks a lot! Paul -- Paul C. Schr?der PhD-Student Division of Proteomics, Genomics & Bioinformatics Center for Applied Medicine (CIMA) University of Navarra Avda. Pio XII, 55 E-31008 Pamplona, Spain Tel: +34 948 194700, ext 5023 email: pschrode at alumni.unav.es
Proteomics GSEABase Proteomics GSEABase • 2.0k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 3 months ago
United States
On Wed, Feb 27, 2008 at 3:21 AM, Paul Christoph Schr?der <pschrode at="" alumni.unav.es=""> wrote: > Hello all, > > I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow. > > If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection? > > I tried: > getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp") Hi, Paul. You need to specify the actual uri of the file that you want to get: broadsets <- getBroadSets('ftp://ftp.broad.mit.edu/pub/gsea/msigdb_v2.1.xml') That should do it for you. Sean
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 months ago
United States
It's funny, Paul asked me this in a private email. I gave him a complicated solution and suggested he ask on the mailing list so all would benefit. And of course Sean offers an excellent simple solution, to which I add a bit below... "Sean Davis" <sdavis2 at="" mail.nih.gov=""> writes: > On Wed, Feb 27, 2008 at 3:21 AM, Paul Christoph Schr?der > <pschrode at="" alumni.unav.es=""> wrote: >> Hello all, >> >> I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow. >> >> If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection? >> >> I tried: >> getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp") > > Hi, Paul. > > You need to specify the actual uri of the file that you want to get: > > These are all gene sets in the Broad collection. If you wanted just those with 'c2' designation then you might do something like > isC2 <- sapply(broadsets, + function(x) bcCategory(collectionType(x))) == "c2" > c2collection <- broadsets[isC2] > c2collection GeneSetCollection names: CHESLER_D6MIT150_TRANSCRIPTION_FACTORS_GLOCUS, DEATHPATHWAY, ..., ET743_RESIST_DN (1687 total) unique identifiers: MSX3, BACH1, ..., RUSC2 (16966 total) types in collection: geneIdType: SymbolIdentifier (1 total) collectionType: BroadCollection (1 total) Reassuringly, the number of gene sets identified (1687) is the same as from the Broad. For the hard-core, my off-list response to Paul was along the lines of 1. download msigdb_v2.1.xml by hand (because I didn't pay attention to the URL my copy came from, because the Broad people ask you to register before using it so it might be nice to let them know, implicitly, that their data is uesful and to whom, and because I want to reference it on more than one occaision so might as well avoid the bandwidth download; mostly though because I forgot where to get it from directly). 2. define the following function getLocalBroadSets <- function(file, category) { xpathq <- paste('//GENESET[@CATEGORY_CODE="', category, '"]', sep="") handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL) GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler)) } 3. Evaluate the commands > fl <- '~/tmp/msigdb_v2.xml' > getLocalBroadSets(fl, "c1") to which Paul rightly asks for an explanation of the cryptic function: getLocalBroadSets <- function(file, category) { ## 'file' is an XML file that we'll read in (parse) using the XML package xpathq <- paste('//GENESET[@CATEGORY_CODE="', category, '"]', sep="") handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL) GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler)) } Here's a more fully commented version. The key thing is that you can reformulate the query to very efficiently extract complicated subsets of the msigdb_v2.1 xml file. getLocalBroadSets <- function(file, category) { ## 'file' is a text file formatted as xml. We'll read it in to R ## using the XML package. You can look at the file and you'll see ## something like ## ## ## <msigdb name="chr16q24" version="1" build_date="1"> ## <geneset standard_name="chr5q23" ...="" ##="" ##="" things="" like="" 'msigdb',="" 'geneset'="" are="" called="" each="" a="" ##="" 'node'.="" things="" like="" 'standard_name'="" is="" an="" 'attribute'.="" ##="" ##="" it's="" possible="" using="" the="" amazing="" xml="" package="" and="" the="" underlying="" ##="" libxml="" library,="" to="" treat="" the="" xml="" as="" a="" kind="" of="" data="" base="" that="" ##="" you="" can="" query="" against.="" the="" syntax="" of="" queries="" is="" defined="" by="" the="" ##="" xpath="" query="" language,="" and="" is="" really="" quite="" simple="" (see="" the="" ##="" abbreviated="" syntax="" at="" ##="" ##="" http:="" www.w3.org="" tr="" xpath#path-abbrev="" ##="" xpathq="" <-="" paste('="" geneset[@category_code="', category, '" ]',="" sep="" )="" ##="" we="" construct="" a="" query="" that="" says="" 'select="" any="" geneset="" node="" with="" ##="" attribute="" category_code="" equal="" to="" the="" value="" of="" the="" variable="" ##="" 'category'="" ##="" ##="" this="" next="" line="" is="" best="" treated="" as="" black="" magic,="" but="" basically="" ##="" when="" we="" find="" a="" node="" from="" our="" query,="" we="" want="" to="" process="" it="" into="" ##="" a="" geneset.="" in="" the="" xml="" package="" we="" do="" this="" by="" providing="" a="" ##="" function="" that's="" invoked="" whenever="" the="" object="" is="" ##="" encountered.="" often="" we="" want="" our="" handler="" to="" remember="" things="" ##="" (e.g.,="" the="" location="" of="" the="" file="" that="" is="" being="" parsed)="" for="" each="" ##="" time="" it="" gets="" invoked.="" one="" way="" of="" doing="" this="" is="" by="" using="" r's="" ##="" 'lexical="" scope'="" to="" define="" a="" function="" that="" returns="" a="" ##="" function.="" the="" returned="" function="" (type="" 'handler'="" after="" the="" ##="" definition="" below)="" parses="" the="" node="" into="" a="" gene="" set,="" but="" has="" ##="" access="" to="" the="" variables="" defined="" in="" the="" function="" that="" defined="" it="" ##="" (.broadxmlnodetogeneset_factory).="" as="" i="" said,="" best="" to="" treat="" as="" ##="" black="" magic="" handler="" <-="" gseabase:::.broadxmlnodetogeneset_factory(null)="" ##="" we'll="" now="" parse="" the="" file="" form="" xml,="" using="" our="" xpathq="" to="" identify="" ##="" the="" nodes="" to="" extract,="" and="" our="" handler="" to="" convert="" each="" node="" from="" ##="" xml="" to="" a="" geneset.="" we'll="" wrap="" the="" entire="" object="" into="" a="" ##="" genesetcollection,="" instead="" of="" just="" returning="" a="" list="" genesetcollection(gseabase:::.fromxml(file,="" xpathq,="" handler))="" }=""> That should do it for you. > > Sean > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
On Wed, Feb 27, 2008 at 12:15 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > 1. download msigdb_v2.1.xml by hand (because I didn't pay attention to > the URL my copy came from, because the Broad people ask you to > register before using it so it might be nice to let them know, > implicitly, that their data is uesful and to whom, and because I want > to reference it on more than one occaision so might as well avoid the > bandwidth download; mostly though because I forgot where to get it > from directly). This point ABOVE IS VERY IMPORTANT! Thanks for reminding us that free software is not free to develop and requires grant support, etc.... Sean
ADD REPLY

Login before adding your answer.

Traffic: 537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6