GeneSetCollection from GSEA

0

Entering edit mode

Paul ChristophSchröder ▴ 70

@paul-christophschroder-1940

Last seen 10.6 years ago

Hello all, I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow. If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection? I tried: getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp") But it fails: Error: 'getBroadSets' failed to create gene sets: msigdb_v2.1.xml does not seem to be XML, nor to identify a file name Thanks a lot! Paul -- Paul C. Schr?der PhD-Student Division of Proteomics, Genomics & Bioinformatics Center for Applied Medicine (CIMA) University of Navarra Avda. Pio XII, 55 E-31008 Pamplona, Spain Tel: +34 948 194700, ext 5023 email: pschrode at alumni.unav.es

Proteomics GSEABase Proteomics GSEABase • 2.1k views

ADD COMMENT • link updated 17.2 years ago by Martin Morgan 25k • written 17.2 years ago by Paul ChristophSchröder ▴ 70

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 9 weeks ago

United States

On Wed, Feb 27, 2008 at 3:21 AM, Paul Christoph Schr?der <pschrode at="" alumni.unav.es=""> wrote: > Hello all, > > I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow. > > If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection? > > I tried: > getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp") Hi, Paul. You need to specify the actual uri of the file that you want to get: broadsets <- getBroadSets('ftp://ftp.broad.mit.edu/pub/gsea/msigdb_v2.1.xml') That should do it for you. Sean

ADD COMMENT • link 17.2 years ago Sean Davis 21k

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 weeks ago

United States

It's funny, Paul asked me this in a private email. I gave him a complicated solution and suggested he ask on the mailing list so all would benefit. And of course Sean offers an excellent simple solution, to which I add a bit below... "Sean Davis" <sdavis2 at="" mail.nih.gov=""> writes: > On Wed, Feb 27, 2008 at 3:21 AM, Paul Christoph Schr?der > <pschrode at="" alumni.unav.es=""> wrote: >> Hello all, >> >> I'm using the GSEABase package For gene set enrichment I used the java software provided by the Broad Institute, but now since I want to use R I'm trying to implement it to my workflow. >> >> If I'm interested only in a gene set collection like 'c2' from MSIGDB, what would be the appropiate command for getting the gene set collection? >> >> I tried: >> getBroadSets('msigdb_v2.1.xml', base="http://www.broad.mit.edu/gsea/donwloads.jsp") > > Hi, Paul. > > You need to specify the actual uri of the file that you want to get: > > These are all gene sets in the Broad collection. If you wanted just those with 'c2' designation then you might do something like > isC2 <- sapply(broadsets, + function(x) bcCategory(collectionType(x))) == "c2" > c2collection <- broadsets[isC2] > c2collection GeneSetCollection names: CHESLER_D6MIT150_TRANSCRIPTION_FACTORS_GLOCUS, DEATHPATHWAY, ..., ET743_RESIST_DN (1687 total) unique identifiers: MSX3, BACH1, ..., RUSC2 (16966 total) types in collection: geneIdType: SymbolIdentifier (1 total) collectionType: BroadCollection (1 total) Reassuringly, the number of gene sets identified (1687) is the same as from the Broad. For the hard-core, my off-list response to Paul was along the lines of 1. download msigdb_v2.1.xml by hand (because I didn't pay attention to the URL my copy came from, because the Broad people ask you to register before using it so it might be nice to let them know, implicitly, that their data is uesful and to whom, and because I want to reference it on more than one occaision so might as well avoid the bandwidth download; mostly though because I forgot where to get it from directly). 2. define the following function getLocalBroadSets <- function(file, category) { xpathq <- paste('//GENESET[@CATEGORY_CODE="', category, '"]', sep="") handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL) GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler)) } 3. Evaluate the commands > fl <- '~/tmp/msigdb_v2.xml' > getLocalBroadSets(fl, "c1") to which Paul rightly asks for an explanation of the cryptic function: getLocalBroadSets <- function(file, category) { ## 'file' is an XML file that we'll read in (parse) using the XML package xpathq <- paste('//GENESET[@CATEGORY_CODE="', category, '"]', sep="") handler <- GSEABase:::.BroadXMLNodeToGeneSet_factory(NULL) GeneSetCollection(GSEABase:::.fromXML(file, xpathq, handler)) } Here's a more fully commented version. The key thing is that you can reformulate the query to very efficiently extract complicated subsets of the msigdb_v2.1 xml file. getLocalBroadSets <- function(file, category) { ## 'file' is a text file formatted as xml. We'll read it in to R ## using the XML package. You can look at the file and you'll see ## something like ## ## ## <msigdb name="chr16q24" version="1" build_date="1"> ## <geneset standard_name="chr5q23" ...="" ##="" ##="" things="" like="" 'msigdb',="" 'geneset'="" are="" called="" each="" a="" ##="" 'node'.="" things="" like="" 'standard_name'="" is="" an="" 'attribute'.="" ##="" ##="" it's="" possible="" using="" the="" amazing="" xml="" package="" and="" the="" underlying="" ##="" libxml="" library,="" to="" treat="" the="" xml="" as="" a="" kind="" of="" data="" base="" that="" ##="" you="" can="" query="" against.="" the="" syntax="" of="" queries="" is="" defined="" by="" the="" ##="" xpath="" query="" language,="" and="" is="" really="" quite="" simple="" (see="" the="" ##="" abbreviated="" syntax="" at="" ##="" ##="" http:="" www.w3.org="" tr="" xpath#path-abbrev="" ##="" xpathq="" <-="" paste('="" geneset[@category_code="', category, '" ]',="" sep="" )="" ##="" we="" construct="" a="" query="" that="" says="" 'select="" any="" geneset="" node="" with="" ##="" attribute="" category_code="" equal="" to="" the="" value="" of="" the="" variable="" ##="" 'category'="" ##="" ##="" this="" next="" line="" is="" best="" treated="" as="" black="" magic,="" but="" basically="" ##="" when="" we="" find="" a="" node="" from="" our="" query,="" we="" want="" to="" process="" it="" into="" ##="" a="" geneset.="" in="" the="" xml="" package="" we="" do="" this="" by="" providing="" a="" ##="" function="" that's="" invoked="" whenever="" the="" object="" is="" ##="" encountered.="" often="" we="" want="" our="" handler="" to="" remember="" things="" ##="" (e.g.,="" the="" location="" of="" the="" file="" that="" is="" being="" parsed)="" for="" each="" ##="" time="" it="" gets="" invoked.="" one="" way="" of="" doing="" this="" is="" by="" using="" r's="" ##="" 'lexical="" scope'="" to="" define="" a="" function="" that="" returns="" a="" ##="" function.="" the="" returned="" function="" (type="" 'handler'="" after="" the="" ##="" definition="" below)="" parses="" the="" node="" into="" a="" gene="" set,="" but="" has="" ##="" access="" to="" the="" variables="" defined="" in="" the="" function="" that="" defined="" it="" ##="" (.broadxmlnodetogeneset_factory).="" as="" i="" said,="" best="" to="" treat="" as="" ##="" black="" magic="" handler="" <-="" gseabase:::.broadxmlnodetogeneset_factory(null)="" ##="" we'll="" now="" parse="" the="" file="" form="" xml,="" using="" our="" xpathq="" to="" identify="" ##="" the="" nodes="" to="" extract,="" and="" our="" handler="" to="" convert="" each="" node="" from="" ##="" xml="" to="" a="" geneset.="" we'll="" wrap="" the="" entire="" object="" into="" a="" ##="" genesetcollection,="" instead="" of="" just="" returning="" a="" list="" genesetcollection(gseabase:::.fromxml(file,="" xpathq,="" handler))="" }=""> That should do it for you. > > Sean > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 17.2 years ago Martin Morgan 25k

0

Entering edit mode

On Wed, Feb 27, 2008 at 12:15 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > 1. download msigdb_v2.1.xml by hand (because I didn't pay attention to > the URL my copy came from, because the Broad people ask you to > register before using it so it might be nice to let them know, > implicitly, that their data is uesful and to whom, and because I want > to reference it on more than one occaision so might as well avoid the > bandwidth download; mostly though because I forgot where to get it > from directly). This point ABOVE IS VERY IMPORTANT! Thanks for reminding us that free software is not free to develop and requires grant support, etc.... Sean

ADD REPLY • link 17.2 years ago Sean Davis 21k

Login before adding your answer.