Question

Difficulties in using the mgsa package for Gene Set Analysis

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 10.3 years ago

Dear list, I have been trying to apply the MGSA method for gene set analysis to my data by using the mgsa package that is part of the Bioconductor release, but so far I haven't been able to make it work. When using the package's readGAF function to create the list of gene sets from the GO categories with the Rat files downloaded from the GO webpage (http://www.geneontology.org/GO.downloads.annotations.shtml), the resulting object looks like this (edited for brevity): Object of class MgsaGoSets 16779 sets over 29266 unique items. Set annotations: term GO:0000002 mitochondrial genome maintenan... ... GO:0000014 Catalysis of the hydrolysis of... ... and 16774 other sets. Item annotations: symbol name 1302934 St8sia5 ST8 alpha-N-acetyl-neuraminide... ... 1302939 Eef1g eukaryotic translation elongat... ... and 29261 other items. Applying the function mgsa() to my list of differentially expressed genes and these gene sets doesn't work, as it looks for matches between the 'symbol' category in the gene sets and the genes of interest. However, the numbers in the 'symbol' category are RGD IDs (from the Rat Genome Database, http://rgd.mcw.edu/), and I haven't been able to find a way to either change these to something else (Entrez ID, gene symbol, etc) or somehow get the RGD IDs for my genes of interest without looking for them manually. So, in order to apply MGSA to my data, I am hoping to get some help on how to do one of these three things: 1) Modify the MgsaGoSets object so it uses as 'symbol' a more common gene ID, such as Entrez ID, instead of RGD ID. 2) Obtain the RGD IDs of my list of differentially expressed genes from a more common gene ID. 3) Create a named list of vectors of gene identifiers, where each GO category is one item in the list and has associated a vector of all the Gene IDs that comprise the category, in a similar way to the process explained in the third section of the package creator's Bioinformatics paper (PMID: 21561920). I would welcome any suggestion you may have, as I am quite interested in comparing the results of this analysis to other gene set analysis methods. Thanks in advance for your help! Juan -- output of sessionInfo(): > sessionInfo() R version 2.15.2 (2012-10-26) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] grid stats graphics grDevices utils datasets [7] methods base other attached packages: [1] mgsa_1.6.0 gplots_2.11.0 MASS_7.3-22 [4] KernSmooth_2.23-8 caTools_1.14 gdata_2.12.0 [7] gtools_2.7.0 BiocInstaller_1.8.3 xtable_1.7-0 [10] GOstats_2.24.0 graph_1.36.1 Category_2.24.0 [13] rat2302cdf_2.11.0 genefilter_1.40.0 RColorBrewer_1.0-5 [16] affycoretools_1.30.0 KEGG.db_2.8.0 GO.db_2.8.0 [19] annotate_1.36.0 rat2302.db_2.8.1 org.Rn.eg.db_2.8.0 [22] RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 [25] limma_3.14.3 affy_1.36.0 Biobase_2.18.0 [28] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] AnnotationForge_1.0.3 Biostrings_2.26.2 GSEABase_1.20.1 [4] IRanges_1.16.4 RBGL_1.34.0 RCurl_1.95-3 [7] XML_3.95-0.1 affyio_1.26.0 annaffy_1.30.0 [10] biomaRt_2.14.0 bitops_1.0-4.2 gcrma_2.30.0 [13] lattice_0.20-10 parallel_2.15.2 preprocessCore_1.20.0 [16] splines_2.15.2 stats4_2.15.2 survival_2.36-14 [19] tools_2.15.2 zlibbioc_1.4.0 -- Sent via the guest posting facility at bioconductor.org.

GO rat2302 Category mgsa GO rat2302 Category mgsa • 1.4k views

ADD COMMENT • link updated 11.9 years ago by Sebastian Bauer ▴ 60 • written 11.9 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2013-01-17

0

Entering edit mode

Sebastian Bauer ▴ 60

@sebastian-bauer-2067

Last seen 10.3 years ago

Dear Juan, [...] > Item annotations: > symbol name > 1302934 St8sia5 ST8 alpha-N-acetyl-neuraminide... > ... > 1302939 Eef1g eukaryotic translation elongat... > ... and 29261 other items. > > Applying the function mgsa() to my list of differentially expressed genes > and these gene sets doesn't work, as it looks for matches between the > 'symbol' category in the gene sets and the genes of interest. However, the > numbers in the 'symbol' category are RGD IDs (from the Rat Genome > Database, http://rgd.mcw.edu/), and I haven't been able to find a way to > either change these to something else (Entrez ID, gene symbol, etc) or > somehow get the RGD IDs for my genes of interest without looking for them > manually. > > So, in order to apply MGSA to my data, I am hoping to get some help on how > to do one of these three things: > > 1) Modify the MgsaGoSets object so it uses as 'symbol' a more common gene > ID, such as Entrez ID, instead of RGD ID. I've peeked into RGD association file. As far as I understood it (I found no documentation in the README) it provides both RGD and gene symbols. The readGAF() function reads both information in as you can see in the output. However, only the primary id is used by mgsa() and the primary id is RGD. If you can turn your list into a list of gene symbols you could use the undocumented gaf at itemAnnotations data frame to convert from the one name space to the other. > 2) Obtain the RGD IDs of my list of differentially expressed genes from a > more common gene ID. I'm unfortunately no expert in this, but maybe you can use BioMart at Ensemble for this. Unfortunately, this site doesn't work for me currently so I couldn't try it out. See http://www.ensembl.org/info/data/biomart.html Hope this helps. Bye Sebastian

ADD COMMENT • link 11.9 years ago Sebastian Bauer ▴ 60

0

Entering edit mode

Dear Sebastian, Thanks for your reply, I wasn't aware of the existence of the @itemAnnotations data frame! I have tried to convert the primary id to gene symbols, but in doing so I've become aware of another problem: the RGD IDs are stored in the data frame as the data frame row names, but some of them refer to the same symbol, so you can't just substitute the RGD IDs for the gene symbols (as row names cannot be repeated). This also means that the gene sets, as defined by the readGAF function, include repeated genes, and that would likely affect the results of the analysis... As an example, the first set defined (GO:0000002) is composed by 17 RGD IDs, but only 14 different genes, as 3 of them are repeated twice with a different RGD ID in each case. I believe that, by using the info in the different slots of the MgsaGoSets object, it should be possible to remove the replicated entries from both the sets and the entries, or at least create a list that the Mgsa function can use for the analysis, so I'll start looking into doing that. I've also looked into the suggestion of using biomart, which was a very good idea, but I'd still be facing the problem of the duplicated elements in the gene sets. Thanks again! Juan ________________________________________ From: Sebastian Bauer [sebastian.bauer@charite.de] Sent: Thursday, January 17, 2013 11:39 AM To: Juan M.Adrian [guest] Cc: bioconductor at r-project.org; Adrian Segarra, Juan Subject: Re: Difficulties in using the mgsa package for Gene Set Analysis Dear Juan, [...] > Item annotations: > symbol name > 1302934 St8sia5 ST8 alpha-N-acetyl-neuraminide... > ... > 1302939 Eef1g eukaryotic translation elongat... > ... and 29261 other items. > > Applying the function mgsa() to my list of differentially expressed genes > and these gene sets doesn't work, as it looks for matches between the > 'symbol' category in the gene sets and the genes of interest. However, the > numbers in the 'symbol' category are RGD IDs (from the Rat Genome > Database, http://rgd.mcw.edu/), and I haven't been able to find a way to > either change these to something else (Entrez ID, gene symbol, etc) or > somehow get the RGD IDs for my genes of interest without looking for them > manually. > > So, in order to apply MGSA to my data, I am hoping to get some help on how > to do one of these three things: > > 1) Modify the MgsaGoSets object so it uses as 'symbol' a more common gene > ID, such as Entrez ID, instead of RGD ID. I've peeked into RGD association file. As far as I understood it (I found no documentation in the README) it provides both RGD and gene symbols. The readGAF() function reads both information in as you can see in the output. However, only the primary id is used by mgsa() and the primary id is RGD. If you can turn your list into a list of gene symbols you could use the undocumented gaf at itemAnnotations data frame to convert from the one name space to the other. > 2) Obtain the RGD IDs of my list of differentially expressed genes from a > more common gene ID. I'm unfortunately no expert in this, but maybe you can use BioMart at Ensemble for this. Unfortunately, this site doesn't work for me currently so I couldn't try it out. See http://www.ensembl.org/info/data/biomart.html Hope this helps. Bye Sebastian

ADD REPLY • link 11.9 years ago Adrian Segarra, Juan ▴ 10