goseq analysis - extracting list of my genes which are DE in the enriched GO category
2
0
Entering edit mode
Helen Wright ▴ 10
@helen-wright-4840
Last seen 10.2 years ago
Hello everyone This is my first message to Bioconductor, I hope someone out there can help me.?I have been working through the goseq vignette and adapting it to my own dataset. ?I have got my list of enriched go categories, which looks like this: > head(GO.wall) ? ? ? ? category over_represented_pvalue under_represented_pvalue 3217 GO:0006954 ? ? ? ? ? ?1.496746e-10 ? ? ? ? ? ? ? ? ? ? ? ?1 3548 GO:0008009 ? ? ? ? ? ?4.502492e-10 ? ? ? ? ? ? ? ? ? ? ? ?1 9081 GO:0042379 ? ? ? ? ? ?1.010877e-09 ? ? ? ? ? ? ? ? ? ? ? ?1 2160 GO:0005126 ? ? ? ? ? ?1.033292e-09 ? ? ? ? ? ? ? ? ? ? ? ?1 2159 GO:0005125 ? ? ? ? ? ?1.534272e-09 ? ? ? ? ? ? ? ? ? ? ? ?1 4251 GO:0009611 ? ? ? ? ? ?1.983311e-09 ? ? ? ? ? ? ? ? ? ? ? ?1 I can get a list of the genes which are contained in each GO category using : >?ensembl.gene.ids=get("GO:0006954", org.Hs.egGO2ALLEGS) > unlist(mget(unique(ensembl.gene.ids), org.Hs.egGENENAME)) but this just gives me a list of ALL the genes attached to this GO category. ?I?would like to find out which of my DE genes are in this GO category. I have referred to a previous Bioconductor post <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20110308="" 92b2="" 7df4="" attachment.pl=""> which shows how to get a list of all the genes in each GO category, like this: > data(genes) > genes2go=getgo(names(genes),'hg19','ensGene') > go2genes=goseq:::reversemapping(genes2go) Here is an extract from go2genes: > go2genes$'GO:0006954' ? [1] "ENSG00000088812" "ENSG00000078747" "ENSG00000172216" "ENSG00000196839" "ENSG00000243646" "ENSG00000128271" "ENSG00000100014" "ENSG00000128335" "ENSG00000100292" "ENSG00000128284" ?[11] "ENSG00000240972" "ENSG00000117215" "ENSG00000050628" "ENSG00000187554" "ENSG00000188257" "ENSG00000159377" "ENSG00000117525" "ENSG00000116288" "ENSG00000158769" "ENSG00000160712" ?[21] "ENSG00000135744" "ENSG00000117335" "ENSG00000163435" "ENSG00000177807" "ENSG00000117586" "ENSG00000078900" "ENSG00000178585" "ENSG00000168297" "ENSG00000134070" "ENSG00000144802" ?[31] "ENSG00000113889" "ENSG00000172936" "ENSG00000074416" "ENSG00000085276" "ENSG00000121807" "ENSG00000157017" "ENSG00000233276" "ENSG00000175040" "ENSG00000072274" "ENSG00000132170" > class(go2genes) [1] "list" I have my list of genes in a data.frame called "pwf" which looks like this" > head(pwf) ? ? ? ? ? ? ? ? ? ? ? ? ? DEgenes bias.data ? ? ? ? pwf ENSG00000000003 ? ? ? 0 ? ?1563.5 0.003486092 ENSG00000000005 ? ? ? 0 ? ?1353.0 0.003287926 ENSG00000000419 ? ? ? 0 ? ?1079.5 0.002955906 ENSG00000000457 ? ? ? 0 ? ?3128.5 0.003943449 ENSG00000000460 ? ? ? 0 ? ?3126.0 0.003943445 ENSG00000000938 ? ? ? 0 ? ?2437.0 0.003896895 The number 1 in pwf$DEgenes indicates a differentially expressed gene. So what (I think) I need to do is: 1) get the GOTERM (e.g. GO:0006954) from GO.wall$category 2) look in go2genes for $'GO:0006954' and get the list of all genes associated with this GOTERM 3) look for these genes in rownames(pwf), and if pwf$DEgenes==1 then print GO.wall$category, GO.wall$over_represented_pvalue, rownames(pwf) I have no idea if this is possible. ?The sort of output I would expect would be something like (GOTERM, p-value, list of genes from this category enriched in my dataset): GO:0006954 1.496746e-10 "ENSG00000078747"?"ENSG00000177807" "ENSG00000117586" "ENSG00000078900"?"ENSG00000175040" "ENSG00000072274" "ENSG00000132170" Thank you for taking the time to read my email. I hope someone out there will be able to help me. Many thanks and have a nice weekend everybody. Helen Wright University of Liverpool, UK > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base other attached packages: ?[1] GO.db_2.5.0 ? ? ? ? ? ?org.Hs.eg.db_2.5.0 ? ? RSQLite_0.9-4 ? ?DBI_0.2-5 ? ? ? ? ? ? ?AnnotationDbi_1.14.1 ? edgeR_2.2.5 ?goseq_1.4.0 ? ? ? ? ? ?geneLenDataBase_0.99.7 ?[9] BiasedUrn_1.04 ? ? ? ? DESeq_1.4.1 ? ? ? ? ? ?locfit_1.5-6 ? lattice_0.19-30 ? ? ? ?akima_0.5-4 ? ? ? ? ? ?Biobase_2.12.2 Rsamtools_1.4.3 ? ? ? ?Biostrings_2.20.3 [17] GenomicFeatures_1.4.4 ?GenomicRanges_1.4.8 ? ?IRanges_1.10.6 loaded via a namespace (and not attached): ?[1] BSgenome_1.20.0 ? ?Matrix_0.999375-50 RColorBrewer_1.0-5 RCurl_1.6-10 ? ? ? XML_3.4-2 ? ? ? ? ?annotate_1.30.1 ? ?biomaRt_2.8.1 ? ? ?genefilter_1.34.0 ?geneplotter_1.30.0 [10] grid_2.13.1 ? ? ? ?limma_3.8.3 ? ? ? ?mgcv_1.7-6 nlme_3.1-101 ? ? ? rtracklayer_1.12.4 splines_2.13.1 survival_2.36-9 ? ?tools_2.13.1 ? ? ? xtable_1.5-6
GO Category goseq GO Category goseq • 5.3k views
ADD COMMENT
0
Entering edit mode
@iain-gallagher-2532
Last seen 9.3 years ago
United Kingdom
Hi Helen Here's how I did exactly what you want (with added symbol and logFC goodness) I was studying cows hence the bosTau4 and using ENSEBML gene ids so hence the ensGene. You can adapt this all you like though. These are the libraries I loaded upfront (i.e. before the edgeR analysis from which the following is snipped): library(edgeR) library(org.Bt.eg.db)#cows! library(goseq) library(GO.db) library(annotate) #this calculates a vector that indicates whether each gene is regulated (at FDR 1% in my case) or not. It's fed to the nullp function of goseq (see below) # e.g. 0 0 1 1 0 0 1 # two genes not regulated, next two are, next two not etc genes <- as.integer(p.adjust(lrtFiltered$table$p.value, method = "BH") < 0.01)#FDR = 1% #...GO ANALYSIS...# pwf=nullp(genes,'bosTau4','ensGene') # genes = the regulated genes goCats <- goseq(pwf, 'bosTau4','ensGene', test.cats=("GO:BP")) sigCats <- goCats[which(goCats[,2] < 0.05),] # getting the sig categories #let's annotate the GO categories cats <- sigCats$category terms <- stack(lapply(mget(cats, GOTERM, ifnotfound=NA), Term)) #write out GO data sigCats$Term <- with(sigCats, terms$values[match(terms$ind, sigCats$category)] ) #get the genes in each category allGos <- stack(getgo(rownames(topGenes$table), 'bosTau4', 'ensGene')) # so here I pull the GO terms for every gene that is regulated. #add the terms onlySigCats <- allGos[allGos$values %in% sigCats$category,] onlySigCats$Term <- with( onlySigCats, terms$value[match(onlySigCats$values, terms$ind)] ) #add the gene symbol onlySigCats$Symbol <- with( onlySigCats, annot2[,2][match(onlySigCats$ind, rownames(annot2) )] ) #add the gene logFC onlySigCats$logFC <- with( onlySigCats, topGenes$table$logFC[match(onlySigCats$ind, rownames(topGenes$table) )] ) You can then write out this table. HTH Best Iain > sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C [3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8 [7] LC_PAPER=en_GB.utf8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0 [4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0 [7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1 [10] Biobase_2.12.2 edgeR_2.2.5 loaded via a namespace (and not attached): [1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0 [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1 [7] IRanges_1.10.6 lattice_0.19-33 limma_3.8.3 [10] Matrix_0.9996875-3 mgcv_1.7-6 nlme_3.1-102 [13] RCurl_1.6-9 rtracklayer_1.12.4 tools_2.13.1 [16] XML_3.4-2 xtable_1.5-6 > --- On Fri, 9/9/11, Helen Wright <hlwright.uk at="" gmail.com=""> wrote: > From: Helen Wright <hlwright.uk at="" gmail.com=""> > Subject: [BioC] goseq analysis - extracting list of my genes which are DE in the enriched GO category > To: bioconductor at r-project.org > Date: Friday, 9 September, 2011, 14:30 > Hello everyone > > This is my first message to Bioconductor, I hope someone > out there can > help me.?I have been working through the goseq vignette > and adapting > it to my own dataset. ?I have got my list of enriched go > categories, > which looks like this: > > > head(GO.wall) > ? ? ? ?? category? ? ? ? > ???over_represented_pvalue > under_represented_pvalue > 3217 GO:0006954 ? ? ? ? ? ?1.496746e-10 ? ? ? ? > ? ? ? ? ? ? ? ?1 > 3548 GO:0008009 ? ? ? ? ? ?4.502492e-10 ? ? ? ? > ? ? ? ? ? ? ? ?1 > 9081 GO:0042379 ? ? ? ? ? ?1.010877e-09 ? ? ? ? > ? ? ? ? ? ? ? ?1 > 2160 GO:0005126 ? ? ? ? ? ?1.033292e-09 ? ? ? ? > ? ? ? ? ? ? ? ?1 > 2159 GO:0005125 ? ? ? ? ? ?1.534272e-09 ? ? ? ? > ? ? ? ? ? ? ? ?1 > 4251 GO:0009611 ? ? ? ? ? ?1.983311e-09 ? ? ? ? > ? ? ? ? ? ? ? ?1 > > > I can get a list of the genes which are contained in each > GO category using : > >?ensembl.gene.ids=get("GO:0006954", > org.Hs.egGO2ALLEGS) > > unlist(mget(unique(ensembl.gene.ids), > org.Hs.egGENENAME)) > > but this just gives me a list of ALL the genes attached to > this GO > category. ?I?would like to find out which of my DE genes > are in this > GO category. > > > I have referred to a previous Bioconductor post > <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20110308="" 92="" b27df4="" attachment.pl=""> > which shows how to get a list of all the genes in each GO > category, > like this: > > > data(genes) > > genes2go=getgo(names(genes),'hg19','ensGene') > > go2genes=goseq:::reversemapping(genes2go) > > Here is an extract from go2genes: > > go2genes$'GO:0006954' > ? [1] "ENSG00000088812" "ENSG00000078747" > "ENSG00000172216" > "ENSG00000196839" "ENSG00000243646" "ENSG00000128271" > "ENSG00000100014" "ENSG00000128335" "ENSG00000100292" > "ENSG00000128284" > ?[11] "ENSG00000240972" "ENSG00000117215" > "ENSG00000050628" > "ENSG00000187554" "ENSG00000188257" "ENSG00000159377" > "ENSG00000117525" "ENSG00000116288" "ENSG00000158769" > "ENSG00000160712" > ?[21] "ENSG00000135744" "ENSG00000117335" > "ENSG00000163435" > "ENSG00000177807" "ENSG00000117586" "ENSG00000078900" > "ENSG00000178585" "ENSG00000168297" "ENSG00000134070" > "ENSG00000144802" > ?[31] "ENSG00000113889" "ENSG00000172936" > "ENSG00000074416" > "ENSG00000085276" "ENSG00000121807" "ENSG00000157017" > "ENSG00000233276" "ENSG00000175040" "ENSG00000072274" > "ENSG00000132170" > > > class(go2genes) > [1] "list" > > > I have my list of genes in a data.frame called "pwf" which > looks like this" > > head(pwf) > ? ? ? ? ? ? ? ? ? ? ? ? ? DEgenes? ? > bias.data ? ? ? ? pwf > ENSG00000000003 ? ? ? 0 ? ?1563.5 0.003486092 > ENSG00000000005 ? ? ? 0 ? ?1353.0 0.003287926 > ENSG00000000419 ? ? ? 0 ? ?1079.5 0.002955906 > ENSG00000000457 ? ? ? 0 ? ?3128.5 0.003943449 > ENSG00000000460 ? ? ? 0 ? ?3126.0 0.003943445 > ENSG00000000938 ? ? ? 0 ? ?2437.0 0.003896895 > > The number 1 in pwf$DEgenes indicates a differentially > expressed gene. > > > So what (I think) I need to do is: > 1) get the GOTERM (e.g. GO:0006954) from GO.wall$category > 2) look in go2genes for $'GO:0006954' and get the list of > all genes > associated with this GOTERM > 3) look for these genes in rownames(pwf), and if > pwf$DEgenes==1 then > print GO.wall$category, GO.wall$over_represented_pvalue, > rownames(pwf) > > > I have no idea if this is possible. ?The sort of output I > would expect > would be something like (GOTERM, p-value, list of genes > from this > category enriched in my dataset): > > GO:0006954 1.496746e-10 > "ENSG00000078747"?"ENSG00000177807" "ENSG00000117586" > "ENSG00000078900"?"ENSG00000175040" "ENSG00000072274" > "ENSG00000132170" > > > Thank you for taking the time to read my email. I hope > someone out > there will be able to help me. > > Many thanks and have a nice weekend everybody. > > Helen Wright > University of Liverpool, UK > > > > > sessionInfo() > R version 2.13.1 (2011-07-08) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > locale: > [1] C/en_US.UTF-8/C/C/C/C > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets > ?methods ? base > other attached packages: > ?[1] GO.db_2.5.0 ? ? ? ? ? ?org.Hs.eg.db_2.5.0 ? ? > RSQLite_0.9-4 > ? ?DBI_0.2-5 ? ? ? ? ? ? ?AnnotationDbi_1.14.1 ? > edgeR_2.2.5 > ?goseq_1.4.0 ? ? ? ? ? ?geneLenDataBase_0.99.7 > ?[9] BiasedUrn_1.04 ? ? ? ? DESeq_1.4.1 ? ? ? ? ? > ?locfit_1.5-6 > ? lattice_0.19-30 ? ? ? ?akima_0.5-4 ? ? ? ? ? > ?Biobase_2.12.2 > Rsamtools_1.4.3 ? ? ? ?Biostrings_2.20.3 > [17] GenomicFeatures_1.4.4 ?GenomicRanges_1.4.8 ? > ?IRanges_1.10.6 > loaded via a namespace (and not attached): > ?[1] BSgenome_1.20.0 ? ?Matrix_0.999375-50 > RColorBrewer_1.0-5 > RCurl_1.6-10 ? ? ? XML_3.4-2 ? ? ? ? > ?annotate_1.30.1 ? ?biomaRt_2.8.1 > ? ? ?genefilter_1.34.0 ?geneplotter_1.30.0 > [10] grid_2.13.1 ? ? ? ?limma_3.8.3 ? ? ? > ?mgcv_1.7-6 > nlme_3.1-101 ? ? ? rtracklayer_1.12.4 splines_2.13.1 > survival_2.36-9 ? ?tools_2.13.1 ? ? ? xtable_1.5-6 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode

Hi Ian and Helen - I tried this bit of code but got an error at the point of 

> terms <- stack(lapply(mget(cats, GOTERM, ifnotfound=NA), Term))
Error in stack(lapply(mget(cats, GOTERM, ifnotfound = NA), Term)) : 
  error in evaluating the argument 'x' in selecting a method for function 'stack': Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘Term’ for signature ‘"logical"’

am I missing something, or not put the code in correctly>

preceeding the offeding line, this is how I set it up

KEGG.Sig.PC3.40.samp <- goseq(pwf.KG, "hg19", "knownGene", method="Sampling",repcnt=1000, test.cats = c("KEGG"))

goCats <- KEGG.Sig.PC3.40.samp
sigCats <- goCats[which(goCats[,2]<0.05),]

cats <- sigCats$category

Is it a KEGG thing (does this only work with GO terms)?

Cheers,

Matt

 

output of sessionInfo():

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] grDevices datasets  parallel  graphics  stats4    stats     utils     methods   base     

other attached packages:
 [1] edgeR_3.10.2              limma_3.24.15             annotate_1.46.1          
 [4] XML_3.98-1.3              GO.db_3.1.2               goseq_1.20.0             
 [7] BiasedUrn_1.06.1          org.Hs.eg.db_3.1.2        RSQLite_1.0.0            
[10] DBI_0.3.1                 AnnotationDbi_1.30.1      Biobase_2.28.0           
[13] geneLenDataBase_1.4.0     genefilter_1.50.0         ggplot2_1.0.1            
[16] DESeq2_1.8.1              RcppArmadillo_0.5.200.1.0 Rcpp_0.12.0              
[19] GenomicRanges_1.20.5      GenomeInfoDb_1.4.1        IRanges_2.2.5            
[22] S4Vectors_0.6.3           BiocGenerics_0.14.0       BiocInstaller_1.18.4     

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.1          lattice_0.20-33         Rsamtools_1.20.4        Biostrings_2.36.3      
 [5] digest_0.6.8            plyr_1.8.3              futile.options_1.0.0    acepack_1.3-3.3        
 [9] zlibbioc_1.14.0         GenomicFeatures_1.20.2  Matrix_1.2-2            rpart_4.1-10           
[13] proto_0.3-10            splines_3.2.1           BiocParallel_1.2.20     geneplotter_1.46.0     
[17] stringr_1.0.0           foreign_0.8-65          RCurl_1.95-4.7          biomaRt_2.24.0         
[21] munsell_0.4.2           rtracklayer_1.28.8      mgcv_1.8-7              nnet_7.3-10            
[25] gridExtra_2.0.0         Hmisc_3.16-0            GenomicAlignments_1.4.1 MASS_7.3-43            
[29] bitops_1.0-6            grid_3.2.1              nlme_3.1-121            xtable_1.7-4           
[33] gtable_0.1.2            magrittr_1.5            scales_0.2.5            stringi_0.5-5          
[37] XVector_0.8.0           reshape2_1.4.1          latticeExtra_0.6-26     futile.logger_1.4.1    
[41] Formula_1.2-1           lambda.r_1.1.7          RColorBrewer_1.1-2      tools_3.2.1            
[45] survival_2.38-3         colorspace_1.2-6        cluster_2.0.3   

 

 

ADD REPLY
0
Entering edit mode
@darshinimamindla-8843
Last seen 8.6 years ago
United States

Hi Iain Gallagher ,

I tried using your code to extract the genes of my interest that are DE from the list of DElist given to goseq analysis. But i got the following error. Could you please tell me what was the mistake that is being done. 

#Goseq analysis
genes=as.integer(p.adjust(de.cmn$table$PValue[de.cmn$table$logFC!=0],method="BH")<.05)

names(genes)=row.names(de.cmn$table[de.cmn$table$logFC!=0,])
table(genes)

pwf=nullp(genes,"hg19","knownGene")

GO.wall=goseq(pwf,"hg19","knownGene")

enriched=GO.wall$category[p.adjust(GO.wall$over_represented_pvalue, method="BH")<.05]

biocLite("org.Hs.eg.db")
library(org.Hs.eg.db)
biocLite("GO.db")
library(GO.db)
biocLite("annotate")
library(annotate)
sigCats <- GO.wall[which(GO.wall[,2] < 0.05),]
cats <- sigCats$category
terms <- stack(lapply(mget(cats, GO:0005615, ifnotfound=NA), extracellular space))

ERROR: Error: unexpected symbol in "terms <- stack(lapply(mget(cats, GO:0005615, ifnotfound=NA), extracellular space"

ADD COMMENT
0
Entering edit mode

Can you try this! Change this line

terms <- stack(lapply(mget(cats, GO:0005615, ifnotfound=NA), extracellular space))

To this

terms <- stack(lapply(mget(cats, "GO:0005615", ifnotfound=NA), extracellular space))

 

 

ADD REPLY

Login before adding your answer.

Traffic: 972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6