Hi
I am trying to run a GO enrichment test on genes identified in experiments on an
affymetrix hgu133plus2 chip. I get zero annotations, even though within hgu133plus2.db
I can find annotations for the genes of interest. I've spent a while trying to get it to
work but with no success. What (no doubt very basic thing) am I doing wrong?
Best
Will
library_calls <- function() { library('hgu133plus2.db') library('topGO') } get_gene_universe <- function(affy_library) { # get all entrez ids associated with this chip gene_universe = as.integer(keys(eval(parse(text = affy_library)), keytype='ENTREZID')) return(gene_universe) } probesets_to_entrez <- function(probeset_keys, affy_library) { # get entrez ids for each of these entrez_ids_df = select(eval(parse(text = affy_library)), probeset_keys, 'ENTREZID', 'PROBEID') # return unique entrez_ids = unique(entrez_ids_df$ENTREZID) # remove any NAs entrez_ids = as.integer(entrez_ids[!is.na(entrez_ids)]) return(entrez_ids) } find_GO_terms_for_probesets <- function(probeset_keys, affy_library) { # get GO ids for each of these GO_ids_df = select(eval(parse(text = affy_library)), probeset_keys, 'GO','PROBEID') # restrict to BP GO_ids = unique(GO_ids_df[GO_ids_df$ONTOLOGY == 'BP', 'GO']) # check for annotations select(GO.db, keys = GO_ids, columns = 'DEFINITION', keytype = 'GOID') } # admin library_calls() # define library affy_library = 'hgu133plus2.db' # define test keys probeset_keys = c('201060_x_at', '213558_at', '223294_at', '201061_s_at', '202976_s_at', '224177_s_at') # get gene universe in entrez ids gene_universe = get_gene_universe(affy_library) # get genes of interest in entrez ids gene_list = probesets_to_entrez(probeset_keys, affy_library) # define inputs to topGOdata creator gene_vector = factor(as.integer(gene_universe %in% gene_list)) names(gene_vector) = gene_universe # check that GO terms exist find_GO_terms_for_probesets(probeset_keys, affy_library) # GOID # 1 GO:0051260 The process of creating protein oligomers, compounds composed of a small number, usually between three and ten, of identical component monomers. Oligomers may be formed by the polymerization of a number of monomers or the depolymerization of a large protein polymer. # 2 GO:1901585 Any process that modulates the frequency, rate or extent of acid-sensing ion channel activity. # 3 GO:0007010 A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of cytoskeletal structures. # ... # create new topGOdata object GOdata_BP = new("topGOdata", ontology = 'BP', allGenes = gene_vector, annotationFun = annFUN.db, affyLib = affy_library ) # Fails: # Building most specific GOs ..... ( 0 GO terms found. ) # Build GO DAG topology .......... ( 0 GO terms and 0 relations. ) # Error in if (is.na(index) || index < 0 || index > length(nd)) stop("vertex is not in graph: ", : # missing value where TRUE/FALSE needed sessionInfo() # R version 3.1.2 (2014-10-31) # Platform: x86_64-w64-mingw32/x64 (64-bit) # locale: # [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 LC_MONETARY=German_Switzerland.1252 # [4] LC_NUMERIC=C LC_TIME=German_Switzerland.1252 # attached base packages: # [1] parallel stats4 stats graphics grDevices utils datasets methods base # other attached packages: # [1] topGO_2.18.0 SparseM_1.05 GO.db_3.0.0 graph_1.44.0 hgu133plus2.db_3.0.0 # [6] org.Hs.eg.db_3.0.0 RSQLite_1.0.0 DBI_0.3.1 AnnotationDbi_1.28.1 GenomeInfoDb_1.2.2 # [11] IRanges_2.0.0 S4Vectors_0.4.0 Biobase_2.26.0 BiocGenerics_0.12.0 # loaded via a namespace (and not attached): # [1] grid_3.1.2 lattice_0.20-29 tools_3.1.2
So I have now tried
however I get the following error:
I tried to correct for this with
but I then get a different error:
I would love to check all of this with the topGO documentation, but the reference manual is sadly
so sparsely filled in that there is effectively no more detail than you get in a vignette.
I think I have set you wrong. The geneSel argument actually appears to be some function used to decide which of your allGenes are the selected genes. In other words
genSel(gene_universe)
should return a boolean vector, where TRUE indicates that a particular gene is in the selected gene set, and FALSE. As an example,
> library(topGO)
> data(geneList)
> topDiffGenes
function (allScore)
{
return(allScore < 0.01)
}
> head(geneList)
1095_s_at 1130_at 1196_at 1329_s_at 1340_s_at 1342_g_at
1.0000000 1.0000000 0.6223795 0.5412240 1.0000000 1.0000000
So in the example in the vignette, you pass in the geneList vector as the allGenes argument, and topDiffGenes() as the geneSel argument.
Sorry to have mixed you up - I in general use GOstats for this sort of analysis, as I find it to be much less confusing (but I have made several contributions to that package, so go figure...).