GO.db: how to get GO Term

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

What is the recommended way to find the GO term that corresponds to a given GO accession number? I can do this: > library(GO.db) > Term(as.list(GOTERM["GO:0000166"])[[1]]) [1] "nucleotide binding" But this seems a bit inelegant, using as.list() and list subsetting as an intermediaries both inside and outside of S4 generic functions. And it needs more work to generalise to multiple accession numbers. Is there something more direct? Thanks a lot Gordon ----------------------------------------------- Associate Professor Gordon K Smyth, NHMRC Senior Research Fellow, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. smyth at wehi.edu.au http://www.wehi.edu.au/bioinformatics http://www.statsci.org/smyth

GO GO • 2.9k views

ADD COMMENT • link updated 15.8 years ago by Wacek Kusnierczyk ▴ 180 • written 15.8 years ago by Gordon Smyth 52k

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 8.7 years ago

United States

Hi Gordon, One thing you can do to make this more efficient is to use mget instead of as.list(). That way you won't be pulling the whole mapping out of the database into a list just to get one thing back out. mget("GO:0000166",GOTERM,ifnotfound=NA) Also, with mget() you can pass in multiple accessions into the 1st argument and it will just hand you a longer list back. mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA) Is this what you were looking for? Marc Gordon K Smyth wrote: > What is the recommended way to find the GO term that corresponds to a > given GO accession number? I can do this: > > > library(GO.db) > > Term(as.list(GOTERM["GO:0000166"])[[1]]) > [1] "nucleotide binding" > > But this seems a bit inelegant, using as.list() and list subsetting as > an intermediaries both inside and outside of S4 generic functions. > And it needs more work to generalise to multiple accession numbers. > Is there something more direct? > > Thanks a lot > Gordon > > ----------------------------------------------- > Associate Professor Gordon K Smyth, > NHMRC Senior Research Fellow, > Bioinformatics Division, Walter and Eliza Hall Institute of Medical > Research, 1G Royal Parade, Parkville, Vic 3052, Australia. > smyth at wehi.edu.au > http://www.wehi.edu.au/bioinformatics > http://www.statsci.org/smyth > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 15.8 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Marc Carlson wrote: > One thing you can do to make this more efficient is to use mget instead > of as.list(). That way you won't be pulling the whole mapping out of > the database into a list just to get one thing back out. > > mget("GO:0000166",GOTERM,ifnotfound=NA) > > Also, with mget() you can pass in multiple accessions into the 1st > argument and it will just hand you a longer list back. > > mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA) > just being curious, i have checked the performance of all three solutions posted on this list: library(GO.db) library(rbenchmark) ids = sapply(sample(GOTERM, 100), GOID) print( benchmark(replications=100, columns=c('test', 'elapsed'), eapply=eapply(GOTERM[ids], Term), lapply=lapply(as.list(GOTERM[ids]), Term), mget=lapply(mget(ids, GOTERM), Term))) # test elapsed # 3 eapply 10.925 # 1 lapply 11.091 # 2 mget 11.160 it appears that they are (with the particular data sample used) virtually equivalent in efficiency. vQ

ADD REPLY • link 15.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Wacek Kusnierczyk wrote: > Marc Carlson wrote: >> One thing you can do to make this more efficient is to use mget instead >> of as.list(). That way you won't be pulling the whole mapping out of >> the database into a list just to get one thing back out. >> >> mget("GO:0000166",GOTERM,ifnotfound=NA) >> >> Also, with mget() you can pass in multiple accessions into the 1st >> argument and it will just hand you a longer list back. >> >> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA) >> > > just being curious, i have checked the performance of all three > solutions posted on this list: > > library(GO.db) > library(rbenchmark) > > ids = sapply(sample(GOTERM, 100), GOID) > print( > benchmark(replications=100, columns=c('test', 'elapsed'), > eapply=eapply(GOTERM[ids], Term), > lapply=lapply(as.list(GOTERM[ids]), Term), > mget=lapply(mget(ids, GOTERM), Term))) > > # test elapsed > # 3 eapply 10.925 > # 1 lapply 11.091 > # 2 mget 11.160 > > it appears that they are (with the particular data sample used) > virtually equivalent in efficiency. I'm not the definitive source for this, but I guess the performance is dominated by, on the one hand, creating S4 instances for each table entry (e.g., in as.list), and on the other immediately extracting a slot from the created S4 object. With this in mind, I thought one could do tbl <- toTable(GOTERM[ids]) res1 <- with(tbl, Term[!duplicated(go_id)]) identical(sort(unlist(res0)), sort(res1)) This is about 10x faster, but now I'm starting to appreciate some of the work the software is doing -- there are duplicate go_ids returned by toTable, corresponding to synonyms for the terms I've entered. Inspired by this success, I looked at the underlying SQL schema (with GO_dbschema()) and intercepted at few calls to the db (with debug(dbGetQuery)) to arrive at this sql <- sprintf("SELECT DISTINCT term FROM go_term LEFT JOIN go_synonym ON go_term._id=go_synonym._id WHERE go_term.go_id IN ('%s');", paste(ids, collapse="','")) res2 <- dbGetQuery(GO_dbconn(), sql)[[1]] identical(sort(res1), sort(res2)) another 2x gain in speed, but also really paying a significant price in terms of responsibility for what the code is doing. Martin > vQ

ADD REPLY • link 15.8 years ago Martin Morgan 25k

0

Entering edit mode

Hi Gordon, Martin, Wacek, Martin Morgan wrote: > Wacek Kusnierczyk wrote: >> Marc Carlson wrote: >>> One thing you can do to make this more efficient is to use mget instead >>> of as.list(). That way you won't be pulling the whole mapping out of >>> the database into a list just to get one thing back out. >>> >>> mget("GO:0000166",GOTERM,ifnotfound=NA) >>> >>> Also, with mget() you can pass in multiple accessions into the 1st >>> argument and it will just hand you a longer list back. >>> >>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA) >>> >> just being curious, i have checked the performance of all three >> solutions posted on this list: >> >> library(GO.db) >> library(rbenchmark) >> >> ids = sapply(sample(GOTERM, 100), GOID) >> print( >> benchmark(replications=100, columns=c('test', 'elapsed'), >> eapply=eapply(GOTERM[ids], Term), >> lapply=lapply(as.list(GOTERM[ids]), Term), >> mget=lapply(mget(ids, GOTERM), Term))) >> >> # test elapsed >> # 3 eapply 10.925 >> # 1 lapply 11.091 >> # 2 mget 11.160 >> >> it appears that they are (with the particular data sample used) >> virtually equivalent in efficiency. > > I'm not the definitive source for this, but I guess the performance is > dominated by, on the one hand, creating S4 instances for each table > entry (e.g., in as.list), and on the other immediately extracting a slot > from the created S4 object. > > With this in mind, I thought one could do > > tbl <- toTable(GOTERM[ids]) > res1 <- with(tbl, Term[!duplicated(go_id)]) > identical(sort(unlist(res0)), sort(res1)) > > This is about 10x faster, but now I'm starting to appreciate some of the > work the software is doing -- there are duplicate go_ids returned by > toTable, corresponding to synonyms for the terms I've entered. > > Inspired by this success, I looked at the underlying SQL schema (with > GO_dbschema()) and intercepted at few calls to the db (with > debug(dbGetQuery)) to arrive at this > > sql <- sprintf("SELECT DISTINCT term > FROM go_term > LEFT JOIN go_synonym > ON go_term._id=go_synonym._id > WHERE go_term.go_id IN ('%s');", > paste(ids, collapse="','")) > res2 <- dbGetQuery(GO_dbconn(), sql)[[1]] WARNING: There is no guarantee that the data.frame returned by dbGetQuery() will have its rows sorted in the same order than the ids injected in the IN clause of the SQL statement. So, here 'res2' is of length 100 but the mapping between 'ids' and 'res2' is lost. Note that this is not a dbGetQuery() feature but an SQL feature (the same would apply from the sqlite3 command line client). Also note that eapply() has the same kind of problem: one needs to be careful with the result of eapply(GOTERM[ids], Term) because, strictly speaking, there is no reason why the returned list should have its elements in the same order as the keys in 'ids'. This behaviour follows in fact what the original eapply() does on a *real* environment where, according to the man page, "the output is not sorted" even though it seems that it is sorted in the lexicographic order of the symbols defined in the environment. But since this is not documented, no code should rely on this. Anyway, IMO eapply() should not be used on Bimap objects because of this confusing behaviour. > identical(sort(res1), sort(res2)) > > another 2x gain in speed, but also really paying a significant price in > terms of responsibility for what the code is doing. Even faster (DISTINCT and JOIN not needed): GOid2Term <- function(ids) { sql <- sprintf("SELECT go_id, term FROM go_term WHERE go_id IN ('%s')", paste(ids, collapse="','")) res <- dbGetQuery(GO_dbconn(), sql) ans <- res[[2]] names(ans) <- res[[1]] unname(ans[ids]) } Cheers, H. > > Martin > >> vQ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 15.8 years ago Hervé Pagès 16k

0

Entering edit mode

Interesting, thanks. vQ > Hi Gordon, Martin, Wacek, > > Martin Morgan wrote: >> Wacek Kusnierczyk wrote: >>> Marc Carlson wrote: >>>> One thing you can do to make this more efficient is to use mget >>>> instead >>>> of as.list(). That way you won't be pulling the whole mapping out of >>>> the database into a list just to get one thing back out. >>>> >>>> mget("GO:0000166",GOTERM,ifnotfound=NA) >>>> >>>> Also, with mget() you can pass in multiple accessions into the 1st >>>> argument and it will just hand you a longer list back. >>>> >>>> mget(c("GO:0000066","GO:0000166"),GOTERM,ifnotfound=NA) >>>> >>> just being curious, i have checked the performance of all three >>> solutions posted on this list: >>> >>> library(GO.db) >>> library(rbenchmark) >>> >>> ids = sapply(sample(GOTERM, 100), GOID) >>> print( >>> benchmark(replications=100, columns=c('test', 'elapsed'), >>> eapply=eapply(GOTERM[ids], Term), >>> lapply=lapply(as.list(GOTERM[ids]), Term), >>> mget=lapply(mget(ids, GOTERM), Term))) >>> >>> # test elapsed >>> # 3 eapply 10.925 >>> # 1 lapply 11.091 >>> # 2 mget 11.160 >>> >>> it appears that they are (with the particular data sample used) >>> virtually equivalent in efficiency. >> >> I'm not the definitive source for this, but I guess the performance is >> dominated by, on the one hand, creating S4 instances for each table >> entry (e.g., in as.list), and on the other immediately extracting a slot >> from the created S4 object. >> >> With this in mind, I thought one could do >> >> tbl <- toTable(GOTERM[ids]) >> res1 <- with(tbl, Term[!duplicated(go_id)]) >> identical(sort(unlist(res0)), sort(res1)) >> >> This is about 10x faster, but now I'm starting to appreciate some of the >> work the software is doing -- there are duplicate go_ids returned by >> toTable, corresponding to synonyms for the terms I've entered. >> >> Inspired by this success, I looked at the underlying SQL schema (with >> GO_dbschema()) and intercepted at few calls to the db (with >> debug(dbGetQuery)) to arrive at this >> >> sql <- sprintf("SELECT DISTINCT term >> FROM go_term >> LEFT JOIN go_synonym >> ON go_term._id=go_synonym._id >> WHERE go_term.go_id IN ('%s');", >> paste(ids, collapse="','")) >> res2 <- dbGetQuery(GO_dbconn(), sql)[[1]] > > WARNING: There is no guarantee that the data.frame returned by > dbGetQuery() > will have its rows sorted in the same order than the ids injected in the > IN > clause of the SQL statement. So, here 'res2' is of length 100 but the > mapping between 'ids' and 'res2' is lost. Note that this is not a > dbGetQuery() > feature but an SQL feature (the same would apply from the sqlite3 command > line client). > > Also note that eapply() has the same kind of problem: one needs to be > careful with the result of eapply(GOTERM[ids], Term) because, strictly > speaking, there is no reason why the returned list should have its > elements in the same order as the keys in 'ids'. This behaviour follows > in fact what the original eapply() does on a *real* environment > where, according to the man page, "the output is not sorted" even though > it seems that it is sorted in the lexicographic order of the symbols > defined in the environment. But since this is not documented, no code > should rely on this. Anyway, IMO eapply() should not be used on Bimap > objects because of this confusing behaviour. > >> identical(sort(res1), sort(res2)) >> >> another 2x gain in speed, but also really paying a significant price in >> terms of responsibility for what the code is doing. > > Even faster (DISTINCT and JOIN not needed): > > GOid2Term <- function(ids) > { > sql <- sprintf("SELECT go_id, term > FROM go_term > WHERE go_id IN ('%s')", > paste(ids, collapse="','")) > res <- dbGetQuery(GO_dbconn(), sql) > ans <- res[[2]] > names(ans) <- res[[1]] > unname(ans[ids]) > } > > Cheers, > H. > > >> >> Martin >> >>> vQ >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 15.8 years ago Wacek Kusnierczyk ▴ 180

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 12 weeks ago

United States

Gordon K Smyth wrote: > What is the recommended way to find the GO term that corresponds to a > given GO accession number? I can do this: > > > library(GO.db) > > Term(as.list(GOTERM["GO:0000166"])[[1]]) > [1] "nucleotide binding" Not sure about 'recommended', but these Term(GOTERM[["GO:0000166"]]) eapply(GOTERM[gids], Term) seem a bit better Martin > > But this seems a bit inelegant, using as.list() and list subsetting as > an intermediaries both inside and outside of S4 generic functions. And > it needs more work to generalise to multiple accession numbers. Is > there something more direct? > > Thanks a lot > Gordon > > ----------------------------------------------- > Associate Professor Gordon K Smyth, > NHMRC Senior Research Fellow, > Bioinformatics Division, Walter and Eliza Hall Institute of Medical > Research, 1G Royal Parade, Parkville, Vic 3052, Australia. > smyth at wehi.edu.au > http://www.wehi.edu.au/bioinformatics > http://www.statsci.org/smyth > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 15.8 years ago Martin Morgan 25k

0

Entering edit mode

Wacek Kusnierczyk ▴ 180

@wacek-kusnierczyk-88

Last seen 10.6 years ago

Gordon K Smyth wrote: > What is the recommended way to find the GO term that corresponds to a > given GO accession number? I can do this: > > > library(GO.db) > > Term(as.list(GOTERM["GO:0000166"])[[1]]) > [1] "nucleotide binding" > > But this seems a bit inelegant, using as.list() and list subsetting as > an intermediaries both inside and outside of S4 generic functions. > And it needs more work to generalise to multiple accession numbers. > Is there something more direct? library(GO.db) # sample ids ids = sapply(sample(GOTERM, 10), GOID) # retrieve terms by id lapply(as.list(GOTERM[ids]), Term) vQ

ADD COMMENT • link 15.8 years ago Wacek Kusnierczyk ▴ 180

Login before adding your answer.