On Sat, Feb 26, 2011 at 12:00 PM, <duke.lists at="" gmx.com=""> wrote:
> Dear colleagues,
> ?I used to download GO database from geneoncology.org and did some
c++ coding to manipulate the data as I wished. Now I want to try my
luck with R - Bioconductor. I have heard of tons of tools supporting
GO such as GO.db, topGO, goseq, GOstats, biomart etc... and I have
been reading their description and examples, but honestly I am
overhelmed and dont really know which package I should use to fulfill
my task. So please advise me how I can do the following two simple
> ?1. I have a list of genes (with gene names from UCSC such as Foxp3
etc...). How do I filter this list to get genes that have certain GO
term such as transcription factor?
since you said it was a simple task, consider the simple solution
involving the "%annTo%" operator, which tells whether the symbols on
the left have been annotated to the term on the right:
> c("FOXP3", "BRCA2") %annTo% "mammary"
> c("FOXP3", "BRCA2") %annTo% "transcription factor"
you could use the named logical vectors generated in this way to
perform the filtering you describe. but see below.
> ?2. How do I know the capacity of the latest GO database on
bioconductor, for example, how many genes available for mm9, and how
many of them have GO term transcription factor?
The "GO database" concerns the gene ontology, a structure of terms and
relationships among them. The association of GO terms to gene names
for mouse is presented in various ways, but the most basic one is in
the org.Mm.eg.db package. With that, you could use
to find, among other statistics,
org.Mm.egGO has 29984 mapped keys (of 63329 keys). Your question
concerning transcription factor mapping is not completely precise, and
you might want to survey the family of GO terms to come up with a set
of terms that meets your requirement. Here's a
demonstration of related queries:
> get("GO:0003700", GOTERM)
GOID: GO:0003700
Term: sequence-specific DNA binding transcription factor activity
Ontology: MF
Definition: Interacting selectively and non-covalently with a specific
DNA sequence in order to modulate transcription. The transcription
factor may or may not also interact selectively with a protein or
macromolecular complex.
Synonym: GO:0000130
Secondary: GO:0000130
> tfg = get("GO:0003700", revmap(org.Mm.egGO))
> length(tfg)
[1] 940
org.Mm.egGO is a mapping from mouse entrez gene ids to GO term tags.
revmap reverses this mapping and takes a tag to the set of genes
mapped to the tag by entrez.
Now, to return to the first question -- it isn't simple and a lot of
presuppositions have to be made explicit. One of the most problematic
is the commitment to use gene symbols. If you don't read the docs
about bioconductor annotation and R packages pertaining thereto, it's
hard to make progress. My code defining "%annTo%" will follow -- it
is not particularly efficient or well-designed, but it shows the
components that have to be identified and used for this to work. All
the annotation is based on SQLite tables distributed with Bioconductor
packages, with interfaces defined in DBI and RSQLite packages. The
%annTo% operator is defined below. More thoughtful solutions are
sym2terms = function (sym)
if (length(sym)>1) stop("scalar symbol input required") # bad form
require(org.Hs.eg.db) # generalize
egs = sapply(sym, function(z) get(z, revmap(org.Hs.egSYMBOL)))
gos = mget(egs, org.Hs.egGO)
goids = sapply(gos, lapply, "[[", "GOID")
sapply(lapply(goids, function(z) get(z, GOTERM)), "Term")
genesAnnotatedTo = function(syms, term="transcription factor",
meth=agrep) {
tms = lapply(syms, sym2terms)
chk = sapply(tms, function(z) length(agrep(term, z))>0)
names(chk) = syms
"%annTo%" = function(x,y) genesAnnotatedTo(x,y)
> ?Off note, I have never tried any of those GO packages, so any
advise/suggestion are welcome. Thank you so much in advance,
> ?D.
> ? ? ? ?[[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: