Entering edit mode
Duke
▴
210
@duke-4050
Last seen 10.4 years ago
On 2/28/11 2:31 PM, Vincent Carey wrote:
> On Mon, Feb 28, 2011 at 2:07 PM, Duke<duke.lists at="" gmx.com=""> wrote:
>> On 2/26/11 8:09 PM, Vincent Carey wrote:
>>> On Sat, Feb 26, 2011 at 6:40 PM, Duke<duke.lists at="" gmx.com="">
wrote:
>>>> Hi Vincent,
>>>>
>>>> On 2/26/11 2:38 PM, Vincent Carey wrote:
>>>>> On Sat, Feb 26, 2011 at 12:00 PM,<duke.lists at="" gmx.com="">
wrote:
>>>>>> Dear colleagues,
>>>>>>
>>>>>> I used to download GO database from geneoncology.org and did
some c++
>>>>>> coding to manipulate the data as I wished. Now I want to try my
luck
>>>>>> with R
>>>>>> - Bioconductor. I have heard of tons of tools supporting GO
such as
>>>>>> GO.db,
>>>>>> topGO, goseq, GOstats, biomart etc... and I have been reading
their
>>>>>> description and examples, but honestly I am overhelmed and dont
really
>>>>>> know
>>>>>> which package I should use to fulfill my task. So please advise
me how
>>>>>> I can
>>>>>> do the following two simple tasks:
>>>>>>
>>>>>> 1. I have a list of genes (with gene names from UCSC such as
Foxp3
>>>>>> etc...). How do I filter this list to get genes that have
certain GO
>>>>>> term
>>>>>> such as transcription factor?
>>>>> since you said it was a simple task, consider the simple
solution
>>>>> involving the "%annTo%" operator, which tells whether the
symbols on
>>>>> the left have been annotated to the term on the right:
>>>>>
>>>>>> c("FOXP3", "BRCA2") %annTo% "mammary"
>>>>> FOXP3 BRCA2
>>>>> FALSE TRUE
>>>>>> c("FOXP3", "BRCA2") %annTo% "transcription factor"
>>>>> FOXP3 BRCA2
>>>>> TRUE FALSE
>>>>>
>>>>> you could use the named logical vectors generated in this way to
>>>>> perform the filtering you describe. but see below.
>>>>>
>>>>>> 2. How do I know the capacity of the latest GO database on
>>>>>> bioconductor,
>>>>>> for example, how many genes available for mm9, and how many of
them
>>>>>> have GO
>>>>>> term transcription factor?
>>>>> The "GO database" concerns the gene ontology, a structure of
terms and
>>>>> relationships among them. The association of GO terms to gene
names
>>>>> for mouse is presented in various ways, but the most basic one
is in
>>>>> the org.Mm.eg.db package. With that, you could use
>>>>>
>>>>> org.Mm.eg()
>>>>>
>>>>> to find, among other statistics,
>>>>>
>>>>> org.Mm.egGO has 29984 mapped keys (of 63329 keys). Your
question
>>>>> concerning transcription factor mapping is not completely
precise, and
>>>>> you might want to survey the family of GO terms to come up with
a set
>>>>> of terms that meets your requirement.
>>>> You are right, I did not make it clear. What I wanted was exactly
what
>>>> you
>>>> said: I need to get a list of genes that have a certain set of
terms (for
>>>> example *transcription*) - which corresponds to a family of GO
terms, not
>>>> just one GO term.
>>>>
>>>>> Here's a
>>>>> demonstration of related queries:
>>>>>
>>>>>> get("GO:0003700", GOTERM)
>>>>> GOID: GO:0003700
>>>>> Term: sequence-specific DNA binding transcription factor
activity
>>>>> Ontology: MF
>>>>> Definition: Interacting selectively and non-covalently with a
specific
>>>>> DNA sequence in order to modulate transcription. The
transcription
>>>>> factor may or may not also interact selectively with a
protein or
>>>>> macromolecular complex.
>>>>> Synonym: GO:0000130
>>>>> Secondary: GO:0000130
>>>>>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>>>>>> length(tfg)
>>>>> [1] 940
>>>> This works fine in case of one GO term (GO: 0003700). Is there a
similar
>>>> function like getterm("transcription", revmap()) to get all the
genes
>>>> that
>>>> their GO terms contain *transcription*?
>>> As far as I know there is no such function. Solutions depend on
the
>>> type of wild-card you are looking for. If SQL LIKE operator is
>>> adequate
>>>
>>> termlikeToTags = function (liketok)
>>> {
>>> require(GO.db)
>>> gcon = GO_dbconn()
>>> as.character(dbGetQuery(gcon, paste("select go_id from
go_term
>>> where term like '%",
>>> liketok, "%'", sep = ""))[[1]])
>>> }
>>>
>>> if you want to use full regular expressions
>>>
>>> reToTags = function (retok, ...) {
>>> require(GO.db)
>>> allt = dbGetQuery(GO_dbconn(), "select go_id, term from
go_term")
>>> inds = grep(retok, as.character(allt[,"term"]), value=FALSE,
...)
>>> # will error if value set at call
>>> if (length(inds)>0) return(as.character(allt[inds,"go_id"]))
>>> stop("retok did not grep to any term")
>>> }
>>>
>>>> length(reToTags("transcription"))
>>> [1] 372
>>>> length(termlikeToTags("transcription"))
>>> [1] 372
>>>> length(reToTags("transcrip.*"))
>>> [1] 411
>>>> length(termlikeToTags("transcrip.*"))
>>> [1] 0
>>>
>> As far as I understand, these functions will list out all the GO
Terms
>> containing "transcription" for example. In order to search for all
of the
>> genes in the database that have these GO Terms, I suppose I will
have to
>> loop
>>
>> tfg = get("GO:0003700", revmap(org.Mm.egGO))
>> length(tfg)
>>
>> for each of the term and then sum them up? Since your %annTo%
function is to
>> check a certain gene list to see if any of them has the specified
term,
>> would it be better to just run that function to the available genes
in the
>> database? How do I list and use the gene list in the database?
> I'd suggest you read the vignettes at
> http://www.bioconductor.org/help/bioc-
views/release/bioc/html/AnnotationDbi.html,
> particularly http://www.bioconductor.org/packages/2.7/bioc/vignettes
/AnnotationDbi/inst/doc/AnnotationDbi.pdf
>
> The codes I provided are hints in the direction of solutions for the
> family of questions you pose.
>
>>>>> org.Mm.egGO is a mapping from mouse entrez gene ids to GO term
tags.
>>>>> revmap reverses this mapping and takes a tag to the set of genes
>>>>> mapped to the tag by entrez.
>>>>>
>>>>> Now, to return to the first question -- it isn't simple and a
lot of
>>>>> presuppositions have to be made explicit. One of the most
problematic
>>>>> is the commitment to use gene symbols. If you don't read the
docs
>>>>> about bioconductor annotation and R packages pertaining thereto,
it's
>>>>> hard to make progress.
>>>> Can you elaborate a little bit more why using gene symbols is
>>>> problematic?
>>>> And if it is really a problem, what gene ID I should use to avoid
that
>>>> problem?
>>> Entrez gene IDs are generally more specific. If you haven't run
into
>>> collisions and ambiguities with symbols by now, maybe you don't
have
>>> to worry about it.
>>>
>> Well, our reference database is UCSC refFlat downloaded from UCSC
genome
>> browser, and in that format there is only geneName (SYMBOL) and
name
>> (refSEQ). There is no Entrez Gene ID in that file. We actually have
a
>> problem of one same geneName (SYMBOL) has more than one isoforms
(more than
>> one refSEQ), but in fact we dont have a good way of differentiate
between
>> those (by mathematical methods). So we use either gene symbol or
refSeq ID.
>>
> most of our annotation packages have mappings that employ refseq
identifiers.
> the functions i provided can be readily altered to use refseq ids as
> keys, and if you
> introduce types for the identifier/term tokens, the functions can be
> implemented as methods
> that perform appropriately for inputs from different vocabularies.
> but you will have to
> learn how to use the mapping objects or the sqlite tables
> productively. Package GSEABase is
> also a useful resource for working with collections of gene
identifiers.
>
Thanks for both of your suggestion Vincent. I will definitely check
them
out. Your suggestions/hints do help me a lot :).
Bests,
D.