Hello,
A naďve question (I am by no means an ace R user) concerning GOstats
and splice variants:
why do you rely on locuslink to map GO terms when GOA that take into
account splice variants as well via, for example, RefSeq? Using the
GOstats tool to study Affymetrix u133a data, I noticed that thhe
hgu133aACCNUM mapping offers RefSeq mapping if I understand - knowing
that you are limited to the genbank accession number attribution for a
probe set offered by Affymetrix.
Thanks for any help/comments
David
[[alternative HTML version deleted]]
On Mar 30, 2005, at 4:03 AM, Rickman David wrote:
>
>
> Hello,
>
> A naive question (I am by no means an ace R user) concerning GOstats
> and splice variants:
>
> why do you rely on locuslink to map GO terms when GOA that take into
> account splice variants as well via, for example, RefSeq? Using the
> GOstats tool to study Affymetrix u133a data, I noticed that thhe
> hgu133aACCNUM mapping offers RefSeq mapping if I understand -
knowing
> that you are limited to the genbank accession number attribution for
a
> probe set offered by Affymetrix.
>
> Thanks for any help/comments
>
David,
I'm perhaps not the best person to answer this (Robert Gentleman and
his team are), but I think the annotation pipeline that is used for
the
bioconductor packages goes through LocusLink (Entrez Gene) in all
cases. Since the mapping is through LocusLink, there isn't a way to
get back to "trancript-level" detail.
Sean
Hi Sean,
What is indicated in the hgu133aACCNUM html for the hgu133a meta-data
package is: "For all the Affymetrix chips, the manufacturer/user
provided ids are GenBank accession numbers." So the starting material
for the pipeline here is GenBank acc #. It seems possible that with
this starting material one could potentially reduce the level of
ambiguity.
As an example -- take the affy ids 207039_at and 211156_at (NM_000077
and AF115544, respective GeneBank# ids). They correspond to locuslink
number 1029. This number corresponds to 3 transcripts encoding 3
proteins (p12, p14 and p16). GOA attributes same GO_ID 0016301
(kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while
attributing 8 GO ids for p16 (NP_000068) (none of which are 0016301).
Entrez Gene associates AF115544 as the source sequence for NM_058197
(NP_478104). NM_00077 corresponds to the variant NP_000068. The
mapping by Dr. Gentleman et al yields the same 2 GO terms for both
probe sets (see example below). The locuslink (GeneID) # 1029 should
yield
Of course using the actual target sequence (which is given by affy) as
the starting material would help better to resolve variants as well as
permit a proper flagging of problem probe sets (see Mecham et al.
Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map probe
sets to GOA. But as you indicated, maybe Dr. Gentleman (or maybe
Chenwei Lin) could shed some light to why it is better to pass from
probe set/accession number provided by affy to locuslink to GO id to
study the potential enrichment of GO ids in an affy microarray
experiment.
###### EXAMPLE QUERY ####################
> affyGO = eapply(hgu133aGO, getOntology)
> affyGO$"211156_at"
[1] "GO:0004861" "GO:0016301"
> affyGO$"207039_at"
[1] "GO:0004861" "GO:0016301"
>
Here we see that for both probe sets we have
Kinase activity (GO:0016301) & cyclin-dependent protein kinase
inhibitor activity (GO:0004861). And not, for example, cell cycle
arrest (GO:0007050) nor cell cycle checkpoint (GO:0000075), 2 TAS GO
ids out of the 8 GO ids attributed by GOA for NP_000068.
A sampling from EBI_GOA_assoc_xrefs for LL 1029:
Supp RefSeq NP locus link_ Gene Symbol GOid DB:reference
evidence
1029_CDKN2A; GO:0007049 PMID:7606716 NAS
1029_CDKN2A; GO:0008372 UniProt:Q16360 ND
NP_478102; 1029_CDKN2A; GO:0016301 GOA:spkw IEA
NP_478104; 1029_CDKN2A; GO:0016301 GOA:spkw IEA
NP_000068; 1029_CDKN2A; GO:0007049 GOA:spkw IEA
NP_000068; 1029_CDKN2A; GO:0000075 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0045786 GOA:spkw IEA
NP_000068; 1029_CDKN2A; GO:0004861 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0007050 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0005634 UniProt:P42771 NR
NP_000068; 1029_CDKN2A; GO:0000079 PMID:7972006 TAS
NP_000068; 1029_CDKN2A; GO:0008285 PMID:7972006 TAS
David
################################
-----Message d'origine-----
De?: Sean Davis [mailto:sdavis2@mail.nih.gov]
Envoy??: Wednesday, March 30, 2005 1:19 PM
??: Rickman David
Cc?: bioconductor@stat.math.ethz.ch
Objet?: Re: [BioC] GOstats question
On Mar 30, 2005, at 4:03 AM, Rickman David wrote:
>
>
> Hello,
>
> A naive question (I am by no means an ace R user) concerning GOstats
> and splice variants:
>
> why do you rely on locuslink to map GO terms when GOA that take into
> account splice variants as well via, for example, RefSeq? Using the
> GOstats tool to study Affymetrix u133a data, I noticed that thhe
> hgu133aACCNUM mapping offers RefSeq mapping if I understand -
knowing
> that you are limited to the genbank accession number attribution for
a
> probe set offered by Affymetrix.
>
> Thanks for any help/comments
>
David,
I'm perhaps not the best person to answer this (Robert Gentleman and
his team are), but I think the annotation pipeline that is used for
the
bioconductor packages goes through LocusLink (Entrez Gene) in all
cases. Since the mapping is through LocusLink, there isn't a way to
get back to "trancript-level" detail.
Sean
On Mar 30, 2005, at 8:58 AM, Rickman David wrote:
> Hi Sean,
>
> What is indicated in the hgu133aACCNUM html for the hgu133a meta-
data
> package is: "For all the Affymetrix chips, the manufacturer/user
> provided ids are GenBank accession numbers." So the starting
material
> for the pipeline here is GenBank acc #. It seems possible that with
> this starting material one could potentially reduce the level of
> ambiguity.
>
> As an example -- take the affy ids 207039_at and 211156_at
(NM_000077
> and AF115544, respective GeneBank# ids). They correspond to
locuslink
> number 1029. This number corresponds to 3 transcripts encoding 3
> proteins (p12, p14 and p16). GOA attributes same GO_ID 0016301
> (kinase activity) for both p12 (NP_478104) and p14 (NP_478102) while
> attributing 8 GO ids for p16 (NP_000068) (none of which are
0016301).
> Entrez Gene associates AF115544 as the source sequence for NM_058197
> (NP_478104). NM_00077 corresponds to the variant NP_000068. The
> mapping by Dr. Gentleman et al yields the same 2 GO terms for both
> probe sets (see example below). The locuslink (GeneID) # 1029
should
> yield
>
> Of course using the actual target sequence (which is given by affy)
as
> the starting material would help better to resolve variants as well
as
> permit a proper flagging of problem probe sets (see Mecham et al.
> Physiol.Genom 2004 and Harbig et al NAR 2005) and ultimately map
probe
> sets to GOA. But as you indicated, maybe Dr. Gentleman (or maybe
> Chenwei Lin) could shed some light to why it is better to pass from
> probe set/accession number provided by affy to locuslink to GO id to
> study the potential enrichment of GO ids in an affy microarray
> experiment.
>
David,
I think Robert answered this indirectly today for another post. The
BioConductor team maps based on ID matching in public databases. In
order to be general, I think the mapping from genbank accession
numbers
to locuslink (Entrez Gene) is via Unigene. A GenBank accession number
is looked up in the Unigene database. If found, the associated
locuslink(s) are assigned to that probe. Then, the information
contained in locuslink (GO, KEGG, etc) is used to provide further
annotation. While for individual sequences (refseqs, in particular),
it is possible to determine the Gene ID or refseq directly, this is
not
in general possible for GenBank accession numbers without going
through
Unigene (and even this isn't 100% fool-proof). Note that going
through
Unigene precludes any attempt to work at the transcript (or protein)
level.
While there are other methods for annotating probesets (see the
articles you cite above), they all require aligning target or probe
sequences (also available from Affy) to known entities (like refseq,
etc.) and is NOT what the BioConductor team attempts to do (and is a
HUGE task to do well, having done this process for some long oligo
arrays). You could do this yourself, if necessary. Also, you could
look at Ensembl which does their own annotation of Affymetrix arrays.
The downside of doing these things yourself (or not using the
annotation packages provided by bioconductor) is that you then need to
either modify the nice functions from the bioconductor project to use
your own data or you need to make your data conform to the structures
needed for the functions to work (which as you point out, in this
case,
will not suffice).
Hope this helps.
Sean
"David,
I think Robert answered this indirectly today for another post. The
BioConductor team maps based on ID matching in public databases. "
I am new to the list and didn't see his posting --
"In
order to be general, I think the mapping from genbank accession
numbers
to locuslink (Entrez Gene) is via Unigene. A GenBank accession number
is looked up in the Unigene database. If found, the associated
locuslink(s) are assigned to that probe. Then, the information
contained in locuslink (GO, KEGG, etc) is used to provide further
annotation. "
Even if the design (or the aim of the Bioconductor team) is limited to
a
"general approach" which precludes working at the level of protein
product (or transcript) -- which is the basis of the GO annotation and
usually the goal of any test of GO category enrichment for a
microarray
result -- then for a given LL # we should have all available GO terms
attributed, right? The example I gave showed that for at least two
probe
sets (sharing the same LL #) this is not the case -- we have only 2 GO
terms to work with versus 12 (again using the same reference GOA as a
reference) for a well characterized gene.
"While there are other methods for annotating probesets (see the
articles you cite above), they all require aligning target or probe
sequences (also available from Affy) to known entities (like refseq,
etc.) and is NOT what the BioConductor team attempts to do (and is a
HUGE task to do well, having done this process for some long oligo
arrays). You could do this yourself, if necessary.
Also, you could
look at Ensembl which does their own annotation of Affymetrix arrays.
The downside of doing these things yourself (or not using the
annotation packages provided by bioconductor) is that you then need to
either modify the nice functions from the bioconductor project to use
your own data or you need to make your data conform to the structures
needed for the functions to work (which as you point out, in this
case,
will not suffice)."
It looks like that is what it takes to get to core of the problem --
One
of my aims (I am sure like many using Affy data) is to summarize/study
lists of probe sets derived from some test at the level of GO terms.
Therefore it is almost intuitive that key to that aim is to resolve
both
the multiplicity issues (many probe sets to one protein product,
somewhat addressed in the GOstats package -- at the level of
LocusLink)
as well as the splice variant issues -- otherwise, it seems that
analyses will always stay at a "general" level.
Thanks for the suggestions and the comments
David
On Mar 30, 2005, at 10:19 AM, Rickman David wrote:
>
> I am new to the list and didn't see his posting --
>
I just meant that you could probably glean some detail from his note
that I may have left out. I am always deleting stuff that doesn't
interest me at the moment, so I just meant to point out that the
subject has come up....
> Even if the design (or the aim of the Bioconductor team) is limited
to
> a
> "general approach" which precludes working at the level of protein
> product (or transcript) -- which is the basis of the GO annotation
and
> usually the goal of any test of GO category enrichment for a
microarray
> result -- then for a given LL # we should have all available GO
terms
> attributed, right? The example I gave showed that for at least two
> probe
> sets (sharing the same LL #) this is not the case -- we have only 2
GO
> terms to work with versus 12 (again using the same reference GOA as
a
> reference) for a well characterized gene.
> It looks like that is what it takes to get to core of the problem --
> One
> of my aims (I am sure like many using Affy data) is to
summarize/study
> lists of probe sets derived from some test at the level of GO terms.
> Therefore it is almost intuitive that key to that aim is to resolve
> both
> the multiplicity issues (many probe sets to one protein product,
> somewhat addressed in the GOstats package -- at the level of
LocusLink)
> as well as the splice variant issues -- otherwise, it seems that
> analyses will always stay at a "general" level.
>
Just out of curiosity, I pulled down the most recent hgu133a
annotation
package. I think your GO terms are there, so perhaps you have an
older
hgu133a package?
> library(reposTools)
Loading required package: tools
> install.packages2('hgu133a',lib='/Users/sdavis/Library/R/library')
> library(annotate)
> library(hgu133a)
> names(get('207039_at',hgu133aGO))
[1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075" "GO:0004861"
[6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634" "GO:0000079"
> names(get('211156_at',hgu133aGO))
[1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075" "GO:0004861"
[6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634" "GO:0000079"
>
Hi,
Finding fault with any annotation that is widely available is
pretty
trivial, and I personally think that it is not a useful exercise. We
have chosen a particular method of building annotation, that is well
documented, both with respect to publications, and perhaps more
importantly we have published code so that you may use, as you see
fit,
and so that you may use to understand the process that we have used.
So the short answer to David's question is because that link
provides
us with a mechanism to unambiguously link a variety of data sources
(or
rather to make use of links that have been made by others). The other
choice is Unigene, and one could certainly build a Unigene based
annotation system. Which is better depends on your perspective. And it
would not take much tweaking to get AnnBuilder to do that, if that is
what you want. Please note, our goal was not and is not to produce
some
elaborate annotation system that satisfies all comers. But rather 1)
to
produce software from which you can build your own annotation for your
own purposes and have that work well with the Bioconductor packages
and
2) to produce generic annotation that is broadly useful to the whole
community (note also that we get many complaints already about how big
and slow this is - and we have tried to remedy that issue).
We are open to concrete suggestions for improvements by those that
are knowledgeable about particular data sources. We are more open to
patches and code contributions that are demonstrated to work widely
and
to be of wide practical interest (not just on your favorite species or
annotation resource).
If there is substantial interest in implementing some of the recent
suggestions we are happy to help coordinate efforts to make
improvements that are of use to the entire community. We have always
accepted patches and well thought-out contributions, and will continue
to do so. We also continue to update our methodology and to make use
of
more accurate information as it becomes available.
Best wishes,
Robert
On Mar 30, 2005, at 7:37 AM, Sean Davis wrote:
>
> On Mar 30, 2005, at 10:19 AM, Rickman David wrote:
>>
>> I am new to the list and didn't see his posting --
>>
>
> I just meant that you could probably glean some detail from his note
> that I may have left out. I am always deleting stuff that doesn't
> interest me at the moment, so I just meant to point out that the
> subject has come up....
>
>
>> Even if the design (or the aim of the Bioconductor team) is limited
>> to a
>> "general approach" which precludes working at the level of protein
>> product (or transcript) -- which is the basis of the GO annotation
and
>> usually the goal of any test of GO category enrichment for a
>> microarray
>> result -- then for a given LL # we should have all available GO
terms
>> attributed, right? The example I gave showed that for at least two
>> probe
>> sets (sharing the same LL #) this is not the case -- we have only 2
GO
>> terms to work with versus 12 (again using the same reference GOA as
a
>> reference) for a well characterized gene.
>> It looks like that is what it takes to get to core of the problem
--
>> One
>> of my aims (I am sure like many using Affy data) is to
summarize/study
>> lists of probe sets derived from some test at the level of GO
terms.
>> Therefore it is almost intuitive that key to that aim is to resolve
>> both
>> the multiplicity issues (many probe sets to one protein product,
>> somewhat addressed in the GOstats package -- at the level of
>> LocusLink)
>> as well as the splice variant issues -- otherwise, it seems that
>> analyses will always stay at a "general" level.
>>
>
> Just out of curiosity, I pulled down the most recent hgu133a
> annotation package. I think your GO terms are there, so perhaps you
> have an older hgu133a package?
>
> > library(reposTools)
> Loading required package: tools
> > install.packages2('hgu133a',lib='/Users/sdavis/Library/R/library')
> > library(annotate)
> > library(hgu133a)
> > names(get('207039_at',hgu133aGO))
> [1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075"
"GO:0004861"
> [6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634"
"GO:0000079"
> > names(get('211156_at',hgu133aGO))
> [1] "GO:0007049" "GO:0007049" "GO:0007050" "GO:0000075"
"GO:0004861"
> [6] "GO:0016301" "GO:0045786" "GO:0008285" "GO:0005634"
"GO:0000079"
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
>
>
+---------------------------------------------------------------------
--
----------------+
| Robert Gentleman phone: (206) 667-7700
|
| Head, Program in Computational Biology fax: (206) 667-1319 |
| Division of Public Health Sciences office: M2-B865
|
| Fred Hutchinson Cancer Research Center
|
| email: rgentlem@fhcrc.org
|
+---------------------------------------------------------------------
--
----------------+
>Even if the design (or the aim of the Bioconductor team) is limited
to a
>"general approach" which precludes working at the level of protein
>product (or transcript) -- which is the basis of the GO annotation
and
>usually the goal of any test of GO category enrichment for a
microarray
>result -- then for a given LL # we should have all available GO terms
>attributed, right? The example I gave showed that for at least two
probe
>sets (sharing the same LL #) this is not the case -- we have only 2
GO
>terms to work with versus 12 (again using the same reference GOA as a
>reference) for a well characterized gene.
The data packages were built a few months ago and will certainly not
have 100%
coverage now. You can always build your own data pacages if you want
to have
updatged annotation.
>
>"While there are other methods for annotating probesets (see the
>articles you cite above), they all require aligning target or probe
>sequences (also available from Affy) to known entities (like refseq,
>etc.) and is NOT what the BioConductor team attempts to do (and is a
>HUGE task to do well, having done this process for some long oligo
>arrays). You could do this yourself, if necessary.
>Also, you could
>look at Ensembl which does their own annotation of Affymetrix arrays.
>The downside of doing these things yourself (or not using the
>annotation packages provided by bioconductor) is that you then need
to
>either modify the nice functions from the bioconductor project to use
>your own data or you need to make your data conform to the structures
>needed for the functions to work (which as you point out, in this
case,
>will not suffice)."
>
>It looks like that is what it takes to get to core of the problem --
One
>of my aims (I am sure like many using Affy data) is to
summarize/study
>lists of probe sets derived from some test at the level of GO terms.
>Therefore it is almost intuitive that key to that aim is to resolve
both
>the multiplicity issues (many probe sets to one protein product,
>somewhat addressed in the GOstats package -- at the level of
LocusLink)
>as well as the splice variant issues -- otherwise, it seems that
>analyses will always stay at a "general" level.
>
>Thanks for the suggestions and the comments
>
>David
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor@stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
Jianhua Zhang
Department of Medical Oncology
Dana-Farber Cancer Institute
44 Binney Street
Boston, MA 02115-6084