A "simple" question:
I built an annotation package using UNIGENE ids as key.
Now I want to get the GB annotation for a list of genes.
How can I do ? I use mget with env nameACCNUM but I obtained
the UG again.
Thanks
--
Mayte Suarez Farinas
The Rockefeller University
1230 York Avenue, Box 212
New York, NY 10021
phone: 1-212-327-8186
fax: 1-212-327-7422
Mayte,
If you want to combine data sets, I would suggest using Unigene, as it
naturally takes GenBank Accessions and groups them into clusters based
(in a
somewhat algorithmic way) on alignment to the genome. If one looks at
a
Unigene record from Hs.data, it has some specific information about
each
cluster (locuslink ID, name, etc.), but is otherwise a long list of
sequence
accession numbers and clone ids. AnnBuilder uses these lists to
assign
GenBank accession numbers or IMAGE clones to Unigene clusters. With
"old"
designs of microarrays, some (and perhaps many) of the genbank
sequences may
not be mapped at all to a Unigene cluster, in which case you "lose"
that
gene. Also, some arrays were designed and each feature assigned a
Unigene
cluster in the past. Unfortunately, these Unigene ID's may or may not
be
comparable to Unigene ID's from the current build, so it is probably
worthwhile re-annotating them based on IMAGE clone or Genbank
Accession
(which functions to "update" the array features to the "newest"
annotation).
All this is a longwinded way of saying that your best bet is probably
to get
all of the arrays to Unigene and then mapping each to the other using
common
Unigene IDs. This is not perfect, but there is not a perfect solution
to
this problem currently. If any of your platforms are comprised of
oligos
rather than cDNA, the problem could be much more involved, but can be
approached in a similar manner. One can use AnnBuilder to do the
re-annotation of each platform or it could be done outside R using
perl and
downloading files from Unigene itself. (I would suggest the former.)
As for why this can't be done with GB IDs, there are MILLIONS of
possible
choices for genbank ID, some of which represent truly IDENTICAL
sequences.
However, there is no way to tell how similar two GB IDs should be (in
terms
of expression behavior) without further processing them, which is
exactly
what Unigene is designed to do.
Hope this helps.
Sean
----- Original Message -----
From: "Mayte Suarez-Farinas" <mayte@babel.rockefeller.edu>
To: "Sean Davis" <sdavis2@mail.nih.gov>
Sent: Monday, June 07, 2004 6:56 PM
Subject: Re: [BioC] GB from a package built with ABPkgBuilder
> On Mon, 7 Jun 2004, Sean Davis wrote:
>
> Sean.
> Thank you for your answer...I will explain you why I need that
because
> I need some advice..
>
> I was trying to to use a multistudy aproach as in MergeMaid software
> (Parmigiani recent paper and softw). Lets say that I have 3 studies.
> One of them is from SMD database (lets call it S1). The SMD data
comes
> with annotation included so I used that annotation. I initially
wanted
> to work with GB as a key for the 3 studies in other to have more
genes
> (at some point the sofware take a mean of the measure with the same
is,
> that can be avoided but i first tried to use GB).
> GBids interception for studies S2 and S3 are OK, a reasonable number
> but when I did the interception of S1 with either S2 and S3 (using
GB)
> the interception is less than 500 genes (out of 43000 spots in S1).
> However if I make the interception with UG I obtained say 9000. It
seems
> like if for the same UG S1 and S2(or S3) got GBids completely
diferent!
> There is some reason for that ??? (I am not very into GBs and
annotations
> details)
>
> Then I decided to create an Annotation package for S1 using
ABPBuilder as
> I usually do. I did it using UG as key and also the image id. With
image
id I got a very poor annotation
> so I just have the ann with UG.
>
> Some suggestion, comments or advice ?? I really appreciate it ...
>
> ps. Everything that I did, was using the last version of everyting
and
> updation annotations every week)
>
> > Mayte,
> >
> > Unfortunately, Unigene is a method for "collapsing" perhaps
hundreds of
> > genbank accession numbers into a single "Unigene Cluster". As
such, a
> > unigene may represent hundreds of genbank sequences. It is not
very
> > meaningful to get a genbank sequence from this. There are other
options.
> > First, you can use the refseq sequence for those that have one
(this
> > probably makes the most sense, but you would have to think about
what
you
> > use will be.) Second, you could go outside R and collect the
"best"
unigene
> > sequence from NCBI (They maintain a file that contains UG ID to
genbank
> > accession of the "best" genbank entry representing the unigene).
Third,
you
> > could use the Hs.data file to get ALL the genbank accessions
associated
with
> > the UG. (The ACCNUM environment usually contains only those
listed in
the
> > locuslink for the gene, if I'm not mistaken). There are many
other
options.
> > Why do you need them, if I might ask?
> >
> > I'm not sure why you are getting back the UG again when getting
from the
> > ACCNUM environment.
> >
> > Sean
> >
> > ----- Original Message -----
> > From: "Mayte Suarez-Farinas" <mayte@babel.rockefeller.edu>
> > To: <bioconductor@stat.math.ethz.ch>
> > Sent: Monday, June 07, 2004 6:13 PM
> > Subject: [BioC] GB from a package built with ABPkgBuilder
> >
> >
> > >
> > > A "simple" question:
> > >
> > > I built an annotation package using UNIGENE ids as key.
> > > Now I want to get the GB annotation for a list of genes.
> > > How can I do ? I use mget with env nameACCNUM but I obtained
> > > the UG again.
> > >
> > > Thanks
> > > --
> > > Mayte Suarez Farinas
> > > The Rockefeller University
> > > 1230 York Avenue, Box 212
> > > New York, NY 10021
> > > phone: 1-212-327-8186
> > > fax: 1-212-327-7422
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor@stat.math.ethz.ch
> > > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
> > >
> >
> >
>
> --
> Mayte Suarez Farinas
> The Rockefeller University
> 1230 York Avenue, Box 212
> New York, NY 10021
> phone: 1-212-327-8186
> fax: 1-212-327-7422
>
>
>