Entering edit mode
Nianhua Li
▴
870
@nianhua-li-1606
Last seen 10.2 years ago
Dear list,
I had some doubts on the data sources used by athPkgBuilder that I
post on
bioc-devel list two months ago, but got no reply. I would like to try
one more
time here. Sorry for the double posting.
----------------------------------------------------------------
I did a close look at the athPkgBuilder function in AnnBuilder
(builder of
ath1121501 and ag) and have some questions about the data source being
used:
1. probeset id to gene mapping:
The current mapping strategy was
1) map probe id to "Representative.Public.ID" by using Affymetrix
GeneChip
annotation data
2) use "Representative.Public.ID" as if it was AGI locus id to get
other
annotations (pathway, go, etc) from TAIR
It seems that the "Representative.Publid.ID is a mix of AGI locus id,
UniGene
Cluster and a small part of other sources. In the affymetrix
annotation file,
there is another column called "Transcript ID (Array Design)", which
has almost
the same value as "Prepresentative.Public.ID". I feel it was
originated from
ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether
affymetrix
update those two columns on a regular basis or not.
But if all the annotations (chromosome, go, pathway) come from TAIR,
maybe we
should use TAIR's mapping of probeset id to AGI locus id:
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ :
"The oligonucleotide sequences of the probes were mapped to the
Arabidopsis
Transcripts dataset from the Arabidopsis genome TAIR6 version
(released November
11, 2005).
The dataset included mitochondria and chloroplast genes, as well as
pseudogenes
and non-
coding RNAs. The mapping to the TAIR6 Transcripts was performed using
the BLASTN
program
with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the
Affy
chips, the
required match length to achieve this e-value is 23 or more identical
nucleotides. To
assign a probe set to a given locus, at least 9 of the probes included
in the
probe set
were required to match a transcript at that locus."
Not all probeset ids have matching AGI locus ids. Do we need to
provide mapping
to other gene identifiers such as GenBank Accession number or Entrez
Gene IDs to
make annoations more complete? Affymetrix starts to provide probeset
id to
Entrez Gene ID mappings in their annotation files. Should we include
that
information? Also, I can see three possible ways to get probe-to-
GenBank
mapping: 1) from affymetrix annotation file directly, 2)probe to AGI
locus and
then AGI locus to GenBank accession, all from TAIR, 3)probe to Entrez
Gene from
affy, and then Entrez Gene to GenBank from NCBI. Which way is the
best? or
should we use the "voting" algorithm used by ABPkgBuilder?
2. chromosome location
The current package get chromosome locations from
ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.L
ocus
Even though the file seems being updated very often, the directory it
locates in
and the README file were not. So, it is not clear for me how it was
generated/updated. Any hint on that? Will
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a
better source?
The meaning of chromosome location in those two sources may be
different though.
The former means the location of a GenBank EST, and the later means
"chromosome
coordinates of the best probe set match to the Transcripts
dataset".
3. gene description (ath1121501GENENAME)
The current package (1.12.1) get the description from
ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The
descriptions
are the same as
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both
of them means the description of the AGI locus corresponding to a affy
probeset.
In the Affymetrix annotation file, there is a column called "Target
Description". It is the description of the gene that a probeset is
targeting to.
All probesets have descriptions, therefore we get a better coverage
than getting
description from TAIR. When the "Representative Public ID" (or
"Transcript ID")
is a AGI locus id, it seems the description was retrieved from TAIR.
However, it
is not clear how this information is updated, and whether it is
synchronized
with TAIR's update or not. Another possible source of description is
Entrez
Gene, since Affymetrix maps probeset to Entrez Gene.
4. pathway
Pathway information is currently obtained from AraCyc, a pathway tool
in TAIR:
http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel it
only
contains metabolic pathways (it can be wrong as I only read the
introduction).
KEGG contains regulatory pathways as well, and it is also manually
curated.
Those two sources are independant from each other. Shall we include
both of them?
5. pubmed
Probeset to pubmed mapping is currently obtained from
ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-060
309.txt .
The pubmed ids represents the publications that TAIR used to map a AGI
locus id
to a concept in Plant Ontology. But I think environment like
ath1121501PUBMED
should represent the publications about the matching gene of a
probeset. I
didn't find AGI locus to pubmed mapping in TAIR. So, we have to get it
from
either Entrez Gene id or GenBank accession. This gets back to the
frist
question: what is the best way to map probeset to GenBank/Entrez Gene?
Hope this email is not too long. Any feedback will be highly
appreciated. If we
decide to use a better data source, I will be happy to do the
implementation.
many thanks
Nianhua Li
computational biology, public health, FHCRC