Entering edit mode
Simon Lin
▴
270
@simon-lin-1272
Last seen 10.2 years ago
In the following two unrelated messages, both Sean and Nianhua
suggested
to download and parse some data tables from the NCBI. The gene_info
and
several other tables seems very useful. If that is the case, why not
have it pre-loaded into a SQlite and distribute it as part of the
annotation package for human? Simon ================= Date: Tue, 12
Jun
2007 05:59:55 -0400 From: Sean Davis <sdavis2 at="" mail.nih.gov="">
Subject: Re:
[BioC] from RefSeq GI protein identifiers to GO terms To: Lina
Hultin-Rosenberg <lina.hultin-rosenberg at="" ki.se=""> Cc:
bioconductor at stat.math.ethz.ch Message-ID:
<466E6E9B.3020609 at mail.nih.gov> Content-Type: text/plain;
charset=ISO-8859-1 Lina Hultin-Rosenberg wrote:
>> Dear list,
>>
>> This might be a question that has been discussed previously but I
could not
>> find any good solution for it. I have lists of human proteins from
various
>> proteomics studies that I want to compare with regards to the GO
terms
>> associated to them. I have the RefSeq GI protein id for the
proteins and my
>> questions is how I best map those to other identifiers that I can
use in
>> subsequent GO analysis?
>>
>> It might be that this problem is solved best outside R but maybe
someone
>> still can give me a hint to the best solution. For me this is a
problem that
>> comes up quite often - the need to map between different
identifiers - and I
>> have not yet find any really good solution to it. If I for example
use IPI I
>> always loose some proteins/genes since the coverage is rather bad,
but maybe
>> there is no solution that will give perfect mapping?!
>
>
The file located here:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
and described in detail here:
ftp://ftp.ncbi.nih.gov/gene/DATA/README
maps refseq to Entrez Gene ID. Once you have the Entrez Gene ID, you
can use the bioconductor annotation packages to get GO mappings. The
file above is a tab-delimited text file, so you should be able to read
it into R and do the matching by GI number rather easily.
Hope that helps.
Sean
========================
Message: 4
Date: Mon, 11 Jun 2007 12:36:31 +0000 (UTC)
From: Nianhua Li <nialicn@yahoo.com>
Subject: Re: [BioC] getting Locus Link ids from gene symbol
To: bioconductor at stat.math.ethz.ch
Message-ID: <loom.20070611t142932-100 at="" post.gmane.org="">
Content-Type: text/plain; charset=us-ascii
Hi, Alex,
You can parse ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
There are 4 useful columns: tax_id (column 1), GeneID (column 2),
Symbol
(column 3), and Synonyms (column 5). You can:
1 Read in the file
2 filter it based on tax_id
3 match your gene symboles to the "Symbol" column and find their Gene
ID
4 removed the matched gene symboles from your list
5 match the rest of gene symboles to the "Synonyms" column and find
their Gene
ID
hope this helps
nianhua
Nianhua Li
Software Developer