Struggling to convert a large list of non-model genes into human orthologs, any suggestions?
1
0
Entering edit mode
ronin • 0
@73e6e70d
Last seen 6 weeks ago
Estonia

I am working with Perca fluviatilis and have a large (many thousands) list of genes. I would like to convert these into human orthologs. If I have a list that looks something like:

PFLUV_G00277780
PFLUV_G00269580
PFLUV_G00217690
PFLUV_G00013790
PFLUV_G00218480
PFLUV_G00127550
PFLUV_G00171730
PFLUV_G00002260
PFLUV_G00161260
PFLUV_G00274260

Is there any resource available that can convert these into human genes? For a few dozen it is easy to just search the gene name on NCBI, but I am dealing with thousands, so I cannot do this manually. Thanks in advance for any suggestions or tips you might have.

annotation • 820 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 9 minutes ago
United States

Normally I would suggest using the Orthology.eg.db package, which you can use to map between two species. Unfortunately you have what NCBI calls 'LocusTags' rather than the usual NCBI Gene ID, and there isn't an easy way that I know of to map the LocusTags to Gene IDs. There is a hard way to do it, using NCBI's efetch utilities, which are very powerful but IMO not intuitive at all. Anyway, there is a set of utilities at NCBI that you can get. And once you have done so, you can craft a super-obvious query:

esearch -db gene -query "txid8168 [orgn]"  | 
efetch -format docsum | 
xtract -set Set -rec Rec -pattern DocumentSummary -block DocumentSummary \
-pkg Common -wrp ID -element Id -wrp Locus -element OtherAliases | 
xtract -pattern Rec -def "-" -element ID Locus > tmp.txt

I mean obviously. And then you will have

$ head tmp.txt
120556064   PFLUV_G00038130
120561572   PFLUV_G00088970, TNF-a
120555793   PFLUV_G00011040
120549074   PFLUV_G00233340, HIF-1a
120548044   PFLUV_G00223160, ALOX5
120547785   PFLUV_G00224060
120566625   PFLUV_G00121520
120560661   PFLUV_G00075320, PTGES2
120553616   PFLUV_G00260310, LTAH4

Which you can use to map your LocusTags to NCBI Gene IDs, after which you can use the Orthology.eg.db package to map to human NCBI Gene IDs

> library(Orthology.eg.db)
> z <- read.table("tmp.txt", header = FALSE, sep = "\t")
> head(z)
         V1                      V2
1 120556064         PFLUV_G00038130
2 120561572  PFLUV_G00088970, TNF-a
3 120555793         PFLUV_G00011040
4 120549074 PFLUV_G00233340, HIF-1a
5 120548044  PFLUV_G00223160, ALOX5
6 120547785         PFLUV_G00224060
> select(Orthology.eg.db, as.character(z[1:20,1]), "Homo.sapiens","Perca.fluviatilis")
   Perca.fluviatilis Homo.sapiens
1          120556064         3291
2          120561572           NA
3          120555793         4306
4          120549074         3091
5          120548044          240
6          120547785        60481
7          120566625           NA
8          120560661        80142
9          120553616         4048
10          22976156           NA
11          22976155           NA
12          22976154           NA
13          22976153           NA
14          22976152           NA
15          22976151           NA
16          22976150           NA
17          22976149           NA
18          22976148           NA
19          22976147           NA
20          22976146           NA
0
Entering edit mode

Thank you very much for your detailed reply, this was a big help. Just in case others come across this problem in the future, the code did not work quite exactly as intended, but when I ran it as:

esearch -db gene -query "txid8168 [orgn]" | efetch -format docsum | xtract -pattern DocumentSummary -block DocumentSummary -element Id,OtherAliases -tab "\t" | awk -F"\t" '{if ($2 == "") print $1, "-"; else print $1, $2}' | sed 's/ /\t/g' > tmp.txt

It did everything perfectly, with a tab separated output. So now I have the full list of P. fluviatilis genes -- perfect!

I am trying now to convert a specific list of genes, and match them with the output of tmp.txt, but I am struggling to come up with a simple method of doing so. My general thought plan is:

step 1: get a tab-separated file containing both my list of genes (list.txt), and the tmp.txt list
step 2: use a python script to look for every gene included in list.txt that appears in tmp.txt
step 3: have the orthologs from tmp.txt added to list.txt

My Python is not very good. Do you think this is something AI could solve, or if this is a question that has been asked already? I have tried searching on the website for this type of question but did not find anything.

Thank you again for your response, it is appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6