Normally I would suggest using the Orthology.eg.db
package, which you can use to map between two species. Unfortunately you have what NCBI calls 'LocusTags' rather than the usual NCBI Gene ID, and there isn't an easy way that I know of to map the LocusTags to Gene IDs. There is a hard way to do it, using NCBI's efetch utilities, which are very powerful but IMO not intuitive at all. Anyway, there is a set of utilities at NCBI that you can get. And once you have done so, you can craft a super-obvious query:
esearch -db gene -query "txid8168 [orgn]" |
efetch -format docsum |
xtract -set Set -rec Rec -pattern DocumentSummary -block DocumentSummary \
-pkg Common -wrp ID -element Id -wrp Locus -element OtherAliases |
xtract -pattern Rec -def "-" -element ID Locus > tmp.txt
I mean obviously. And then you will have
$ head tmp.txt
120556064 PFLUV_G00038130
120561572 PFLUV_G00088970, TNF-a
120555793 PFLUV_G00011040
120549074 PFLUV_G00233340, HIF-1a
120548044 PFLUV_G00223160, ALOX5
120547785 PFLUV_G00224060
120566625 PFLUV_G00121520
120560661 PFLUV_G00075320, PTGES2
120553616 PFLUV_G00260310, LTAH4
Which you can use to map your LocusTags to NCBI Gene IDs, after which you can use the Orthology.eg.db
package to map to human NCBI Gene IDs
> library(Orthology.eg.db)
> z <- read.table("tmp.txt", header = FALSE, sep = "\t")
> head(z)
V1 V2
1 120556064 PFLUV_G00038130
2 120561572 PFLUV_G00088970, TNF-a
3 120555793 PFLUV_G00011040
4 120549074 PFLUV_G00233340, HIF-1a
5 120548044 PFLUV_G00223160, ALOX5
6 120547785 PFLUV_G00224060
> select(Orthology.eg.db, as.character(z[1:20,1]), "Homo.sapiens","Perca.fluviatilis")
Perca.fluviatilis Homo.sapiens
1 120556064 3291
2 120561572 NA
3 120555793 4306
4 120549074 3091
5 120548044 240
6 120547785 60481
7 120566625 NA
8 120560661 80142
9 120553616 4048
10 22976156 NA
11 22976155 NA
12 22976154 NA
13 22976153 NA
14 22976152 NA
15 22976151 NA
16 22976150 NA
17 22976149 NA
18 22976148 NA
19 22976147 NA
20 22976146 NA
Thank you very much for your detailed reply, this was a big help. Just in case others come across this problem in the future, the code did not work quite exactly as intended, but when I ran it as:
It did everything perfectly, with a tab separated output. So now I have the full list of P. fluviatilis genes -- perfect!
I am trying now to convert a specific list of genes, and match them with the output of tmp.txt, but I am struggling to come up with a simple method of doing so. My general thought plan is:
My Python is not very good. Do you think this is something AI could solve, or if this is a question that has been asked already? I have tried searching on the website for this type of question but did not find anything.
Thank you again for your response, it is appreciated.