Hello all,
I am working on trying to convert types of bioinformatic data into other forms, as NCBI, Ensembl, and other sources are not exactly cohesive, even after all of these years. I am wondering if there is a tool that I can use to convert, say, an NCBI protein (NP_xxxxxxxx) to its respective Ensembl or gene symbol (eg. hemoglobin subunit beta in homo sapiens=HBB) equivalent. I am using R-studio.
Thank you for any help,
J. Denton
Thank you for the code James, however I seem to have issues when trying to get the nps data to work. For example, I am trying to get the Ensembl equivalent of NP_001265470.1, however when I plug it into
nps <- head(grep("^NP_001265470.1", keys(org.Hs.eg.db, "REFSEQ"), value = TRUE), 20)
the outcome is
nps
character(0).
(Your examples work fine, however). Also, is it possible to take this information from multiple databases? I understand this is the Homo sapiens package, however some of my proteins are from different species and I can't seem to find a streamlined, multi-species package.
Thanks, J. Denton
Just an update, it appears that nps seems to dislike periods in the accession numbers, and so I had to edit NP_001265470.1 down to just NP_001265470. I'm not sure how that would affect the total accuracy, but at least I managed to figure that part out. My other questions still stand, but I still appreciate the help.
Thanks,
J. Denton
The OrgDb packages do not have versioned IDs, so you do need to strip off the version numbers first.
There isn't a multi-species package, so you will either need to search using each species separately, or hypothetically you could use NCBI's Eutils to map things (there is the CRAN reutils package that will do things from within R). Although try as I might, I can never figure out how to use Eutils fluently.
Thanks for the response, I made a new post with some updated questions on the forum, one being if there is a tool out there that can translate an accession number into an organism's actual name (eg.NP_062749=Mus musculus). I was thinking about combining several of the OrgDb databases using Sqlite, however if that does not pan out I will definitely check out Eutils.
What if I want to perform the notation between the same database?
For example, I want to convert gene_symbol to transcript to protein, all of them with RefSeq database
Is that possible?
Thank you for the code James, however I seem to have issues when trying to get the nps data to work. For example, I am trying to get the Ensembl equivalent of NP_001265470.1, however when I plug it into nps <- head(grep("^NP_001265470.1", keys(org.Hs.eg.db, "REFSEQ"), value = TRUE), 20) the outcome is
Just an update, it appears that nps seems to dislike periods in the accession numbers, and so I had to edit NP_001265470.1 down to just NP_001265470. I'm not sure how that would affect the total accuracy, but at least I managed to figure that part out. My other questions still stand, but I still appreciate the help. Thanks, J. Denton
The
OrgDb
packages do not have versioned IDs, so you do need to strip off the version numbers first.There isn't a multi-species package, so you will either need to search using each species separately, or hypothetically you could use NCBI's Eutils to map things (there is the CRAN reutils package that will do things from within R). Although try as I might, I can never figure out how to use Eutils fluently.
Thanks for the response, I made a new post with some updated questions on the forum, one being if there is a tool out there that can translate an accession number into an organism's actual name (eg.NP_062749=Mus musculus). I was thinking about combining several of the OrgDb databases using Sqlite, however if that does not pan out I will definitely check out Eutils.
What if I want to perform the notation between the same database? For example, I want to convert
gene_symbol
totranscript
toprotein
, all of them with RefSeq database Is that possible?