Conversion between different types of bioinformatic forms. (eg. NCBI protein to Ensembl protein) in R
1
0
Entering edit mode
j_denton • 0
@672b5b3e
Last seen 12 months ago
United States

Hello all, I am working on trying to convert types of bioinformatic data into other forms, as NCBI, Ensembl, and other sources are not exactly cohesive, even after all of these years. I am wondering if there is a tool that I can use to convert, say, an NCBI protein (NP_xxxxxxxx) to its respective Ensembl or gene symbol (eg. hemoglobin subunit beta in homo sapiens=HBB) equivalent. I am using R-studio. Thank you for any help, J. Denton

RStudio convert ensembldb NCBI • 1.6k views
ADD COMMENT
3
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

You can use either the OrgDb packages supplied by Bioconductor, or you can use biomaRt.

> library(org.Hs.eg.db)

## get some IDs
> nps <- head(grep("^NP_", keys(org.Hs.eg.db, "REFSEQ"), value = TRUE), 20)
> nps
 [1] "NP_570602"    "NP_000005"   
 [3] "NP_001334352" "NP_001334353"
 [5] "NP_001334354" "NP_000653"   
 [7] "NP_001153642" "NP_001153643"
 [9] "NP_001153644" "NP_001153645"
[11] "NP_001153646" "NP_001153647"
[13] "NP_001153648" "NP_001153651"
[15] "NP_001278891" "NP_000006"   
[17] "NP_001076"    "NP_001371601"
[19] "NP_001371602" "NP_001371603"

## map Ensembl and then symbol
> select(org.Hs.eg.db, nps, "ENSEMBL", "REFSEQ")
'select()' returned 1:1 mapping
between keys and columns
         REFSEQ         ENSEMBL
1     NP_570602 ENSG00000121410
2     NP_000005 ENSG00000175899
3  NP_001334352 ENSG00000175899
4  NP_001334353 ENSG00000175899
5  NP_001334354 ENSG00000175899
6     NP_000653 ENSG00000171428
7  NP_001153642 ENSG00000171428
8  NP_001153643 ENSG00000171428
9  NP_001153644 ENSG00000171428
10 NP_001153645 ENSG00000171428
11 NP_001153646 ENSG00000171428
12 NP_001153647 ENSG00000171428
13 NP_001153648 ENSG00000171428
14 NP_001153651 ENSG00000171428
15 NP_001278891 ENSG00000171428
16    NP_000006 ENSG00000156006
17    NP_001076 ENSG00000196136
18 NP_001371601 ENSG00000196136
19 NP_001371602 ENSG00000196136
20 NP_001371603 ENSG00000196136
> select(org.Hs.eg.db, nps, "SYMBOL", "REFSEQ")
'select()' returned 1:1 mapping
between keys and columns
         REFSEQ   SYMBOL
1     NP_570602     A1BG
2     NP_000005      A2M
3  NP_001334352      A2M
4  NP_001334353      A2M
5  NP_001334354      A2M
6     NP_000653     NAT1
7  NP_001153642     NAT1
8  NP_001153643     NAT1
9  NP_001153644     NAT1
10 NP_001153645     NAT1
11 NP_001153646     NAT1
12 NP_001153647     NAT1
13 NP_001153648     NAT1
14 NP_001153651     NAT1
15 NP_001278891     NAT1
16    NP_000006     NAT2
17    NP_001076 SERPINA3
18 NP_001371601 SERPINA3
19 NP_001371602 SERPINA3
20 NP_001371603 SERPINA3

## now biomaRt
mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> getBM(c("refseq_peptide","ensembl_gene_id"), "refseq_peptide",nps,  mart)
   refseq_peptide ensembl_gene_id
1       NP_000005 ENSG00000175899
2       NP_000006 ENSG00000156006
3       NP_000653 ENSG00000171428
4       NP_001076 ENSG00000196136
5    NP_001153642 ENSG00000171428
6    NP_001153643 ENSG00000171428
7    NP_001153644 ENSG00000171428
8    NP_001153645 ENSG00000171428
9    NP_001153646 ENSG00000171428
10   NP_001153647 ENSG00000171428
11   NP_001153648 ENSG00000171428
12   NP_001153651 ENSG00000171428
13   NP_001278891 ENSG00000171428
14   NP_001334352 ENSG00000175899
15   NP_001334353 ENSG00000175899
16   NP_001334354 ENSG00000175899
17   NP_001371601 ENSG00000196136
18   NP_001371602 ENSG00000196136
19   NP_001371603 ENSG00000196136
20      NP_570602 ENSG00000121410
> getBM(c("refseq_peptide","hgnc_symbol"), "refseq_peptide",nps,  mart)
   refseq_peptide hgnc_symbol
1       NP_000005         A2M
2       NP_000006        NAT2
3       NP_000653        NAT1
4       NP_001076    SERPINA3
5    NP_001153642        NAT1
6    NP_001153643        NAT1
7    NP_001153644        NAT1
8    NP_001153645        NAT1
9    NP_001153646        NAT1
10   NP_001153647        NAT1
11   NP_001153648        NAT1
12   NP_001153651        NAT1
13   NP_001278891        NAT1
14   NP_001334352         A2M
15   NP_001334353         A2M
16   NP_001334354         A2M
17   NP_001371601    SERPINA3
18   NP_001371602    SERPINA3
19   NP_001371603    SERPINA3
20      NP_570602        A1BG
ADD COMMENT
0
Entering edit mode

Thank you for the code James, however I seem to have issues when trying to get the nps data to work. For example, I am trying to get the Ensembl equivalent of NP_001265470.1, however when I plug it into nps <- head(grep("^NP_001265470.1", keys(org.Hs.eg.db, "REFSEQ"), value = TRUE), 20) the outcome is

nps character(0). (Your examples work fine, however). Also, is it possible to take this information from multiple databases? I understand this is the Homo sapiens package, however some of my proteins are from different species and I can't seem to find a streamlined, multi-species package. Thanks, J. Denton

ADD REPLY
0
Entering edit mode

Just an update, it appears that nps seems to dislike periods in the accession numbers, and so I had to edit NP_001265470.1 down to just NP_001265470. I'm not sure how that would affect the total accuracy, but at least I managed to figure that part out. My other questions still stand, but I still appreciate the help. Thanks, J. Denton

ADD REPLY
1
Entering edit mode

The OrgDb packages do not have versioned IDs, so you do need to strip off the version numbers first.

There isn't a multi-species package, so you will either need to search using each species separately, or hypothetically you could use NCBI's Eutils to map things (there is the CRAN reutils package that will do things from within R). Although try as I might, I can never figure out how to use Eutils fluently.

ADD REPLY
0
Entering edit mode

Thanks for the response, I made a new post with some updated questions on the forum, one being if there is a tool out there that can translate an accession number into an organism's actual name (eg.NP_062749=Mus musculus). I was thinking about combining several of the OrgDb databases using Sqlite, however if that does not pan out I will definitely check out Eutils.

ADD REPLY
0
Entering edit mode

What if I want to perform the notation between the same database? For example, I want to convert gene_symbol to transcript to protein, all of them with RefSeq database Is that possible?

ADD REPLY

Login before adding your answer.

Traffic: 725 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6