biomaRt to match UniProt protein IDs with protein names
1
0
Entering edit mode
@a8115bcd
Last seen 22 months ago
Canada

I'm trying biomaRt (a bioconductor package) for converting a column of Uniprot IDs to their corresponding protein names using RStudio. But I'm still not able to see the names changing. I'd appreciate it if someone can have a look at my code and correct it for me so that it does actually work

#Install the biomaRt (a bioconductor package)
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("biomaRt")

# Load the required libraries
library(biomaRt)

# Read the CSV file
fold_changes <- read.csv("fold_changes.csv")

# Connect to the Ensembl database using biomaRt
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Get the protein names using biomaRt
search_ids <- c("protein")
protein_names <- biomaRt::getBM(attributes = c("ensembl_gene_id","uniprotswissprot", "description","hgnc_symbol","gene_biotype"),#,"entrezgene" 
                                filters = "uniprotswissprot", 
                                values = search_ids, 
                                mart = ensembl)

# Rename the protein name column to "protein_name"
colnames(fold_changes)[1] <- "protein_name"

# Write the updated data frame to a new CSV file
write.csv(fold_changes, file = "fold_changes_with_names.csv", row.names = FALSE)
Proteomics UniProtKeywords • 2.5k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

Your call to getBM has an extra '#' in it that is probably not helping things. Plus you are asking a database for a lot of things in one query, which will often result in tons of results coming back. It's usually better to ask for just a few things. If you just want the protein name, I would get that. Also, you want to always include your filter as an attribute, because a database won't return things in the same order that you provided. Also also, if you have NCBI Gene IDs it's probably better to query a database that is based on them rather than something like the Biomart server, which is based on Ensembl. So, a few examples.

> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> library(org.Hs.eg.db)
## get some random Gene IDs
> egids<- head(keys(org.Hs.eg.db), 20)
> z <- getBM(c("entrezgene_id","ensembl_gene_id","uniprotswissprot","description","hgnc_symbol"), "entrezgene_id", egids, mart)
## check for duplicates
> table(z$entrezgene_id)

 1  2  9 10 12 13 14 15 16 18 19 20 21 22 23 24 25 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2 14  2  2
> head(z)
  entrezgene_id ensembl_gene_id uniprotswissprot
1            23 ENSG00000225989           Q8NE71
2            23 ENSG00000225989                 
3            23 ENSG00000236149           Q8NE71
4            23 ENSG00000236149                 
5            10 ENSG00000156006           P11245
6            10 ENSG00000156006                 
                                                                 description
1 ATP binding cassette subfamily F member 1 [Source:HGNC Symbol;Acc:HGNC:70]
2 ATP binding cassette subfamily F member 1 [Source:HGNC Symbol;Acc:HGNC:70]
3 ATP binding cassette subfamily F member 1 [Source:HGNC Symbol;Acc:HGNC:70]
4 ATP binding cassette subfamily F member 1 [Source:HGNC Symbol;Acc:HGNC:70]
5                   N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
6                   N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
  hgnc_symbol
1       ABCF1
2       ABCF1
3       ABCF1
4       ABCF1
5        NAT2
6        NAT2

## Alternative method
> library(UniProt.ws)
> ws <- UniProt.ws()
> select(ws, egids, "id", "GeneID")

   From      Entry       Entry.Name
1     1     P04217       A1BG_HUMAN
2     1     V9HWD8     V9HWD8_HUMAN
3     2     P01023       A2MG_HUMAN
4     9     P18440       ARY1_HUMAN
5     9     F5H5R8     F5H5R8_HUMAN
6     9     Q400J6     Q400J6_HUMAN
7    10     P11245       ARY2_HUMAN
8    10     A4Z6T7     A4Z6T7_HUMAN
9    12     P01011       AACT_HUMAN
10   12 A0A024R6P0 A0A024R6P0_HUMAN
11   13     P22760       AAAD_HUMAN
12   14     Q13685       AAMP_HUMAN
13   14     C9JEH3     C9JEH3_HUMAN
14   15     Q16613       SNAT_HUMAN
15   15     F1T0I5     F1T0I5_HUMAN
16   16     P49588       SYAC_HUMAN
17   18     P80404       GABT_HUMAN
18   18     X5D8S1     X5D8S1_HUMAN
19   19     O95477      ABCA1_HUMAN
20   19 A0A7I2V5U0 A0A7I2V5U0_HUMAN
21   19     B2RUU2     B2RUU2_HUMAN
22   19     B7XCW9     B7XCW9_HUMAN
23   20     Q9BZC7      ABCA2_HUMAN
24   21     Q99758      ABCA3_HUMAN
25   21     Q4LE27     Q4LE27_HUMAN
26   22     O75027      ABCB7_HUMAN
27   22 A0A087WW65 A0A087WW65_HUMAN
28   22 A0A0S2Z2Z3 A0A0S2Z2Z3_HUMAN
29   23     Q8NE71      ABCF1_HUMAN
30   23 A0A1U9X609 A0A1U9X609_HUMAN
31   23     Q2L6I2     Q2L6I2_HUMAN
32   24     P78363      ABCA4_HUMAN
33   24     Q6AI28     Q6AI28_HUMAN
34   25     P00519       ABL1_HUMAN
35   25 A0A024R8E2 A0A024R8E2_HUMAN
36   25     Q59FK4     Q59FK4_HUMAN
Warning message:
IDs not mapped: 11, 17, 3

Which still has duplicates, but not as bad.

ADD COMMENT
0
Entering edit mode

Oh wait, I think I misunderstood.

> upids <- head(keys(ws, "UniProtKB"), 20)
> upids
 [1] "A0A0C5B5G6" "A0A1B0GTW7" "A0JNW5"     "A0JP26"     "A0PK11"    
 [6] "A1A4S6"     "A1A519"     "A1L190"     "A1L3X0"     "A1X283"    
[11] "A2A2Y4"     "A2RU14"     "A2RUB6"     "A2RUC4"     "A4D1B5"    
[16] "A4GXA9"     "A5D8V7"     "A5PLL7"     "A6BM72"     "A6H8Y1"
>  select(ws, upids, "gene_names", "UniProtKB")


         From      Entry                        Gene.Names
1  A0A0C5B5G6 A0A0C5B5G6                           MT-RNR1
2  A0A1B0GTW7 A0A1B0GTW7                       CIROP LMLN2
3      A0JNW5     A0JNW5 BLTP3B KIAA0701 SHIP164 UHRF1BP1L
4      A0JP26     A0JP26                            POTEB3
5      A0PK11     A0PK11                             CLRN2
6      A1A4S6     A1A4S6                    ARHGAP10 GRAF2
7      A1A519     A1A519                      FAM170A ZNFD
8      A1L190     A1L190              SYCE3 C22orf41 THEG2
9      A1L3X0     A1L3X0                            ELOVL7
10     A1X283     A1X283      SH3PXD2B FAD49 KIAA1295 TKS4
11     A2A2Y4     A2A2Y4                    FRMD3 EPB41L4O
12     A2RU14     A2RU14                           TMEM218
13     A2RUB6     A2RUB6                            CCDC66
14     A2RUC4     A2RUC4                      TYW5 C2orf60
15     A4D1B5     A4D1B5                         GSAP PION
16     A4GXA9     A4GXA9                              EME2
17     A5D8V7     A5D8V7                     ODAD3 CCDC151
18     A5PLL7     A5PLL7            PEDS1 KUA PDES TMEM189
19     A6BM72     A6BM72   MEGF11 KIAA1781 UNQ1949/PRO4432
20     A6H8Y1     A6H8Y1       BDP1 KIAA1241 KIAA1689 TFNR
## or maybe
> getBM(c("uniprotswissprot","hgnc_symbol"), "uniprotswissprot", upids, mart)
   uniprotswissprot hgnc_symbol
1        A0A1B0GTW7       CIROP
2            A0JNW5      BLTP3B
3            A0JP26      POTEB3
4            A0PK11       CLRN2
5            A1A4S6    ARHGAP10
6            A1A519     FAM170A
7            A1L190       SYCE3
8            A1L3X0      ELOVL7
9            A1X283    SH3PXD2B
10           A2A2Y4       FRMD3
11           A2RU14     TMEM218
12           A2RUB6      CCDC66
13           A2RUC4        TYW5
14           A4D1B5        GSAP
15           A4GXA9        EME2
16           A5D8V7       ODAD3
17           A5PLL7       PEDS1
18           A6BM72      MEGF11
19           A6H8Y1        BDP1

Is that what you meant?

ADD REPLY

Login before adding your answer.

Traffic: 579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6