Potential bug in STRINGdb$get_aliases()
0
0
Entering edit mode
@64951acf
Last seen 2.6 years ago
United States

Heya,

First, thanks for the STRINGdb package :)

I was having an issue where incorrect STRING_ids were getting assigned, and looking into it i think I found the problem is coming from the get_aliases() method. Apologies for this being long, but just want to lay out what I've seen.

For example, searching for TAIR ID "AT1G01030" is returning a STRING_id of "3702.AT1G01040.2", while it should be returning "3702.AT1G01030.1". The mapping seems to be correct in the underlying protein.aliases table (as shown below), so I started looking into the method.

Demonstration of problem in R

library(tidyverse)
library(STRINGdb)

packageVersion("STRINGdb")
# ‘2.6.0’ # note i'm using 2.6.0, but the method in question looks like it has the same code in the latest 2.8.0

string_db <- STRINGdb$new(version = "11.5",
                          species = 3702)

string_db$mp("AT1G01030")
# "3702.AT1G01040.2"

# we can see this is what's set up by the get_aliases() function like so
alias_df <- string_db$get_aliases()

alias_df %>% filter(alias == "AT1G01030")
#              STRING_id     alias
# 54830 3702.AT1G01040.2 AT1G01030
    # where the alias is linked to the incorrect STRING_id

## and we can see the info doesn't come in that way from the initial table
initial_alias_df <- read.table(paste0(tempdir(), "/", "3702.protein.aliases.v11.5.txt.gz"), 
                               skip = 1, sep = "\t", header = FALSE, quote = "", 
                               stringsAsFactors = FALSE, fill = TRUE,
                               col.names = c("STRING_id", "alias", "sources"))

initial_alias_df %>% filter(alias == "AT1G01030")
#          STRING_id     alias                  sources
# 1 3702.AT1G01030.1 AT1G01030     Ensembl_ArrayExpress
# 2 3702.AT1G01030.1 AT1G01030          Ensembl_Genomes
# 3 3702.AT1G01030.1 AT1G01030       Ensembl_TAIR_LOCUS
# 4 3702.AT1G01030.1 AT1G01030 Ensembl_TAIR_TRANSLATION
# 5 3702.AT1G01030.1 AT1G01030             Ensembl_gene
# 6 3702.AT1G01030.1 AT1G01030            Plants_Source
# 7 3702.AT1G01030.1 AT1G01030                  TAIR_ID
    # as all are in there linked to 3702.AT1G01030.1

Where I think the problem is in get_aliases()

# in the get_aliases method, the protein table is read in
proteins <- string_db$get_proteins()

# and the alias table is read in
aliasDf <- read.table(paste0(tempdir(), "/", "3702.protein.aliases.v11.5.txt.gz"), 
                               skip = 1, sep = "\t", header = FALSE, quote = "", 
                               stringsAsFactors = FALSE, fill = TRUE,
                               col.names = c("STRING_id", "alias", "sources"))

# at this point we're good (same as above)
aliasDf %>% filter(alias == "AT1G01030")
#          STRING_id     alias                  sources
# 1 3702.AT1G01030.1 AT1G01030     Ensembl_ArrayExpress
# 2 3702.AT1G01030.1 AT1G01030          Ensembl_Genomes
# 3 3702.AT1G01030.1 AT1G01030       Ensembl_TAIR_LOCUS
# 4 3702.AT1G01030.1 AT1G01030 Ensembl_TAIR_TRANSLATION
# 5 3702.AT1G01030.1 AT1G01030             Ensembl_gene
# 6 3702.AT1G01030.1 AT1G01030            Plants_Source
# 7 3702.AT1G01030.1 AT1G01030                  TAIR_ID

# the aliasDf is then cut down to the first two columns
aliasDf = subset(aliasDf, select = c("STRING_id", "alias"))

# next 3 protein sub-tables are generated like so
pr1 <- data.frame(STRING_id = proteins$protein_external_id, 
                 alias = proteins$preferred_name,
                 stringsAsFactors = FALSE)

pr2 <- data.frame(STRING_id = proteins$protein_external_id,
                 alias = proteins$protein_external_id, 
                 stringsAsFactors = FALSE)

pr3 <- data.frame(STRING_id = proteins$protein_external_id,
                 alias = unlist(strsplit(proteins$protein_external_id,
                                "\\."))[seq(from = 2, to = 2 * nrow(proteins), by = 2)],
                 stringsAsFactors = FALSE)

# if we look at that 3rd table, the indexing on the split STRING_ids is not pulling out the core ID 
# each time, but is cycling through the taxid and version also

unlist(strsplit(proteins$protein_external_id, "\\."))[seq(from = 2, to = 2 * nrow(proteins), by = 2)] %>% head
# "AT1G01010" "3702"      "1"         "AT1G01030" "3702"      "2"

head(pr3)
#          STRING_id     alias
# 1 3702.AT1G01010.1 AT1G01010
# 2 3702.AT1G01020.1      3702
# 3 3702.AT1G01030.1         1
# 4 3702.AT1G01040.2 AT1G01030
# 5 3702.AT1G01050.1      3702
# 6 3702.AT1G01060.2         2
    # which happens to be improperly linking some, like on row 4 there our example "AT1G01030" to "3702.AT1G01040.2"

# this holds true after merging them and then in the final returned dataframe
aliasDf2 <- rbind(pr1, pr2, pr3, aliasDf)

aliasDf2 %>% filter(alias == "AT1G01030")
#         STRING_id     alias
# 1 3702.AT1G01040.2 AT1G01030
# 2 3702.AT1G01030.1 AT1G01030
# 3 3702.AT1G01030.1 AT1G01030
# 4 3702.AT1G01030.1 AT1G01030
# 5 3702.AT1G01030.1 AT1G01030
# 6 3702.AT1G01030.1 AT1G01030
# 7 3702.AT1G01030.1 AT1G01030
# 8 3702.AT1G01030.1 AT1G01030
    # where that first row is linking improperly

Suggested fix

# if the indexing step during the pr3-table part counts by 3s instead of 2s, things seem to work out
pr3_mod <- data.frame(STRING_id = proteins$protein_external_id,
                      alias = unlist(strsplit(proteins$protein_external_id,
                                     "\\."))[seq(from = 2, to = 3 * nrow(proteins), by = 3)],
                      stringsAsFactors = FALSE)

head(pr3_mod)
#          STRING_id     alias
# 1 3702.AT1G01010.1 AT1G01010
# 2 3702.AT1G01020.1 AT1G01020
# 3 3702.AT1G01030.1 AT1G01030
# 4 3702.AT1G01040.2 AT1G01040
# 5 3702.AT1G01050.1 AT1G01050
# 6 3702.AT1G01060.2 AT1G01060

# then after merging, only the correct ones are still linked
aliasDf2_mod <- rbind(pr1, pr2, pr3_mod, aliasDf)

aliasDf2_mod %>% filter(alias == "AT1G01030")
#          STRING_id     alias
# 1 3702.AT1G01030.1 AT1G01030
# 2 3702.AT1G01030.1 AT1G01030
# 3 3702.AT1G01030.1 AT1G01030
# 4 3702.AT1G01030.1 AT1G01030
# 5 3702.AT1G01030.1 AT1G01030
# 6 3702.AT1G01030.1 AT1G01030
# 7 3702.AT1G01030.1 AT1G01030
# 8 3702.AT1G01030.1 AT1G01030

I've never used STRING_ids before, but i suspect maybe there was a change at some point in how they are formatted such that the splitting/indexing above needs to be shifted ¯\_(ツ)_/¯

I couldn't find somewhere to report an issue, so I'm posting here as an "Ask a question". If there is somewhere I should log this as an issue, please point me there.

Thanks!

Session info

sessionInfo()
# R version 4.1.2 (2021-11-01)
# Platform: x86_64-conda-linux-gnu (64-bit)
# Running under: Ubuntu 20.04.4 LTS
# 
# Matrix products: default
# BLAS/LAPACK: /global/smf/miniconda38_admin/envs/RNAseq_Rtools_03-2022/lib/libopenblasp-r0.3.18.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base
# 
# other attached packages:
#  [1] STRINGdb_2.6.0  forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7
#  [5] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.6
#  [9] ggplot2_3.3.5   tidyverse_1.3.1
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_1.0.8         lubridate_1.8.0    png_0.1-7          gtools_3.9.2
#  [5] assertthat_0.2.1   utf8_1.2.2         chron_2.3-56       R6_2.5.1
#  [9] cellranger_1.1.0   plyr_1.8.6         backports_1.4.1    reprex_2.0.1
# [13] RSQLite_2.2.8      httr_1.4.2         sqldf_0.4-11       pillar_1.7.0
# [17] gplots_3.1.1       rlang_0.4.12       readxl_1.3.1       rstudioapi_0.13
# [21] blob_1.2.2         hash_3.0.1         gsubfn_0.7         proto_1.0.0
# [25] igraph_1.2.11      RCurl_1.98-1.6     bit_4.0.4          munsell_0.5.0
# [29] broom_0.7.12       compiler_4.1.2     modelr_0.1.8       pkgconfig_2.0.3
# [33] tidyselect_1.1.1   fansi_1.0.2        crayon_1.5.0       tzdb_0.2.0
# [37] dbplyr_2.1.1       withr_2.5.0        bitops_1.0-7       grid_4.1.2
# [41] jsonlite_1.8.0     gtable_0.3.0       lifecycle_1.0.1    DBI_1.1.2
# [45] magrittr_2.0.2     scales_1.1.1       KernSmooth_2.23-20 cachem_1.0.6
# [49] cli_3.2.0          stringi_1.7.6      fs_1.5.2           xml2_1.3.3
# [53] ellipsis_0.3.2     generics_0.1.2     vctrs_0.3.8        RColorBrewer_1.1-2
# [57] tools_4.1.2        bit64_4.0.5        glue_1.6.2         hms_1.1.1
# [61] plotrix_3.8-2      fastmap_1.1.0      colorspace_2.0-3   caTools_1.18.2
# [65] rvest_1.0.2        memoise_2.0.1      haven_2.4.3
STRINGdb • 1.2k views
ADD COMMENT
1
Entering edit mode

Dear Mike,

Thanks for pointing out the issue. I can reproduce it, and the fix will be pushed soon. I'll keep you updated.

Best, Damain.

ADD REPLY
1
Entering edit mode

The bug was in the part of code relating to handling the parsing of identifiers with dot inside. This is now done in preprocessing so it wasn't even necessary. It's fixed in version 2.8.1 (Bionc version 3.15) I cannot retroactively bugfix the older version of Bionconductor so unfortunately you will have to update it.

The fix is pushed to the Bionc github, it should go live in the next few days.

Sorry for the the issue and thanks for the feedaback, Damian.

ADD REPLY
0
Entering edit mode

Great, thanks for the quick work, Damian :)

Cheers, Mike

ADD REPLY

Login before adding your answer.

Traffic: 519 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6