Heya,
First, thanks for the STRINGdb package :)
I was having an issue where incorrect STRING_ids were getting assigned, and looking into it i think I found the problem is coming from the get_aliases()
method. Apologies for this being long, but just want to lay out what I've seen.
For example, searching for TAIR ID "AT1G01030" is returning a STRING_id of "3702.AT1G01040.2", while it should be returning "3702.AT1G01030.1". The mapping seems to be correct in the underlying protein.aliases table (as shown below), so I started looking into the method.
Demonstration of problem in R
library(tidyverse)
library(STRINGdb)
packageVersion("STRINGdb")
# ‘2.6.0’ # note i'm using 2.6.0, but the method in question looks like it has the same code in the latest 2.8.0
string_db <- STRINGdb$new(version = "11.5",
species = 3702)
string_db$mp("AT1G01030")
# "3702.AT1G01040.2"
# we can see this is what's set up by the get_aliases() function like so
alias_df <- string_db$get_aliases()
alias_df %>% filter(alias == "AT1G01030")
# STRING_id alias
# 54830 3702.AT1G01040.2 AT1G01030
# where the alias is linked to the incorrect STRING_id
## and we can see the info doesn't come in that way from the initial table
initial_alias_df <- read.table(paste0(tempdir(), "/", "3702.protein.aliases.v11.5.txt.gz"),
skip = 1, sep = "\t", header = FALSE, quote = "",
stringsAsFactors = FALSE, fill = TRUE,
col.names = c("STRING_id", "alias", "sources"))
initial_alias_df %>% filter(alias == "AT1G01030")
# STRING_id alias sources
# 1 3702.AT1G01030.1 AT1G01030 Ensembl_ArrayExpress
# 2 3702.AT1G01030.1 AT1G01030 Ensembl_Genomes
# 3 3702.AT1G01030.1 AT1G01030 Ensembl_TAIR_LOCUS
# 4 3702.AT1G01030.1 AT1G01030 Ensembl_TAIR_TRANSLATION
# 5 3702.AT1G01030.1 AT1G01030 Ensembl_gene
# 6 3702.AT1G01030.1 AT1G01030 Plants_Source
# 7 3702.AT1G01030.1 AT1G01030 TAIR_ID
# as all are in there linked to 3702.AT1G01030.1
Where I think the problem is in get_aliases()
# in the get_aliases method, the protein table is read in
proteins <- string_db$get_proteins()
# and the alias table is read in
aliasDf <- read.table(paste0(tempdir(), "/", "3702.protein.aliases.v11.5.txt.gz"),
skip = 1, sep = "\t", header = FALSE, quote = "",
stringsAsFactors = FALSE, fill = TRUE,
col.names = c("STRING_id", "alias", "sources"))
# at this point we're good (same as above)
aliasDf %>% filter(alias == "AT1G01030")
# STRING_id alias sources
# 1 3702.AT1G01030.1 AT1G01030 Ensembl_ArrayExpress
# 2 3702.AT1G01030.1 AT1G01030 Ensembl_Genomes
# 3 3702.AT1G01030.1 AT1G01030 Ensembl_TAIR_LOCUS
# 4 3702.AT1G01030.1 AT1G01030 Ensembl_TAIR_TRANSLATION
# 5 3702.AT1G01030.1 AT1G01030 Ensembl_gene
# 6 3702.AT1G01030.1 AT1G01030 Plants_Source
# 7 3702.AT1G01030.1 AT1G01030 TAIR_ID
# the aliasDf is then cut down to the first two columns
aliasDf = subset(aliasDf, select = c("STRING_id", "alias"))
# next 3 protein sub-tables are generated like so
pr1 <- data.frame(STRING_id = proteins$protein_external_id,
alias = proteins$preferred_name,
stringsAsFactors = FALSE)
pr2 <- data.frame(STRING_id = proteins$protein_external_id,
alias = proteins$protein_external_id,
stringsAsFactors = FALSE)
pr3 <- data.frame(STRING_id = proteins$protein_external_id,
alias = unlist(strsplit(proteins$protein_external_id,
"\\."))[seq(from = 2, to = 2 * nrow(proteins), by = 2)],
stringsAsFactors = FALSE)
# if we look at that 3rd table, the indexing on the split STRING_ids is not pulling out the core ID
# each time, but is cycling through the taxid and version also
unlist(strsplit(proteins$protein_external_id, "\\."))[seq(from = 2, to = 2 * nrow(proteins), by = 2)] %>% head
# "AT1G01010" "3702" "1" "AT1G01030" "3702" "2"
head(pr3)
# STRING_id alias
# 1 3702.AT1G01010.1 AT1G01010
# 2 3702.AT1G01020.1 3702
# 3 3702.AT1G01030.1 1
# 4 3702.AT1G01040.2 AT1G01030
# 5 3702.AT1G01050.1 3702
# 6 3702.AT1G01060.2 2
# which happens to be improperly linking some, like on row 4 there our example "AT1G01030" to "3702.AT1G01040.2"
# this holds true after merging them and then in the final returned dataframe
aliasDf2 <- rbind(pr1, pr2, pr3, aliasDf)
aliasDf2 %>% filter(alias == "AT1G01030")
# STRING_id alias
# 1 3702.AT1G01040.2 AT1G01030
# 2 3702.AT1G01030.1 AT1G01030
# 3 3702.AT1G01030.1 AT1G01030
# 4 3702.AT1G01030.1 AT1G01030
# 5 3702.AT1G01030.1 AT1G01030
# 6 3702.AT1G01030.1 AT1G01030
# 7 3702.AT1G01030.1 AT1G01030
# 8 3702.AT1G01030.1 AT1G01030
# where that first row is linking improperly
Suggested fix
# if the indexing step during the pr3-table part counts by 3s instead of 2s, things seem to work out
pr3_mod <- data.frame(STRING_id = proteins$protein_external_id,
alias = unlist(strsplit(proteins$protein_external_id,
"\\."))[seq(from = 2, to = 3 * nrow(proteins), by = 3)],
stringsAsFactors = FALSE)
head(pr3_mod)
# STRING_id alias
# 1 3702.AT1G01010.1 AT1G01010
# 2 3702.AT1G01020.1 AT1G01020
# 3 3702.AT1G01030.1 AT1G01030
# 4 3702.AT1G01040.2 AT1G01040
# 5 3702.AT1G01050.1 AT1G01050
# 6 3702.AT1G01060.2 AT1G01060
# then after merging, only the correct ones are still linked
aliasDf2_mod <- rbind(pr1, pr2, pr3_mod, aliasDf)
aliasDf2_mod %>% filter(alias == "AT1G01030")
# STRING_id alias
# 1 3702.AT1G01030.1 AT1G01030
# 2 3702.AT1G01030.1 AT1G01030
# 3 3702.AT1G01030.1 AT1G01030
# 4 3702.AT1G01030.1 AT1G01030
# 5 3702.AT1G01030.1 AT1G01030
# 6 3702.AT1G01030.1 AT1G01030
# 7 3702.AT1G01030.1 AT1G01030
# 8 3702.AT1G01030.1 AT1G01030
I've never used STRING_ids before, but i suspect maybe there was a change at some point in how they are formatted such that the splitting/indexing above needs to be shifted ¯\_(ツ)_/¯
I couldn't find somewhere to report an issue, so I'm posting here as an "Ask a question". If there is somewhere I should log this as an issue, please point me there.
Thanks!
Session info
sessionInfo()
# R version 4.1.2 (2021-11-01)
# Platform: x86_64-conda-linux-gnu (64-bit)
# Running under: Ubuntu 20.04.4 LTS
#
# Matrix products: default
# BLAS/LAPACK: /global/smf/miniconda38_admin/envs/RNAseq_Rtools_03-2022/lib/libopenblasp-r0.3.18.so
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] STRINGdb_2.6.0 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
# [5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6
# [9] ggplot2_3.3.5 tidyverse_1.3.1
#
# loaded via a namespace (and not attached):
# [1] Rcpp_1.0.8 lubridate_1.8.0 png_0.1-7 gtools_3.9.2
# [5] assertthat_0.2.1 utf8_1.2.2 chron_2.3-56 R6_2.5.1
# [9] cellranger_1.1.0 plyr_1.8.6 backports_1.4.1 reprex_2.0.1
# [13] RSQLite_2.2.8 httr_1.4.2 sqldf_0.4-11 pillar_1.7.0
# [17] gplots_3.1.1 rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13
# [21] blob_1.2.2 hash_3.0.1 gsubfn_0.7 proto_1.0.0
# [25] igraph_1.2.11 RCurl_1.98-1.6 bit_4.0.4 munsell_0.5.0
# [29] broom_0.7.12 compiler_4.1.2 modelr_0.1.8 pkgconfig_2.0.3
# [33] tidyselect_1.1.1 fansi_1.0.2 crayon_1.5.0 tzdb_0.2.0
# [37] dbplyr_2.1.1 withr_2.5.0 bitops_1.0-7 grid_4.1.2
# [41] jsonlite_1.8.0 gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.2
# [45] magrittr_2.0.2 scales_1.1.1 KernSmooth_2.23-20 cachem_1.0.6
# [49] cli_3.2.0 stringi_1.7.6 fs_1.5.2 xml2_1.3.3
# [53] ellipsis_0.3.2 generics_0.1.2 vctrs_0.3.8 RColorBrewer_1.1-2
# [57] tools_4.1.2 bit64_4.0.5 glue_1.6.2 hms_1.1.1
# [61] plotrix_3.8-2 fastmap_1.1.0 colorspace_2.0-3 caTools_1.18.2
# [65] rvest_1.0.2 memoise_2.0.1 haven_2.4.3
Dear Mike,
Thanks for pointing out the issue. I can reproduce it, and the fix will be pushed soon. I'll keep you updated.
Best, Damain.
The bug was in the part of code relating to handling the parsing of identifiers with dot inside. This is now done in preprocessing so it wasn't even necessary. It's fixed in version 2.8.1 (Bionc version 3.15) I cannot retroactively bugfix the older version of Bionconductor so unfortunately you will have to update it.
The fix is pushed to the Bionc github, it should go live in the next few days.
Sorry for the the issue and thanks for the feedaback, Damian.
Great, thanks for the quick work, Damian :)
Cheers, Mike