Hi, I am using the drawProteins package to draw protein domains as described nicely in several other places. My problem is that in some instances, Uniprot entries are missing CHAIN information, which is required for drawing the background chain in the plot. The CHAIN information essentially provides the length of a given protein. Is there a way to add this information to the data.frame produced by drawProteins::featurestodataframe? I am new to R, so there is probably an embarrassingly simple solution to this problem. I understand how to add rows to a data.frame, but unfortunately I do not understand how to add this information to the slightly more complicated data.frame created by drawProteins. Alternatively, I could contact Uniprot.
Here is the code I am using. If you replace Uniprot ID Q4WXX3 with Q4WVE3 (a different protein), then you can see what is missing.
Thanks in advance!
library("drawProteins")
library("ggplot2")
prot <- drawProteins::get_features("Q4WXX3")
drawProteins::feature_to_dataframe(prot) -> prot_data
draw_canvas(prot_data) -> p
p <- draw_chains(p, prot_data,
labels = c("AgoA"))
p <- draw_domains(p, prot_data,
label_domains = FALSE)
p <- draw_regions(p, prot_data)
p <- draw_repeat(p, prot_data)
p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 8)
p <- p + theme_bw(base_size = 20) + # white background
theme(panel.grid.minor=element_blank(),
panel.grid.major=element_blank()) +
theme(axis.ticks = element_blank(),
axis.text.y = element_blank()) +
theme(panel.border = element_blank())
p <- p + theme(legend.position="bottom") + labs(fill="")
prot_subtitle <- paste0("nsource:Uniprot")
p <- p + labs(title = "Protein Domains",
subtitle = prot_subtitle)
p
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.38.0 BiocInstaller_1.32.1 forcats_0.3.0 stringr_1.3.1
[5] dplyr_0.7.8 purrr_0.2.5 readr_1.3.1 tidyr_0.8.2
[9] tibble_2.0.1 tidyverse_1.2.1 ggplot2_3.1.0 drawProteins_1.2.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 lubridate_1.7.4 lattice_0.20-38 prettyunits_1.0.2
[5] assertthat_0.2.0 digest_0.6.18 R6_2.3.0 cellranger_1.1.0
[9] plyr_1.8.4 backports_1.1.3 stats4_3.5.1 RSQLite_2.1.1
[13] httr_1.4.0 pillar_1.3.1 rlang_0.3.1 progress_1.2.0
[17] lazyeval_0.2.1 curl_3.3 readxl_1.2.0 rstudioapi_0.9.0
[21] blob_1.1.1 S4Vectors_0.20.1 labeling_0.3 RCurl_1.95-4.11
[25] bit_1.1-14 munsell_0.5.0 broom_0.5.1 compiler_3.5.1
[29] modelr_0.1.2 pkgconfig_2.0.2 BiocGenerics_0.28.0 tidyselect_0.2.5
[33] IRanges_2.16.0 XML_3.98-1.16 crayon_1.3.4 withr_2.1.2
[37] bitops_1.0-6 grid_3.5.1 nlme_3.1-137 jsonlite_1.6
[41] gtable_0.2.0 DBI_1.0.0 magrittr_1.5 scales_1.0.0
[45] cli_1.0.1 stringi_1.2.4 bindrcpp_0.2.2 xml2_1.2.0
[49] generics_0.0.2 tools_3.5.1 bit64_0.9-7 Biobase_2.42.0
[53] glue_1.3.0 hms_0.4.2 parallel_3.5.1 yaml_2.2.0
[57] AnnotationDbi_1.44.0 colorspace_1.4-0 rvest_0.3.2 memoise_1.1.0
[61] bindr_0.1.1 haven_2.0.0
Hi James, Nice job. To add to your answer, we can use the amino acid sequence to calculate the protein length. Here is some code that will do that. Best wishes, Paul