Question

Creating a Biostrings PDict object from amino-acid sequences

0

Entering edit mode

rubi ▴ 110

@rubi-6462

Last seen 6.4 years ago

Hi,

I'm trying to match a vector of peptide sequences against an AAStringSet to get all perfect matches.

I thought the most straightforward way to do this is to create a PDict object from the vector of peptide sequences using:

PDict(peptide.seq.vec)

And then use one of the matchPDict functions of the PDict object vs. the AAStringSet reference to get all perfect matches.

However, running:

PDict(peptide.seq.vec)

Already throws this error:

Error in .Call2("new_XString_from_CHARACTER", classname, x, start(solved_SEW),  : 
  key 73 (char 'I') not in lookup table

peptide.seq.vec[1] is

"KNVSIGIVGKD"

Is it expecting a DNA sequence only? The documentation of PDict says it accepts a character vector, not necessarily a DNA string

Any idea?

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
 [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Biostrings_2.42.1    XVector_0.14.0       matrixStats_0.51.0   topGO_2.26.0         SparseM_1.72        
 [6] graph_1.50.0         fastcluster_1.1.22   cluster_2.0.5        GO.db_3.4.0          org.Hs.eg.db_3.4.0  
[11] AnnotationDbi_1.36.0 Biobase_2.34.0       gageData_2.12.0      gage_2.24.0          biomaRt_2.30.0      
[16] rtracklayer_1.34.1   GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  IRanges_2.8.1        S4Vectors_0.12.1    
[21] BiocGenerics_0.20.0  doBy_4.5-15          yaml_2.1.14          doParallel_1.0.10    iterators_1.0.8     
[26] foreach_1.4.3        snpEnrichment_1.7.0  fgsea_1.0.2          Rcpp_0.12.8          data.tree_0.6.2     
[31] zoo_1.7-13           gplots_3.0.1         ggdendro_0.1-20      RColorBrewer_1.1-2   venneuler_1.1-0     
[36] rJava_0.9-8          scales_0.4.1         reshape2_1.4.2       plotrix_3.6-3        outliers_0.14       
[41] Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34      data.table_1.9.6    
[46] edgeR_3.16.1         limma_3.30.2         ggpmisc_0.2.12       dplyr_0.5.0          plyr_1.8.4          
[51] magrittr_1.5         gridExtra_2.2.1      ggplot2_2.2.1        dendextend_1.3.0     ape_4.0             

loaded via a namespace (and not attached):
 [1] colorspace_1.2-7           class_7.3-14               modeltools_0.2-21          mclust_5.2                
 [5] rstudioapi_0.6             flexmix_2.3-13             mvtnorm_1.0-5              codetools_0.2-15          
 [9] splines_3.3.2              snpStats_1.24.0            robustbase_0.92-6          jsonlite_1.1              
[13] Rsamtools_1.26.1           kernlab_0.9-25             png_0.1-7                  DiagrammeR_0.9.0          
[17] httr_1.2.1                 assertthat_0.1             Matrix_1.2-7.1             lazyeval_0.2.0            
[21] acepack_1.4.1              visNetwork_1.0.3           htmltools_0.3.5            tools_3.3.2               
[25] igraph_1.0.1               gtable_0.2.0               fastmatch_1.0-4            rgexf_0.15.3              
[29] trimcluster_0.1-2          gdata_2.17.0               nlme_3.1-128               fpc_2.1-10                
[33] stringr_1.1.0              gtools_3.5.0               XML_3.98-1.4               DEoptimR_1.0-6            
[37] zlibbioc_1.20.0            MASS_7.3-45                SummarizedExperiment_1.2.3 rpart_4.1-10              
[41] latticeExtra_0.6-28        stringi_1.1.2              RSQLite_1.0.0              Rook_1.1-1                
[45] caTools_1.17.1             BiocParallel_1.8.1         chron_2.3-47               prabclus_2.2-6            
[49] bitops_1.0-6               GenomicAlignments_1.8.4    htmlwidgets_0.8            R6_2.2.0                  
[53] DBI_0.5-1                  whisker_0.3-2              foreign_0.8-67             KEGGREST_1.14.0           
[57] RCurl_1.95-4.8             nnet_7.3-12                tibble_1.2                 KernSmooth_2.23-15        
[61] viridis_0.3.4              locfit_1.5-9.1             influenceR_0.1.0           digest_0.6.11             
[65] diptest_0.75-7             brew_1.0-6                 munsell_0.4.3

biostrings pdict • 1.3k views

ADD COMMENT • link updated 7.9 years ago by Hervé Pagès 16k • written 7.9 years ago by rubi ▴ 110

score 2 · Accepted Answer · 2017-02-02

2

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 7 days ago

Seattle, WA, United States

Hi Rubi,

PDict objects are for DNA sequences only. See the man page:

    The PDict class is a container for storing a preprocessed
    dictionary of DNA patterns...

There are other restrictions to what PDict() can preprocess. See man page for the details.

If your set of patterns cannot be preprocessed, then don't preprocess it ;-) , i.e. use one of the matchPDict functions directly on your AAStringSet object. See D. USING A NON-PREPROCESSED DICTIONARY in examples section of ?matchPDict for some examples.

Cheers,

H.

ADD COMMENT • link 7.9 years ago Hervé Pagès 16k

0

Entering edit mode

Also please check matching of AAStringSet vs. another AAStringSet for a similar question and an efficient solution for the exact matching case based on CRAN package AhoCorasickTrie.

H.

ADD REPLY • link 7.9 years ago Hervé Pagès 16k