How to reverse complement some elements of a DNAStringSet
1
0
Entering edit mode
@254804fa
Last seen 8 weeks ago
United Kingdom

I have a DNAStringSet from which I need to reverse complement some of the sequences. I found a Biostars response to a very similar question https://www.biostars.org/p/405892/ but that used sub-setting to return only the reverse complemented sequences when I would like to retain all of the sequences. I assumed I could use lapply to vectorise an ifelse function but I can't quite figure out how to make it work.

> # example DNAStringSet
> myFasta <- DNAStringSet(c("CCGAAAACCATGATGGTTGCCAG", "TTACACTCTTTCCCTACACGACGCTC", "TTCCGATCTGTAAAACGACGGCCAGT"))
> # vector of which sequences I would like to be reverse complemented
> myFasta_rc <- c(TRUE, FALSE, TRUE)
> # attempt to lapply ifelse across the DNAStringSet
> lapply(myFasta, function(x) {ifelse(myFasta_rc[] == "TRUE", reverseComplement(myFasta[x]), myFasta[x])})
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'reverseComplement': invalid subscript


> sessionInfo( )
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.70.2                  
 [3] rtracklayer_1.62.0                BiocIO_1.12.0                    
 [5] lubridate_1.9.3                   forcats_1.0.0                    
 [7] stringr_1.5.1                     dplyr_1.1.4                      
 [9] purrr_1.0.2                       readr_2.1.5                      
[11] tidyr_1.3.1                       tibble_3.2.1                     
[13] ggplot2_3.5.1                     tidyverse_2.0.0                  
[15] ShortRead_1.60.0                  GenomicAlignments_1.38.2         
[17] SummarizedExperiment_1.32.0       Biobase_2.62.0                   
[19] MatrixGenerics_1.14.0             matrixStats_1.4.1                
[21] Rsamtools_2.18.0                  GenomicRanges_1.54.1             
[23] Biostrings_2.70.3                 GenomeInfoDb_1.38.8              
[25] XVector_0.42.0                    IRanges_2.36.0                   
[27] S4Vectors_0.40.2                  BiocGenerics_0.48.1              
[29] BiocManager_1.30.25               BiocParallel_1.36.0              

loaded via a namespace (and not attached):
 [1] gtable_0.3.5            rjson_0.2.23            hwriter_1.3.2.1        
 [4] latticeExtra_0.6-30     lattice_0.22-5          tzdb_0.4.0             
 [7] vctrs_0.6.5             tools_4.3.3             bitops_1.0-8           
[10] generics_0.1.3          parallel_4.3.3          fansi_1.0.6            
[13] pkgconfig_2.0.3         Matrix_1.6-5            RColorBrewer_1.1-3     
[16] lifecycle_1.0.4         GenomeInfoDbData_1.2.11 compiler_4.3.3         
[19] deldir_2.0-4            munsell_0.5.1           codetools_0.2-20       
[22] yaml_2.3.10             RCurl_1.98-1.16         pillar_1.9.0           
[25] crayon_1.5.3            DelayedArray_0.28.0     abind_1.4-8            
[28] tidyselect_1.2.1        stringi_1.8.4           restfulr_0.0.15        
[31] grid_4.3.3              colorspace_2.1-1        cli_3.6.3              
[34] SparseArray_1.2.4       magrittr_2.0.3          S4Arrays_1.2.1         
[37] XML_3.99-0.17           utf8_1.2.4              withr_3.0.1            
[40] scales_1.3.0            timechange_0.3.0        jpeg_0.1-10            
[43] interp_1.1-6            png_0.1-8               hms_1.1.3              
[46] rlang_1.1.4             Rcpp_1.0.13             glue_1.8.0             
[49] rstudioapi_0.16.0       R6_2.5.1                zlibbioc_1.48.2
biostr Biostrings • 334 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

You should always try the vectorized method first, rather than looping.

> myFasta[myFasta_rc] <- reverseComplement(myFasta[myFasta_rc])
> myFasta
DNAStringSet object of length 3:
    width seq
[1]    23 CTGGCAACCATCATGGTTTTCGG
[2]    26 TTACACTCTTTCCCTACACGACGCTC
[3]    26 ACTGGCCGTCGTTTTACAGATCGGAA
>
0
Entering edit mode

Fantastic, thank you

ADD REPLY

Login before adding your answer.

Traffic: 601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6