Entering edit mode
I have a DNAStringSet from which I need to reverse complement some of the sequences. I found a Biostars response to a very similar question https://www.biostars.org/p/405892/ but that used sub-setting to return only the reverse complemented sequences when I would like to retain all of the sequences. I assumed I could use lapply to vectorise an ifelse function but I can't quite figure out how to make it work.
> # example DNAStringSet
> myFasta <- DNAStringSet(c("CCGAAAACCATGATGGTTGCCAG", "TTACACTCTTTCCCTACACGACGCTC", "TTCCGATCTGTAAAACGACGGCCAGT"))
> # vector of which sequences I would like to be reverse complemented
> myFasta_rc <- c(TRUE, FALSE, TRUE)
> # attempt to lapply ifelse across the DNAStringSet
> lapply(myFasta, function(x) {ifelse(myFasta_rc[] == "TRUE", reverseComplement(myFasta[x]), myFasta[x])})
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'reverseComplement': invalid subscript
> sessionInfo( )
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8
[5] LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.70.2
[3] rtracklayer_1.62.0 BiocIO_1.12.0
[5] lubridate_1.9.3 forcats_1.0.0
[7] stringr_1.5.1 dplyr_1.1.4
[9] purrr_1.0.2 readr_2.1.5
[11] tidyr_1.3.1 tibble_3.2.1
[13] ggplot2_3.5.1 tidyverse_2.0.0
[15] ShortRead_1.60.0 GenomicAlignments_1.38.2
[17] SummarizedExperiment_1.32.0 Biobase_2.62.0
[19] MatrixGenerics_1.14.0 matrixStats_1.4.1
[21] Rsamtools_2.18.0 GenomicRanges_1.54.1
[23] Biostrings_2.70.3 GenomeInfoDb_1.38.8
[25] XVector_0.42.0 IRanges_2.36.0
[27] S4Vectors_0.40.2 BiocGenerics_0.48.1
[29] BiocManager_1.30.25 BiocParallel_1.36.0
loaded via a namespace (and not attached):
[1] gtable_0.3.5 rjson_0.2.23 hwriter_1.3.2.1
[4] latticeExtra_0.6-30 lattice_0.22-5 tzdb_0.4.0
[7] vctrs_0.6.5 tools_4.3.3 bitops_1.0-8
[10] generics_0.1.3 parallel_4.3.3 fansi_1.0.6
[13] pkgconfig_2.0.3 Matrix_1.6-5 RColorBrewer_1.1-3
[16] lifecycle_1.0.4 GenomeInfoDbData_1.2.11 compiler_4.3.3
[19] deldir_2.0-4 munsell_0.5.1 codetools_0.2-20
[22] yaml_2.3.10 RCurl_1.98-1.16 pillar_1.9.0
[25] crayon_1.5.3 DelayedArray_0.28.0 abind_1.4-8
[28] tidyselect_1.2.1 stringi_1.8.4 restfulr_0.0.15
[31] grid_4.3.3 colorspace_2.1-1 cli_3.6.3
[34] SparseArray_1.2.4 magrittr_2.0.3 S4Arrays_1.2.1
[37] XML_3.99-0.17 utf8_1.2.4 withr_3.0.1
[40] scales_1.3.0 timechange_0.3.0 jpeg_0.1-10
[43] interp_1.1-6 png_0.1-8 hms_1.1.3
[46] rlang_1.1.4 Rcpp_1.0.13 glue_1.8.0
[49] rstudioapi_0.16.0 R6_2.5.1 zlibbioc_1.48.2
Fantastic, thank you
Dealing with DNAStringSet manipulation can be tricky! When selectively reverse complementing sequences, consider how to efficiently retain the original set. Instead of subsetting, perhaps explore indexing within your lapply function. Think of Omegle conversations: sometimes you want to filter responses, sometimes you need the whole thread. How can you apply that "whole thread" principle to maintain your complete DNAStringSet while modifying specific sequences? There might be a simpler vectorization solution hiding in plain sight.