Question

How to reverse complement some elements of a DNAStringSet

0

Entering edit mode

Christine Jones • 0

@254804fa

Last seen 5 months ago

United Kingdom

I have a DNAStringSet from which I need to reverse complement some of the sequences. I found a Biostars response to a very similar question https://www.biostars.org/p/405892/ but that used sub-setting to return only the reverse complemented sequences when I would like to retain all of the sequences. I assumed I could use lapply to vectorise an ifelse function but I can't quite figure out how to make it work.

> # example DNAStringSet
> myFasta <- DNAStringSet(c("CCGAAAACCATGATGGTTGCCAG", "TTACACTCTTTCCCTACACGACGCTC", "TTCCGATCTGTAAAACGACGGCCAGT"))
> # vector of which sequences I would like to be reverse complemented
> myFasta_rc <- c(TRUE, FALSE, TRUE)
> # attempt to lapply ifelse across the DNAStringSet
> lapply(myFasta, function(x) {ifelse(myFasta_rc[] == "TRUE", reverseComplement(myFasta[x]), myFasta[x])})
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'reverseComplement': invalid subscript


> sessionInfo( )
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.70.2                  
 [3] rtracklayer_1.62.0                BiocIO_1.12.0                    
 [5] lubridate_1.9.3                   forcats_1.0.0                    
 [7] stringr_1.5.1                     dplyr_1.1.4                      
 [9] purrr_1.0.2                       readr_2.1.5                      
[11] tidyr_1.3.1                       tibble_3.2.1                     
[13] ggplot2_3.5.1                     tidyverse_2.0.0                  
[15] ShortRead_1.60.0                  GenomicAlignments_1.38.2         
[17] SummarizedExperiment_1.32.0       Biobase_2.62.0                   
[19] MatrixGenerics_1.14.0             matrixStats_1.4.1                
[21] Rsamtools_2.18.0                  GenomicRanges_1.54.1             
[23] Biostrings_2.70.3                 GenomeInfoDb_1.38.8              
[25] XVector_0.42.0                    IRanges_2.36.0                   
[27] S4Vectors_0.40.2                  BiocGenerics_0.48.1              
[29] BiocManager_1.30.25               BiocParallel_1.36.0              

loaded via a namespace (and not attached):
 [1] gtable_0.3.5            rjson_0.2.23            hwriter_1.3.2.1        
 [4] latticeExtra_0.6-30     lattice_0.22-5          tzdb_0.4.0             
 [7] vctrs_0.6.5             tools_4.3.3             bitops_1.0-8           
[10] generics_0.1.3          parallel_4.3.3          fansi_1.0.6            
[13] pkgconfig_2.0.3         Matrix_1.6-5            RColorBrewer_1.1-3     
[16] lifecycle_1.0.4         GenomeInfoDbData_1.2.11 compiler_4.3.3         
[19] deldir_2.0-4            munsell_0.5.1           codetools_0.2-20       
[22] yaml_2.3.10             RCurl_1.98-1.16         pillar_1.9.0           
[25] crayon_1.5.3            DelayedArray_0.28.0     abind_1.4-8            
[28] tidyselect_1.2.1        stringi_1.8.4           restfulr_0.0.15        
[31] grid_4.3.3              colorspace_2.1-1        cli_3.6.3              
[34] SparseArray_1.2.4       magrittr_2.0.3          S4Arrays_1.2.1         
[37] XML_3.99-0.17           utf8_1.2.4              withr_3.0.1            
[40] scales_1.3.0            timechange_0.3.0        jpeg_0.1-10            
[43] interp_1.1-6            png_0.1-8               hms_1.1.3              
[46] rlang_1.1.4             Rcpp_1.0.13             glue_1.8.0             
[49] rstudioapi_0.16.0       R6_2.5.1                zlibbioc_1.48.2

biostr Biostrings • 677 views

ADD COMMENT • link updated 5 weeks ago by shrill.asp.tufl • 0 • written 6 months ago by Christine Jones • 0

score 1 · Accepted Answer · 2024-10-14

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 8 hours ago

United States

You should always try the vectorized method first, rather than looping.

> myFasta[myFasta_rc] <- reverseComplement(myFasta[myFasta_rc])
> myFasta
DNAStringSet object of length 3:
    width seq
[1]    23 CTGGCAACCATCATGGTTTTCGG
[2]    26 TTACACTCTTTCCCTACACGACGCTC
[3]    26 ACTGGCCGTCGTTTTACAGATCGGAA
>

ADD COMMENT • link 6 months ago James W. MacDonald 68k

0

Entering edit mode

Fantastic, thank you

ADD REPLY • link 6 months ago Christine Jones • 0

0

Entering edit mode

Dealing with DNAStringSet manipulation can be tricky! When selectively reverse complementing sequences, consider how to efficiently retain the original set. Instead of subsetting, perhaps explore indexing within your lapply function. Think of Omegle conversations: sometimes you want to filter responses, sometimes you need the whole thread. How can you apply that "whole thread" principle to maintain your complete DNAStringSet while modifying specific sequences? There might be a simpler vectorization solution hiding in plain sight.

ADD REPLY • link 5 weeks ago shrill.asp.tufl • 0