Question

Replace a specific nucleotide sequence, at a specific position, with another in Biostrings

1

Entering edit mode

Benjamin ▴ 20

@Benjamin-24571

Last seen 4.2 years ago

I have a dataframe with a column 'sequences'. It contains stretches of nucleotides in this column. I would like to create a new column which essentially introduces a mutation in the sequence, for example replacing "ACA" with "ATA". Importantly, I would like to do this at a specific position, for example, position 2. Therefore, the sequence: "ACAACA" would become "ATAACA". If the sequence did not contain the pattern "ACA" I would like the sequence to remain unchanged.

I can see the replace replaceAt() you can specify x (in this case a DNAstringSet object which is the 'sequences' column) and you can set the position (IRanges(1, 3) would be the range for position 1 to 3 in the sequence) and the replacement ("ATA") but it will replace any sequence at this position. Any idea of how to make this specific to a sequence? I think I could possibly write an ifelse statement with a grep/regex in it to achieve this, but eventually, I would like to build a loop and replace the static "ACA" and "ATA" with either vectors or dataframe columns with lists of mutations to iterate through. Any help would be very welcome!!

#example data for convenience 
Read <- c("1","2","3","4")
Sequences <- c("ATACCCACG", "AAAGGGAAT", "GCCGATGCG", "ACCAAATCC")
df <- data.frame(Read,Sequences)

# Almost works
df$Mut <- replaceAt(DNAStringSet(df$Sequences), IRanges(1, 3), "ATA")

#sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

```

DNASeqData Biostrings • 2.6k views

ADD COMMENT • link 4.2 years ago Benjamin ▴ 20

score 1 · Answer 1 · 2021-01-16

Hello Benjamin for your example, I can suggest you this code:

Read <- c("1","2","3","4")
Sequences <- c("ACACCCACG", "AAAACAAAT", "GCCGATACA", "AACAAATCC","ATACCCACG")# I changed here a bit the sequences in order to have the example with ACA->ATA 
df <- data.frame(Read,Sequences)
DS_df <- DNAStringSet(df$Sequences)
## Replace bases 1:3 "ACA" with "ATA:
at <- subseq(DS_df, start = 1, width = 3)
midx1_3 <- vmatchPattern("ACA", at, fixed=FALSE)
DS_df2 <- replaceAt(DS_df, midx1_3, value="ATA")

df$Mut <- as.data.frame(DS_df2,)
df
Read Sequences         x
1    1 ACACCCACG ATACCCACG
2    2 AAAACAAAT AAAACAAAT
3    3 GCCGATACA GCCGATACA
4    4 AACAAATCC AACAAATCC

As you can see it changed that first sequence.
I just followed the instructions in ?replaceAt of
(C) ADVANCED EXAMPLES In case you need to perform this with different kind of "mutations" and lists,
I suggest you read about ?matchPattern as there are a lot of different function and may they help you.

If you can't find a solution there is always the way of dplyr:mutate with str_replace,
it depends on how you want to make your code to perform the iterations but in the end you should try what assists you better.