Performing basic Multiple Sequence Alignment in R?

0

Entering edit mode

Tal Galili ▴ 20

@tal-galili-4408

Last seen 10.1 years ago

Hello everyone, I am not sure if this should go on the general R mailing list (for example, if there is a text mining solution that might work here) or the bioconductor mailing list (since I wasn't able to find a solution to my question on searching their lists) - so this time I tried both, and in the future I'll know better (in case it should go to only one of the two). The task I'm trying to achieve is to align several sequences together. I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I'm looking at, have had missing values introduced to them at random points. Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe. My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same) Real_sequence The_sequence_we_see 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG Here is an example code to reproduce the above example: ATCG <- c("A","T","C","G") set.seed(40) original.seq <- sample(ATCG, 30, T) seqS <- matrix(original.seq,200,30, T) change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) { number.of.changes <- sample(seq_len(number.of.changes), 1) new.letters <- sample(letters.to.change.with , number.of.changes, T) where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) x[where.to.change.the.letters] <- new.letters return(x) } change.letters(original.seq) insert.missing.values <- function(x) change.letters(x, 3, "-") insert.missing.values(original.seq) seqS2 <- t(apply(seqS, 1, change.letters)) seqS3 <- t(apply(seqS2, 1, insert.missing.values)) seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) require(stringr) # library(help=stringr) all.seqS <- str_replace(seqS4,"-" , "") # how do we allign this? data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) I understand that if all I had was a string and a pattern I would be able to use library(Biostrings) pairwiseAlignment(...) But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern). Is there a known method for doing this in R? Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------- ------------------------ [[alternative HTML version deleted]]

GO GO • 2.8k views

ADD COMMENT • link 13.8 years ago Tal Galili ▴ 20

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 6 months ago

United States

To computer the multiple alignments, you need to use external programs like dialign, ClustalW, T-Coffee, MUSCLE or Multalin. If you are familiar with the command-line then you can execute them from R with the system() function, and then import the resulting alignment files into R as XMultipleAlignment objects. For details see: http://bioconductor.org/packages/release/bioc/html/Biostrings.html. For handling multiple alignements in R, you also may want take a look at some of these examples http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Multiple- Sequence-Alignments-MSAs- or the ape library at CRAN. Thomas On Tue, Dec 21, 2010 at 11:29:05AM +0200, Tal Galili wrote: > Hello everyone, > > I am not sure if this should go on the general R mailing list (for example, > if there is a text mining solution that might work here) or the bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the future I'll > know better (in case it should go to only one of the two). > > > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm looking > at, have had missing values introduced to them at random points. > Here is an example of such sequences, were on the left we see what is the > real location of the missing values, and on the right we see the sequence > that we will be able to observe. My goal is to reconstruct the left column > using only the sequences I've got on the right column (based on the fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG > > > Here is an example code to reproduce the above example: > > ATCG <- c("A","T","C","G") > > set.seed(40) > > > original.seq <- sample(ATCG, 30, T) > > > seqS <- matrix(original.seq,200,30, T) > > > change.letters <- function(x, number.of.changes = 15, > letters.to.change.with = ATCG) > > { > > > number.of.changes <- sample(seq_len(number.of.changes), 1) > > > new.letters <- sample(letters.to.change.with , number.of.changes, T) > > > where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) > > > x[where.to.change.the.letters] <- new.letters > > > return(x) > > } > > > change.letters(original.seq) > > > insert.missing.values <- function(x) change.letters(x, 3, "-") > > > insert.missing.values(original.seq) > > > seqS2 <- t(apply(seqS, 1, change.letters)) > > > seqS3 <- t(apply(seqS2, 1, insert.missing.values)) > > > seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) > > require(stringr) > > # library(help=stringr) > > > all.seqS <- str_replace(seqS4,"-" , "") > > > # how do we allign this? > > > data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) > > > I understand that if all I had was a string and a pattern I would be able to > use > > library(Biostrings) > > > pairwiseAlignment(...) > > > > But in the case I present we are dealing with many sequences to align to one > another (instead of aligning them to one pattern). > > Is there a known method for doing this in R? > > > Thanks, > > Tal > > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili at gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > -------------------------------------------------------------------- -------------------------- > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 13.8 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Tal Galili ▴ 20

@tal-galili-4408

Last seen 10.1 years ago

Hello everyone, I am not sure if this should go on the general R mailing list (for example, if there is a text mining solution that might work here) or the bioconductor mailing list (since I wasn't able to find a solution to my question on searching their lists) - so this time I tried both, and in the future I'll know better (in case it should go to only one of the two). The task I'm trying to achieve is to align several sequences together. I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I'm looking at, have had missing values introduced to them at random points. Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe. My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same) Real_sequence The_sequence_we_see 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG Here is an example code to reproduce the above example: ATCG <- c("A","T","C","G") set.seed(40) original.seq <- sample(ATCG, 30, T) seqS <- matrix(original.seq,200,30, T) change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) { number.of.changes <- sample(seq_len(number.of.changes), 1) new.letters <- sample(letters.to.change.with , number.of.changes, T) where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) x[where.to.change.the.letters] <- new.letters return(x) } change.letters(original.seq) insert.missing.values <- function(x) change.letters(x, 3, "-") insert.missing.values(original.seq) seqS2 <- t(apply(seqS, 1, change.letters)) seqS3 <- t(apply(seqS2, 1, insert.missing.values)) seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) require(stringr) # library(help=stringr) all.seqS <- str_replace(seqS4,"-" , "") # how do we allign this? data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) I understand that if all I had was a string and a pattern I would be able to use library(Biostrings) pairwiseAlignment(...) But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern). Is there a known method for doing this in R? Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------- ------------------------ [[alternative HTML version deleted]]

ADD COMMENT • link 13.8 years ago Tal Galili ▴ 20

0

Entering edit mode

I don't have an answer, trying to solicit more input with additional questions. > From: tal.galili at gmail.com > Date: Tue, 21 Dec 2010 11:21:03 +0200 > To: r-help at r-project.org; bioconductor at r-project.org > Subject: [R] Performing basic Multiple Sequence Alignment in R? > > Hello everyone, > > I am not sure if this should go on the general R mailing list (for example, > if there is a text mining solution that might work here) or the bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the future I'll > know better (in case it should go to only one of the two). > I take it you don't want an R interface for clustal and I seem to recall, from doing this a few years ago, that alignment by exact string matching was a bit of a research area ( I think you can find papers on citeseer for example). It does seem you are asking about exact string matches for alignment markers- your left sequences appear exactly someplace on the right- but your overall interests are not real clear. I never got my code fully working but I was happy that I could do different strains of e coli ( or something in the 5-10 Mbp genome range ) very quickly ( seconds as I recall ) and you could also presumably find similar items that had moved a long way. Earlier someone came here with a task and was pointed to bio packages but I thought there may be something in computational linguistics or mining better suited to needs but no one ever volunteered anything. > > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm looking > at, have had missing values introduced to them at random points. Alternatively I guess someone could make an R interface for various BLAST's, sometimes the help desk at NCBI can get questions like this to the right person internally. > Here is an example of such sequences, were on the left we see what is the > real location of the missing values, and on the right we see the sequence > that we will be able to observe. My goal is to reconstruct the left column > using only the sequences I've got on the right column (based on the fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG >

ADD REPLY • link 13.8 years ago Mike Marchywka ▴ 10

Login before adding your answer.