Performing basic Multiple Sequence Alignment in R?
2
0
Entering edit mode
Tal Galili ▴ 20
@tal-galili-4408
Last seen 10.1 years ago
Hello everyone, I am not sure if this should go on the general R mailing list (for example, if there is a text mining solution that might work here) or the bioconductor mailing list (since I wasn't able to find a solution to my question on searching their lists) - so this time I tried both, and in the future I'll know better (in case it should go to only one of the two). The task I'm trying to achieve is to align several sequences together. I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I'm looking at, have had missing values introduced to them at random points. Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe. My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same) Real_sequence The_sequence_we_see 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG Here is an example code to reproduce the above example: ATCG <- c("A","T","C","G") set.seed(40) original.seq <- sample(ATCG, 30, T) seqS <- matrix(original.seq,200,30, T) change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) { number.of.changes <- sample(seq_len(number.of.changes), 1) new.letters <- sample(letters.to.change.with , number.of.changes, T) where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) x[where.to.change.the.letters] <- new.letters return(x) } change.letters(original.seq) insert.missing.values <- function(x) change.letters(x, 3, "-") insert.missing.values(original.seq) seqS2 <- t(apply(seqS, 1, change.letters)) seqS3 <- t(apply(seqS2, 1, insert.missing.values)) seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) require(stringr) # library(help=stringr) all.seqS <- str_replace(seqS4,"-" , "") # how do we allign this? data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) I understand that if all I had was a string and a pattern I would be able to use library(Biostrings) pairwiseAlignment(...) But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern). Is there a known method for doing this in R? Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------- ------------------------ [[alternative HTML version deleted]]
GO GO • 2.8k views
ADD COMMENT
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 6 months ago
United States
To computer the multiple alignments, you need to use external programs like dialign, ClustalW, T-Coffee, MUSCLE or Multalin. If you are familiar with the command-line then you can execute them from R with the system() function, and then import the resulting alignment files into R as XMultipleAlignment objects. For details see: http://bioconductor.org/packages/release/bioc/html/Biostrings.html. For handling multiple alignements in R, you also may want take a look at some of these examples http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Multiple- Sequence-Alignments-MSAs- or the ape library at CRAN. Thomas On Tue, Dec 21, 2010 at 11:29:05AM +0200, Tal Galili wrote: > Hello everyone, > > I am not sure if this should go on the general R mailing list (for example, > if there is a text mining solution that might work here) or the bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the future I'll > know better (in case it should go to only one of the two). > > > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm looking > at, have had missing values introduced to them at random points. > Here is an example of such sequences, were on the left we see what is the > real location of the missing values, and on the right we see the sequence > that we will be able to observe. My goal is to reconstruct the left column > using only the sequences I've got on the right column (based on the fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG > > > Here is an example code to reproduce the above example: > > ATCG <- c("A","T","C","G") > > set.seed(40) > > > original.seq <- sample(ATCG, 30, T) > > > seqS <- matrix(original.seq,200,30, T) > > > change.letters <- function(x, number.of.changes = 15, > letters.to.change.with = ATCG) > > { > > > number.of.changes <- sample(seq_len(number.of.changes), 1) > > > new.letters <- sample(letters.to.change.with , number.of.changes, T) > > > where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) > > > x[where.to.change.the.letters] <- new.letters > > > return(x) > > } > > > change.letters(original.seq) > > > insert.missing.values <- function(x) change.letters(x, 3, "-") > > > insert.missing.values(original.seq) > > > seqS2 <- t(apply(seqS, 1, change.letters)) > > > seqS3 <- t(apply(seqS2, 1, insert.missing.values)) > > > seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) > > require(stringr) > > # library(help=stringr) > > > all.seqS <- str_replace(seqS4,"-" , "") > > > # how do we allign this? > > > data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) > > > I understand that if all I had was a string and a pattern I would be able to > use > > library(Biostrings) > > > pairwiseAlignment(...) > > > > But in the case I present we are dealing with many sequences to align to one > another (instead of aligning them to one pattern). > > Is there a known method for doing this in R? > > > Thanks, > > Tal > > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili at gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > -------------------------------------------------------------------- -------------------------- > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Tal Galili ▴ 20
@tal-galili-4408
Last seen 10.1 years ago
Hello everyone, I am not sure if this should go on the general R mailing list (for example, if there is a text mining solution that might work here) or the bioconductor mailing list (since I wasn't able to find a solution to my question on searching their lists) - so this time I tried both, and in the future I'll know better (in case it should go to only one of the two). The task I'm trying to achieve is to align several sequences together. I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I'm looking at, have had missing values introduced to them at random points. Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe. My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same) Real_sequence The_sequence_we_see 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG Here is an example code to reproduce the above example: ATCG <- c("A","T","C","G") set.seed(40) original.seq <- sample(ATCG, 30, T) seqS <- matrix(original.seq,200,30, T) change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) { number.of.changes <- sample(seq_len(number.of.changes), 1) new.letters <- sample(letters.to.change.with , number.of.changes, T) where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) x[where.to.change.the.letters] <- new.letters return(x) } change.letters(original.seq) insert.missing.values <- function(x) change.letters(x, 3, "-") insert.missing.values(original.seq) seqS2 <- t(apply(seqS, 1, change.letters)) seqS3 <- t(apply(seqS2, 1, insert.missing.values)) seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) require(stringr) # library(help=stringr) all.seqS <- str_replace(seqS4,"-" , "") # how do we allign this? data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) I understand that if all I had was a string and a pattern I would be able to use library(Biostrings) pairwiseAlignment(...) But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern). Is there a known method for doing this in R? Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------- ------------------------ [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
I don't have an answer, trying to solicit more input with additional questions. > From: tal.galili at gmail.com > Date: Tue, 21 Dec 2010 11:21:03 +0200 > To: r-help at r-project.org; bioconductor at r-project.org > Subject: [R] Performing basic Multiple Sequence Alignment in R? > > Hello everyone, > > I am not sure if this should go on the general R mailing list (for example, > if there is a text mining solution that might work here) or the bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the future I'll > know better (in case it should go to only one of the two). > I take it you don't want an R interface for clustal and I seem to recall, from doing this a few years ago, that alignment by exact string matching was a bit of a research area ( I think you can find papers on citeseer for example). It does seem you are asking about exact string matches for alignment markers- your left sequences appear exactly someplace on the right- but your overall interests are not real clear. I never got my code fully working but I was happy that I could do different strains of e coli ( or something in the 5-10 Mbp genome range ) very quickly ( seconds as I recall ) and you could also presumably find similar items that had moved a long way. Earlier someone came here with a task and was pointed to bio packages but I thought there may be something in computational linguistics or mining better suited to needs but no one ever volunteered anything. > > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm looking > at, have had missing values introduced to them at random points. Alternatively I guess someone could make an R interface for various BLAST's, sometimes the help desk at NCBI can get questions like this to the right person internally. > Here is an example of such sequences, were on the left we see what is the > real location of the missing values, and on the right we see the sequence > that we will be able to observe. My goal is to reconstruct the left column > using only the sequences I've got on the right column (based on the fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG >
ADD REPLY

Login before adding your answer.

Traffic: 359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6