Question

msa alignement: sorted sequences by sequence name

0

Entering edit mode

Olorin ▴ 50

@olorin-12230

Last seen 8.2 years ago

Hello,

I wonder if there is a way to use msaPrettyPrint() with sorted alignement by sequence name without rewriting the output fasta file. I know I can read the output fasta file with python, and re-order the sequences by the sequence name, but I'm pretty sure there is a quickest way to achieve that.

I mean i want to see in fasta file and pdf file (output from msaPrettyPrint):

>A  

seq_A

>B

seq_B

Anybody can help?

msa alignement output sorted • 4.2k views

ADD COMMENT • link 8.2 years ago Olorin ▴ 50

score 1 · Answer 1 · 2017-01-27

TeXshade reorders the sequences by similarity. So changing the order in the FASTA file will not help. My suggestion is to enforce a certain order in the alignment object and then to force TeXshade to leave the order unchanged by an extra tweak.

msaPrettyPrint() uses exactly the same order as in the multiple alignment object. The order of the sequences in the multiple alignment object can be altered by all MSA methods in the 'msa' package: they have an argument 'order' which can either be "aligned" (i.e. as the algorithm reorders them) or "input" (as in the input sequences object). If this is not what you want, you can reorder the sequences as you like by simply using a permutation of the sequence. If you want to order sequences alphabetically by their names, you can do that as in the following example (re-order sequences before alignment and then use 'order="input"'):

library(msa)

filepath <- system.file("examples", "exampleAA.fasta", package="msa")
mySeqs <- readAAStringSet(filepath)

## order sequences alphabetically by names
permutation <- order(names(mySeqs))
myAlignment <- msa(mySeqs[permutation],
                   order="input")  ## forces msa() to leave order as it is
myAlignment

msaPrettyPrint(myAlignment, output="pdf",
               code=paste0("\\orderseqs{1-", nrow(myAlignment), "}")) ## reordering off

Sorry that the solution is sort of complicated, but since TeXshade reorders by default, there seems to be no other way.

score 1 · Answer 2 · 2017-01-28

PROBLEM SOLVED

______________________________

for (i in 1:n){
  current_seq <- readAAStringSet(myFiles[i])
  permutation <- order(names(current_seq )
  current_align <- msa(current_seq[permutation],order="input")
  #I prefer to make tex file first
  msaPrettyPrint(current_align, output="tex", askForOverwrite=FALSE, verbose=FALSE,
                 file = tex_file_1, alFile = fasta_file_1,
                 code=tex_code)
  #compile tex file
  texi2pdf(tex_file_1, clean=TRUE)
  #move outputs files
  file.rename(from = fasta_file_1, to = fasta_file_2)
  file.rename(from = tex_file_1, to = tex_file_2)
  file.rename(from = pdf_file_1, to = pdf_file_2
}

#This code need to be insert in the loop
#Tex_mis take basics customisations (logo consensus etc..)
#Put all basics stuff here
tex_misc = "\\shadingmode{identical}
\\threshold{50}
\\shadingcolors{blues}
\\showsequencelogo[chemical]{top}
\\hidelogoscale
\\shownumbering{right}
\\\showconsensus{}
\\showlegend"
tex_misc = unlist(strsplit(tex_misc, split="\n"))
tex_code = c(tex_misc)
#Put the full name for each sequences
for (i in 1:33){
  ind = permutation[i]
  print(names(current_seq[ind]))
  new_name = gsub(pattern = "_", replacement = "-",names(current_seq[ind]))
  tex_code = c(tex_code,paste("\\nameseq{",i,"}{",new_name,"}",sep =""))
}

________________________________________________________

Good things to know:

https://github.com/Bioconductor-mirror/msa/blob/master/R/msaPrettyPrint.R for helping retreving default customisations (logo, colors, consensus).

You can also use msaPrettyPrint() with only its own arguments (without changing the code), looks at the tex file to find how the basics stuff are written in tex file and copy/paste them in tex_misc in here (that's how I did)

LaTeX doesnt handle every character in sequence name. Any issues may be due to a wrong char in one sequence name.

/!\ for adding several LINES in tex code, the code arguments in msaPrettyPrint() need to have a list ( eg custom_tex_code = c(), where each elements in the list/vector will be a line in the tex file

The code shown above isnt ready to direct use. I use msaPrettyPrint() in a loop because i have several multifasta file to align seperately.

I can answer questions if you have some issues. Now i'm good in this awesome function.

score 0 · Answer 3 · 2017-01-28

0

Entering edit mode

Olorin ▴ 50

@olorin-12230

Last seen 8.2 years ago

Thanks for you great help! Your solutions is tricky but I understand how you solve my problem =)

Your solution works fine. Sequences in fasta file are sorted by name and in the pdf too.

But there are 2 littles issues:

-My sequences names are like : Escherichia coli|gene_name|. In the pdf i get only Escherichia. It's a problem since I have n sequences from the same genus (e.g Escherichia strain_a, Escherichia strain_b, and so on..). I wonder how I can keep the full name.

-A very little issue is I lost the logo on the pdf.

________

EDIT: see my next answer.

ADD COMMENT • link 8.2 years ago Olorin ▴ 50

0

Entering edit mode

There is not always the space to display full names if they are longer. I suggest to use some string processing to shorten the names properly, e.g. to remove "Escherichia coli" entirely (if you only have E.coli sequences anyway) or to abbreviate it by "E.coli". This can be done by altering the names of the original sequences of the row names of the alignment object.

ADD REPLY • link 8.2 years ago UBod ▴ 300

0

Entering edit mode

Yes. I know right. But I need to keep the full species names ( Genus species strain ) but my pdf is nice

ADD REPLY • link 8.2 years ago Olorin ▴ 50