msa alignement: sorted sequences by sequence name
3
0
Entering edit mode
Olorin ▴ 50
@olorin-12230
Last seen 8.0 years ago

Hello,

I wonder if there is a way to use msaPrettyPrint() with sorted alignement by sequence name without rewriting the output fasta file. I know I can read the output fasta file with python, and re-order the sequences by the sequence name, but I'm pretty sure there is a quickest way to achieve that.

I mean i want to see in fasta file and pdf file (output from msaPrettyPrint):

>A  

seq_A

>B

seq_B

Anybody can help?

msa alignement output sorted • 3.9k views
ADD COMMENT
1
Entering edit mode
UBod ▴ 300
@ubodenhofer-5425
Last seen 8 months ago
University of Applied Sciences Upper Au…

TeXshade reorders the sequences by similarity. So changing the order in the FASTA file will not help. My suggestion is to enforce a certain order in the alignment object and then to force TeXshade to leave the order unchanged by an extra tweak.

 

msaPrettyPrint() uses exactly the same order as in the multiple alignment object. The order of the sequences in the multiple alignment object can be altered by all MSA methods in the 'msa' package: they have an argument 'order' which can either be "aligned" (i.e. as the algorithm reorders them) or "input" (as in the input sequences object). If this is not what you want, you can reorder the sequences as you like by simply using a permutation of the sequence. If you want to order sequences alphabetically by their names, you can do that as in the following example (re-order sequences before alignment and then use 'order="input"'):

library(msa)

filepath <- system.file("examples", "exampleAA.fasta", package="msa")
mySeqs <- readAAStringSet(filepath)

## order sequences alphabetically by names
permutation <- order(names(mySeqs))
myAlignment <- msa(mySeqs[permutation],
                   order="input")  ## forces msa() to leave order as it is
myAlignment

msaPrettyPrint(myAlignment, output="pdf",
               code=paste0("\\orderseqs{1-", nrow(myAlignment), "}")) ## reordering off

Sorry that the solution is sort of complicated, but since TeXshade reorders by default, there seems to be no other way.

 

ADD COMMENT
1
Entering edit mode
Olorin ▴ 50
@olorin-12230
Last seen 8.0 years ago

PROBLEM SOLVED

______________________________

for (i in 1:n){
  current_seq <- readAAStringSet(myFiles[i])
  permutation <- order(names(current_seq )
  current_align <- msa(current_seq[permutation],order="input")
  #I prefer to make tex file first
  msaPrettyPrint(current_align, output="tex", askForOverwrite=FALSE, verbose=FALSE,
                 file = tex_file_1, alFile = fasta_file_1,
                 code=tex_code)
  #compile tex file
  texi2pdf(tex_file_1, clean=TRUE)
  #move outputs files
  file.rename(from = fasta_file_1, to = fasta_file_2)
  file.rename(from = tex_file_1, to = tex_file_2)
  file.rename(from = pdf_file_1, to = pdf_file_2
}

 

#This code need to be insert in the loop
#Tex_mis take basics customisations (logo consensus etc..)
#Put all basics stuff here
tex_misc = "\\shadingmode{identical}
\\threshold{50}
\\shadingcolors{blues}
\\showsequencelogo[chemical]{top}
\\hidelogoscale
\\shownumbering{right}
\\\showconsensus{}
\\showlegend"
tex_misc = unlist(strsplit(tex_misc, split="\n"))
tex_code = c(tex_misc)
#Put the full name for each sequences
for (i in 1:33){
  ind = permutation[i]
  print(names(current_seq[ind]))
  new_name = gsub(pattern = "_", replacement = "-",names(current_seq[ind]))
  tex_code = c(tex_code,paste("\\nameseq{",i,"}{",new_name,"}",sep =""))
}

________________________________________________________

Good things to know:

https://github.com/Bioconductor-mirror/msa/blob/master/R/msaPrettyPrint.R for helping retreving default customisations (logo, colors, consensus).

You can also use msaPrettyPrint() with only its own arguments (without changing the code), looks at the tex file to find how the basics stuff are written in tex file and copy/paste them in tex_misc in here (that's how I did)

LaTeX doesnt handle every character in sequence name. Any issues may be due to a wrong char in one sequence name.

/!\ for adding several LINES in tex code, the code arguments in msaPrettyPrint() need to have a list ( eg custom_tex_code = c(), where each elements in the list/vector will be a line in the tex file

The code shown above isnt ready to direct use. I use msaPrettyPrint() in a loop because i have several multifasta file to align seperately.

I can answer questions if you have some issues. Now i'm good in this awesome function.

ADD COMMENT
0
Entering edit mode

Awesome, Olorin! May I add a few more bits:

  • Not all of your customizations really need to be done in the LaTeX source. The most common customizations can be made using optional arguments of the msaPrettyPrint() function. See '?msaPrettyPrint' for details. 
  • Those interested in where more details about LaTeX customiizations can be found, have a look at the TeXshade manual: http://mirrors.ctan.org/macros/latex/contrib/texshade/texshade.pdf
ADD REPLY
0
Entering edit mode

I saw right. you can see msaPrettyPrint help. If you add code arguments, it overrides all other arguments:

code    
this argument can be used to specify the entire LaTeX code in the texshade environment. This overrides all arguments that customize the appearance of the output. Instead, all customizations must be done as LaTeX commands provided by the package texshade.sty directly. This option should only be used by expert users and for special applciations in which the possibilities of the customizations of the msaPrettyPrint function turn out to be insufficient.

- Ty for the links

ADD REPLY
0
Entering edit mode

That is true, 'code' overrides everything else. However, if you just want to add some customizations on top of what msaPrettyPrint() configures by itself, you can use the 'furtherCode' argument which only adds TeX code without overriding the rest.

ADD REPLY
0
Entering edit mode

Good to know. thank you

ADD REPLY
0
Entering edit mode
Olorin ▴ 50
@olorin-12230
Last seen 8.0 years ago

Thanks for you great help! Your solutions is tricky but I understand how you solve my problem =)

Your solution works fine. Sequences in fasta file are sorted by name and in the pdf too.

But there are 2 littles issues:

-My sequences names are like : Escherichia coli|gene_name|. In the pdf i get only Escherichia. It's a problem since I have n sequences from the same genus (e.g Escherichia strain_a, Escherichia strain_b, and so on..). I wonder how I can keep the full name.

-A very little issue is I lost the logo on the pdf.

________

EDIT: see my next answer.

ADD COMMENT
0
Entering edit mode

There is not always the space to display full names if they are longer. I suggest to use some string processing to shorten the names properly, e.g. to remove "Escherichia coli" entirely (if you only have E.coli sequences anyway) or to abbreviate it by "E.coli". This can be done by altering the names of the original sequences of the row names of the alignment object.

ADD REPLY
0
Entering edit mode

Yes. I know right. But I need to keep the full species names ( Genus species strain ) but my pdf is nice

ADD REPLY

Login before adding your answer.

Traffic: 546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6