Question

Colour a phylogenetic tree, based on column in DECIPHER?

0

Entering edit mode

reubenmcgregor88 • 0

@reubenmcgregor88-13722

Last seen 3.8 years ago

Some background in case I have got the wrong idea here. I have 240 amino acid sequences and want to see how closely related these sequences are (as well as visualising it), and how similar they are between different strains of bacteria. I have a data frame with the sequences as well as a column indicating to which strain each sequence belongs.

I have calculated a distance matrix after aligning the amino acid sequences in DECIPHER using the following code:

aa_aligned <- AlignSeqs(NZ_50_mature_D0.4_first50aa)

d <- DistanceMatrix(aa_aligned)

I then clustered them based on this matrix using IdCluster:

c <- IdClusters(d, show=TRUE, verbose = T)

This gives a tree, but with 240 sequences it is hard to read the names of the id's to then relate back to which strain they belong.

I would like to be able to colour the tree based on the "strain" column in my data frame. This way it will be easier visually to see if the sequences form the same strains cluster together. Is this possible?

Note: there are 50 "strains" so a key would be useful too.

If there is a better way t achieve what I am trying to do then any advise is welcome, I am still new to this kind of analysis in R,

Thank you.

r decipher biostrings • 1.9k views

ADD COMMENT • link updated 6.9 years ago by Erik Wright ▴ 150 • written 6.9 years ago by reubenmcgregor88 • 0

score 2 · Accepted Answer · 2018-01-02

2

Entering edit mode

Erik Wright ▴ 150

@erik-wright-14386

Last seen 6 months ago

United States

Thanks for your interest in DECIPHER. Without your sequences it will be hard to provide an exact answer, but this should hopefully get you started...

Here is some artificial data for this example:

aa <- AAStringSet(c(Strain1="NLK", Strain2="NLV", Strain3="NIL"))

aa_aligned <- AlignSeqs(aa)

d <- DistanceMatrix(aa_aligned)

So far this is similar to what you wrote. Next, you will need to get the dendrogram:

tree <- IdClusters(d, type="dendrogram")

I am going to assume that you have a vector of colors named by the strain labels:

colors <- c(Strain1="red", Strain2="blue", Strain3="green")

Then you can color the labels or edges based on the strain name.

tree_colored <- dendrapply(tree,
    FUN=function(n) {
        if(is.leaf(n))
            attr(n, "nodePar") <- list(lab.col=colors[attr(n, "label")], pch=NA)
        n
    })

The key line above is:

attr(n, "nodePar") <- list(lab.col=colors[attr(n, "label")], pch=NA)

This could be modified to color only the leaves of the tree:

attr(n, "edgePar") <- list(col=colors[attr(n, "label")])

See ?dendrogram for more options. Now, we can draw the plot:

plot(tree_colored)

You can add a key with the legend function:

legend("topright", names(colors), text.col=colors)

I hope that helps.

ADD COMMENT • link 6.9 years ago Erik Wright ▴ 150

0

Entering edit mode

Very useful,

The problem is my original data frame is something like this (with about 200 sequences:

id	strain	full_length_emm
GAS131275	emm77.0	GGCTAGAAAAGATACGAATAAACAGTATTCGCTTAGAA
GAS09190	emm87.0	ATGGCTAGAAAAGATACGAATAAACAGTATTCGCTTAG

So I then did the following:

### define where in the table the sequnences are (full_length_emm)
seq <- NZ_50_fullseq$full_length_emm
### and define that the names of the sequences are in the table "id" column
names(seq) <- NZ_50_fullseq$id

### tell biostrings that "seq" is a DNA string
NZ_50_dna <- DNAStringSet(seq)

### translate the DNA to amino acids

NZ_50_aa <- aa <- translate(NZ_50_dna)

### Select first 50 amino acids
NZ_50_aa <- subseq(NZ_50_aa, 1, 50)

So I would like to do the colouring relating back to the original data frame, so the names on the tree remain as the "id" column and the colours are per strain?
On this note it is probably more useful to do align the sequences for each "strain" separately again relating back to the original data frame. Is there a way to call "AlignSeqs" on a subset of the sequences based on the "strains" column mentioned?

Thank you again, and sorry if these are somehow missing the point.

ADD REPLY • link 6.9 years ago reubenmcgregor88 • 0

0

Entering edit mode

If I understand correctly, bullet point #1 sounds like an indexing problem. You could simply lookup the strain name in a vector of colors named by strain.

For bullet point #2, you could subset the sequences before alignment:

w <- which(NZ_50_fullseq$strain=="em77.0")

aa_subset <- aa[w]

ADD REPLY • link 6.9 years ago Erik Wright ▴ 150

0

Entering edit mode

Yes subsetting seems to be the easiest way to do it,

Is there a way to save the image opened in the web browser when using: BrowseSeqs(), rather than doing each strain iteratively and browsing one by one?

ADD REPLY • link 6.9 years ago reubenmcgregor88 • 0

0

Entering edit mode

The BrowseSeqs() function opens a webpage and not an image. You can specify the htmlFile argument to set the filepath.

ADD REPLY • link 6.9 years ago Erik Wright ▴ 150