Are phylogenetic tree construction algorithms any different than general clustering algorithms?

Question

Are phylogenetic tree construction algorithms any different from clustering algorhithms? I suspect the answer is no.
Of course phylogenetic tree construction uses biological knowledge, e.g special distance metrics, but does it brings anything new to clustering algorithm (e.g hierarchical, neighbor-joining etc).

M__ · Accepted Answer

The goal of a phylogeny is to estimate the "expected" number of mutations between all taxa in the analysis and their hypothetical common ancestors. A cluster-analysis will only identify the "observed" mutations and "expected" and "observed" mutations can be majorly different due to the major artefact of reversion mutation. This is particularly true of nucleotide phylogenies.

The key difference between clustering algorithms based on a "uncorrected" distance matrix and phylogeny is the latter is based on an explicit model to accommodate reversion mutations. The real problem is that there are only 4 bases, so randomly there is 1/4 chance that a mutation at a given position will mutate back to the original, e.g. A->C->A. Observed mutational differences = 0, expected (real) mutational differences = 2. What phylogeny is concerned about is reconstructing that "2". The problem is significant because almost every gene has regions of rapid mutations and regions of low mutations.

The primary way of doing this is via the Jukes-Cantor correction and is foundational in all nucleotide trees expect "p-distances". If nucleotide divergence is lower than 75% then it is possible to estimate the expected number of mutations using the observed via the JC correction. Moreover, when combined with a method of estimating rate variation within a gene, (usually the discrete gamma distribution), JC correction is very effective at recovering the "true tree". This is because fast evolving homologous sites are grouped together - big correction via JC, slow evolving sites are grouped together - small correction via JC. Other approaches to improving the JC correction are via identifying the bias between purine to purine mutations and pyrimidine to pyrimidine and purine to pyrimidine mutations.

The importance of JC was demonstrated by simulation studies of 4 taxa ((a,b), (c,d)), if two strains evolved very quickly when there sister lineages evolve slowly a clustering algorithm will report the fast strains are a sister group, i.e. ((b, c),(a, d)). If the JC method is implemented via maximum likelihood (or Bayesian) then it correctly recovers the true tree ((a,b), (c,d)). The artefact is known as long branch attraction.

Clustering algorithms based on a distance matrix that implements the JC correction, tend to perform poorly for long branch-attraction artefacts. This doesn't mean the correction is useless, just not particularly powerful. The issue is that distance matrix methods don't "stick" to the model and the clustering will introduce a layer of inaccuracy. Normally presentation of a "corrected" distance matrix combined with a clustering algorithm will require bootstrapping (resampling with replacement) to assess whether a given cluster is supported. Parameterised distance matrices, using neighbour-joining clustering in in-conjunction with a bootstrap are considered okay.

The @user172818 mentioned parsimony and this method is considered less reliable because it can't implement a JC correction. IMO it is possible weighted parsimony could make a "come-back", but it would be really complicated to implement a biological weighting method and would require extensive, independent calculations.

user172818 · Answer

A great question, though a little ambiguous. I don't know what "general clustering algorithms" refer to. For biological sequences, building a tree can be thought as a way of clustering. Anyway...

There are different tree building algorithms. Max parsimony (MP), max likelihood (ML) and bayesian algorithms directly take sequences as input. They are distinct from distance based clustering.

Then there is a class of distance based algorithms in phylogenetics. They start from an all-pair distance matrix and aim to find a tree that is most compatible with the matrix. Among them, UPGMA is largely hierarchical clustering. Neighbor-joining is somewhat similar to hierarchical clustering, but it builds unrooted trees. FastME uses a very different approach. Briefly, for each topology it tries to find the best branch lengths by weighted least square (for details, see its paper); the best topology minimizes total branch lengths. FastME finds the best topology via nearest neighbor interchange (NNI), closer to some ML algorithms (e.g. PhyML).

does it brings anything new at the level of clustering algorithm?

Personally, I think the distance based tree building methods are more sophisticated than naive hierarchical clustering, provided that the real relationship follows a tree topology.

Are phylogenetic tree construction algorithms any different than general clustering algorithms?

2 Answers

Add your own answers!

Ask a Question