Motivation behind the neighbor-joining distance matrix recomputation

Bioinformatics Asked by Gogis on September 28, 2021

In each iteration of the neighbour-joining method of phylogenetic trees construction, after joining the nearest neighbours the (additive) distance matrix $D_{m times m}$ is recomputed as

$$Q_{ij} = (m-2)d_{ij} – Bigg( sum_{k neq i} d_{ik} Bigg) – Bigg( sum_{k neq j} d_{jk} Bigg)$$

As is stated in numerous sourced that I found, it prevents the joining of non-direct neighbours, but am looking for a more detailed explanation and motivation.

Sources: Wikipedia

2 Answers

Let's look at the components of the equation one by one first since it is easier to describe each part on its own. We can do this because they are added together in computing $Q_{ij}$. Component1, $(m-2)d_{ij}$ is saying that we want a quantity which would be the total penalization, if every distance for $m$ sequences was $d_{ij}$ excluding the 2 distances between themselves (it's a direct proportionality so that the $Q_{ij} propto (m-2)d_{ij}$). Component2, $left(sum_{kneq i}d_{ik}right)$, says that we look at the sum of the distances sequence $i$ has against all other sequences excluding itself (it's an inverse proportionality so that the $Q_{ij} propto left(sum_{kneq i}d_{ik}right)^{-1}$), and $left(sum_{kneq j}d_{jk}right)$ as component3 is analogous.

non-direct neighbours means that on the phylogenetic tree, 2 extant sequences (sequences in the alignment) have another sequence set that can link them together. If two sequences are definitely related to each other then on the phylogenetic tree they will produce a bifurcation and the smallest distance between themselves as being directly related, $(m-2)d_{ij} < left(sum_{kneq i}d_{ik} + sum_{kneq j}d_{jk}right)$, eg. the $d_{ij}=1$ and the rest are all greater than 1 gives us a low number. If $f_{max_{i,j}}(d_{ij})$ is used we'd be using the maximum distance for component1 so the overall measure would be positive and as large as possible.

In summary it provides a measure for placing the sequences together with the closest distance. You may say that this is 'overkill' as we can just search the $m^2/2-m$ set of distances and choose the minimum. In the goal for producing the tree, things get more interesting. Let's say look at 3 sequences $i,j,k$ of a larger set $m$, and see that sequences together have the same distance between them $d_{ij}=d_{ik}$ in respect to $i$. $d_{ij}=d_{j*}$ would mean that $j$ is equally distance to all other sequences as it is to $i$ and let's say $d_{kz}=max land d_{ky,yneq z}=d_{ij}$ (the sequence $k$ is equidistance to all sequences except to a particular sequence $z$ and can happen in some region in the alignment). Which do we join with $i$? According to the formula we choose $k$ over $j$ because $k$ is more distant from other sequence(s) as it would provide a more negative number (lower values cause of the 2nd and 3rd components although component1 is the same in both cases). Effectively we want the minimum distance to $i$ and simultaneously the maximal distance from other sequences (penalize sequences which could be closely related to the other sequences helps produce an overall better tree).

Anders Gorm Pedersen (of DTU) in his coursera videos (also on YT) provides some numerical examples to see how the numbers are produced on an example tree.

Correct answer by Vass on September 28, 2021

NJ as with all clustering methods there isn't an explanation, It is a hierarchical clustering method but there are many algorithms to cluster a distance matrix. NJ is closely associated with the concept behind UPGMA (unweighted pair group mean algorithm)

The key difference with UPGMA is that after joining two taxa by NJ the reconstructed genetic distance between the node and one taxa versus the other taxa is not assumed to be equal, i.e. the rate of mutation is heterogeneous. All other clustering algorithms I know assume a "global molecular clock" if that makes any sense, the mutation is constant against time. NJ removes that assumption and thats why we like it.

Answered by M__ on September 28, 2021

Add your own answers!

Related Questions

Extract sequences from partial Header

2  Asked on August 25, 2021 by bioinfo


Calculating Isoelectric point from a multifasta file

1  Asked on August 22, 2021 by farinelli


Kraken2 > OTU format > Phyloseq

0  Asked on August 22, 2021 by reebola95


nanopore QC measures on fastq file

0  Asked on August 22, 2021 by arun-venugopalan


Removing data from OrthoFinder output prefix

1  Asked on August 22, 2021 by michael-winter


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir