# Motivation behind the neighbor-joining distance matrix recomputation

Bioinformatics Asked by Gogis on September 28, 2021

In each iteration of the neighbour-joining method of phylogenetic trees construction, after joining the nearest neighbours the (additive) distance matrix $$D_{m times m}$$ is recomputed as

$$Q_{ij} = (m-2)d_{ij} – Bigg( sum_{k neq i} d_{ik} Bigg) – Bigg( sum_{k neq j} d_{jk} Bigg)$$

As is stated in numerous sourced that I found, it prevents the joining of non-direct neighbours, but am looking for a more detailed explanation and motivation.

Sources: Wikipedia

Let's look at the components of the equation one by one first since it is easier to describe each part on its own. We can do this because they are added together in computing $$Q_{ij}$$. Component1, $$(m-2)d_{ij}$$ is saying that we want a quantity which would be the total penalization, if every distance for $$m$$ sequences was $$d_{ij}$$ excluding the 2 distances between themselves (it's a direct proportionality so that the $$Q_{ij} propto (m-2)d_{ij}$$). Component2, $$left(sum_{kneq i}d_{ik}right)$$, says that we look at the sum of the distances sequence $$i$$ has against all other sequences excluding itself (it's an inverse proportionality so that the $$Q_{ij} propto left(sum_{kneq i}d_{ik}right)^{-1}$$), and $$left(sum_{kneq j}d_{jk}right)$$ as component3 is analogous.

non-direct neighbours means that on the phylogenetic tree, 2 extant sequences (sequences in the alignment) have another sequence set that can link them together. If two sequences are definitely related to each other then on the phylogenetic tree they will produce a bifurcation and the smallest distance between themselves as being directly related, $$(m-2)d_{ij} < left(sum_{kneq i}d_{ik} + sum_{kneq j}d_{jk}right)$$, eg. the $$d_{ij}=1$$ and the rest are all greater than 1 gives us a low number. If $$f_{max_{i,j}}(d_{ij})$$ is used we'd be using the maximum distance for component1 so the overall measure would be positive and as large as possible.

In summary it provides a measure for placing the sequences together with the closest distance. You may say that this is 'overkill' as we can just search the $$m^2/2-m$$ set of distances and choose the minimum. In the goal for producing the tree, things get more interesting. Let's say look at 3 sequences $$i,j,k$$ of a larger set $$m$$, and see that sequences together have the same distance between them $$d_{ij}=d_{ik}$$ in respect to $$i$$. $$d_{ij}=d_{j*}$$ would mean that $$j$$ is equally distance to all other sequences as it is to $$i$$ and let's say $$d_{kz}=max land d_{ky,yneq z}=d_{ij}$$ (the sequence $$k$$ is equidistance to all sequences except to a particular sequence $$z$$ and can happen in some region in the alignment). Which do we join with $$i$$? According to the formula we choose $$k$$ over $$j$$ because $$k$$ is more distant from other sequence(s) as it would provide a more negative number (lower values cause of the 2nd and 3rd components although component1 is the same in both cases). Effectively we want the minimum distance to $$i$$ and simultaneously the maximal distance from other sequences (penalize sequences which could be closely related to the other sequences helps produce an overall better tree).

Anders Gorm Pedersen (of DTU) in his coursera videos (also on YT) provides some numerical examples to see how the numbers are produced on an example tree.

Correct answer by Vass on September 28, 2021

NJ as with all clustering methods there isn't an explanation, It is a hierarchical clustering method but there are many algorithms to cluster a distance matrix. NJ is closely associated with the concept behind UPGMA (unweighted pair group mean algorithm)

The key difference with UPGMA is that after joining two taxa by NJ the reconstructed genetic distance between the node and one taxa versus the other taxa is not assumed to be equal, i.e. the rate of mutation is heterogeneous. All other clustering algorithms I know assume a "global molecular clock" if that makes any sense, the mutation is constant against time. NJ removes that assumption and thats why we like it.

Answered by M__ on September 28, 2021

## Related Questions

### differential analysis of chip-seq data

3  Asked on August 26, 2021

### Coronavirus RNA structures?

1  Asked on August 26, 2021

### Is it possible for coronavirus or SARS to be synthetic?

1  Asked on August 25, 2021 by sscheme

### Subsetting from seurat object based on orig.ident?

2  Asked on August 25, 2021

### Extract sequences from partial Header

2  Asked on August 25, 2021 by bioinfo

### How to reduce the occupied RAM when you are dealing with a very sparse matrix in a single-cell Experiment in R?

3  Asked on August 22, 2021

### Bacterial DNA at the tail of transcriptome reads. What does that mean?

2  Asked on August 22, 2021 by juanjo75es

### Chromatin accessibility level on promoters with different CpG ratio calculation

1  Asked on August 22, 2021 by krushnach-chandra

### What is the best way to get a large number of RNA seq data from SRA in Python without being denied access

1  Asked on August 22, 2021

### Trying to download a large number of files from ENA programmatically

0  Asked on August 22, 2021

### How can I obtain FTP links to studies in ENA?

3  Asked on August 22, 2021

### Calculating Isoelectric point from a multifasta file

1  Asked on August 22, 2021 by farinelli

### How to fragment genomes into non-overlapping sequences of differing sizes?

2  Asked on August 22, 2021 by stanley-ho

### Kraken2 > OTU format > Phyloseq

0  Asked on August 22, 2021 by reebola95

### downloaded software in Conda environment does not show in ls command

1  Asked on August 22, 2021 by user9196

### nanopore QC measures on fastq file

0  Asked on August 22, 2021 by arun-venugopalan

### How to use pyopenms.FeatureFindingMetabo() for peak picking in pyOpenMS?

2  Asked on August 22, 2021

### Removing data from OrthoFinder output prefix

1  Asked on August 22, 2021 by michael-winter

### How to install/import Biopython into Python 3.8/ PyCharm IDE

1  Asked on August 22, 2021 by bioisaac