# Measure the purifying selection for certain taxa along a phylogeny

Bioinformatics Asked on December 14, 2020

I am posting this message because I need clarity about the analysis I want to perform.
In my analysis, I have a homologous gene in 13 species, and I would like to evaluate the selection pressure of this gene (its dN/dS) in 3 species of the tree and test if this value is significantly different from the neutral model (dN/dS = 1).

So here is the protocol I am considering, and I would like you to tell me if I am right or not:

To calculate this selection pressure, the ratio of non-synonymous to synonymous substitution rates (dN/dS) must be calculated.

For this I use CODEML implemented in ete3.

So I need :

A phylogenetic tree of the gene (performed with the AA alignment of the gene via Iqtree)

A codon alignment (performed with the DNA alignment of the gene with pal2nal)

Mark branches in the tree.


Here is the exemple where I want to get the dN/dS value of SP5,6 and 7.:

         /-SP1
/-|
|   -SP2
|
/-|      /-SP3
|  |   /-|
|  |  |   -SP4
|   -|
|     |   /-SP5
/-|      - |  /-SP6
|  |         -|
|  |           |
|  |            -SP7
|  |
|  |   /-SP8
-|   -|
|      -SP9
|
|     /-SP10
|  /-|
| |   -SP11
|-|
|   /-SP12
-|
-SP13


And I want to calculate the dN/dS of SP5,SP6 and SP7 in order to confirme the hypotesis that the gene is under strong purifying selection on this 3 species.

I mark the SP5, SP6 and SP7, run a free-branch model and look at the Codeml output (the dN/dS value should be the same for the 3 marked species right?) Then this value is kind of the mean dN/dS value between three leafs?

Suppose I get an omega = 0.15
If I want to test that this value is significantly different from 1, I need to mark the same sheets and run the neutral model, then compare the two models (neutral vs. free) using chi2, and if the p-value is <0.5, then the dN/dS value (0.15) is significantly different from 1.
I’m right ?

The following is not the clearest explanation of dN/dS humankind has seen, but I hope you get the gist and core idea of likelihood is absolutely correct. I am wondering if you are thinking of a different test to the way I would do this calculation. This ain't a text book clarity answer.

Anyway what you are proposing is a likelihood ratio test (LRT) for selection based on the parameterisation of omega. Your idea isn't quite correct ... its not dN/dS value you are looking at to assess significance, its the likelihood. This point is critical and is the most important difference in our thinking on this test.

LRT works because its a log likelihood and subtracting two log likelihood is the same as dividing them (I forget the theory - but you subtract the likelihood to work out the p-value via Chi squared). There's some theory about why the Chi-squared distribution works but I can't remember it.

You need to work out the degrees of freedom which is the parameter differences between 'free branch model' and the neutral model and this value is LOW. If the neutral model is the "global model", then its 2 degrees of freedom - IN YOUR TREE. This is because the SP you are looking at form a single clade (monophyly). From memory you stick a #1 on the branch leading into the clade, ie. defining the node and thats your 'model' on the treefile. Here you have two dN/dS one for the whole tree excluding your clade and one for your taxa. This gives you a likelihood.

Note you could stick a #1 on each branch within the clade, i.e. the treefile. That means something different - like every branch was subject to ... in this case purifying selection ... not just the common ancestor.

You now work out the likelihood of the 'global' model... i.e. remove the #1 and make sure it gives you only 1 dN/dS value.

Subtract the likelihoods and multiple by two - if its LRT (super important).

You now need to work out the degrees of freedom. If its the same model but one has an additional node with a different dN/dS ... the df = 2. If you've dumped a #1 on every branch in the clade ... I still think df = 2 - you need to check.

If the difference in likelihood results in p < 0.05 you are correct the difference is significant ... whatever that difference might be.

If the neutral model is some different calculation of omega within the population genetics process (see here) then it will be the parameter differences between your model and the 'neutral' model.

To reiterate I like simulations so much I type in reiterations ... So again the LRT usual will usually subtract one likelihood from another and THEM multiply the probability by two ... and compare the result on the Chi squared table for the difference in degrees of freedom. And again what you are looking at is not the value of dN/dS but the difference in LIKELIHOOD. That is very important. You need to be careful with the degrees of freedom on an LRT because they can introduce an additional subtraction constant, test dependings (its ages since I've done one so I can't remember).

If the test is significant and one dN/dS is x for the clade and y for the rest of the tree, then that differences is significant and if y is >1 (bit unlikely) and x < 1 then sure thats purifying selection. Usually clades evolve via positive selection rather than purifying selection.

From memory that is my understanding ... but if you have a neutral model of omega and a purifying model of omega ... yeah sort of, but usually thats for a 'global' model, i.e. no differences within the tree. In your case your interested in a specific clade. The idea is still the same its measured through likelihood.

Its so long since I've done this, but I do caution not to confuse two different tests.

I do understand Yang et al did an MCMC for dN/dS ... what I'm thinking of here is maximum likelihood. You'd need to watch for that carefully and thats what you are referring to when you say Chi-squared, because that is LRT 100%. LRT is maximum likelihood.

Correct answer by M__ on December 14, 2020

## Related Questions

### How do I find all supplementary mappings of a chimeric long (ONT) read mapped with NGMLR?

0  Asked on March 10, 2021 by liam-mcintyre

### Python – Finding a motif – input: a txt file with 10 sequences and 10 motifs

2  Asked on March 7, 2021 by beeta

### How can I get properties from each one letter amino acid?

0  Asked on March 7, 2021 by mh-lee

### error in random forest analysis

1  Asked on March 7, 2021

### Upgrading guppy basecaller on Oxford Nanopore Gridion via PPA

1  Asked on March 5, 2021 by bcarg

### Explanation for RNA-seq samples not clustering in PCA as expected

1  Asked on March 5, 2021 by user260878

### How to find possible viral subsequences in a Eukaryote genome?

2  Asked on March 4, 2021

### How to infer the number of chaperones involved in protein folding?

0  Asked on March 3, 2021

### How to call this : some dots in volcano plot aligned a line after add a very small value to padj

0  Asked on March 1, 2021 by chu

### How to analyze the correlation of two different tissue gene expression under three conditions?

0  Asked on March 1, 2021 by dowithr

### Metagenomics: Identifying most common sequences

1  Asked on March 1, 2021 by dumbledorethegrey

### Query on htseq count

0  Asked on February 28, 2021 by lavanya-c

### Why do I get cytosine to guanine/adenine transitions in bisulphite treated sequences?

0  Asked on February 27, 2021 by alimm94

### Can’t locate object method “new” via package Bio::Tools::Primer3

1  Asked on February 25, 2021 by shawnc

### Multi Omics comparison RNA seq : Condition specific gene signature

1  Asked on February 25, 2021

### Get nucleic acids’ chain names from PDB API

0  Asked on February 21, 2021 by rtviii

### How to read E-value annotation on NCBI BLAST?

0  Asked on February 21, 2021 by user12256545

### How are the scores of GeneSplicer, MaxEntScan and SpliceRegion interpreted from VEP annotations?

0  Asked on February 20, 2021 by law

### Kraken2 or metaphlan2 report to phyloseq

0  Asked on February 20, 2021

### Genome assembly of SRR12196449 with SPAdes

2  Asked on February 19, 2021