How to modify DNA evolution model to fit actual data?

Question

I'm working on understanding the evolution of a gene in a phylogenetic tree. I know the rates of evolution of each organism (from the tree).
I am taking a random DNA sequence with my gene, evolving it along the tree with the said rates and comparing this to the actual gene in these various organisms.
I am 'evolving' the gene if randomly at a probability determined by the rate at each step in the tree.
The gene from my simulation is different from what is actually in these organisms. How do I modify my model, so that my simulation gives the same results?

Chris_Rands · Answer

Mutation is not a random process because there are many mutational biases, e.g. the transition vs. transversion ratio. Ideally use a proper evolutionary model. Maybe try a tool that incorporates substitution and indel models. I'm not up-to-date on the best packages, but e.g.

INDELible http://abacus.gene.ucl.ac.uk/software/indelible/
Evolver https://www.drive5.com/evolver/ (not to be confused with the PAML package of the same name)

Maximilian Press · Answer

I decided that this was too involved for a comment, so I'm making it an answer.
It is still quite difficult to answer the question given the absence of methods explanation; but I think that there's some misunderstanding of phylogenetic modeling going on here, and I can at least address that. I'm going to try to break it down into its components, starting with the basic statistical idea of simulation and then going back to phylogenies.
Statistical reasoning
Let's say that I flip 10 coins (2 outcomes: head (H) and tails (T)), and I observe the sequence HTHTTTHHTT. I then fit a maximum likelihood model, and find that the best model for the data has
$Pr(H) = 0.4$,
$Pr(T) = 0.6$,
and that these two outcomes are mutually exhaustive:
$Pr(H) + Pr(T) = 1.0$
I can then use a binomial model to re-simulate data based on these parameters. I will observe various sequences, for example:
HTTTTTTHHT
TTTTHHTHHH
HTHHTHHTHH...
However, based purely on the parameters of the model, I don't expect every new simulation to recapitulate the exact same HTHTTTHHTT sequence, because I have reduced its information to some rate parameters for the simulation.
Each of those new sequences will yield model parameters ($Pr(H), Pr(T)$) that are similar to the training data, but they will not necessarily be particularly similar to the training data in sequence.
Phylogenetic modeling
Phylogenetic modeling works in the same way, such that you reduce the observations at the tips of the tree into some model with some parameters. Most of the time you are estimating e.g. a rate matrix of parameters for transition between states for a set of residues (in simplest case for DNA 4 nucleotide states and for proteins, 20 amino acid states usually). If you are working with gene sequences you are probably using the 4 nucleotide states or some derivation thereof.
Simulations usually work by starting with some fixed ancestral state at the root (which can take various states), and then evolving forward from there along each branch of the tree using the rate matrix in a continuous-time Markov chain to "replay the tape of life", by assuming that the model you've fit somewhat reflects truth.
While opinions differ somewhat on what to expect from "replaying the tape of life" (see that linked review for more), basically no one expects to get the same results as the input data, or even 90% similar to the input data. That would indicate that evolution is highly deterministic, which just isn't reliably true. Of course there are situations where such determinism happens to be the case, but they are relatively rare, and I don't think that sequence evolution is considered to be very deterministic at all.
It's a little different in that the tree defines a correlation structure of the tips, which constrains the simulation results to be somewhat more similar to the training data than we expect from coin-flipping. But there's still no reason to expect particularly similar results on the tips.
But you can also model discrete characters on a tree, which can be even simpler. For example, I used the R APE package to simulate the following character states (A or B for each tip) on a simulated tree of 10 tips with the following commands:
tr = rtree(10)
c_tr = rTraitDisc(tr)

I can then build a model for this character and this tree using the following:
rates = ace(c_tr, tr, type="discrete")

and use it to simulate new sets of states:
# A (index 1) is estimated to be ancestral state with 99% confidence
new_sim = rTraitDisc(tr, rate=rates$rates, states=c("A", "B"), root.value=1)
plot(tr)
tiplabels(new_sim, adj = c(1,0))

Look! Not a single observation of state B! Let's try it again:
new_sim2 = rTraitDisc(tr, rate=rates$rates, states=c("A", "B"), root.value=1)
plot(tr)
tiplabels(new_sim2, adj = c(1,0))

Ok, this is more similar, but there's still another state B on the tips.
Conclusion
Evolution is not very deterministic. The correlation structure of the tree makes the outcomes less random than independent trials as in coin-flipping, but it is not at all expected that simulation results from phylogenetic models will match the training observations well. The best that can be said is that each simulated dataset can hopefully be reduced to the same model parameters as the input data, but that's a very different statement from yielding the exact same outcomes.
If what you are looking for from simulation is to yield the exact same sequence states at the tips, then you are unlikely to get that from phylogenetic simulation. You might occasionally get pretty close, but that is not at all the expectation.

How to modify DNA evolution model to fit actual data?

2 Answers

Add your own answers!

Ask a Question