TransWikia.com

How to model RNA-DNA and DNA-DNA binding as a “string matching" problem in statistical physics?

Physics Asked on July 7, 2021

I am trying to model the binding of a guide-RNA molecule with a target-DNA strand as it occurs before the CRISPR-Cas9 protein cuts the DNA.

Both the RNA ($S_1 = Pi_i sigma_1(i)$) and the DNA ($S_2 = Pi_i sigma_2(i)$) sequences are of the same length $N$, where the beads’ letters $sigma_1(i), sigma_2(i) in {text{A},text{C},text{T},text{G}}$, where $text{T}$ represents Uracil in the case of RNA and Thymine in the case of DNA).

Each DNA bead $sigma_1(i)$ and RNA bead $sigma_2(i)$ may be at one of four possible nondegenerate energetic states, given a pre-defined energetic constant $epsilon_0$:

  1. Bead $sigma_1(i)$ or $sigma_2(i)$ is open with energy cost $E = omega epsilon_0 $
  2. Bead $sigma_1(i)$ matches and is bound to $sigma_2(i)$ with energy $E=0$ (if both letters agree namely )
  3. Bead $sigma_1(i)$ mismatches yet is bound to $sigma_2(i)$ with an energy $E=epsilon_0 mathbf{M}_{sigma_1(i),sigma_2(i)}$, where $mathbf{M}$ is a $4 times 4$ matrix of the energy penalties in a single letter mismatch.
  4. Bead $sigma_1(i)$ bulges and attach to bead $sigma_2(i)$ with energy $E=betaepsilon_0.$

All cases above are depicted in this illustration:

enter image description here

Now, for a given $N$ and a fixed temperature $T$ (mean field approximation), I would like to write the partition function:

$$
Z = sum_i^N expleft(-frac{E(sigma_1(i)) + E(sigma_2(i))}{k_BT}right)
$$

Once I have the $Z$ partition function, the free energy is given by:$F = -k_BT ln{Z}$ and one should be able to apply the max-likelihood/max-entropy principle to find the best fit for the parameters $omega, mathbf{M}, beta$ given a set of observations from simulations or experiments.

I would appreciate some help to discuss if this idea is valid, and how to develop from the partition function an actual equation that I can solve using max-likelihood estimation, given an observation list of DNA-RNA pairs.

As a demo, I thought to generate random data for $N=6$, where the configuration space of all possible $S_1$ and $S_2$ combinations is digestible: $4^6 cdot 4^6 = 2^{24} approx 1.6 times 10^7$. Please note that my goal is to have a statistical physical model for which I can “fit” observations from experimental data.


Edit:

Although the current answer is fantastic and very helpful, it doesn’t quite address the essence of my misunderstanding, which is how to find the best-fit parameters that "maximize the likelihood" given (a) the partition function and (b) a list of observed RNA-DNA pairs.

One Answer

There exist extensive literature on the statistical mechanics of DNA, RNA and proteins, so extensive literature search is a must here (there is a lot in Physical Reviews, but there is even more in biology-minded publications NCBI/PMC is your friend here). I would not be surprized, if your problem has been studied and described as well, although it is a bit more exotic (just a bit, after all, RNA aligns to DNA during the transcription process, which is commonplace). I will give only a few directions for search centered on my experience with RNA structures. (Btw, note that RNA has U-s instead of T-s, which may make some difference when studying RNA-DNA coupling).

The classical RNA folding approaches based on base pairing are grounded in the Nussinov's algorithm, although there have been many refinements (see, e.g., here, here and here). The most crucial refinement is that no serious algorithm actually calculates the energy of an RNA or DNA on the basis of base pairing, but rather as a stacking energy between pairs. The use of dynamical programming algorithm is necessitated by the fact that one has to deal with a huge number of possible structures and calculation the partition function directly is simply not feasible.

Another good place to look is bioinformatics textbooks, Durbin and others is a classic, and includes an introduction on modeling RNA structures using context-free grammars, a subject that has received more attention very recently.

Among more exotic approaches are inferring RNA structure on the basis of correlations in a sequence alignment (See the series of publications by Weigt et al.), as well as the approach based on field-theoretical large-N expansion by Henri Orland and collaborators (yes, the author of Negele&Orland's book on quantum statistical physics). Orland also has articles about protein folding, but it is a domain of its own, by far more advanced.

To get closer to DNA helix, helix-coil transition is the basic model here, but the literature is somewhat scarce and rather dated.

Update
Let me add a few more concrete considerations regarding the model presented in the OP. First point is purely biological: A matches to T (AT), whereas C matches to G (CG), and sometimes there is also non-canonical pair GT. CG consist of three hydrogen bonds, AT of two, and GT of one, so they have different energy. In addition, as I mentioned previously, RNA has Uracil (U) instead of Thymine (T). It is hard to judge by the question, whether youa dopted new naming conventions or simply were not aware of this.

Basic model
The energy of a chain is typically expressed not as a sum of single nucleotdie energies, but as a sum of pairing energies. Let $i$ be the index of the RNA chain and $j$ be the index of the DNA chain. If we have pairs $(i_1, j_1), (i_2, j_2),...,(i_K, j_K)$ then the energy of a this configuration can be written as $$ Eleft[{(i_k,j_k)|k=1...K}right]=sum_{k=1}^Kepsilonleft[sigma(i_k),sigma(j_k))right], $$ where $epsilon(sigma_1,sigma_2)$ is teh pairing energy between the nucletides with identities $sigma_1$ and $sigma_2$.

A more handy way of describing pairing configurations is by introducing pairing matrix $M_{ij}$ which has entries $1$ for paired nucleotides and zeros elsewhere. The we could write the energy as a sum over all the sites: $$ E(M) = sum_isum_jepsilonleft[sigma(i),sigma(j)right]M_{ij} $$

If both chains have length $N$, this gives $2^{N^2}$ different configurations. This number is usually reduced by introducing the non-crossing constraint - that is bonds not cross (how it is written in terms of indices $i$ and $j$ depends on the direction in which you enumerate the two chains). Still, the number of configurations is huge for realistic chains. This is why one ahs to resort to recirsive relations that express the partition function of longer chains in terms of those of shorter ones. Nussinov's algorithm is the simplest case, treating only the ground state energy and assuming that all the binding energies are equal. There are more advanced versions of this algorithm associated with the names of Mathews and Zucker, who have published extensively on the subject, but you may also look here and here.

Realistic models
The problem with the model described baove is that it is not realistic. The pairing energies are often much smaller than the stacking energies, i.e., the energies between the adjacent pairs. This may be especially important in your case, since the pairs in the mismatched configuration may have the same energy as the adjacent pairs. To make things worse, the stacking enery is dependent on the solution in which the molecules live, i.e., they may be different in a molecule in a cell nucleus, in cytoplasm, and in a lab.

The Nussnov-like algorithms have been developed for this case, I particularly recommend sorting through the Turner's lab page that I cited above.

String matching and maximum likelihood
String matching and maximum likelihood mean rather specific concepts in bioinformatics and biostatistics respectively. I am not sure whether these terms were used in the question in their true meaning or as figures of speech, since they may lead very far away from what seems the immediate subject. I suggest checking my post on string algorithms in bioinformatics forum for gaining some common ground.

Answered by Roger Vadim on July 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP