MODEL

Introduction

This year, the modeling aspect is at the heart of our project. Indeed, the construction of overlapping sequences requires the use of particularly advanced mathematical techniques.

CAMEOS (Blazejewski et al. 2019) is a software which allows one to entangle two gene sequences together, and is the central part of our iGEM team project.

In the first part of this model, we present the Hidden Markov Models (HMM) and explain how they can be used to capture the "digital fingerprint" of a family of proteins.
Once built, this fingerprint can be seen as a black box that randomly generates homologous proteins, i.e. proteins that are in the same family. It is also possible to estimate the probability that a protein belongs to a given protein family.

Morally, an HMM is a statistical model that represents a family of proteins by capturing its recurring patterns and statistical peculiarities. Thus it is possible to understand the structure of a protein family.

Thanks to this modeling, it is possible to estimate the efficiency of a given entanglement, i.e. to estimate whether the entangled proteins have retained their properties. This is essentially how the CAMEOS algorithm works

From the digital fingerprint (HMM) of the family of proteins to be entangled, it progressively builds a family of optimal sequence that binds the two target proteins.

Since a protein has a 3D structure, the CAMEOS algorithm models the long range interactions between the different amino acids of the entangled proteins. For this, these interactions are modeled using a statistical tool: the Markov Random Field (MRFs) that we will present later.

HMM, a tool to capture the fingerprint of a protein family

The first step in our modeling is to build a mathematical structure (Hidden Markov Process) that captures the digital fingerprint of a family of proteins.

Thanks to this structure, it is possible to estimate the probability that a protein is or is not in this family. It can also be used to generate new proteins in the family.

To make this construction, we start by presenting the markov processes. These processes are the essential building block of the Hidden Markov process, which corresponds to a Markov process with imperfect observation.

Then we explain how this kind of structure can be used to capture the fingerprint of a family of proteins and how we can calculate its different parameters.

Markov Process

A Markov Model is a probabilistic model composed of a finite number of states. At each instant, the state jumps to another state according to a known probability law.

The fundamental assumption of a model ensures that the law of transition from one state to another depends only on the current state. The process is then said to be without memory, i.e. it forgets its past.

We propose a formal definition of this type of process

A Markov model $M$ is a triplet $(\Omega, q, q_0)$ where

$\Omega$ is a finite set of states,
$q : \Omega \to \Delta(\Omega)$ is the transition function, where $\Delta(\Omega)$ is the set of probabilities on $S$,
$q_0 \in \Delta(\Omega)$ is the probability of the initial state.

The dynamic of a Markov Process is defined as follows:

Step 1 : We draw an initial state $\omega_1$ according to the law of probability $q_0$ and place the process on this state
Step $n \geq 2$ : Knowing the previous state $\omega_{n-1}$, we draw a random state $\omega_n$ according to the law of probability $q(\omega_{n-1})$. Then the state of the process is moved to $\omega_n$.

To explain the previous definition, we propose an example of Markov's process on three states :

A player moves between three bags. Each time he arrives on a bag, he chooses the next bag he goes to according to a predefined law of probability (shown in the diagram).

In each bag are balls of different colors.

In a Markov process, the user always knows which bag he is in.

Description of HMMs

A Hidden Markov Model (HMM) works exactly the same way as a Markov model. Except that this time, we don't know the state of the process. Instead, we only have a random observation that depends on the state of the process.

To understand what's going on, we can complete the previous Markov model. This time, we always move the state from one bag to another. However, we don't know directly which bags we are in.
To find out, we make an extra print: we draw a ball in the bag and look at its color.

If you shoot a red ball, it will be quite easy to deduce that you are in bag number 3. But what if we shoot a white ball? In this case, we can't say for sure which bag we are in. Of course, it seems more likely to be in bag number 1, since there is only one white ball in bag number 2. But this is not a certainty.

Contrary to Markov's model, in a HMM one does not work with the state of the system which remains unknown, but only with an estimation of it.

A formal definition of HMMs is proposed

A Hidden Markov Model (HMM) $\mathcal{H}$ is a 5-uplet $(\Omega, S, q, \sigma, q_0)$ where

$\Omega$ is a finite set of states,
$S$ is a finite set of signals,
$q : \Omega \to \Delta(\Omega)$ is the transition function, where $\Delta(\Omega)$ is the set of probabilities on $S$,
$\sigma : \Omega \to \Delta(S)$ is corresponds to the law of observations,
$q_0 \in \Delta(\Omega)$ is the probability of the initial state.

We propose a description of the dynamic of such type of process

Step 1 : We draw an initial state $\omega_1$ according to the law of probability $q_0$ and we place the process on this state which remains hidden. We draw a signal $s_1$ according to the law of probability $\sigma(\omega_1)$ and this signal is announced publicly.
Step $n \geq 2$ : Knowing the previous state $\omega_{n-1}$, we draw a random state $\omega_n$ according to the law of probability $q(\omega_{n-1})$. Then the state of the process is moved to $\omega_n$ and the state of the process is kept secret. Then one draws randomly a signal $s_n$ according to the law of probability $\sigma(\omega_n)$. This signal is announced publicly.

The use of HMMs

Once you've built an HMM, there are a multitude of questions you can ask about it : (Rabiner et al. 1989). The three most classic questions are :

Knowing all the parameters of the model $\mathcal{H}$, as well as a sequence of observations $\mathcal{O} = (\mathcal{O}_1, \cdots, \mathcal{O}_n)$, what is the probability $\mathbf{P}(\mathcal{O} | \mathcal{H})$ that this sequence was generated by $\mathcal{H}$.
Given an observation sequence $\mathcal{O} = (\mathcal{O}_1, \cdots, \mathcal{O}_n)$, can we find the "best" sequence of states having generated this observation?
How to choose the parameters of the model in order to optimize the quantity $\mathbf{P}(\mathcal{O} | \mathcal{H})$.

The interested reader will find a method for resolving these three questions in (Rabiner et al. 1986).

Protein Application - Building a Profile HMMs

Overlapping genes involves constructing a DNA sequence that "contains" the two target proteins. Unfortunately, most of the time it is impossible to perfectly obtain the starting proteins.

Nevertheless, it is possible to get around this difficulty. Indeed, there are many DNA sequences that give proteins "homologous", i.e. proteins that have similar physico-chemical properties. Therefore, all we need to do is to construct the entangled sequence so that it contains proteins homologous to our target proteins.

To find out if these proteins belong to the same family as our target proteins, we will build an HMM to capture the patterns and statistical properties of a protein family. This type of construction has been introduced for the first time in (Krogh et al. 1994)

In simple words, a Profile HMMs is like a big black box that randomly generates DNA sequences from the same family.
For a given sequence, it is possible to study the probability that it is in the family.

Let's assume that we have the following multiple sequence alignment of proteins or DNA sequences (such aligment can be found on large public database). $$ \begin{array}{llllll} S & A & C & L & A & Y \\ S & E & C & L & L & T \\ S & A & D & L & U & T \\ S & E & C & L & P & I \\ \end{array}$$

Hypothesis : Fist, It is assumed that there are no gaps and no deletions in the alignment.

We can consider the HMM with 6 states and where each state $(X_i)_{i = 1}^6$ corresponds to one of the amino acids. $$ \begin{array}{c|c|c|c|c|c} X_1 & X_2 & X_3 & X_4 & X_5 & X_6 \\ \hline S & A & C & L & A & Y \\ S & E & C & L & L & T \\ S & A & D & L & U & T \\ S & E & C & L & P & I \\ \end{array} $$

This first model is quite simple, since it considers that there is no insertion and no deletion in sequence alignment. Nevertheless, this assumption is generally false when working with real sequences. Indeed, many insertions and deletions have appeared during the evolution of the different protein families. $$ \begin{array}{llllll} S & A & C & L & A & Y \\ S & E & C & - & L & T \\ S & A & - & - & U & T \\ S & E & C & - & P & I \\ \end{array} $$ To remove it, the HMM model is modified below to take into account insertions and deletions.

Parameter's Estimation

But how can we know the parameters of the different transitions in the model?
In other words, once we have been able to recover many sequences encoding a protein, how can we find the parameters of the HMMs model in order to optimize it to model this family?

Let's take again our artificial example of sequence alignment. $$ \begin{array}{llllll} S & A & C & L & A & Y \\ S & E & C & - & L & T \\ S & A & - & - & U & T \\ S & E & C & - & P & I \\ \end{array} $$

To choose the number of states of the associated HMM, one chooses to consider only the columns that have kept an amino acid for at least half of the sequences.
In our example, we keep only columns 1, 2, 3, 5 and 6.
To choose the probability of observations in each state, the frequency of each amino acid in the starting alignment is calculated. Here $$ \begin{array}{c} \sigma_{x_1}(S) = 1 \\ \sigma_{x_2}(A) = \frac{1}{2} , \sigma_{x_2}(E) = \frac{1}{2} \end{array} $$
To choose the probabilities of transitions from one state to another, again based on the observed frequency.
In our example, there is no gap after state 1, and there is one after state 2. $$ \begin{array}{c} q_{M_1}(M_2) = 1 \\ q_{M_2}(M_3) = \frac{3}{4} , q_{M_2}(D_3) = \frac{1}{4} \end{array} $$

The score of a new sequence

HMMs allow to build a "digital fingerprint" of the protein family. This fingerprint allows to associate to each protein a score that represents its probability to be in the represented family.

We call negative log likelihood (NLL)-score of a protein $S$ the value $$ - \log \mathbf{P}(S | \mathcal{H})$$ where $\mathbf{P}(S | \mathcal{H})$ corresponds to the probability that it is generated by the model $\mathcal{H}$.

Unfortunately, this score is particularly sensitive to the size of the protein (Krogh et al. 1994)
It can be improved by considering the Z-score which corresponds to the difference between the NLL-score of a sequence and the average NLL-score of typical sequences of that same length.

Morally, the lower the Z-score of a protein is, the more likely it is to belong to the target family.

CAMEOS: The mathematical principles of the entanglement algorithm

Sequence Homology - Similar Sequence

Let's consider the following two proteins

P1 : HKTSTE
P2 : ITQAP

It is easy to see that these proteins can be recovered from the amino acid sequence.

P1 : CAT AAC ACA AGC ACC GAA
P2 : ATA ACA CAA GCA CCG

These proteins can be entangled into a single protein.

Pe : CATAACACAAGCACCGAA

Unfortunately, it is not always possible to make a tangle without changing the codons of our proteins. Indeed, it is impossible to directly entangle the amino acids $F \in \{TTT, TTC\}$ and $G \in \{GGA, GGC, GGT, GGG\}$.

This raises the question of the extent to which it is possible to substitute one amino acid for another in order to preserve the functional aspect of a protein.

The score calculation of a entanglement sequence

In order to build an entangled sequence optimal, the CAMEOS algorithm must first determine that it is the meaning of the optimal word.

Let's consider two target proteins $p_1$ and $p_2$ that we would like to entangle. The first step of the algorithm consists in building two HMMs that store the digital fingerprint of the $p_1$ and $p_2$ family.

Once these footprints have been constructed, a score can be constructed that represents the efficiency of the substitution of one amino acid by another within the different families.

This allows to establish a score for each tetra-nucleotide of the entangled sequence, which represents the impact of a base change on the function of the entangled proteins.

Since we are working on Markov models, the score of a substitution is independent of the rest of the residuals in the sequence. Therefore, we can apply the principle of dynamic programming to calculate the optimal entanglement.

The Optimal Tangle Calculation

The number of ways to entangle two sequences of length $n$ is very huge : $$ Nbr(n) = 4^n.$$ For two sequences of 1kb size, there are more possible combinations than atoms in the universe. Therefore, it is inconceivable to test the score of each combination to determine the best one.

To solve this problem, a general optimization method called dynamic programming is used. This method consists of following the old popular adage divide and regnant. In practice, it is sufficient to divide the main problem into sub-problems and solve each of the sub-problems.

Let $A = a_1 \cdots a_n$ and $B = b_1 \cdots b_n$ be two gene sequences to tangle. The idea is to progressively build the tangle of sub-sequences that are longer and longer.
To do this, we start from the following observation

Any subsequence of an optimal sequence is optimal.

To build the optimal entanglement, a $T$ table is built with a size of $n \times 4$. The value of the $T(i, b_i)$ box contains the optimal score of the entanglement of the first $i$ nucleotides of $A$ and $B$ that fixes the last nucleotide to $b_i$.

To calculate $T(i+1,b_{i+1})$, just take the maximum of the four scores $$ T(i+1, b_{i+1}) = \max \left\{ \begin{array}{l} T(i,A) + \sigma(A, b_{i+1}, i+1), \\ T(i,C) + \sigma(C, b_{i+1}, i+1), \\ T(i,G) + \sigma(G, b_{i+1}, i+1), \\ T(i,T) + \sigma(T, b_{i+1}, i+1). \end{array}\right. $$ In this way, an optimal overlapping sequence can be built up gradually.

Here a deterministic method is constructed to build an optimal solution. Nevertheless, it is interesting to consider a larger set of sub-optimal solutions in order to have more variability on the entanglements.

A little Random

The previous discussion proposes a general method for determining an optimal solution to the entanglement problem. If this solution is very satisfactory from a theoretical point of view, in practice it is preferable to have a set of satisfactory sequences, even if this weakens a little the optimal character of the solution.

To do this, a random disturbance is introduced during the construction stage of the dynamically programmed tangled sequence. In fact, instead of choosing the optimal sub-sequence in a deterministic way, $10\%$ of the time we lengthen the entanglement by randomly drawing one of the proposed sub-sequences.

This method has the advantage of proposing a large number of sequences on which it is possible to make a posteriori optimization in order to determine the best entanglement.

Long Range Interactions

To entangle two proteins, the first step of the CAMEOS algorithm consists in building a digital fingerprint of the target protein family in order to construct an optimal entanglement.

This digital footprint is built from HMMs. Nevertheless, a protein is not just a linearly written sequence of amino acids. In reality, proteins have 3D structures and different distant codons on the sequence can be close in reality.

Unfortunately, Markov models make a very strong assumption about the sequence: namely that there is no correlation between remote regions. In other words, HMMs cannot capture the 3D structure of a protein. However, it is easy to imagine that the 3D structure of entangled sequences must play an important role if functional proteins are to be successfully obtained.

To take this structure into account, it is necessary to improve the numerical fingerprint of protein families using another type of statistical model: the Markov Random Fields (MRFs). (Balakrishnan et al. 2011).

Representing the protein's family as an MRF

The first step of the optimization process allowed us to obtain a multitude of tangled sequences.

Let $X_i$ be the random variable that represents the i-th amino acid of the alignment. We note $X = \{X_1 , \cdots, X_n \} $ the random vector which describes the composition of the protein of size $n$.

The $\P(X)$ law corresponds to the amino acid distribution of the entangled protein. The main idea is that if we know the $X$ law, we can look for an optimal solution with maximum probability.

To describe the law of $X$, we rely on a classical probabilistic model in mathematics: MRFs.

A Markov random field is a quadruplet $(X, \mathcal{E}, \Phi, \Psi)$ where

$(X,\mathcal{E})$ is a non-oriented graph across all variables.
$\Phi$ and $\Psi$ are the potentials on the vertices and edges.

We have the following property

The probability that a sequence $x = \{x_1, \cdots, x_n \}$ comes from of $\mathcal{M}$ is given by $$ \mathbf{P}_{\mathcal{M}(x)} = \frac{1}{Z} \prod_{s \in V} \Phi_s(x_s) \prod_{(s,t) \in \mathcal{E}} \Psi_{s,t}(x_s, x_t), $$ where $Z = \sum_{x \in X} \prod_{s \in V} \phi_s(X_s) \prod_{(s,t) \in E} \Psi_{s,t} (x_s, x_t)$

The construction of the MRF associated with our solutions

Thus, we know that it is possible to represent the protein distribution of entangled sequences using an MRF process.

Question : How to build the MRF from the sequences obtained?

The previous equation describes the amino acid distribution of the $x$ sequences generated by the $\mathcal{M}$ model.

Now we define the log-likehood of model.

Let $(X^i)_{i = 1}^n$ a collection of independent sequences. The log-likehood of model parameters $\Theta = (\mathcal{E}, v,w)$ is defined by $$ \mathcal{I} (\Theta) = \frac{1}{n} \sum_{i = 1}^n \left[ \sum_{s \in V} \log \Phi_s (X^i _s) + \sum_{(s,t) \in E} \log \Psi_{s,t}(X_s^i, X_t^i) \right] - \log Z. $$

The parameterization of the MRF can be seen as a problem of maximizing the log-likelihood function.

When a large amount of data is available, maximizing the likelihood function makes it possible to find the real parameters of the model.

IMC: An easy first explanation on image processing

Now that we have a trained MRF model, we are looking for an iterative method to be able to find an optimal entanglement.

For this, the CAMEOS algorithm is based on an adapted version of the ICM algorithm. The principle of this algorithm consists in building an increasing maximizing sequence that converges towards a locally optimal entanglement.

In order to make the reader understand the operation of the algorithme, we will briefly leave the world of biology to explain the algorithm (ICM) on a rather simple example from the image processing theory.

In general, an image is described by a table of numbers. For simplicity, we will only consider black and white images.

A image of size $n\times n$ is a table of size $n\times n$ composed of 0 or 1.

In everyday life, many images are captured. Unfortunately, it is very frequent that these images are altered, noisy with a random term.

Hypothesis :

It is assumed that each pixel can be modified with a probability $p \in (0, 1)$.
It is assumed that these modifications are independent.

The ICM Algorithm is based on a rather simplistic hypothesis :

Hypothesis : On a healthy image, neighboring pixels tend to have the same color.

This last hypothesis is particularly important for image reconstruction. Indeed, suppose for a moment that a white pixel is discovered in the middle of a group of black pixels, the hypothesis assures that it is highly probable that this pixel was originally a black pixel.

The ICM algorithm therefore works as follows. Each pixel is compared to its neighbors, and it is determined whether or not it is likely to have been altered.

To do this, a $C$ function is introduced which corresponds to the cost of restoration. $$ C(y_p) = \alpha(1 - \delta_{x_p, y_p}) + \beta \sum_{q \in \mathcal{N}(p)} (1 - \delta_{y_p, q}) $$

The term $\alpha(1 - \delta_{x_p, y_p})$ corresponds to the gain of reconstruction. Here we pay $\alpha$ if the reconstruction does not leave the pixel $p$ unchanged, and nothing else.
The term $\beta \sum_{q \in \mathcal{N}(p)} (1 - \delta_{y_p, q})$ corresponds to the cost of the reconstructed pixel $p$ with its neighbors. Here we pay $\beta$ per different neighbor.

At each iteration of the algorithm, a restored image is constructed by calculating a $C$ minimizer. $$ y_p^{(k+1)} \longleftarrow argmin(C(y)).$$ Then repeat the algorithm several times in a row.

The ICM algorithm can be described as follows $$ \left\{\begin{array}{c} C(y_p) = \alpha(1 - \delta_{x_p, y_p}) + \beta \sum_{q \in \mathcal{N}(p)} (1 - \delta_{y_p, q}) \\ y_p^{(k+1)} \longleftarrow argmin(C(y)). \end{array}\right. $$

ICM : Back to overlapping proteins

Now, we have everything we need to understand the last step of the modeling process carried out by CAMEOS. The algorithm works in much the same way, except that this time, the proximity between two amino acids is constructed by the MRF generated.

Some improvements

One of the main weakness of the CAMEOS software is that it relies only on data retrieved from databases.

When used in practice, this characteristic is not necessarily satisfactory.

Undesirable mutations

Previously, we have seen that CAMEOS software entangles genes in order to minimize a score function that it defines using the digital fingerprint of the target proteins that it generates using HMMs.

From then on, it can make any mutation appear as long as the score is minimized. In practice, it may be desirable to eliminate certain mutations in the final sequences.

A first naive approach consists in manually eliminating the sequences presenting these mutations from the results provided by CAMEOS. If this method has the advantage of being easy to implement in practice, it is not optimal from a mathematical point of view.

Indeed, it would have been desirable for the algorithm to directly eliminate these mutations in order to perform the optimization steps on these proteins. not having the mutation

To remedy this, we propose a way to modify the step of dynamic programming (link)
The general idea is to manually modify the score function when one of these mutations occurs.

Let $M$ be the set of undesirable mutation. We denote by $l(M)$ the size of the mutation. To calculate $T(i+1,b_{i+1})$, just take $$ T(i+1, b_{i+1}) = \left\{ \begin{array}{l} \max \left\{ \begin{array}{l} T(i,A) + \sigma(A, b_{i+1}, i+1), \\ T(i,C) + \sigma(C, b_{i+1}, i+1),\\ T(i,G) + \sigma(G, b_{i+1}, i+1), \\ T(i,T) + \sigma(T, b_{i+1}, i+1), \end{array}\right. \quad \text{if } \; \forall n, \; b_{i+1-n} \cdots b_{i+1} \notin M \\ -\infty \qquad \text{else} \end{array}\right. $$

Here we can see that undesirable mutations are highly penalized. The algorithm will therefore not be able to select them.

Desirable mutations

Similarly, one may want to conserve certain mutations in the target genes.

To do this, the score is modified so that it is favorable to such mutations.

Let $M$ be the set of desirable mutation. We denote by $l(M)$ the size of the mutation. $$ T(i+1, b_{i+1}) = \left\{ \begin{array}{l} +1000 \qquad \exists n, \; b_{i+1-n} \cdots b_{i+1} \in M \\ \max \left\{ \begin{array}{l} T(i,A) + \sigma(A, b_{i+1}, i+1), \\ T(i,C) + \sigma(C, b_{i+1}, i+1),\\ T(i,G) + \sigma(G, b_{i+1}, i+1), \\ T(i,T) + \sigma(T, b_{i+1}, i+1), \end{array}\right. \quad \text{else}\\ \end{array}\right. $$

These modifications take into account the additional information available to a biologist who wants to tangle two genes.

Bibliography

Blazejewski and al. (2019). Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science, 365(6453), 595-598.
Rabiner (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.
Rabiner and al. (1986). An introduction to hidden Markov models. ieee assp magazine, 3(1), 4-16.
Krogh and al. (1994). Hidden Markov models in computational biology. Applications to protein modeling. Journal of molecular biology, 235(5), 1501-1531
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S. I., & Langmead, C. J. (2011). Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4), 1061-1078

Team:GO Paris-Saclay/Model

The Mathematical Model

Contents