Cas9 Folding Rate Prediction
Introduction
To couple the expression and function of Cas9 with cell division accurately, we need to predict the expression time of Cas9. From DNA to protein, there are three main processes: transcription, translation, and folding. Through literature review, we learn that transcription and translation can be completed in about 14 minutes.
The DNA sequence of Cas9 is 4104 bp. The rate of transcription in mammalian cells is nearly 1000 nucleotides per minute. It is estimated that this process can be done in about 4 minutes. The protein of Cas9 is 1368 amino acids. The rate of translation is nearly 140 amino acids per minute, so this process can be done in about 10 minutes.
Generally, the time of protein folding varies from microseconds to minutes. However, we couldn’t find the related data in the literature, so we decided to explore it by modeling. After extensive literature search and analysis, we considered using Neural Networks to predict Cas9 folding rate.
To improve prediction accuracy, we combined deep neural networks (DNN) with Fuzzy Cognitive Map (FCM) and Levenberg-Marquarelt (LM) algorithm[1].
Models and Results
Our method can be summarized in Figure 1.
Figure 1. Illustration of our model
First, FCM was used to train the degree of correlation between 86 proteins with known folding rates and some 9 properties corresponding to their amino acid sequences (Figure 2).
Figure 2. FCM of protein folding rate (Liu L,et al., 2017)
Parameters explanations: Alpha helix of C terminus ; beta sheets ; compression capability of protein sequence ; react ability of proteins in solvents ; contact surface of the unfolded chain and solvent ∆ASA; Gibbs free energy of hydration in denatured proteins ; the average range of amino acid contact ; the polarity value of amino acids P; and the number of torsion angles of the side chain n.
Then take the adjacency matrix W in FCM as the initial value of the weight of the Neural Networks and train the network.
We set a neural network (Figure3):
Figure 3. Structure of neural network
In the two literatures, the author used 116 and 99 kinds of data respectively and obtained reliable results. In combination with the actual project, after the redundant data were excluded, 86 kinds of data were selected as the sample data for our model analysis.
Besides, when we trained the Neural Networks, we found that the accuracy of the data obtained without considering the location of amino acids was very low, so it was speculated that different amino acids and amino acid locations had a great influence on the folding rate. So we improved the method in literature. The Monte Carlo method is used to select sequence features. We replaced the method of calculating the average value of amino acid quantity in the literature with Monte Carlo method, so that we can obtain the average prediction accuracy, and select the sequence features with the highest prediction accuracy. In this way, both the interaction between amino acid residues and the influence of amino acid sequence order on the folding rate is considered.
(Guo J,et al., 2010)
Figure 4. A schematic drawing to show the sequence order correlation mode along a protein sequence.
The first-tier panel (a) reflects the correlation mode between all the most contiguous residues, the second-tier panel (b) that between all the second-most contiguous residues, and the third-tier panel (c) that between all the third-most contiguous residues.
In order to improve the accuracy of prediction, we use a set of sequence correlation factors to represent the location information of protein structure, and use Ri to represent the residue of amino acid.
is known as correlation factor of the first layer, corresponding to the correlation of sequence order between adjacent residues, as shown in (1); is known as correlation factor of the second layer, corresponding to the correlation of sequence order between interval residues, as shown in (2); is known as correlation factor of the third layer, corresponding to the correlation of sequence order with two residues apart, as shown in (3). The meanings of other related factors are analogous. We want to find the factors that are related to the folding rate, that is, we want to find the factors that can obtain high prediction accuracy. If we directly check one by one, the workload is very large. Therefore, we put forward a method. First, randomly select m characteristic factors, use Monte Carlo method to obtain the average prediction accuracy, and get a return value. Then repeat the above steps to make the number of characteristic factors take all possible cases. In this way, we can get the number of characteristic factors that can achieve the highest prediction accuracy. Compared with the method of averaging the number of all amino acids in the original literature, this method can obtain more accurate results.
After repeated training, the final constant estimate of the protein folding rate is between 120 and 140. It can be roughly estimated that Cas9 folding takes about 7.1~8.3ms.
Discussion
By simulating the folding rate of Cas9, we found that the time it takes for Cas9 to be expressed is much shorter than cell cycle. Therefore, it predicts the feasibility of our design preliminary.
References
[1]Liu, L., Ma, M., & Cui, J. (2017). A novel model-based on FCM-LM algorithm for prediction of protein folding rate. Journal of bioinformatics and computational biology, 15(4), 1750012.
[2] Lv, J., & Luo, L. (2014). Statistical analyses of protein folding rates from the view of quantum transition. Science China. Life sciences, 57(12), 1197–1212.
[3] Cheng, X., Xiao, X., Wu, Z. C., Wang, P., & Lin, W. Z. (2013). Swfoldrate: predicting protein folding rates from amino acid sequence with sliding window method. Proteins, 81(1), 140–148.