Team:SYSU-CHINA/Model

Model
Due to the impact of COVID-19, we were unable to do wet experiment until September, 2020. In order to speed up the progress of directed evolution and reduce the experimental quantity of this huge experimental project, we made great effort on Model parts. We carried out this part of project by doing the following things:
(1)Make models on pre-experiment and formal one to help test the reliability of the experiment results
(2)Develop machine-learning and feature extraction algorithms to predict which type of RNA may have better affinity to ADAR1
Experiment Model
PART A. Pre-experiment
  • Toxic Gene Test
  • (1) Expression Test
    DOX can promote the binding of rtTA to TRE promoter in Tet-On system and increase the expression of toxic genes. In this step, we replaced the toxic gene with GFP to characterize the expression of the target gene in the Tet-On system under the condition of gradient DOX addition.
    Since this step has not yet been performed, we have found in a research article a set of images of the mean fluorescence intensity (MFI) of GFP expressed after stable transfection of Tet-on systems (including pCMV-rtTA and pTet-GFP) in Hela cells by gradient addition of DOX.[1]
    Figure 1.MFI-dox relationship in Hela cells(data estimated from M2 in the left panel of C)
    We think that the average fluorescence intensity of GFP measured after applying gradient addition of DOX to the cells in our experiment will be similar to the left image of C in the figure above. However, since the original data are not given in the article, the MFI value of M2 variant (which is the same as the rtTA variant used in our experiment) curve (triangular point) is estimated as a dependent variable, and the DOX concentration as an independent variable. Quantities were fitted as functions.
    The images are as follows:
    Figure 2.MFI-DOX scatter plot
    To view the code behind this figure, please click https://github.com/sysu-china2020/igem_project to visit our code file on github.

    The characteristic of the scatter plot is similar to the point from logarithmic function curve, so logarithmic function model is used for fitting.
    The custom equation was set as: y = a * log (x + b) + C
    Use the function fitting toolkit curvefitting tool in MATLAB to fit
    Figure 3.MATLAB curvefitting result of MFI-DOX scatter
    To view the code behind this figure, please click https://github.com/sysu-china2020/igem_project to visit our code file on github.

    result:

    f(x) = a*log(x+b)+c
    Coefficients (with 95% confidence bounds):
    a = 4.006 (2.935, 5.077)
    b = 96.24 (30.09, 162.4)
    c = -13.44 (-20.77, -6.115)


    Goodness of fit:
    SSE: 0.1275
    R-square: 0.9983
    Adjusted R-square: 0.9971
    RMSE: 0.2062

    The fitting results show that R-square is about 0.998. SSE and RMSE value are small, which are close to 0, show that this is a good fitting result.

    As shown in the figure above, it can be concluded that the higher the DOX concentration is, the higher the expected fluorescence value of GFP is.(Under the actual experimental conditions, excessive DOX will lead to cell necrosis, which needs to be verified by experiments though).However, within a certain concentration range, we can control the expression of toxic genes by adjusting the dosage of DOX, so as to achieve the purpose of experimental screening.
    (2)Measurement of virulence of toxic genes
    Toxic gene:
    We chose apoptin protein gene as the toxic gene. It locates in cytoplasm in normal cells and locates in nucleus in many tumor cell lines, which can specifically induce apoptosis. According to the current research, the molecular mechanism of inducing tumor cell apoptosis is as follows:[2]
    Figure 4. Molecular mechanism of apoptin inducing tumor cell apoptosis
    Calculation of the efficiency of virulence gene:
    We plan to select apoptin as the toxic gene. Since our purpose is directional evolution, the killing efficiency of toxic gene is best known and controllable, so we need to model the killing efficiency of toxic gene.

    Under our current experimental conditions ,Hela cells transfected with pCMV empty plasmid were used as control group, and Hela cells transfected with pCMV apoptin recombinant plasmid were used as experimental group. The mortality is measured by measuring the difference in absorbance after staining with trypan blue to characterize the virulence of toxic genes.

    But we found that if we can use flow cytometry to measure the rate of cell death, we will get better results. The project of 2013 SYSU-China gave us inspiration, and our future work is as follows:
    Hela cells transfected with pCMV-GFP recombinant plasmid will be used as control group, and Hela cells transfected with pCMV apoptin recombinant plasmid will be used as experimental group.
    Generally, the killing efficiency of apoptin can be calculated by comparing the mortality of GFP cells in the control group with that of GFP cells in the experimental group by using two-color flow cytometry. But in this part, we will try to expand our data set under the same experimental conditions by introducing the value of transfection rate, and based on several basic assumptions, including untransfected cells into the calculation scope, in order to obtain more accurate results.
    The specific calculation method is as follows:
    Among them:
    N0: initial cell number before transfection
    n0: the number of PI stained cells in the control group within a certain period of time measured by flow cytometry
    n1: the number of green fluorescent cells in the control group measured by flow cytometry within a certain period of time
    n2: the number of PI stained cells in the experimental group within a certain period of time measured by flow cytometry
    r1: the inherent mortality of control group cells after culturing for a period of time
    r2: mortality of pCMV-GFP-apoptin transfected cells after culturing for a period of time
    a: transfection efficiency
    k: A dimensionless representation of the ratio between the number of cells counted by flow cytometry and the total number of such cells within a certain period of time

    Through experiments and flow cytometry, we can obtain the values of n0, n1, n2.Therefore, by substituting the corresponding a value, i.e. the efficiency of electroconversion, we can calculate the corresponding r2, that is, the inherent mortality of pCMV-GFP-apoptin transfected cells after a period of culture.
    On this basis, the relationship between apoptin and these quantities can be simulated by changing the types of transfected cells, cell culturing time and the amount of plasmid added during transfection.

    (3)Changes of cell death rate with time:
    The following schematic diagram is drawn in simbiology of Matlab, which shows that in the cell culture media, the cells which have been transfected with the plasmid containing the toxic gene become dead gradually because the toxic gene is expressed.
    Figure 5.The process of cell death in the cell culture media
    The chemical reaction process is shown as follows:
    According to the transformation of cell state, the ordinary differential equations (ODE) are listed as follows:
    Suppose the initial cell number is 10000 cells with toxin gene, K1 = K2 = 1, results are as follow:
    Figure 6.The number of dead cells change over time

    Under ideal conditions, all the cells that have been transfected with toxic gene and added with appropriate concentration of DOX will eventually die if they are not "rescued" by dsRNA.


    PART B.Formal experiment
    (1)Stimulation of the expression of cytotoxic genes :
    We constructed a model using simbiology of Matlab to show how the designed pathways are expressed in cells.
    Figure 7.The pathways in the cell
    Firstly, the transfected cells were cultured for a period of time, and a certain amount of dsRNA was expressed in the cells. Then, IFN and DOX were added to induce the expression of ADAR1 and stem-loop respectively. If dsRNA bind to ADAR1, the cells could survive; if the stem-loop bind to ADAR1, the stem-loop would be opened and the toxic gene would be expressed, resulting in cell death.

    According to this biological process, we assign a reaction rate constant to each reaction, and the reaction formula are as follows:
    The ordinary differential equation (ODE) are as follows:
    Run the model above:
    (Note: in order to run the model and get output images, the following settings are applied:
    1) All reactions follow the mass action law by default;
    2) In order to simplify the model, IFN and DOX are added at the same time in this model after the cells being cultured for a period of time, so the settings in 3) are made;
    3) The initial concentration of IFN, DOX and dsRNA were set as 4, 5, and 3 respectively;
    4) All reaction rate constants are set to 1 by default, that is, K1 = K2 = K3 = K4 = 1
    5) In order to show the influence of the input IFN and DOX on the image, the values are set to 6 and 3 respectively and the model is run again while keeping the former result.

    The following images are obtained:
    In order to make the image clear, only IFN, DOX and [toxin-die] curves are showed.
    a.
    b.
    Figure 8.Different amounts of toxic gene expression under different IFN and DOX dosage

    The images show that when the input value of IFN and DOX are changed (i.e. the amount of IFN and DOX added in the experiment), the expression of toxic gene will change, and it can be inferred that the mortality rate will change accordingly.

    (2) Estimation of cell mortality
    In the experiment, we will use the curvefitting tool of MATLAB to make curved surface fitting irectly,using the data that we can know in the experiment to fit the relationship between the affinity of dsRNA to ADAR1 and the survival rate under different IFN and Dox concentrations.

    However, the "inhibition capability of dsRNA to ADAR1" we want cannot be obtained directly from experiments.In order to directly link the attinity of dsRNA with the survival rate obtained in the experiment, and to make the progress of the experiment intuitive, our team member GJM defined a value "evolution percentage" as the ratio of the current dsRNA's inhibition capability to the ideal inhibition capability.

    The "inhibition capability" is quite abstract. Therefore, we can fix a relatively high concentration of IFN and Dox according to our own needs. Under these conditions, the affinity of dsRNA that can make the cell survival rate reach the specified value is ideal. At the same time, we consider the evolution percentage of dsRNA in this state is 1.

    The functional relationship between the affinity of dsRNA and it’s inhibition capability is not clear to us, but if it is roughly considered to be close to linear, then the value of the evolution percentage is approximately equal to the ratio of the affinity to the ideal affinity. In this way, the data we use for fitting are all available from the experiments.

    The next step is to carry out directed evolution. Through the data obtained in the experiment, we can gradually make our model more substantial and more perfect.

    Now, due to the low degree of completion of the experiment, this model cannot be built and functioned yet, but in our future experiments, it will definitely help us a lot.

    References:
    [1].Das, A.T., et al., Selecting the optimal Tet-On system for doxycycline-inducible gene expression in transiently transfected and stably transduced mammalian cells. Biotechnol J, 2016. 11(1): p. 71-9.
    [2].Los M, Panigrahi S, Rashedi I, et al. Apoptin, a tumor-selective killer. Biochim Biophys Acta 2009 2009-08-01;1793(8):1335-42.
    Feature extraction
    Semi-rational design, however, refers to using feature extraction and deep machine learning algorithms to predict the sequence of dsRNAs with high affinity to ADAR1.

    Therefore, we need to extract the features of dsRNAs for analysis, and the simplest should be the primary structure——sequence of dsRNAs.

    Genome Reference Consortium version was released in June 2013 for GRCh37 and ADAR1 in laboratory data was available. We first downloaded p13 of human Genome (chromosome 1-22, X, Y and mtDNA) from the NCBI and use Python 3.7.3 to extract dsRNAs sequence with ADAR1 affinity data. Sequence data more than 15 bp will be divided into two parts, endogenous and exogenous. There are 1819 endogenous dsRNAs and 1926 exogenous dsRNAs.

    Subsequently, we counted the data of editing sites in dsRNAs sequence, among which 362 editing sites were found in endogenous dsRNAs, some editing sites were located in the same dsRNAs, and editing sites were distributed on 100 dsRNAs. A total of 975 editing sites were found in exogenous dsRNAs. Some editing sites were located in the same dsRNAs, and editing sites were distributed on 235 dsRNAs. Although there is a connection between the editing site and the binding site, it is not exactly the same motif. As a result, this part of data may be used in the subsequent steps to verify the motif in the editing site.

    We are more inclined to look for the binding site.Therefore, we apply new method as follows. We normalize the affinity data by using the Sigmoid function:
    After that, according to the affinity interval distribution diagram we drew, some data with affinity roughly centered and sparse distribution were divided into intermediate data sets, and endogenous dsRNAs was divided into positive and negative training sets. Wherein, positive training set y > 0.62 and negative training set y < 0.58.

    We used K-MERS to segment and count the endogenous dsRNAs sequence. We cut the sequence with K as the cutting length and 1 as the step length. We counted the number of sequences with lengths of 5 and 6 respectively in the positive and negative training set, namely 5-MERS and 6-MERS(see Figure 1,2). To help analysis, [1]we introduced the richness:
    a is the number of occurrences of the sequence in the positive training set, and b is the number of occurrences in the negative training set. When y ≥ 0, we believe that the sequence is more likely to appear at strong binding sites. When y < 0, we believe that sequences are more likely to appear at weak binding sites. The image is drawn as follows. We marked the top ten data in > 0 and < 0 respectively:
    Figure 1. Richness of 5-mers, ten pieces of 5-mers with the highest richness and ten pieces of 5-mers with the lowest richness are represented by orange and blue dots in the graph above. Purple dots represent the other dots in this graph.
    Figure 2. Richness of 6-mers, ten pieces of 6-mers with the highest richness and ten pieces of 6-mers with the lowest richness are represented by orange and blue dots in the graph above. Purple dots represent the other dots in this graph.
    The top 10 marked in the figure are shown in table 1,2.
    Table 1,2,The top 10 mers marked in the figure of 5-mers and 6-mers are shown in table 1,2.
    From the table, we can gain helpful information. For example, ATAGGG is more likely to appear in the positive training set at the same time ATACGG is more likely to appear in the negative training set. This means, if we replace ATACGG with the sequence ATAGGG, we may obtain a dsRNA with higher affinity with ADAR1*. What's more, from table 2, we found that AAAAA, TTTTT and TTTTTT which repeat five or six bases in dsRNA, tend to appear in the negative training set, and for 5-mers is concerned, the top 10 sequence of positive and negative training focused on the difference is that negative training focus on repetitive bases or always more and more regular. However, helpful information that we can obtain here is limited because mers are cut randomly.
    (To view the code behind this figure, please click https://github.com/sysu-china2020/igem_project to visit our code file on github).

    As a result, we introduce new method as follows.
    Machine Learning Algorithm
  • Background
  • It is a huge workload to design dsRNA sequences by enumeration and verify them through experiments one by one. If a priori knowledge of the affinity between a dsDNA sequence and its substrate can be obtained, our workload will be greatly reduced. Therefore, we introduce the method of machine learning into our project. Using the data of the affinity between dsRNA sequences and their substrate, which is obtained from previous experiments, we train a neural network and verify its performance. Finally, with a new dsRNA sequence as input, the model can predict the affinity degree of the sequence and its substrate.

  • Model building
  • Considering that a dsRNA is a sequence, which means the order in which bases appear determines its specificity, we choose the one-dimensional Convolution Neural Network(CNN) which is suitable for processing sequence data. Compared with other types of neural networks, CNN is very good at capturing local features of data, and the local features learned by CNN have little relationship with where they’re learned. In other words, if the neural network learns a valid base sequence from a certain dsRNA sequence, the network can also recognize the sequence when it appears in any position in any other dsRNA sequence. Obviously, this characteristic is very consistent with our demands, since we won’t expect to miss a valid base sequence just because it appears in a different position.

    Let's first introduce the existing data sets. We measured 2080 data, namely dsRNA sequences and their affinity with their substrate. A dsRNA sequence is a base sequence and the corresponding a ffinity level is a specific float value. Both dsRNA sequences and the affinity need to be transformed before they can fit into our CNN model.

    We select the median of the affinity degree of all data as the threshold value, and mark the dsRNA with the affinity level higher than the threshold as 1, and the dsRNA with the affinity level lower than the threshold as 0. In other words, 1 and 0 respectively represent strong and weak affinity between a dsRNA sequence and its substrate. At this time, we are faced with a dichotomous problem: given a dsRNA sequence, label it with 0 or 1, that is to judge whether its affinity with its substrate is strong or weak.

    For dsRNA sequences, obviously the original character sequence can not participate in numerical operations, so it needs to be converted. In our model, each dsRNA sequence is represented by a (777,4) two-dimensional matrix and the number “777” is the maximum length of dsRNA in the data set. So each base is represented by a 4-dimensional vector. The four dimensions of the vector correspond to bases A, C, G and T respectively. The value of the corresponding position of the real base in the sequence is 1, and the rest positions are 0.

    For non computer background researchers, it is difficult to understand the principles behind the specific details of the model, but it is necessary to understand the basic functions of each layer, the building block of our model. The network structure we established is as follows:
    Layer (type)                 Output Shape              Param #   
    ===========================================================
    conv1d_1 (Conv1D) (None, 773, 256) 6656
    _________________________________________________________________
    max_pooling1d_1 (MaxPooling1 (None, 386, 256) 0
    _________________________________________________________________
    conv1d_2 (Conv1D) (None, 382, 128) 163968
    _________________________________________________________________
    global_max_pooling1d_1 (Glob (None, 128) 0
    _________________________________________________________________
    dense_1 (Dense) (None, 1) 129
    ===========================================================
    Total params: 170,753
    Trainable params: 170,753
    Non-trainable params: 0


    The model is a multi-layered network. Conv1D layer is the convolution layer. It convolutes the input, which equals to extracting the local features and condensing them into more effective representations as the output. You can see that each conv1D layer is followed by a pooling layer (i.e. the second layer MaxPooling1D and the fourth layer GlobalMaxpooling1D). A pooling layer replaces the multiple dimensions of the output of the convolution layer with a single dimension, thus reducing dimensionality of the data. This can not only reduce the amount of network calculation, but also prevent the model from learning some redundant noise when the dimensionality is too large. The last layer is the full connection layer Dense, which can convert the output of former layers into the probability value ranged [0,1]. This value can then be intuitively understood as the probability that the real label of the input sequence is 1. In addition, each Conv1D layer and Dense layer use an activation function to "activate" the output. The activation process introduces some nonlinear calculation process into the model, and this process is exactly the key to enabling the model to learn much more complex relationship between data.

    In order to train and evaluate the model, we divide the data with the ratio of 9:1. The majority part is used to train the model, and the minority part is used to evaluate the model. The data used for training model is further divided into the training set and the validation set with the ratio of 4:1.

    The learning process of neural network is actually the updating process of model parameters. For each iteration, the model will calculate its binary cross entropy (a commonly used index to measure the error between the predicted value and the real label, also known as the loss) in the training set, and update the network weights along the direction in which the loss descends, according to the chain derivation rule. In the process of model training, it is necessary to calculate the binary cross entropy loss of the network on the validation set from time to time. This is to monitor the generalization ability of the model, that is, the prediction ability of the model on new data. This is because with the updating of the model, the network will gradually learn more about the training set data. At the beginning, it will learn the really important laws in the data, that is, the laws conducive to improving its generalization ability. With the deepening of the learning, it will slowly start to learn some noise, resulting in over-fitting. The result of over-fitting is that although the performance of the network on the training set is becoming better, the performance of the network on new data is getting worse, because it has learned some noise exclusively from the training set.

    By monitoring the performance of the model on the validation set, we can find the critical point at which the model begins to learn noise, and finally stops training at this critical point. At this time, the generalization ability of the model is the best. We visualized the results of the training process and got the following results.(see figure 1)
    Figure 1. The results of training
    In figure 1, the left figure is the core of our analysis, and the right figure is the accuracy of model prediction, which can be used as a reference. From the left figure, we can see that the loss of the model on the training set is decreasing, which means that the neural network is learning constantly. However, the loss of the model on the validation set does go through the change of first decreasing and then rising, which is exactly what we've mentioned above: over-fitting. At the beginning, what the model learns is the inherent law of data, so the generalization ability of the model becomes better. With the deepening of learning, the model learns the noise on the training set data. The noise only belongs to the training set and it’s definitely not the common characteristics of all data. Therefore, when the model is evaluated on new data (i.e. validation set), the loss will rise. From this, we can determine the final architecture of the model, that is, after about 6 rounds of training, the generaliza tion ability of the model is optimal. Using the theoretical optimal model to evaluate the test set, we get the accuracy rate of 59.6%, that is, the probability that the model can co rrectly classify a dsRNA sequence is 59.6%. This probability is higher than that of the 50% a ccuracy of random classification, which means that the model can successfully provide prior knowledge about affinity degree.

    But from the perspective of neural network, if the amount of training data can be greatly increased, the accuracy of the model is expected to reach about 90%. Still, it is already a acceptable result to reach the 59.6% accuracy, noting that we only use 2000 pieces of data, far less than the tens of thousands of data needed to train an e xcellent neural network. In addition, there are also many optimization methods in the field of neural network, which can further improve the generalization ability of the model.
    (To view the code behind this figure, please click https://github.com/sysu-china2020/igem_project to visit our code file on github).

    References:
    [1]Adjeroh D, Allaga M, Tan J, et al. Feature-Based and String-Based Models for Predicting RNA-Protein Interaction. Molecules. 2018;23(3):697. Published 2018 Mar 19. doi:10.3390/molecules23030697