Team:SYSU-Software/Model

Overview

This year, we collaborate with Tongji-Software throughout the whole season. Despite the COVID-19 epidemic, we found a way to work closely to each other through online meetings and conferences. Our partnership has shaped our project in a fundamental and profound way.

Automated Design

To generate the desired gene pathways, we first use python's GeneNet module to generate the topology of the gene pathways. Then, according to Autofil algorithm prepared by us, a set of transcription factors and proteins that meet the topological structure and have the highest score in the algorithm are automatically searched in the database. As the optimal scheme, they are filled into the topological structure to realize the automatic design of gene pathway.

GeneNet

The “GeneNet” module is a gradient descent optimization algorithm ranging from machine learning to rapid screening and design of genetic circuits that can quickly design circuits capable of performing a range of different functions^[1]. On this basis, we improve the algorithm to make it easier to use and user-oriented.

Based on the user-supplied change function of the desired target gene expression, we first consider a simple and universal model of the transcriptional gene circuit that has been fully studied in a series of biological contexts (Formula 1) :

The model comprises N transcription factors whose expression concentration is expressed by Y. We assume that all the interactions between genes are possible, then let W be a N*N matrix , where each element W_ij specifies that the transcription of gene I is affected by gene J (if W_ij is positive, then gene J activates gene I; If W_ij is negative, J suppresses I). ϕ(x)is a nonlinear function, to ensure that the transcription rate is always positive, and at a high level of input saturation. We further assume that each gene degrades at rate k_i. Thus these N ordinary differential equations (ODEs) specify the gene network model. Therefore, GeneNet's task is to find the appropriate parameters W and k through machine learning and gradient descent algorithm to make the network operate as expected and convert the input I to the specified output.

After obtaining the W matrix, we can generate the topology of the gene network according to the promotion and inhibition relations between genes in the matrix.

Autofill

In Autofill algorithm, we first take Regulon Database as the data source and use Depth first search and Pruning algorithm to select a group of promoters and translation factors (TFs) whose interaction relationship satisfies the topology output by GeneNet. RegulonDB is a model of the regulation of transcription initiation and the regulatory network in Escherichia coli K-12. It includes knowledge of the organization of the genes in transcription units, operons and simple and complex regulons. Then, through the scoring mechanism, 10 schemes with the highest score are selected and recommended to users.

The overall framework of the scheme search is a Depth first search, pruning with greedy thoughts and using priority queue to save several local optimal solutions. Although the global optimal solution cannot be guaranteed, it can provide users with several appropriate solutions within an acceptable time.

The key factors in the scoring mechanism are the total number of citations of TFs and promoters in a scheme, and the length of the constructed plasmid for the scheme.

The criteria established are as follows:

1. If the sequence of elements within a plasmid is within an ideal length, it is said to be an ideal state. Once the total sequence length exceeds the maximum ideal length, acceptability decreases rapidly. The total sequence length must be less than or equal to the threshold maximum acceptable length, and the acceptability is 0 when the total sequence length is bigger than or equal to the maximum acceptable length.

2. The number of references related to the elements contained in the scheme should be as high as possible.

3. Try to avoid the short-board effect, that is, the number of references for the element with the least references should be as large as possible.

Specific implementation of scoring standards:

For standard 1, we adopt a strictly convex function. After testing and adjustment, we choose to map [maximum ideal length, maximum acceptable length] to [0, 1], and then use 1-x³ as the evaluation function.

For standard 2 and 3, in order to strike a balance between them, we have two options:

Option 1: Preprocess the elements in all schemes, storing the ranking of the number of references for each element, so that the total reference ranking of the components used is as high as possible.

Option 2: Calculate the number of references and use some transformation to normalize them into the interval [0, 1], as the score. After testing linear transformation, standard transformation, arctangent transformation and logistic transformation, we found that logistic transformation had the best effect, so we chose logistic transformation as the evaluation function.

The scoring algorithm flow is as follows:

According to the citation times of relevant literature of the scheme, the initial score of the scheme is set as:

Suppose the total sequence length of the elements contained in the plasmid constructed by this scheme:

real length = plasmid length + inserted gene sequence length.

The default maximum ideal length is 10kbp and the maximum acceptable length is 20kbp.

If the total length of the sequence is greater than the maximum accepted length, then： grade = grade -1;

If the total length of the sequence is greater than the maximum ideal length and less than the maximum accepted length, then:

If the total length of the sequence is less than the maximum ideal length, no points are deducted.

In the end, the top ten solutions with the highest scores are recommended to the user, in order of their scores.

Reference

[1] Hiscock TW. Adapting machine-learning algorithms to design gene circuits. BMC Bioinformatics. 2019;20(1):214. Published 2019 Apr 27. doi: 10.1186/s12859-019-2788-3

TF Binding Sites Affinity Prediction

Introduction

While users are constructing functional circuits with genenet algorithm and our auto-fill algorithm, there are still something missing. As the calculated topological structure of the genes not only represents the qualitative activation parameters of each gene part, but more importantly, their activation parameters have also been given. Here, we designed a deep learning algorithm based on ChIP-seq data set to predict the affinity between transcription factors and binding sites, and attempted to use the affinity information to provide users with a reference to the activation parameters.

Alogorithm details

Data resource

Since there are few ChIP-seq data in prokaryotes at present, we adopted the method of mixed learning of eukaryotic and prokaryotic transcription factor data. We obtained ChIP-seq data from Mycobacterium Tuberculosis from TB Database ^[1], and Rattus Norvegicus, Saccharomyces Cerevisiae, Arabidopsis Thaliana, Caenorhabditis Elegans ChIP-seq data from GTRD ^[2]; Finally, about 20 million pieces of TF-BS data of 821 transcription factors were obtained. TF amino acid sequence and binding site DNA sequence were used as double independent variables, and enrichment score as dependent variable for model training.

pretreatment

Since each unique TF sequence and DNA sequence represents/influences its biological functions, We use one-hot encoding to preprocessing the input data. One-hot encoding Digitize the characteristics of the classified values. It solves the problem that the classifier can not deal with attribute data and extends the feature to some extent.

Fig1: DNA sequence is processed like this

Convolutional Neural Network (CNN)

Fig 2: the structure of CNN

Function description

ReLU0: Rectified Linear Unit. it is a kind of activation function commonly used in artificial neural network. represented by slope function(f(x=max(0,x)), since its derivative= 1, it perfectly reduced gradient disappearance problem(GDP).

MaxPool2d: Which applies a 2D max pooling over an input signal composed of several input planes Parameters used in our model.

Gradient descent: use Adaptive momentum (Adam), which has high computational efficiency and can adapt to large data set.

Cost function: With a large amount of data, we may get a large loss in the beginning. Thus, MSE is a smart choice, it is ideal for regression problem.

Reference

1.Reddy TB, Riley R, Wymore F, Montgomery P, DeCaprio D, Engels R, Gellesch M, Hubble J, Jen D, Jin H, Koehrsen M, Larson L, Mao M, Nitzberg M, Sisk P, Stolte C, Weiner B, White J, Zachariah ZK, Sherlock G, Galagan JE, Ball CA, Schoolnik GK. TB database: an integrated platform for tuberculosis research. Nucleic Acids Res. 2009 Jan;37(Database issue):D499-508. doi: 10.1093/nar/gkn652. Epub 2008 Oct 3. PMID: 18835847; PMCID: PMC2686437.
2.Ivan Yevshin, Ruslan Sharipov, Semyon Kolmykov, Yury Kondrakhin, Fedor Kolpakov, GTRD: a database on gene transcription regulation—2019 update, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D100–D105, https://doi.org/10.1093/nar/gky1128

Image search Algorithm

Input: image processed by YOLOv4 and OCR

Preprocessing

Name correction

Since the identification result of Yolo + OCR has certain error, that is to say, the input component name has certain error, so it is necessary to do a correction of name.

The fuzzy matching algorithm is used to search the most similar name of the input component name in the database to correct the component name.(UniProt information in the database to be supplemented)

The concrete implementation of fuzzy matching algorithm is to take Levenshtein Distance in string editing distance as the standard to find the component with the shortest Levenshtein Distance of input component name in database. The calculation of Levenshtein Distance makes dynamic programming come true. The complexity is O(n), and N is the length of component name.

Because the component names of iGEM registry are encoded by numbers, and the meaning of fuzzy matching is often limited, the default component names from iGEM registry are accurate.

Functional correction

After the name of input element is corrected, the most likely corresponding accurate element in the database is known, so that the function of the element can be corrected. For names with functional prefixes such as "pro" and "RBS", special treatment will be applied, and the function corresponding to the name shall prevail.

Searching for similar structures

Purpose

The structure of the path graph is extracted, and the path graph which is closest to the input path graph in the database is searched.

option

The sub element of a compound may also be a compound, which means that the sub element of an element also has its own component tree.

Therefore, two alternative search preferences are provided:

• Recursive type: the element that does not expand the complex. For example, the sub components B and C of compound a are all compounds. The pathway structure of a is BC

• Recursive complex element. For example, if the sub-element B of complex a is a complex, the sub component C non-complex of a, and the sub component D and E non complex of B, the underlying channel structure of a is DEC.

Specific implementation

Extracting the genetic circuit

There may be multiple path graphs in the graph of Zhang input, so the partition of different path graphs can be realized by merging and subtracting sets. The structure extraction of specific path graph is realized by using linked list.

Searching the genetic circuit

If the path structure of the path graph is extracted and the component functions are mapped to characters, each path graph can be converted into a string corresponding to its path structure. The fuzzy matching algorithm based on Levenshtein Distance is used to obtain the most similar path graph.

For the search results in iGEM registry, the path diagram on synbiohub is provided. Otherwise, query the corresponding information of the path map in the database, and provide the path map and related information.

example

Make the path graph BBA on iGEM registry_B3202 search:
https://synbiohub.org/public/igem/BBa_B3202/1

BBa_Element BBA of b3202_I13516, BBa_B1202 is a complex, which is as follows:
https://synbiohub.org/public/igem/BBa_I13516/1

https://synbiohub.org/public/igem/BBa_B1202/1

By using recursive structure search, we can find the similar path map as follows:
https://synbiohub.org/public/igem/BBa_B3103/1
https://synbiohub.org/public/igem/BBa_B3106/1
https://synbiohub.org/public/igem/BBa_B3204/1
wait.
Specifically, let's look at BBA_B3103:

BBa_BBA of B3103_I13506, BBa_B1103 is a complex, which is as follows:
https://synbiohub.org/public/igem/BBa_I13506/1

https://synbiohub.org/public/igem/BBa_B1103/1

BBa_Element BBA of b1103_I3507 is a complex
https://synbiohub.org/public/igem/BBa_I13507/1

The structure of the first floor and the BBA of the system_B3202 is the same.

Searching similar parts

Purpose

The structure of the path graph is extracted, and the path graph which is closest to the components of the input path graph in the database is searched.

option

As mentioned above, the sub element of a certain complex may also be a compound, which means that the sub element also has its element of itself, thus forming a component tree.

Therefore, two alternative search preferences are provided:

• Recursive type: the element that does not expand the complex.

• Recursive type: recursively unfolds the sub element of the complex.

Specific implementation

Causes the priority queue to maintain the closest component queue.Make the greedy algorithm search pruning.

For each element in the input path graph, the fuzzy matching algorithm based on Levenshtein Distance is used to obtain the closest component. If the opposite number of edit distance between the two is taken as the score of the component, the closer the score is, the closer the score is. The adjacent components are placed in the priority queue and the total degree of the priority queue is maintained within the threshold.

The greedy idea favors the path graph containing the closest component queue maintained by the priority queue. The score of the path graph is obtained by summing the scores of the components in the priority queue contained in the path graph. The output of the path graph is selected as the output if the path graph has the highest score and its corresponding components.

Simulation for Gene Circuit

Introduction

To help researchers better understand how the gene circuit they created works, we build a series of models to simulate the dynamic behaviors of it. Most of the biochemical processes are made up of activation or repression, so we constructed activation and inhibition formulas based on Hill equation. After the user constructs the gene circuit in the designer, our software can create a corresponding ODE system automatically.

Model

Assumptions

In order to achieve a balance between efficiency and accuracy in the simulation, we simplify the model through some assumptions.

To construct an ODE system, we need to assume that the concentration of intracellular substances is evenly distributed. So we assumed that RNA polymers, ribosomes and chemicals are even in cells, and we neglect the effect of cell growth and division on the concentration of them.

In our ODE model, we only consider mRNA, protein and chemicals that can promote or inhibit promotors. We assumed that post transcriptional modification does not affect the quantity of mRNA and the efficiency of translation, and we don’t consider the effect of protein incorrect folding on protein concentration. In addition, we mainly consider the effects of promoters and RBSs on expression.

Basic Form of ODEs

We set the basic Concentration-Time ODEs form of a certain mRNA m or protein p as below^[1]:

[m] represents the concentration of mRNA, α is the basic transcription rate, d₁ is the degradation rate of mRNA; [P] represents the concentration of protein, β is the binding efficiency of ribosome to mRNA (translation rate), and d₂ is the degradation rate of protein.

If a circuit is promoted by y, we set the concentration-Time ODEs form as blow:

Where v represents the maximum transcription rate.

If a circuit is inhibited by z, we set the concentration-Time ODEs form as blow:

If a circuit is promoted by substance y and inhibited by substance z, we set the concentration-Time ODEs form as blow:

And:

Reference:

[1] Yeoh J W , Ng K B I , Teh A Y , et al. An Automated Biomodel Selection System (BMSS) for Gene Circuit Designs[J]. ACS Synthetic Biology, 2019.
[2] https://2018.igem.org/Team:SYSU-Software?Simulation-Circuits

Bayesian Optimization

Introduction

The optimization of biological systems is an important issue in the field of synthetic biology. At present, many solutions are based on bottom-up approach. However, biological modeling based on mechanism often has a large deviation from the actual results, and the large amount of prior knowledge required makes this method less operable. These problems have also been confirmed by the results of our questionnaire. This year, we take the expression level of genes in cells as the system parameters to be adjusted, and apply Bayesian optimization algorithm to these problems , by the top-down approach, hoping to assist users to use the least experimental cost to achieve the optimization of system parameters, and finally obtain the optimized biological system^[1].

Algorithm flow

Bayesian Optimization (BO) ^[2,3,4]

BO is an algorithm widely used in neural network hyperparameter Optimization. In essence, BO is a global optimization algorithm, which iteratively obtains the optimal parameters of the system by sampling the parameters selectively and learning the corresponding experimental results. Compared with other global optimization algorithms, BO has two distinct advantages:

1.High sample efficiency BO can use very little sampling cost to find a set of parameters with good performance;
Because of the high cost of biological experiments, this condition is necessary for algorithms to be applied to biological problems

2.Gradient-free BO does not need to assume that the objective model satisfies convex optimization conditions and can solve the optimization problem that cannot find the gradient.
Since most biological systems do not meet convex optimization conditions, and it is difficult to obtain and solve the corresponding high-precision model, this feature is also applicable to biological problems.

where x^* is the Optimized parameter vector we want, f(x) is the system function,X belongs Nⁿ,n is the number of parameters users choose to Optimize.

Gaussian Process (GP)

Given an input x_n, we suppose the corresponding function value f(x_n) comes from a Gaussian process, that is, every {f(x_n)^N_n=1 } corresponds to Gaussian distribution. where μ is the mean vector, Σ is the Covariance matrix of Gaussian Distribution of objective function, The covariance between the two function values f(x) and f(x') is defined by kernel function K(x, x'), where γ is a free parameter.

Acquisition Function (AF)

The aim of acquisition function is to decide which point x to try next using the existing knowledge of f(x). The policy is to choose x_next = argmax_xa(x), in the other word, the x that maximizes acquisition function a(x).

a(x) represents the utility value of a parameter vector x, the higher it is ,the more possibly a good result will achieve.

There are many types of acquisition function, including EI, CI, LCB. We choose EI as our acquisition function, partly because it performs better in preventing the algorithm being trapped around some local extreme point.

Expected Improvement (EI)

utility function u(x):

where f_t-1' represents the highest output value that we observed during last t-1 iteration.

EI function a_EI,t(x):

Where, a_EI,t(x) represents the a(x) in t^th iteration, Φ and φ respectively are the cumulative distribution function (CDF) and probability density function (PDF) of standard Gaussian distribution.

the parameter vector chosen in t^th iteration:

Reference

1.HamediRad M, Chao R, Weisberg S, Lian J, Sinha S, Zhao H. Towards a fully automated algorithm driven platform for biosystems design. Nat Commun. 2019 Nov 13;10(1):5150. doi: 10.1038/s41467-019-13189-z. PMID: 31723141; PMCID: PMC6853954.
2.Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 25, 2951–2959(2012).
3.Rasmussen C.E. (2004) Gaussian Processes in Machine Learning. In: Bousquet O., von Luxburg U., Rätsch G. (eds) Advanced Lectures on Machine Learning. ML 2003. Lecture Notes in Computer Science, vol 3176. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28650-9_4
4.Brochu, E., Vlad M. Cora and N. D. Freitas. “A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.” ArXiv abs/1012.2599 (2010): n. pag.

Team:SYSU-Software/Model

Overview

Automated Design

GeneNet

Autofill

The criteria established are as follows:

Specific implementation of scoring standards:

Reference

TF Binding Sites Affinity Prediction

Introduction

Alogorithm details

Data resource

pretreatment

Fig1: DNA sequence is processed like this

Convolutional Neural Network (CNN)

Fig 2: the structure of CNN

Function description

Reference

Image search Algorithm

Preprocessing

Name correction

Functional correction

Searching for similar structures

Purpose

option

Specific implementation

Extracting the genetic circuit

Searching the genetic circuit

example

Searching similar parts

Purpose

option

Specific implementation

Simulation for Gene Circuit

Introduction

Model

Assumptions

Basic Form of ODEs

Reference:

Bayesian Optimization

Introduction

Algorithm flow

Bayesian Optimization (BO) [2,3,4]

Gaussian Process (GP)

Acquisition Function (AF)

Expected Improvement (EI)

Reference

Bayesian Optimization (BO) ^[2,3,4]