Team:SYSU-Software/Engineering

Engineering Success

The following picture is Engineering process of our software.

1. Recognition of need

When using traditional search engines to search for information about Synthetic Biology, like PubMed or Web of Science, we found it inconvenient to find appropriate genetic circuits (Question 17), which is the key to design. To find out whether this problem is universal, we sent out questionnaires to biological labs and iGEM teams. To our amazement, it was indeed a big problem. Moreover, predictability of results (Question 2) also appeared to be another problem. Also, too much effort spent on searching and repeated experiments is in urgency to be solved (Question 6).

2. Requirement

Based on the requirement analysis above, we summarized that our software should inspire synthetic biologists and help them design within less time. Furthermore, we can introduce the parameter optimization module into our project to improve the users' design with less experimental workload. In a word, we can establish a more powerful design platform.

3. Specification

We knew the quickest way to learn a synthetic biology design is to look at the genetic circuit diagram. Therefore, we wanted to use a image-search method to get the most related design circuit. In this way, users can improve their design according to the retrieve results.

Also, when biologists think of an interesting biological function, they do not always know how to construct a genetic circuit to implement it. We generated an idea that if the biological function could be converted into a mathematical function, then we would be able to give a genetic network output, thus helping users design.

For the parameter optimization, our software can first give a theoretical set of parameters. Based on these parameters, biologists can go to conduct experiments and then input the results back to the computer to update suggestions in the next turn. The parameters would continuously optimize the experiment and form a cycle.

4. Research

We first looked at Google image search engine and found that the results could not meet with our need. After consulting an associate professor and reading some papers, we realized that current google image recognition technique and biomedical image search engines are Content-based Image Retrieval, which is based on image’s color or shape. However, they ignore parts’ information and relationship in a genetic circuit, which is important in synthetic biology design. We wanted to build a new image search engine fitted for synthetic biology, which the key of matching is based on parts.

In some papers, scientists used a machine learning algorithm called GeneNet to implement top-down design, which suits our need-to-design idea. This algorithm needed an ODE or PDE input and return a matrix of interactions between genes, which meant we needed to make a front end that could convert users’ biological function to mathematical function, and a back end that could convert matrix to visual genetic circuit.

Having discussed with programmers, we knew that in the field of automated search of network hyperparameters, Bayesian Optimization is of high performance. So we formed a group and in-depth studied it to find how it can fit into our project.

5. Imagination

In our survey, most people preferred to use all-in-one software(Question 14). We had introduced some new ideas into synthetic biology design, however, they were loosely connected, so how can we make an integrated platform?

We learned a workflow named DBTL, the abbreviation of design, build, test and learn. Why not create an automated design platform with the workflow of DBTL, we thought?

First, biologists can either input a target gene expression demand and we return an initial genetic circuit, or a user designed circuit which can complete at our platform as well and return some closely related circuits from published papers or synthetic biology communities. Next, we will give a simulation, which is the stage of test. Finally, we return a calculated set of parameters for biologists to conduct experiments. Once the experiment results input, the cycle begins and seek for the best genetic circuit design. Using our software, you only need to give your demand or primary design, automatically, a best design will be exported. You can start at any phase and back to what you want.

6. Design

We first fulfilled design module. In the image-search engine, we conceived that we could extract parts’ information and the whole structure of the genetic circuit diagram. Based on these, every two circuits can compare and output a quantified value under a defined distance function. In the end we can give a rank according to the grades. For the GeneNet, we not only export a gene-gene interact topological structure, we will further use an autofill algorithm to search for a set of transcription factors and promoters in the database we integrated and put them into the genetic circuits. After rating these circuits based on some principles, we export one or two best ones.

Then we combine these two functions with other fuctions to make a perfect design platform. We next add the simulation module, which is based on Hill function. At last we will integrate the Bayesian Optimization as the Learn stage.

7. Coding&Build

To build a search engine, we first trained a CNN model to select our needed circuits diagrams from open source papers. Then we used OpenCV (Computer Vision) and OCR (Optical Character Recognition) to extract parts information and Yolo v3 (You Only Look Once) to extract circuit structure. Using the information, we build a database in the standard of SBOL (Synthetic Biology Open Language). Under self-defined distance function, we can calculate the similarity of two genetic circuits.

To fulfill top-down design, we combined GeneNet with an autofill algorithm and improved for more user-friendly.

8. Testing

We first downloaded 500 papers and used CNN model to select wanted images to form a training dataset. After using OpenCV and OCR, we examined the accuracy and found it only 50% a little more. And Yolo v3 can only recognize CDs (coding sequence).

For the GeneNet, we decided conducting experiments to validate. we input an oscillatory equation with three nodes (genes) and got a gene-gene interact matrix. Based on the matrix, we select appropriate promoters and genes coding transcription factors. We plan to plasmids according to the genetic circuit output and transferred into E. coli. We will measure the expression of GFP to see if the circuit we constructed based on our auto-designer is accurate.

9. Learning

Because the image-search engine’s accuracy is poor, we wondered whether it is for our small training dataset. We expanded the dataset and trained again. With other improvements, we got around 80% accuracy and Yolo v3 can recognize promoter and other types of parts. Combining all these, we tested a whole process, inputting several initial genetic circuits and received well retrieved results.

10. Improvement

For the search engine we will seek for more ways to improve the precision, for example, a more effective algorithm to take place of Yolo v3. In the simulation module, instead of traditional model, on the basis of global metabolic simulation may be more accurate and more factors should be taken into account, like the physiological state of host and so on. FBA (Flux Balance Analysis) is a good algorithm for that. And if we can combine Bayesian Optimization with FBA, our auto-designer can have great impact on metabolic engineering.

11. Maintenance

In the future, we will constantly integrate synthetic biology designs from papers and communities. For the problems our database and software have, our team will correct as soon as possible. We are going to try our best to propagandize our software to more researchers. More importantly, making our software better for use and creating more value.