Design
1.Introduction
The focus of our project is the metabolic pathway. A metabolic pathway is a set of related cellular metabolic reactions. These reactions are so interconnected in various ways that given the number of metabolic reactions an organism has, it is almost impossible for a researcher to analyze them comprehensively without the extra help of various software.
2.Overview
This year, our software is meant to be a toolbox for both iGEMers and those who work in synthetic biology, which inevitably involves the modification of cell metabolism. The toolbox we provide is designed to help researchers engineer, modify, and evaluate the metabolic pathways, and it consists of three tools.
The Metabolism Simulation tool allows users to observe how metabolism proceeds in a cell under different circumstances. for example, in a solution with a different concentration of glucose or when a gene is knocked out. The other tool is a Pathway Finder, which analyzes the target compound and its related pathways much easier than before. The potential precursors and their pathways to a target compound can be found using our tool, and after some design and modification, researchers can evaluate their engineering by utilizing the Metabolism Simulation, which can help them avoid undesirable mistakes. Our last tool is Synthetic Bay, which provide data support for the first two. Meanwhile, you can also download our integrated database to analyze other information or find more interesting results.
Most importantly, our toolbox has a pipeline for your analysis. you can search the ID from database to get what you want, and then use these IDs to search for the best pathway. finally, according to the reactions on the pathway, we set initial states value, simulate metabolic reactions and predict what will be changed.
3.Details
Database
To provide data support for the Synthesis Navigator software's pathway search function and metabolic simulation function, we integrated several databases, including MetaCyc, ChEBI, BRENDA, eQuilibrator, and KEGG.
Finally, we get vast amount of data. Through some data processing and cleaning methods, we obtained three data tables: Compound, Reaction and Enzyme. We provide some attributes, associate the three tables with some features, and then integrate them into the database 'sql' file. After the database is complete, our data properties are improved into 21(5+6+10) and we add more than 310k records for this database.
figure1. Our database and several other databases.
Preparation
To use computers in the analysis of interconnected metabolic reactions, a concept in computer science named graph is adopted in our project.
In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within mathematics.
A graph data structure consists of a finite (and possibly mutable) set of vertices (also called nodes or points), together with a set of unordered pairs of these vertices for an undirected graph or a set of ordered pairs for a directed graph. These pairs are known as edges (also called links or lines), and for a directed graph are also known as arrows. The vertices may be part of the graph structure or may be external entities represented by integer indices or references.
A graph data structure may also associate some edge value to each edge, such as a symbolic label or a numeric attribute (cost, capacity, length, etc.).
We can integrate interconnected metabolic reactions into a metabolism graph with the graph concept, with edges representing reactions and nodes representing compounds. In this way, the biology concept is well suited for computers, and algorithms can be applied.
Algorithm & Application
·Random Walk in Metabolism Simulation
Our Metabolism Simulation tool’s core is random walking, which is a powerful tool in graph analysis. Random walking is like the Monte-Carlo method, and the difference is that random walking is applied to a graph. If a molecule 'walk' through an edge, the molecule undergoes a reaction in the metabolism graph. In our tool, we put many different compound molecules into the metabolism graph and let them walk randomly. In the process, changes in the quantity of each kind of molecule are recorded as the simulation result, which approximates the actual cell metabolism process.
Other methods, even those widely used in metabolism analysis like ODE/PDE based dynamic models and flux-based models, are not that practical in real life scene because many data needed for those methods are not available, and the methods themselves are not robust enough for real scene application. Luckily, the metabolism network itself has its information about how cell metabolism is carried out, and for most prokaryotic cells like E.coli, the pathway networks are robust enough. Instead of precise ODE/PDE based dynamic models or rough FBA or FVA models, our random walking model can be an ideal choice.
·Graph Algorithms in Pathway Finder
The core of our Pathway Finder is traditional graph algorithms. A*, Dijkstra, and Yen’s k shortest way algorithm are used. Here the metabolism graph is slightly different from that in the Metabolism Simulation tool, in that the nodes are divided into two types: reactions and compounds, and edges now are half-reactions. This kind of graph is called a Bipartite Graph.
We have changed our path search algorithm thoroughly this year with NetworkX, a python library to replace the handwritten DFS and Greedy search in the past year’s project since standard algorithm libraries are more stable and fine-tuned.
A*, Dijkstra, and Yen’s k shortest are used in the simple search function, where the starting and target compound is given. If you want to get Detailed algorithm introduction, please visit ourmodelspage.
The A* algorithm defines a loss function f(x) = g(x) + h(x) for the current state x, where g(x) is the actual cost of getting from the initial state to the current state, and h(x) is the estimated cost of the optimal path from the current state to the target state. Through a repeated operation for getting lower f(x), we will minimize the sum of costs.
Dijkstra algorithm is only suitable for non-negative weights, but the time complexity is excellent, which is O(nlogn) with a priority queue. It is also an algorithm for finding a single source shortest path. Dijkstra algorithm is also used in the reverse search when only the target compound is given.
K-th shortest path algorithm in NetworkX based on Yen’s algorithm, reported in Finding the K Shortest Loopless Paths in a Network, 1971. In his algorithm, "its computational upper bound increases only linearly with the value of K". We only need to add a very large penalty to the nearest path and then repeat the A* or Dijkstra algorithm. Finally, we can get the k-th shortest path.
·Other Applied Technologies
To make a user-friendly toolbox, apart from the core algorithms, we also used some open-source libraries and tools in our project, and we appreciate their contribution to the science community. These tools, software, and libraries are listed below:
figure2. Other Applied Technologies.