Team:Shanghai SFLS SPBS/Model

Document

We construct a mathematical model with Artificial Neural Network (ANN) that takes in environmental factors and generates a logistic function that predicts the growth curve of the two kinds of bacteria (E. coli and Vibrio natriegens).

Data recording

The first step is to alter environmental factors in the laboratory environment and record the growth data. In order to measure bacteria count, we use the microplate reader to measure absorbance, or optical density (OD), values. For each bacteria strain, we have devised 8 environmental factors: growth medium (LB, LBv2, or M9), lab shaker (37°C, 220 rpm or 20°C, 120 rpm), glucose concentration (0, 1, 2, 4, 8, 16, or 32 g/L), and concentration of 5 ions (Na+, Ca2+, Mg2+, K+, or Mn2+). By artificially altering these factors and measuring the bacterial growth curve, we obtained a set of training data to train our model.

The measuring process continued for approximately three weeks, with each factor alteration taking 12 hours to have an accurate growth curve that reaches its environmental capacity. For each environmental factor set, we have 2~3 biological repeats and 3 mechanical repeats to make sure the data collected is reliable and accurate. Moreover, all environmental factors are being recorded under two growth mediums, LB and LBv2. First, the "basic" growth curve is recorded with no added glucose or ions and at 37°C, 220 rpm. Next, the 20°C 120 rpm option is recorded. Next, to know the exact content of nutrition (as opposed to the unknown in LB medium), a glucose concentration gradient is created in M9 medium. Then, the same gradient is created in the regular LB and LBv2 mediums. After that, the 5 ions are each given a gradient of different concentrations and tested in LB and LBv2 mediums. In the end, we obtained 289 sets of growth curve data recorded under different environmental factors.

Model Construction

Logistic growth model

The logistic regression model we use is defined by the equation, in which "a" represents the environmental capacity, "b" represents the starting population, and "c" represents steepness of the curve. The two kinetic parameters that we focus on are "a" and "c", each signifying the maximum bacteria amount and growth rate.

Above shows the actual growth measurement (blue line) and the fitted logistic curve (orange curve) of E. coli (left) and V. natriegens (right). See that the fitted curve accurately represents the original growth data.

Estimation of Kinetic Parameters

Before constructing the model in Python, we need to pre-process our recorded data and correspond it with the neural network structure. First, we need to convert all data points into the three kinetic parameters with the help of curve_fit function in the scipy library. The arrays of parameters will become our network output, whereas the input will be a feature layer consisting of alteration in the environmental factors.

Factor Values Factor Values
Glucose concentration 0, 1, 2, 4, 8, 16, 32 g/L M9 0/1 (binary)
Na+ concentration 0, 0.05, 0.1, 0.5, 1 M LB 0/1 (binary)
K+ concentration 0, 0.001, 0.01, 0.05, 0.1 M LBv2 0/1 (binary)
Mg2+ concentration 0, 0.001, 0.01, 0.05, 0.1 M 37°C 220rpm 0/1 (binary)
Mn2+ concentration 0, 0.0001, 0.001, 0.01, 0.1 M 20°C 120rpm 0/1 (binary)
Ca2+ concentration 0, 0.0001, 0.001, 0.01, 0.1 M E. coli 0/1 (binary)
V. natriegens 0/1 (binary)

For every set of 3 parameters, those 13 environmental factors serve as the input set of the network. By corresponding each set of parameters to the appropriate factors, the network can learn to recognize potential connections between them and devise a generalized pattern that enables us to predict untested factor combinations with the model.

Erroneous data

Some troubles we encountered in data processing are the irregularity of some specific sets of data, where the three parameters fitted are either negative or too big to be correct. In those cases, we often set bounds to the data and regulate its interval. If that didn't work, we deleted those data from the set. It happened to the E. coli dataset in the 20°C 120rpm environment, presumably caused by not enough time to grow to the maximum environmental capacity, ending the measurement with a concave up increase.

Artificial Neural Network

Using Tensorflow implemented with Python, weconstructed a simple artificial neural network (ANN) consisting of 4 layers, with the 13 environmental factors being the input layer, the 3 kinetic parameters being the output layer, and 2 hidden layers with 20 perceptrons each:

As you can see from the picture above, every perceptron (circle) in this structure is linked to all perceptrons in the next layer (except for the three in the output layer, in which case they generate the output). This way, the previous layer's inputs are applied to an activation function, or "threshold" function, to determine their output. In this case, we use the popular "ReLU" function. Then, a weight is assigned to each connection above in order to increase or decrease the importance of the connection artificially. Finally, these weighted values are passed to the three final perceptrons. Because we are working with a regression model, in which specific values are required (as opposed to a classification model where a binary output is needed), the predicted three parameters are the final output of the model.

Model Evaluation

Cost Function

In order to evaluate our model, we also need a cost function to calculate how close our model is to the actual result. In this case, we choose the Mean Absolute Error (MAE) function, a popular regression cost function which is the sum of absolute differences between our target and predicted variables:

Through every training, our model aims to lower the cost function by constantly changing the weights with every input to generate results closer the actual value.

Cross-Validation and Train Test Split

Because of our rather limited dataset and the high sample amounts usually required to train an efficient neural network, we employ the K-Fold cross-validation method, which breaks the training set into 10 iterations, ensuring that the model can be exposed to all dataset.

As shown above, the 10 models generated after 10 iterations are averaged into a single model, which combines the advantages of all 10 models. This was proven to be a useful way to train on a limited dataset, as well as a good way to reduce overfitting.

Before cross-validation, we used train-test-split to split our original data into 80% of the training set and 20% of the testing set. After running K-Fold validation, the model attempts to predict the 3 parameters based on the 13 factors in the testing set.

Model Evaluation

The accuracy of the model is evaluated through Pearson correlation, which measures linear correlation between two variables X and Y:

The first and third parameters have achieved a correlation of 0.722 and 0.702, both nice results that demonstrated strong correlations between the predicted and actual values. The second parameter, however, only achieved 0.339, because it measures the starting population and is subject to more influence and is luckily not relevant in our experiment.

After using our test data to test against the first and third models, we graphed two best-fit lines to see how well they fit:

The theoretic optimal line is y=x, which suggests that all predicted values correspond to the same real value, and have 100% accuracy. With our first model, we see that the x coefficient is 1.0429, extremely close to the expected value of 1, which shows a good line fit. The third model shows an x coefficient of 0.7837, which is slightly off the expected value but still a good prediction.

With this model, we will be able to optimize the growth conditions of E. coli, Vibrio natriegens, or any other type of bacteria, after having collected some data. We can also assign different weights to the maximum capacity and growth rate of these bacteria as needed.

We thank our sponsors

Contact us

Email: igem_modu@126.com

WeChat public account: iGEM2020Modu