# Team:GZ HFI/Model

Model | iGEM GZ_HFI

# Model

In this section we propose a capsule in which three types of E.colis consuming ammonia, hydrogen sulfide and producing mycrene compete with each other and bacteria in the intestine. This is because they take up a similar niche with bacteria in the intestine environment. Consequently, some of our engineered E.colis may become extinct during competition after introduction, leading to the loss of the function of our product.

# Preliminaries

We are inspired by the general logistic growth model, widely used to describe the population growth in biology, because it depicts the growth accurately in previously study. In our data, the growth curve is decided by the growth rate without inhibition and inhibition rate. The inhibition rate here refers to many inhibitive factors, such as the inhibition of the environment, the inhibition of population density, the inhibition of other bacteria... With the increase of the total population, the inhibition rate increase meanwhile, slowing down the marginal growth rate. When the inhibition rate is same as growth rate, meaning the marginal growth rate is equal to zero, the population reaches the highest capacity.

Figure 1: Some sample logistic functions

Here are some basic assumptions of our model. First, assuming no other bacteria in the intestine, this model only considers competition among three engineered E.coli. Second, the growth rate of our E.coli in intestine is same as the data tested in vivo with lax anaerobic control due to technical limitation. Third, the environment of intestine--its pH value, temperature, humidity--is same to the LB medium we used in the experiment. Fourth, the growth rate of each E.coli would not be promoted or inhibited by the specific substances secreted by other bacteria or intestine itself. Fifth, the total population capacity of the bacteria in intestine keeps the same. Sixth, all our E.colis and nutrients distribute evenly in intestine and would not be excreted outside body during a certain period of time.

Therefore, by looking out for the growth rate of each single E.coli, and their respective inhibition rate, we can then calculate and draw curves of concentrations rate and finally find the real-time ratio of each E.coli. Fortunately, with the best-fit logistic curve of population growth of each E.coli, we can calculate the growth rate and inhibition rate of each E.coli.

Based on our experimental results, we trained such logistic regression model that predicts concentration with a given time period as input,

$$\frac {dA} {dt} = kP(1-\frac{P}{N_m})$$ taking the integral, we get $$A(t) = \frac{N_m}{1+\frac{N_m-P_0}{P_0} \cdot e^{-kt}} = \frac{N_m}{1+a \cdot e^{-kt}}$$

Here, noticing that P_0 can be trivial, we denote (N_m-P_0)/P_0 as an extra parameter a.

# Data

Before training our model, we have to tidy up and pre-process our data. Here we use pandas.

import pandas as pd
df


Figure 2: Raw data

From the raw data, we observe that there exists 68 columns, for 68 recording periods, as well as 39 groups of data, including 3 for the control groups. We notice that temperatures and cycle number should not taken account as features, nor should the control groups should be our training sets.

## Data Cleaning and EDA

To clean our data, we use pandas and list operations to remove some rows and indices. After cleaning, we end up with 36 groups of data. In every group of data, we have $68$ features, namely $68$ points of time, each labeled with a corresponding concentration rate. Since for every E.coli in every scenarios, there exist three groups of data and each points of time should be mapped to only one output, namely one concentration rate, we manually pair our groups in three for each scenarios, and calculate the respective average value as our training sets.

import itertools
df.columns = df.loc[0].values
df.drop(df.index[0:2],inplace=True)
df.drop(df.index[36:],inplace=True)
total = df.groupby('Trial').sum().max(axis=1)
all_sets = total.index.to_list()
all_data =  pd.DataFrame()
for set in all_sets:
temp = df[df['Trial'] == set][df.columns[1:]].T.sum(axis=1)
temp = temp.to_frame(set)
all_data = pd.concat([all_data, temp], axis=1)
all_data['$argA^fbr$ <antibio>'] = all_data[['A01', 'B01','C01','A07', 'B07','C07']].mean(axis=1)
all_data['$cysE-mut$ <antibio>'] = all_data[['A02', 'B02','C02','A08', 'B08','C08']].mean(axis=1)
all_data['$myrcene$ <antibio>'] = all_data[['A03', 'B03','C03','A09', 'B09','C09']].mean(axis=1)
all_data['$argA^fbr$'] = all_data[['A04', 'B04','C04','A10', 'B10','C10']].mean(axis=1)
all_data['$cysE-mut$'] = all_data[['A05', 'B05','C05','A11', 'B11','C11']].mean(axis=1)
all_data['$myrcene$'] = all_data[['A06', 'B06','C06','A12', 'B12','C12']].mean(axis=1)


Now the tidy data is visualized as follows,

all_data


Figure 3: Clean data

We can also visualize some of our experimental as follows, by mapping all the time periods with the corresponding concentrations onto numpy arrays and plot them in matplotlib (Harris, Hunter). After connecting all the data points, we see that, in general, the curves are in S shape.

import numpy as np
from matplotlib import pyplot as plt
import matplotlib.patches as patches
plt.style.use("tableau-colorblind10")
start = 0
ax = all_data[['A01', 'B01','C01','A07','B07','C07','$argA^{fbr}$ <antibio>']][start:].plot(style='-', figsize=(15,9))
markers = itertools.cycle(("o", "v", "^", "<", ">", "s", "p", "P", "*", "h", "X", "D", '.'))
for i, line in enumerate(ax.get_lines()):
marker = next(markers)
line.set_marker(marker)
_ = ax.legend()


Figure 4: We plot the change of concentration of A01, B01, C01, A07, B07, C07 and argA^{fbr} with antibacterial. We observe that, in general, the points are logistically distributed.

# Nonlinear Least Squares for argA^{fbr} with Antibacterial

When looking at the data, we only have the concentration rates per time period. We also have the formula that we want to apply, but we do not yet have the correct values of the parameters $a$, $k$ and $N_m$ in the formula.

Unfortunately, it is not possible to rewrite the Logistic Function as a Linear Regression, as was the case for the Exponential model. We will therefore need a more complex method: Nonlinear Least Squares estimation.

Define the logistic function that has to be fitted. First, we define the logistic function with input point of time t and parameters a, k, N_m and an offset to accommodate our model with the non-zero concentration rate at the begining.

def logistic(t, a, k, N_m, offset):
return N_m / (1 + a * np.exp(-k*t)) + offset


Random Initialization of parameters and upper, lower bounds set up. Next, we use np.random.random to initialize our parameters, set up a relatively high bounds to let the model free.

p0 = np.random.random(size=4)
bounds = (0., [10017.,3.,10019834.,10000.])


Use SciPy's Curve Fit for Nonlinear Least Squares Estimation. In this step, Scipy (Virtanen) does a Nonlinear Least Squares optimization, which minimizes the following lost function L,

$$\ell(a,k,N_m,\textrm{offset}) = \sum_{i=0}^T |r_i|^2,$$

where a residual |r_i|, the error distance matrix between ground-truth label concentraton rate and the predicted one, is given by,

$$r_i = y^{true}_i - f(t_i :a,k,N_m,\textrm{offset})$$

Here, f refers to the logistic model in its training state.

import scipy.optimize as optim
x = np.array([float(x) for x in all_data.index])
y = np.array([float(x) for x in all_data['$argA^fbr$ <antibio>']])
(a,k,N_m,offset),cov = optim.curve_fit(logistic, x, y, bounds=bounds, p0=p0)
a,k,N_m,offset

(81.26777631099041,
0.00022831181675420664,
0.5963896442822748,
0.03403317666750181)


Plot the fitted function vs the real data. As shown in the graph below, our Logistic model is very close to the ground truths.

test_logistic = lambda t : N_m / (1 + a * np.exp(-k*t)) + offset
plt.scatter(x, y)
plt.plot(x, test_logistic(x))
plt.title('Logistic Models v. Real Observation of group $argA^fbr$ <antibio>')
plt.title('Logistic Models v. Real Observation of group ')
plt.legend(['Logistic Model','Experimental Data'])
plt.xlabel('Time/(s)')
plt.ylabel('Concentration')


Figure 5: We plot the change of concentration of argA^{fbr}, both the actual and the predicted.

# Nonlinear Least Squares for all cases

In this section, we apply the same training algorithm, as we did to argA^{fbr} with antibacterial, to all 6 scenrios. We also visualize all the fitted curves alongside our original data. We observe that none of the curves are obviously under-fitted.

ret_data = {
'group_name':[],
'a':[],
'k':[],
'N_m':[],
'offset':[],
}
for group in all_data.columns[all_data.columns.get_loc('$argA^fbr$ <antibio>'):]:
p0 = np.random.random(size=4)
y = np.array([float(x) for x in all_data[group]])
(a,k,N_m,offset),cov = optim.curve_fit(logistic, x, y, bounds=bounds, p0=p0)
test_logistic = lambda t : N_m / (1 + a * np.exp(-k*t)) + offset
plt.scatter(x, y)
ret_data['group_name'].append(group)
ret_data['a'].append(a)
ret_data['k'].append(k)
ret_data['N_m'].append(N_m)
ret_data['offset'].append(offset)
plt.plot(x, test_logistic(x),marker = next(markers))
plt.title(f'Logistic Models v. Real Observation of all groups')
plt.xlabel('Time/(s)')
plt.ylabel('Concentration')


Figure 6: We plot the change of concentration of all scenarios, both the actual and the predicted.

We also display all the trained parameters in a pandas.DataFrame.

output_df = pd.DataFrame(ret_data)
output_df.to_csv('./outputs/out.csv')
output_df


Figure 7: All the resultant parameters

# Applying the Parameters into Three E.colis Coexisting Scenarios

For antibacterial settings, given the formula of derivatives of concentrations of the three crafted e.colis, and denote argA^fbr as A, cysE-mut as B, and myrcene as C, we have

$$\frac {dA}{dt} = k_A \cdot A (1 - \frac {A}{N_1} - \frac{k_B}{k_A} \cdot \frac{B}{N_2} - \frac{k_C}{k_A} \cdot \frac{C}{N_3}) = k_A \cdot A (1 - \frac {A}{N_1}) - \frac{k_B}{N_2} \cdot AB - \frac{k_C}{N_3} \cdot AC \$$ $$\frac {dB}{dt} = k_B \cdot B (1 - \frac {B}{N_2} - \frac{k_A}{k_B} \cdot \frac{A}{N_1} - \frac{k_C}{k_B} \cdot \frac{C}{N_3}) = k_B \cdot B (1 - \frac {B}{N_2}) - \frac{k_A}{N_1} \cdot BA - \frac{k_C}{N_3} \cdot BC \$$ $$\frac {dC}{dt} = k_C \cdot C (1 - \frac {C}{N_3} - \frac{k_A}{k_C} \cdot \frac{A}{N_1} - \frac{k_B}{k_C} \cdot \frac{B}{N_2}) = k_C \cdot C (1 - \frac {C}{N_3}) - \frac{k_A}{N_1} \cdot AC - \frac{k_B}{N_2} \cdot BC$$ in which the values of K_a,K_b,K_c, N_1, N_2, N_3 are pre-defined constants in Fig.7.

From the previous results in Fig.14, we observe that

$$\frac{k_A}{N_1} = 0.0003823008137306356$$ $$\frac{k_B}{N_2} = 0.00021057979637267946$$ $$\frac{k_C}{N_3} = 0.00031453301225957343$$ and therefore $$\frac{k_A}{N_1} > \frac{k_C}{N_3} > \frac{k_B}{N_2}.$$

Therefore, given that k_A > k_C > k_B, we can deduce that not only is B the slowest growing group, so is it the most prone to the disruption of other groups. Thus, B's constration should constantly be the minimum of the three. On the other hand, A is the rapidest growing group and is the least effected by other groups, so it outlives the other two e.colis under antibacterial settings.

Similarly, for antibacterial-free settings, we have

$$\frac{k_A}{N_1} = 0.0002476434970892165$$ $$\frac{k_B}{N_2} = 0.00032405774932581086$$ $$\frac{k_C}{N_3} = 0.00026389727700023636$$ and therefore $$\frac{k_B}{N_1} > \frac{k_C}{N_3} > \frac{k_A}{N_1}.$$ Therefore, given that k_B > k_C > k_A, we also can predict that A this time drops to zero first and B will be monotonically increasing.

These assumptions are supported by the following data.

t = np.arange(0,80000,1000)
def deriv(w,t,ka,kb,kc,n1,n2,n3):
x,y,z = w
return np.array([ka*(1-x/n1)*x-kb*y*x/n2-kc*x*z/n3,
kb*(1-y/n2)*y-ka*y*x/n1-kc*y*z/n3,
kc*(1-z/n3)*z-ka*z*x/n1-kb*y*z/n2])
fig = plt.figure(figsize=(10, 18))
p = ret_data['k'][:3] + ret_data['N_m'][:3] + [0.06,0.06,0.06]
ka,kb,kc,n1,n2,n3,x0,y0,z0=p
yinit = np.array([x0,y0,z0]) # initial vals
yyy = odeint(deriv,yinit,t,args=(ka,kb,kc,n1,n2,n3))
ax1.plot(t,yyy[:,0],marker = next(markers),label="$argA^{fbr}$")
ax1.plot(t,yyy[:,1],marker = next(markers),label="$cysE-mut$")
ax1.plot(t,yyy[:,2],marker = next(markers),label="$myrcene$")
ax1.set_xlabel('time/(s)')
ax1.set_ylabel('concentration')
ax1.set_title('Three engineer e.colis\' real-time concentrations comparison antibacterial')
_ = ax1.legend(loc=2)
p = ret_data['k'][3:] + ret_data['N_m'][3:] + [0.06,0.06,0.06]
ka,kb,kc,n1,n2,n3,x0,y0,z0=p
yinit = np.array([x0,y0,z0]) # initial vals
yyy = odeint(deriv,yinit,t,args=(ka,kb,kc,n1,n2,n3))
ax2.plot(t,yyy[:,0],marker = next(markers),label="$argA^{fbr}$")
ax2.plot(t,yyy[:,1],marker = next(markers),label="$cysE-mut$")
ax2.plot(t,yyy[:,2],marker = next(markers),label="$myrcene$")
ax2.set_xlabel('time/(s)')
ax2.set_ylabel('concentration')
ax2.set_title('Three engineer e.colis\' real-time concentrations comparison no antibacterial')
_ = ax2.legend(loc=2)


Figure 8: Predicted population concentrations of the three engineered e.colis with antibacterial(upper) and without antibacterial(lower) 80,000 seconds

As shown above, using the SciPy's integrate.odeint (Virtanen), we can solve the aforementioned system of differential equations and compute the integral of the upper six curves, which gives us the predicted concentration. we plot this resultant concentrations of three crafted e.coli with and without the effect of antibacterial. In addition, for e.coli, we also mark the effective time on the plot.

Also, if extending the time period to 100,000, we can observe that concentration of A converges to 0.6 and the other two to zeros in antibacterial environment. For an antibacterial-free setup, we observe that concentration of B converges to 0.8 and the others' to zeros.

Figure 9: Predicted population concentrations of the three engineered e.colis with antibacterial(upper) and without antibacterial(lower).

After predicting the growth curve of three E.coli containing pTYT-argA^fbr, pTYT-cysE-5-11-2, and mycrene production plasmids(pTYT-GPPS-MS, pMevT, and pMBIS), we want to further elongate the effective time when three kinds of E.coli can function regularly.

It is impossible for us to modify the growth rate of the E.coli in a short run without affacting their function, so we decide to change their ratio in the capsule to maximize the effective time. The original model showed above is when the ratio of three kinds of E.coli is equal, or, in other words, 1:1:1.

To generalize the modeling, we assumed that at the very begining when the capsule is eaten, the initial ratio of the E.coli is 1 : 1+x : 1+y. That is, if the population of the E.coli with gene argA^fbr and argR knockout is A, the population of cysE-5-11-2 plasmid contained E.coli and myrcene production plasmids contained E.coli are (1+x)*A and (1+y)*B respectively. By modifying the magnitude of x and y, we can gain the optimal ratio of three E.coli that maximize the effective time.

Below is the visualization of effective time of the magnitude of x and y.

THRESHOLD = 0.1
t = np.arange(0,2000000,1)
def deriv(w,t,ka,kb,kc,n1,n2,n3,offset1,offset2):
x,y,z = w
y+=offset1
z+=offset2
return np.array([ka*(1-x/n1)*x-kb*y*x/n2-kc*x*z/n3,
kb*(1-y/n2)*y-ka*y*x/n1-kc*y*z/n3,
kc*(1-z/n3)*z-ka*z*x/n1-kb*y*z/n2])
o1,o2 = params
p = ret_data['k'][3:] + ret_data['N_m'][3:] + [0.06,0.06,0.06]
ka,kb,kc,n1,n2,n3,x0,y0,z0=p
yinit = np.array([x0,y0,z0])
yyy = odeint(deriv,yinit,t,args=(ka,kb,kc,n1,n2,n3,o1,o2))
start = 0
cnt = 0
for i in range(1,yyy.shape[0]):
j = min(yyy[i])
if (j <= threshold) and (min(yyy[i-1]) > j):
return float(t[i] -t[start])
break
elif j > threshold:
cnt += 1
if cnt > 1:
continue
else:
start = i

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
fig = plt.figure(figsize=(15,15))
ax = fig.gca(projection='3d')
X = np.arange(-0.1, 0.1, 0.0075)
Y = np.arange(-0.1, 0.1, 0.0075)
Z = np.array([add_offset(i) for i in tqdm(np.array(np.meshgrid(X, Y)).reshape(729,2))]).reshape(27,27)
X, Y = np.meshgrid(X, Y)
surf = ax.plot_surface(X, Y, Z, cmap=cm.viridis,edgecolor='none')
ax.plot([0,0],[0,0],[0,60000],'b',linewidth=8)
ax.set_xlabel("X")
ax.set_ylabel("Y")
ax.set_zlabel("effective time")
fig.colorbar(surf, shrink=0.3, aspect=10)
plt.show()


Figure 10: Changes of effective time for different x and y. The blue verticle line refers to the point as x = 0 and y = 0. We can observe that the intersection of the line and the surface is behind the peek.

Using tf.keras.optimizers (Abadi, 2016), we can minimize the function lambda params : - add_offset(params), thereby performing a gradient ascent to the above surface. Here, we adopt Momentum(Polyak, 1964) as optimizer, 0.005 as step size and 0.9 as momentum. The resultant highest point in the graph is at point (-0.03, 0.025, 439431), which means that when the ratio of three E.coli is 8:4:7, we can get the maximum effective time 439431s or 122 hours.

In this condition, we presume that our engineered E.coli is able to function as long as the population is larger than 0. However, it is obviously that this setting is unreasonable because, in the reality, there must be a certain amount of same E.coli to maintain their function. Therefore, to rationalize our model, we then set a threshold of the population of our engineered E.coli. If the population of certain group is below than that threshold, we define this situation as nonfunctional. Only if all three population is higher than that specifc floor, can our product function.

Figure 11: Changes of effective time for different x and y (Bird's-eye View).

After setting a proper threshold, the original effective time with ratio 1 to 1 to 1 now downs to 5 hours. However, when the optimum ratio is used, the effective time still remains higher than 80 hours, showing that by modifying the initial ratio of three E.coli in the capsule.

This model instructs our design of the final product. By applying the proper ratio of three different E.coli in the capsule, we can maximize the effective time so that individuals are not required to take the capsule too often, which declinesthe fee burdened by them.

## References

Abadi, Martín, et al. "Tensorflow: A system for large-scale machine learning." 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 2016.

Harris, Charles R., et al. "Array programming with NumPy." Nature 585.7825 (2020): 357-362.

Hunter, John D. "Matplotlib: A 2D graphics environment." Computing in science & engineering 9.3 (2007): 90-95.

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.

Virtanen, Pauli, et al. "SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python." Nature Methods, vol. 17, no. 3, 3, Nature Publishing Group, Mar. 2020, pp. 261–72. www.nature.com, doi:10.1038/s41592-019-0686-2.

🧬 P-Erfume Wiki ©2020 Team GZ_HFI.

Authored and maintained by Team GZ_HFI.