Note: This description of the project reflects its state after the iGEM competition on November 16th, 2020. The project will be continued to be worked on and updated. Please refer to the Github link below in the "Code Section" for the current state of the project. The README in the GitHub repository will have the current project description.
In order to decide which microbes were worth training a Generative Model to generate protein-coding Gene Sequences of, we first trained a Classification model using logistic regression in TensorFlow 2.3.0. We defined this model as a keras.Sequential() as two dense layers, one of 64 units, and one output layer of 1 unit. The model was trained to fit the logistic regression model for 10 Epochs.
Our Generative Adversarial Network model was taken from Gupta et. al (1) and implemented in Pytorch. Since the original network was built to code for Antimicrobial Protein (AMP) sequences of 5-50 amino acids (15-150 base pairs in length), we had to change some parameters, namely the learning rate, which we set to 0.000001 as larger values failed to converge, and the maximum number of examples, which we set to 2377 (the size of the Staph. Epidermidis protein coding corpus). The model is a Wasserstein Generative Adversarial Network with the added benefit of stabilizing training. The model consists of a Generator and Discriminator both containing 5 Residual Blocks of 2 1D Convolutions and a gumbel_softmax at the end. We chose to train the model with the protein-coding sequences of Staphylococcus Epidermidis as this specific microbe affected nearly 100 tumor-suppressing pathways. The model was trained with for 80 Epochs at a sequence length of 156 with a batch size of 64. Although the model was able to produce gene sequences, they unfortunately did not match those of the original sequences. Improvement of this model is a definite future direction that we would like to take.
Code for this model can be found here.