Our project can be summarized into three sections: Database, Phylogenetics, and Antibodies. Below is a flowchart demonstrating how is every component summarized: .
Viralizer database stores the point mutation data across the spike protein from the databases. Every individual amino acid on the spike protein is mutated across all amino acid. Which gives us a total of around 20,000 spike protein models. This is a gigantic computational problem, to actually complete the computations in a reasonable amount of time, a lot of optimization is conducted. The modeling process is briefly described below: For the spike protein mutation modeling, we used pyrosetta as the base algorithm. Many assumptions are made during the process. We first used the closed state of the spike protein (unbound state) as the base protein, as the structure is more compact and requires less computational time. And because spike protein is a trimer, and the sequence for the three subunits are the same, we only fold one subunit at one time and duplicate it three times. Folding the whole protein will be very unreasonable, so what we do is a combination of loop modeling and folding within the region. Loop modeling is achieved by cutting a 7-10 amino acid region around the mutation and folding from one end to the other end. Another method we took was folding within the region of the mutation, we fold the amino acids around 10 A circling the site of mutation. By combining the two methods, we can achieve decent accuracy and low folding time. The total folding of 1273 amino acid sites took about 2 days. We have also compared our result with the current state of the art swiss-model, which is an online protein homology modeling service. Our model is significantly more compact and has lower REU (Rosetta energy unit) compared to swiss fold results, which can observed in the image below. A heat map is also generated with the folding data to see how stable is the structure after mutation, which can create insight on what mutations should be researched on, as they have the potential to induce great structure change. The database tab allows the users to search for the pdb files of the spike proteins, and it will also store the antibody pdb files. The antibody design will be introduced later. .
Phylogenetics and Models:
Over an epidemic, pathogens naturally accumulate inevitable, random mutations to their genomes. Since different genomes typically pick up different mutations, they can be used as a marker of transmission in which closely related genomes indicate closely related infections. The molecular phylogeny of viruses are important for understanding epidemiological parameters such as spatial spread, introduction timings and epidemic growth rate. The model and the phylogenetic tree of the virus is displayed together for visualizing temporal and spatial transmission of the virus. The GISAID database contains the human covid virus(hCoV-19) sequences. The point mutations in the spike protein present in the sequence collection can be analyzed and used to construct a phylogeny via Nextstrain in order to learn about transmission over space and time, as well as epidemic growth rates. Finally, we utilized Snakemake to automate our pathogen build for a more streamlined and scalable process that can be used to conduct the same commands with different datasets.
Initially, we designed a Gaussian Process Regression model that would hopefully generate several predicted iterations of sequences based on the binding energies of the antibody to various spike protein mutants. This, however, did not work as the model could not generate properly mutated sequences based off of numerical values for binding energy. Several discussions with experts later, we came across the concept of the genetic algorithm, a computational simulation of natural selection.
A simple genetic algorithm was then designed, applying the random mutation aspect of the algorithm to several positions on the antibody sequence. This algorithm was implemented for both the heavy and light chain sequences, generating several newly mutated sequences. PDB files were then generated for these sequences, which were then tested on PyRosetta for binding with the spike proteins. After getting the REU values through rosetta, the dominant sequences that can bind to the spike protein are kept in the population. A flow chart is also attached under which briefly describes the process. To reduce the time we need to spend on folding, a lot of optimization is conducted, which are described below:
Any sequence with more than 4 mutations is killed as too much mutation will greatly reduce the quality of the protein model. Only key mutations are kept. Distances between mutations are kept as large as possible to reduce interference between mutations When folding protein, a stability test is first conducted to determine what quality of the protein should be folded.
Also, sequences with higher scoring is kept in majority so we don’t have to do duplicate mutations Mutation scans are also conducted with the antibody and spike protein files to give insight on what amino acid site can be researched and what spike protein mutation will cause problems. The heatmaps are analyzed in the results section. Finally, 3 light chain and heavy chain sequences are picked out of the population with the best soring in binding with different variants in the spike protein mutation scan. We uploaded the mas coding parts and composite parts for future igem teams to test on as a vaccine neutralizing agent against COVID. Our antibody is also used by the NEGEM team in their project design.