We are able to collect many valuable information from our protein folding algorithms and our spike protein mutation data. First, we can see what are the hotspots for the mutation, on the right of the image you can see a pdb model that shows the frequency of the mutation on the spike protein. With red(<10), green(<100), blue(<1000), yellow(<10000) mutations in all the 115’000 spike protein sequences we have. On the lower middle is a part of the stability heat map that we generated using pyrosetta in the mutation process. This can create many valuable insights on what mutations on the spike protein should be looked after. All the heatmaps and pdb files can be seen on our external website.
Below is a list of sites that change the spike protein stability drastically: G35, V62, G89, G103, G107, V193, A222, G 283, I285, A288, T299, C336, A363, G381, V401, I 402, G341, S443, P507, G526, G468, A672, V781, G857, S884, I909, S1003, A1025, C1043, G1059, A1080, A1087
We have also generated a heat map for the occurrences food different mutations in the 115’000 spike protein sequences we have, here a link to the the heatmap, there are many interesting conclusions that can be collected from it but we are limited in time to extract more interesting conclusions, we will leave that to future igem teams. https://drive.google.com/file/d/1dg8NG7_URXwwBbQBeIfvSVVOTTiTUQTc/view?usp=sharing
One successful result that we were able to achieve was the predictive point mutation scan that was performed on the original antibody sequence. The goal of this mutation scan was to identify the parts of the antibody sequence such that, if mutated, would result in the largest differences in structure and binding energy. Our hope with the scan was that we would be able to selectively mutate the antibody sequence such that it would have high binding energies; with this it would be able to bind to most mutated spike protein sequences.
The genetic algorithm that we developed generated several mutated antibody sequences, with the majority of sequences being mutated purely at random. The newly mutated sequences are then tested by being binded to several mutated spike proteins via PyRosetta, and binding energies are evaluated. The goal of this test was to be able to tell which antibodies are the best sequences that can successfully bind to a majority of spike proteins. In the end, we were able to achieve this and selected three heavy chain sequences and three light chain sequences to not only test for binding with mutant spike proteins, but to also submit to the iGEM Parts Registry.
Below are the scoring for the different generations for the light chain and heavy chain genetic algorithms: It can bee seen that there is a difference in the scoring of the heavy chain and the light chain as the heavy chain is closer to the spike protein. After the genetic algorithm, an mutation scan is conducted on the spike protein to analyze the ability of the antibody to combat the mutation on the spike protein .
The resulting heavy chain and light chain are submitted into igem parts and they are described below:
Here are the links to the heatmaps for the antibody mutation scan, spike protein mutation scan, and an extra ACE2 mutation scan. We conducted and ACE2 antibody scan for the future igem team to use to design a neutralizing agent from the ACE2 receptor, which could potentially be even better than the antibody. And as there are labs that are testing ACE2 mutation, it could be validated in the future. https://drive.google.com/file/d/1JhHCYvvnuQ441kRuSGqf0Sr_0E6IaLkJ/view?usp=sharing
https://drive.google.com/file/d/1VCt3bwdLnjIvOy8VB01WY42e6-6g4l83/view?usp=sharing All the mutated antibody sequences generated from the algorithm are available to view on the VIRALIZER webpage.Phylogenetics of Spike protein mutation:
Nextstrain consists of augur, a bioinformatics toolkit that reconstruct the phylogeny via maximum-likelihood phylodynamics analysis(Sagulenko et. al, 2018). The phylogeny of the virus is built with 558 sequences from the GISAID database. Our GISAID sequence selection consists of all of the sequences with transmission dates in October(most recent strains) and all of the sequences that have at least 4 transmission occurrences in the GISAID database(most frequent stains).
The Nextstrain constructs a phylogeny and translates these to amino acids for the analysis on sequence divergence and overall diversity of both amino acid and nucleotide sequences. The output JSON file can then be visualized within a tree and map output using auspice. Using these tools, the steps of a build consist of: preparing pathogen sequences and metadata files, aligning sequences, constructing a phylogeny, annotating the phylogeny with dates and traits, and exporting the phylogeny into a visualization-friendly format. Here are the list of exemplary insights and conclusions from the tree, mutation count, rendered pdb model: 37.6% of the new mutations were located at genotype at A site codon222 49.6% of the new mutation genotype of ORF1a codon 265. 28.9% of the sampled genomes had a D614G mutation at Spike protein region. D614G mutation is inferred to be introduced on January 10th, 2020, and became prevalent in Europe and North America. Range in divergence: 6.654e-5 (China, Jan 29, 2020) - 9.660e-4 (Scotland, Aug 4, 2020). This shows that divergence has a correlation with increasing over time