Poster: UIUC_Illinois

VIRALIZER – Modeling COVID-19 mutations and binding energies to design potential antibodies

UIUC iGEM Team 2020

Mary Cook, Sachin Jajoo, Yan Luo, Suva Narayan, Royal Shrestha, Angela Yoon

Team Mentors

Anna Fedders, Carl Schultz, Christopher Rao, Daniel Ryerson, Matthew Waugh, William Woodruff

Abstract

COVID-19 is caused by the SARS-CoV-2 virus, a highly mutative virus for which developing effective antibodies is difficult. Thousands of spike protein mutations have been detected but fewer than 1% of them have solved crystal structures. To address this shortage, we created VIRALIZER: an interactive online database that contains over 26,000 mutated spike protein sequences and their corresponding structures, developed using homology modeling in PyRosetta. This allows the user to analyze the effects of spike protein mutations on the functionality of the protein and its binding with potential antibodies. The spike protein database is paired with a phylogenetic tree that characterizes the propagation of the virus over space and time. We also developed a genetic algorithm that uses spike protein structures and their binding energies to design hundreds of potential antibodies that bind to the spike protein and reduce its binding affinity to the ACE-2 receptor on cells.

Project Goals

1) Create a database of spike protein sequences

2) Characterize the propagation of the virus

3) Allow researchers to use our tool to visualize novel viral protein models to analyze structural effects of mutations

See the structural effects of mutations
Analyze how these mutations will affect antibody binding

4) Develop an antibody/neutralizing agent via machine learning algorithms

Motivation

The pandemic has obviously changed the way the majority of the world lives, and our team members felt that we should do something to help contribute to the fight against coronavirus. Since we didn’t have access to a wet lab, we resolved to help researchers from outside the lab. We identified the ability of software modeling tools to help model the thousands of mutations that exist and decided to use homology modeling to discern their respective crystal structures. We also wanted to be able to directly contribute to the fight via synthesizing our own antibodies that can neutralize the spike protein of the virus.

Problem

COVID-19 is the disease caused by the SARS-CoV-2 virus and is responsible for the 2020 coronavirus pandemic. As of November 8 2020, the pandemic has resulted in over 50 million cases and over 1.25 million deaths worldwide. The virus uses the ACE2 receptor on host cells to infect them. The binding of the spike protein with the ACE2 receptor is extremely important, as this mechanism alone is essential in allowing the virus to infect the host. Despite access to good technology, effective treatment and vaccines have been hard to make since the virus is highly mutative. The mutations are especially important when looking at the spike protein of the virus, which is the protein that allows SARS-CoV-2 to enter the host cell. Due to the high rate of mutation present in this strain of virus, researchers have been unable to find a reliable treatment that can neutralize the various mutated spike proteins that exist.

New mutations of the virus pop up almost every day, though most aren’t very prevalent. Despite the speed with which these new mutations can be sequenced, obtaining the crystal structure of the protein is a different matter as it is expensive and time consuming. Due to this, it’s difficult for researchers to determine how mutations could potentially change the structure of the spike protein. Without understanding what the spike protein will look like, it is impossible for researchers to know whether a potential antibody or vaccine will be effective against the virus.

Idea

Our project, VIRALIZER, aims to tackle many of the issues presented by SARS-CoV-2.

The first component of our project is a database of more than 20,000 mutated spike protein sequences as well as their corresponding protein models. Using this data, we designed antibodies that can bind to the majority of spike protein sequences we analyzed.

The database also contributed to the development of a phylogenetic tree that characterizes the propagation of the virus over the course of the pandemic. The tree allows researchers to track particular strains as well as sort the mutations by location and date.

Engineering Success

This project was done entirely remotely and online, as we had no access to a wet lab to test our predictions. As a result, we were forced to progress our project entirely on simulation-based environments, including PyRosetta, SWISS-MODEL, and FoldX. We used these environments to simulate protein folding, binding, and analysis. More specifically, SWISS-MODEL and FoldX were utilized to compare our folding methods to those of these environments, to show that the predicted folding procedure and structure is accurate. In the future, we hope to gain access to a wet lab to test if predicted mutation structures modeled through our PyRosetta methods do occur.

In general, our tool will allow anyone to create their own antibody for any protein, as long as they have a pdb model that has a base protein to mutate from. Our pipeline can also work on many other viruses, like HIV, influenza, or Ebola. The virus and protein models will allow researchers to save time and effort to pick potential antibodies to work on. The antibodies can be used in their project as a therapy agent such as a neutralizing agent or an mRNA vaccine. Even in this year, two teams used our antibody sequence: Technion and Harvard NEGEM. We have also submitted 6 coding parts, and 3 composite parts. Each part is folded and characterized in depth using simulation and folding data. The parts are ready to test by future teams for their research against the coronavirus.

Database

The VIRALIZER database contains 20,000 mutated spike protein sequences along with their corresponding protein crystal structures. PyRosetta was used for modeling the spike protein mutations. To do this, we employed loop modeling and folding. Loop modeling was achieved by cutting a 7-10 amino acid region around the mutation and then folding from one end to the other. Folding occurred around the region of the mutation, where amino acids circling the site of the mutation were folded. We also generated heat maps using the folding data to determine how stable the structure is after mutations, which is useful in determining which mutations could have significant effects on the structure of the spike protein.

Phylogenetics

Over the course of an epidemic, pathogens naturally accumulate inevitable, random mutations to their genomes. Since different genomes typically pick up different mutations, they can be used as a marker of transmission in which closely related genomes indicate closely related infections. The molecular phylogeny of viruses are important for understanding epidemiological parameters such as spatial spread, introduction timings and epidemic growth rate. We created a phylogenetic tree using sequences obtained from the GISAID database, which contains human SARS-CoV-2 sequences. The phylogeny was created using Nexstrain, which allowed us to learn more about transmission over space and time. Snakemake was utilized to automate our pathogen build, which will allow us to conduct the same commands with different datasets.

Antibody Design

This is the most ambitious aspect of our project, beginning with an already-researched antibody sequence that binds to the spike protein, the S309 neutralizing agent (PDB ID 6WPT). A genetic algorithm was designed, applying the random mutation aspect of the algorithm to several positions on the antibody sequence. This algorithm was implemented for both the heavy and light chain sequences, generating several newly mutated sequences. PDB files were then generated for these sequences, which were then tested on PyRosetta for binding with the spike proteins. After getting the REU (Rosetta Energy Unit) values through PyRosetta, the dominant sequences that can bind to the spike protein are kept in the population. A flow chart is also attached, which briefly describes the process.

Mutation scans were also conducted with the antibody and spike protein files to give insight on what amino acid site can be researched and what spike protein mutation will cause problems. The heatmaps generated in this step are analyzed in the results section. Finally, 3 light chain and heavy chain sequences are picked out of the population with the best scoring in binding with different spike protein mutants. We uploaded the mas coding parts and composite parts for future iGEM teams to test as a vaccine neutralizing agent against COVID-19. Our antibody is also used by the NEGEM team in their project design.

Results

Protein Folding:

We are able to collect valuable information from our protein folding algorithms and our spike protein mutation data. First, we can see what are the hotspots for the mutation, on the right of the image you can see a pdb model that shows the frequency of the mutation on the spike protein. With red(<10), green(<100), blue(<1000), yellow(<10000) mutations in all the 115’000 spike protein sequences we have. On the lower middle is a part of the stability heat map that we generated using pyrosetta in the mutation process. This can create many valuable insights on what mutations on the spike protein should be looked after. All the heatmaps and pdb files can be seen on our external website.

Below is a list of sites that change the spike protein stability drastically: G35, V62, G89, G103, G107, V193, A222, G 283, I285, A288, T299, C336, A363, G381, V401, I 402, G341, S443, P507, G526, G468, A672, V781, G857, S884, I909, S1003, A1025, C1043, G1059, A1080, A1087

Antibody Design:

One successful result that we were able to achieve was the predictive point mutation scan that was performed on the original antibody sequence. The goal of this mutation scan was to identify the parts of the antibody sequence such that, if mutated, would result in the largest differences in structure and binding energy.

Below are the scoring plots for the different generations of light chain and heavy chain genetic algorithms, based on their Rosetta Energy Units. There are differences in the scoring of the heavy chain and the light chain as the heavy chain is closer to the spike protein.

Above is the QR code for the Viralizer database.

References and Acknowledgements

Project Support and Advice: Weekly meetings with mentors

Funding: Center for Advanced Bioenergy and Bioproducts Innovation (CABBI) and The Carl R. Woese Institute for Genomic Biology (IGB)

Presentation Coaching: Anna Fedders, Christopher Rao, Carl Schultz, William Woodruff

Faculty at UIUC who aided in the development of the project:

Brenda Wilson: Professor of Microbiology

Think of using the software as a proof of concept using other data
Introduced us to GISAID database, which we used to build our database

Diwakar Shukla: Blue Waters Assistant Professor

Modify the antibody structure to build different antiviral agents that binds to as many mutated spike proteins as possible
Introduce point mutations of spike proteins and according changes in spike protein structures
Utilize machine learning concepts like the genetic algorithm for point mutations

Erik Procko: Professor of Biophysics and Quantitative Biology

End-user implementation: how current researchers of COVID-19 are already looking into the mutant diversity of the virus
Possibility of noise/sequencing errors in mutation research; look into how this could affect the construction of a sequence database for spike proteins

Huimin Zhao: Steven L. Miller Chair Professor of chemical and biomolecular engineering

End-user possibilities: discussed how the implementation of a new sequence database could help current COVID researchers design a new vaccine/treatment a lot faster
Various parameters could affect antibody design and vaccine development, such as specific mutations and multiple unknowns/assumptions that have to be made

Mohammed El-Kebir: Assistant professor of Computer Science

Nonsynonymous mutation visualization can be performed using PyRosetta and a phylogenetic tree
Generate multiple sequence alignments using Nexstrain, since it’s important to index the sequence

Nicholas Wu: Assistant professor at the School of Molecular and Cellular Biology

Discussed general research with antibody design and how effective a new antibody would be against a rapidly mutating pathogen (i.e. the coronavirus)
Evaluate what parts of the spike protein should be targeted for an effective treatment to COVID-19

Team:UIUC Illinois/Poster

Poster: UIUC_Illinois