Machine Learning


When we first began work on machine learning, we did a deep dive into literature to find which techniques are commonly used and how we should approach the problem. We found that using machine learning for antibody optimization was a fairly new field that had grown in publications recently in light of COVID-19. In the early phases, we did a broad overview of the type of databases available. Our logic was that the type of model we would use would mainly be determined by the amount and type of data that we were able to access. After reading “Potential Neutralizing Antibodies Discovered for Novel Corona Virus Using Machine Learning”, we found that the research group put together a variety of viral antibody databases into a singular database they termed VirusNet (Magar et. al, 2020). We reached out several times, but we did not receive a response. However, we still gained a lot of inspiration from their paper, specifically their use of graph theory to represent each molecule in the antibody. We decided to further explore the broad amount of pathways we could have taken. Using “A Review of Deep Learning Methods for Antibodies” as inspiration, we combed through about 15 different databases/corresponding ML methods (Graves, et. al, 2020). These included databases such as CATNAP, SKEMPI, AB-Bind, and SAbDab. Given the scope of our project and our projected timeline, we decided to focus on databases that mainly had SARS-CoV-2 antibodies or coronavirus-related antibodies. We felt we would not be able to capture the complexities of antibody optimization for SARS-CoV-2 using HIV antibodies, for example. Therefore, we decided to settle on CoV-AbDab, a coronavirus antibody database by the Oxford Protein Informatics Group. The database keeps track of every antibody published/patented in literature that neutralizes coronaviruses, including SARS-CoV-1, SARS-CoV-2, and MERS-CoV. The database has approximately 1,300 unique antibodies, where it highlights the complementarity determining region of the third heavy chain (CDRH3).

Image of the Oxford CoV-AbDab Database logo. Can be reached here.

Selection of Region of Interest

Literature has shown that the CDRH3 region is the most important for neutralizing SARS-CoV-2, since it attacks the ACE2 receptor binding domain on the SARS-CoV-2 spike protein (Barnes, et. al, 2020). Antibodies are unique such that their variable regions create their specificity for attacking specific antigens. Therefore, we focused on model specifically on the CDRH3 region to hopefully improve performance since there were less factors at play and to reduce runtime.

Cartoon of SARS-CoV-2 and an antibody (Y-shaped) interacting near the spike protein of the virus and variable domains of the antibody.


The next step was to determine what we wanted as our model parameters. Originally, we wanted one of our parameters to be quantitative binding assessments for each antibody. However, we quickly found that binding affinities across the papers featured in CoV-AbDab were inconsistent in terms of values and units due to a wide array of experimental techniques and conditions. Therefore, we found it would make the model extremely biased to include these calculations.

We next tried to implement a pipeline to computationally calculate binding affinities for each antibody in the database using molecular dynamic software. During our literature review, we found that many papers had used Rosetta. Later on, we also found out through our collaboration that UIUC was also using Rosetta to help determine binding affinities for their antibody sequences. After researching, however, we decided to use FoldX. We reached out to the FoldX team, and they were gracious enough to let us use their software free of charge. After learning the basics on how to operate the software, we began trying to implement some of the sequences in the paper. We quickly ran into several issues, however, and realized that for FoldX to perform properly, the input must be a PDB file of the antibody-antigen complex. That means the PDB files must also include SARS-CoV-2 and how the antibody interacts with the RBD of the virus. After combing through the CoV-AbDab database, we found that only approximately 100 of the antibodies had PDB files with antibody-antigen binding domains. Therefore, we decided to shift gears.

An example of our team’s antibody folding visualization. We used FoldX and the YASARA molecular graphics software to initially analyze the structures of antibody sequences we found in the literature.

The CoV-AbDab database has a qualitative assessment of the binding and neutralization strength of each antibody. These labels were designated by the Oxford Protein Informatics Group, and so we felt confident that their assessment was well educated. Obviously, there is definitely bias in any qualitative assessment, but given time constraints and resources, we decided that this was the best course of action. Therefore, we converted each qualitative assessment into an ordinal scale to describe binding affinity.

However, there’s obviously a lot more going on for antibody binding than just its relative binding strength and its sequence. Antibodies are made up of a diverse group of amino acid, each with their own individual properties that affect secondary and tertiary structures and neutralization potentials. Originally, we tried to encapsulate this information using RDKit (Duvenaud et. al, 2015) and MoleculeNet (Wu, et. al, 2018). These packages are cheminformatic packages that perform machine learning methods on chemical structures by converting each atom to a node with bonds as edges, each with distinct properties. We tried implementing these packages, but found there was a steep learning curve, and found that the model was too finely grained for what we wanted to achieve. We needed a package that would allow us to describe more macromolecular properties in the antibody.

After much research, the team eventually stumbled upon Interpol. This is an R package that processes protein amino acid sequences for machine learning purposes that converts each amino acid into a numerical value (Heider & Hoffman, 2017). These values are determined by AAIndex, which is a database of 533 properties for each amino acid in matrix form. Additionally, Interpol has a powerful feature: it allows users to normalize a particular protein amino acid sequence to a desired length, using mathematical interpolation methods based on the AAIndex numerical values. Therefore, this massively simplifies the machine learning task of reading in sequence data, since we can first normalize each CDRH3 region to a particular length. This would be coupled with a one-hot continuous encoding of each amino acid residue position in the sequence.

Therefore, we plotted a histogram of the CDRH3 lengths in the database, and found that the mode length was about 15 amino acids long. We selected 15 as our length throughout the model as a result of this, since it’s relatively long since it contains a lot of sequence information, but is still relatively short when compared to other existing CDRH3 regions. We wanted a shorter region simply because that would mean a shorter mRNA sequence, making it easier for an mRNA sequence to fit into a DNA origami nanostructure. As such, we utilized Interpol and normalized all the CDRH3 regions to a length of 15. We ran this pipeline for all 533 AAIndex properties.

Histogram of the distribution of CDRH3 lengths in the CoV-AbDab database. The mode is a length of 15, and it seems relatively normally distributed.

Issues, Resolutions, & Pipeline

With our data, we knew we needed to use this data to construct a model to produce an optimized antibody. However, we needed a way to generate new antibodies, and a way to assess this new antibody. Therefore, we knew that we needed a stacked ensemble model. After assessing our data, we decided that an ensemble model such as a Random Forest Regressor would work best for encapsulating the intricacy of the AAIndex. A Random Forest algorithm uses an ensemble of decision trees, which are essentially a series of if statements, in order to make decisions about how the numerical should be regressed. However, a single decision tree can be biased by the number of layers and the starting conditions. Therefore, a Random Forest takes the average of thousands of these iterations, thus making the model significantly better at prediction.

Schematic representation of a Random Forest, which is the model we used for our scoring function. As explained above, the RF algorithm generates an ensemble of decision trees, which are then averaged together in some fashion (“majority voting” in this figure) to produce a final prediction scheme. Venkata Jagannath, CC BY-SA 4.0, via Wikimedia Commons

We then implemented a differential evolution algorithm in order to propose new sequences and mutate them to get an optimized sequence. In order to do this, we use the Random Forest Regressor as our scoring function for the differential evolution algorithm. This means that each proposed sequence would have to be converted to its corresponding AAIndices to serve as an input into the scoring model. To run the R package Interpol in Python, we utilized RPy2, which is a package that allows users to run R scripts through Python, and allowed us to get outputs directly back into Python. We began running our model and implementing this method for the RandomForest as well. However, when we tried to run the model using RPy2, the script took more than 20 hours to run! Clearly, we needed to reduce this runtime to make testing our model efficient and feasible.

To do this, we used PCA to reduce the amount of data and its high dimensionality. We also used a Randomized Search CV to optimize our hyperparameters of the Random Forest Regressor. However, the limiting factor was running the R scripts through Python. We had to figure out a way to more efficiently and locally run the Interpol functions in our Python script. Therefore, we had to implement a Python function to basically perform the AAIndex function Interpol, where it converts an amino acid sequence into a vector of AAIndex numerical values. Due to time restrictions, we did not implement the interpolation functionality of Interpol, but that was necessary since all of our generated sequences in the DE algorithm were made to be length 15.

A visualization of the cumulative variance in the data explained by our PCs. The important observation is that most of the variance can be explained by the first few PCs, so by only focusing on those in the RF algorithm we were able to drastically reduce the dimensionality of our data (thereby improving runtime) without sacrificing much quality.

Overall, as discussed in the Motivations section, our design pipeline was governed by four steps:

  1. Training: We give the system input data in the form of antibody sequences that have varying capabilities in binding and neutralizing SARS CoV-2. We also train it on accessory databases that link an antibody’s sequence to its essential biochemical properties. This essentially “teaches” the system about what sorts of antibodies tend to be successful against SARS CoV-2 and which are not.
  2. Validation: We cross-reference the model’s scores with existing lab-validated data to ensure that our system can correctly apply its learned patterns to unknown sequences.
  3. Predictions: We use our validated model to generate predictions about how well a given antibody sequence will perform against SARS CoV-2.
  4. Optimization: Now that we have a way to generate a “score” for each sequence, we optimize a given set of sequences by applying targeted mutations to improve SARS CoV-2 binding and neutralization. We use a differential evolution (DE) approach to accomplish this.

Diagram of the machine learning antibody optimization pipeline.

DNA Origami

The design process for our DNA origami delivery vehicle was driven by two main requirements: (a) the structure should carry out the desired programmable behaviors and (b) the structure should be able to fit several antibody mRNA molecules inside so each vehicle can deliver multiple payloads.

In terms of programmable behaviors, we needed our structure to be able to respond to two different stimuli:

  1. Bind only to B cell receptors and no other cells.
  2. Disassemble automatically when taken up by the cell and release the mRNA payload.

We addressed each of these requirements using the framework and constraints of DNA origami and nanotechnology.

  1. Bind only to B cell receptors and no other cells. DNA origami structures can be easily functionalized and decorated with proteins using chemical modifications and attachment chemistry. After conducting a literature review of B cell receptors and CD proteins, we determined that the best receptor to target would be CD27, since it is highly expressed on all plasma B cells (as described by Jung et al.), which are professional antibody-producing cells. CD27 is a B cell receptor which controls B cell differentiation and activation; stimulation of CD27 also promotes antibody production, which could boost the translation of our delivered antibody sequence (Jung et al.). This means that our delivery vehicle would have to be decorated with CD70, which is the binding/activating ligand for CD27.
  2. Disassemble automatically when taken up by the cell and release the mRNA payload. There were a number of options we considered for this response, including strand displacement and temperature-based disassembly. However, we decided to use DNA i-motifs in our structure instead. DNA i-motifs are structures that form when 4 cytosine-rich DNA strands hydrogen bond with each other to form a quadruplex structure. These structure are useful because they respond to changes in pH. i-motifs only form at low pHs (<5.5) and act like normal DNA strands at higher pHs. This property allows for payload release in the cellular endosome, as the endosome is held at an acidic pH of around 5. By incorporating i-motif staples into our delivery vehicle, these staples would unbind from the vehicle scaffold and refold into i-motifs when the vehicle is taken up into the endosome. When these staples refold into i-motifs in response to the decrease in pH, this should cause the structure to disassemble and release the antibody mRNA into the cellular environment.

A flyaround animation of a DNA i-motif. Cytosine nucleotides are represented as yellow slabs, thymine as blue slabs and adenine as red slabs. Adapted from PDB entry 1A83.

Our structure also needed to be large enough to accommodate several of the relatively long antibody mRNAs (upwards of 1300nt) that it would be carrying. We wanted each of these antibody mRNAs to be mostly unfolded and linear inside the box since complex mRNA secondary structure is known to inhibit translation. Since the length of the cargo mRNA would be around 80nm when unfolded, we decided on making our structure a box with a side length of 80nm. Our inspiration for the box design came from a study by Burns et al., which presented a DNA origami box with a lid that opened in acidic environments using DNA i-motif staples. However, our box design needed to be much larger than that of Burns et al. to accommodate the long RNA molecules inside. Therefore, we decided to split the box into two identical C-shaped subunits that attach to each other using i-motif staples. A visualization of the two C-shaped subunits combining to form the complete box can be seen below. We also wanted to minimize the number of scaffolds being used to assemble the structure since including multiple scaffolds significantly increases the probability of misfolding and side product formation (Wagenbauer et al.). To this end, we additionally added cutouts to the faces of the box to minimize the overall amount of DNA needed to create each subunit; with the cutouts, our structure could be assembled using only 2 scaffolds, the p8634 mutated phage scaffold and the phiX174 circular phage genome. The size of each cutout (roughly 20nm on a side for each cutout) was determined using information from Bujold et al. to minimize the risk of nuclease entry as well as the amount of DNA needed to construct the box.

A depiction of two identical DNA origami C-shaped subunits combining to form a single box structure.

Once we had decided on a final design for our DNA origami delivery vehicle, we used a DNA origami CAD program called caDNAno2 to model and visualize the structure as well as route the scaffold and staples that would hold the structure together. caDNAno is the standard CAD design program used by researchers as a starting point to model and visualize DNA origami structures as well as create a computational representation of them for downstream applications. We first replicated the C shape with cutouts by routing the scaffold in the corresponding pattern. We then generated staples for the scaffold using the following guidelines:

  1. No staple could be shorter than 16nt or longer than 60nt.
  2. Staples and scaffolds could only cross over each other in positions predetermined to be stable by caDNAno based on the turning of the DNA helix.
  3. Crossovers could not overlap with each other, but as many crossovers should be used as possible to maximize stability.

Our initial design, made entirely from scratch, is shown below. This design would later be iterated and improved significantly based on the results from simulations.

An image of the entire caDNAno design for a single C-shaped subunit.

Additionally, we later identified staples that could be used to anchor the mRNA payload inside the box. These staples were identified because their 5’ or 3’ end pointed towards the inside of the box and could therefore be extended into a handle that would anchor the mRNA inside the box. By carefully choosing the handle lengths that hybridize to the mRNA payload, the box could be designed to release the RNA at cellular temperatures by tuning the melting temperature of each handle. These so-called “anchor” staples are highlighted in red and yellow in the image below. We also identified staples that could be extended into i-motifs that would attach to another subunit. These staples and their corresponding attachment points are highlighted in green in the image.

An example portion of the caDNAno design showing the staples identified to be potential handle locations for anchoring the antibody mRNA payload. Staples whose handles should be located on the 5' end are in red, those whose handles should be located on the 3' end are in red, and all other staples without handles are in gray and black.

With our DNA origami design complete, we then proceeded to simulate the design using finite element analysis and molecular dynamics (see the “Model” page for more information).