Team:Harvard/Model

Modeling

Machine Learning

Stacked Ensemble Approach

While studying different methods elucidated in the review paper “A Review of Deep Learning Methods for Antibodies”, we came across a common theme during our literature review sessions. Many of these methods for antibody design found an ensemble approach, where results of multiple component models are ‘merged’ in some way to produce a final output, to be quite successful in this context. It was often the case that an ensemble model consistently outperformed each of the component models it was made of, ostensibly due to the biases and potential systematic variations in the component models “cancelling each other out” (Graves et al. 2020). Inspired by this, we decided to use an ensemble approach in our own model. In particular, we used a stacked ensemble approach, where components of the model are hierarchically “stacked”--with the outputs of one model being directly used as the inputs in the subsequent model.

This figure depicts a basic ensemble schema. Predictions from previous models are incorporated into a “meta-learning” model. If an entire model is used within the next successive model, the structure is called a “stacked-ensemble” model. Supun Setunga, CC BY-SA 4.0, via Wikimedia Commons

In our project, we had a two-layer stacked ensemble model. The first layer was our scoring function, which was a random forest (RF) model trained on lab-validated data from the CoV-AbDab database as described above. This scoring function would take a CDR-H3 sequence and output a score from 0 to 1 which is a measure of its efficacy against SARS CoV-2. The second layer was our optimization model, which used Differential Evolution (see below section for details) to optimize antibody sequences with respect to the scoring function. Crucially, the second layer used the outputs from the first layer (the scoring function) to perform the optimization. This is what makes our model a “stacked ensemble”.

It has been demonstrated in practice that stacked ensemble approaches tend to perform exceptionally well on optimization problems. They not only tend to converge quickly, but also yield improved accuracy in comparison to other approaches in many settings. The paper Stacked Ensemble Models for Improved Prediction Accuracy, which we credit as the primary inspiration for our stacked-model approach, highlighted some of these results (Gunes et al. 2017).

Interpol

Interpol is a freely R package that interpolates amino acid sequences to normalize them to the same length and to convert each value into a numerical value using AAIndex. This was immensely helpful in allowing our model to have more information to be able to make more educated extrapolations. However, we utilized RPy2 to run Interpol through Python. This was found to be extremely inefficient, as our model took 20 hours to run at one point. As such, as we discuss in our Design and Contributions sections, we had to make our own Python function to convert amino acid sequences into AAIndex. We also implemented a one-hot continuous encoding for amino acid sequences. We made the AAIndex in Python publicly available and a separate section in our Github. Therefore, this reduced our model time on the order of hours, allowing us to run many more iterations of our model and make many more tweaks.

Image of the RString we used originally to run the Interpol functions through Python. This block of code took about 20 hours to run when executed before we made our switch.

Random Forest Regressor

A Random Forest Regressor is a type of ensemble model that allows the user to run many decision trees, that are then averaged to make the best prediction possible. The RFR was used as the scoring function for our Differential Evolution Model. As such, the input was a flattened numpy array of the AAIndex values for each antibody. However, due to the large amount of data, we used a PCA reduction. With this, we ran the PCA data through our model to improve performance.

Heatmap of the first 9 Principle Components of our PCA analysis. The variance is fairly explained by this few amount of PCs (originally there were around 8,000).

Bar graph of how much each PC contributed to the model output. These results helped us determine what AAIndices we wanted to focus on.

After we ran our Random Forest Regressor, the model results were decent. However, we chose our RFR model parameters fairly arbitrarily, and so we wanted to rigorously and mathematically determine the hyperparameters that would produce the best results. In order to do that, we utilized RandomizedSearchCV and GridSearchCV, hyperparameter optimization methods. We found that Randomized Search performed a bit better, and so we used its generated hyperparameters in our final model.

Image of the runtime and execution of the RandomizedSearchCV, and its produced hyperparameters.

Differential Evolution Model

Once we developed our scoring function, the next step was to use it to iteratively “optimize” a given CDR-H3 sequence with respect to that scoring function. For our proof-of-concept, we chose to look at CDR-H3 sequences of length 15 since the CoV-AbDab database had the most data corresponding to this length after pre-processing. Since there are 20 possibilities for each amino acid in the sequence, there are 20^15 = 3.28*10^19 possible sequences to search, which is far too many to look at using a naive approach.

We first tried to do some first-order optimizations to aid efficiency of the search, such as using similarities and differences in biochemical properties of different amino acids to direct the search. However, it soon became clear that the sheer size of the sample space made it impossible to reach convergence within a reasonable amount of time. Therefore, we looked for a more robust approach used for solving this type of optimization problem.

Differential Evolution (DE) describes a set of algorithms that efficiently searches a sample space to find local maxima or minima of a given function defined on that space. In this case, our function is the Random Forest Regressor scoring function that is used to produce a score from 0 to 1 for any CDR-H3 length. This method works analogously to how actual evolution works in an idealized population. The approach we used is outlined in the following steps:

  1. Initialize a starting “population” of seed sequences. Our model starts off with a small set of CDR-H3 seed sequences. These were chosen from the CoV-AbDab database and scored by our scoring function to establish an initial baseline for the “fitness” of each population. As done in “An Investigation in Optimal Encoding of Protein Primary Sequence for Structure Prediction by Artificial Neural Networks", we encode each amino acid with a unique number from 0 to 20, thereby encoding each CDR-H3 sequence as a vector with length 15 (Hein et al. 2020). We optimize the sequences in continuous space (allowing each entry to be a real number, rather than just an integer), projecting to discrete space by rounding these decimals to the nearest integer and then transforming back to the amino acid using the same encoding scheme in order to score them. This process of continually projecting between continuous and discrete spaces was inspired by the paper “Antibody complementarity determining region design using high-capacity machine learning” (Magar et al. 2020) .

    The typical pipeline for many data analysis models. Data initialization and pre-processing was an important component of our model, as we had to make sure our data was validated. Jforgo1, CC BY-SA 4.0, via Wikimedia Commons

  2. Choose model hyperparameters. As explored in the paper “On the Performance of Differential Evolution for Hyperparameter Tuning”, there are two fundamental hyperparameters in any differential evolution model (Schmidt et al. 2019). The first is the mutation factor, which indicates how significantly our simulated mutations will change the item being mutated. According to the above paper, values for this parameter are generally chosen in the interval [0.5, 2]. Larger values for the mutation factor introduce more variability in the results and thereby ensure that the sequences that are searched are relatively heterogeneous. However, the tradeoff is that this would also cause some instances where the algorithm takes a very long time to converge. The second important hyperparameter is the crossover probability, which essentially is the in silico analogue of recombination frequency. Since this is a probability, it is selected in the interval [0,1], and larger values indicate higher probabilities that a mutation will actually occur.

    To search for optimized hyperparameters, we used a similar approach as we did for Random Forest hyperparameter optimization (using randomized search across the intervals of possible values for the parameters, described above) and compared model outputs. Ultimately, there were no significant differences found among mutation factor choices in the interval [0.5,2]. For crossover probabilities, we found that there were no significant differences in model outputs, but crossover probabilities in the interval [0.5, 0.9] generally yielded the fastest convergence.

    A visualization of the RandomSearchCV algorithm, which we used to optimize hyperparameters. Random points within the intervals x1, x2 (the range of acceptable values for each of our two parameters, mut and crossp) are initially evaluated, and then level curves for model performance are computed in order to further narrow in on the best choice of hyperparameters. Alexander Elvers, CC BY-SA 4.0, via Wikimedia Commons

  3. Implement a computationally inexpensive method to generate mutations. To intelligently speed up the convergence of the algorithm, differential evolution approaches involve a way to generate mutations targeted towards the end goal. It’s important that this method should happen almost instantaneously because our algorithm involves calling this method tens of thousands of times, at the very least. In order to do this, we first select three random sequences (encoded as vectors) from the population. For each of the 15 positions, we compute the mutation value a + mut*(b-c), where a, b, c are the entries in the three sequence vectors and mut is the mutation factor described in the previous section. We then linearly transform these values to the interval [0, 20], and then project them down to discrete space to get a list of point sequence mutations for each position in the vector. Then, we independently decide with probability equal to the crossover probability whether to mutate each position in a given sequence. This is a classical approach commonly used in differential evolution for its ease of implementation and relatively good performance.

    A visualization of the prototypical optimization problem that differential evolution aims to solve. We want to find the lowest point on the surface, but it’s computationally expensive to check every single point. So, we sequentially check a subset of points by using information that we learn at each step to induce targeted mutations that get us closer to our end goal. By picking seed sequences at various points in the sample space, we minimize the chances that the entire population gets stuck in a local extremum and never finds the global extremum.Pablormier, CC BY-SA 4.0, via Wikimedia Commons

  4. Simulate ‘epochs’ of evolution. Each “epoch” consists of a number of iterations equivalent to the population size. In each iteration, we select one of the sequences in the population, apply the above mutation procedure to that sequence, and then evaluate both the original sequence and mutated sequence with our scoring function. If the mutated sequence scores higher (i.e. it has greater “fitness” for binding/neutralizing SARS CoV-2), we replace the original sequence with the mutated sequence in the population. At the end of each epoch, this procedure has been repeated for every sequence in the population. Every sequence in the population after the epoch will have greater than or equal scores compared to the corresponding sequences in the population before the epoch. After running this algorithm for a predetermined number of epochs (usually on the order of hundreds), we arrive at a final population of optimized sequences.

    In the Design section, we discussed how a stacked-model approach can improve model efficiency and performance via modularizing our pipeline and training/testing models on each component. In addition to using this approach on our overall model, we experimented with substacking within the Differential Evolution component itself. In particular, we tried performing two layers of DE optimization (each with the same 4 steps described above), with the optimized sequences obtained from the first layer being used as the population input for the second layer. However, we did not observe a significant difference in the runtime or model performance in this case.

Results of all of the above approaches are discussed in the Results page.

DNA Origami

Brownian Motion and Diffusion of Enzymes through Cutouts

Since we had to introduce cutouts to our DNA origami structure to improve its chances of assembling, it is important that RNases or other enzymes that could potentially harm the mRNA payload not enter the structure through one of the cutouts. In order to simulate this probability, we used two calculations.

Firstly, we used a simple geometric calculation to measure the probability that a randomly moving protein could enter through one of the cutouts. This calculation was based on dividing the total area of all the cutouts on one face, 3802nm^2, by the total area of one face, 6400nm^2. This calculation assumes that the radius of a protein is negligible compared to the size of the cutouts and that any contact with the DNA structure will prevent a protein from entering the structure. Through this calculation, it can be estimated that the chance of a protein entering through a cutout is around 60%. This may seem high but it is also important to consider that the box will be moving through the bloodstream, possibly away from any enzyme that may be randomly diffusing towards it. Also, even once the enzyme has passed through a cutout, it has to make contact with the mRNA to break it down, which is difficult considering the mRNA is anchored on the inside of a face of the box and would not be easy for an enzyme to gain access to. In that time, the enzyme could diffuse out of the box. The fact that each box will carry multiple copies of the mRNA will also improve the chances of at least one mRNA molecule reaching its destination intact.

To improve the accuracy of this calculation, we then conducted a Brownian motion simulation in Python to model the diffusion of an RNase enzyme. Our simulation modeled the enzyme as a sphere of uniform diameter that diffuses by moving in a random direction at each discrete timestep. We use the Einstein-Stokes equation to model the rate of diffusion of the enzyme. Because of computational limitations, we made the following assumptions in our model: the enzyme started 0.006nm away from the structure (which is extremely close compared to the size of the enzyme), any contact with the structure would prevent the enzyme from entering the model, and the concentration of enzyme molecules in blood is uniform. Additionally, we assumed for the sake of simplicity that there were no attractive or repulsive forces between the enzyme and the structure. (Further details about the simulation parameters can be found in the Python code.) We obtained the histogram below, which shows the frequency of an enzyme entering the structure after a particular number of seconds. We found that the enzyme only collides with the structure within 1 timestep 40% of the time, moving away from the structure in the remainder of the trials. Considering that most enzymes would begin much further away from the structure, this probability is likely an overestimate. This shows that the probability of an enzyme entering the structure is low enough that our mRNA payload is safe even with the cutouts on the box faces.

A histogram showing the frequency that an enzyme collides with the structure after a given amount of time.

Finite Element Analysis and Staple Optimization

One of the most basic ways to analyze the stability of a DNA origami structure (as well as other structures) is to analyze its mechanical rigidity and deformation. We accomplished this initial simulation by conducting finite element analysis on our DNA origami structures, which involves breaking each structure up into small pieces and simulating the effects of a force on each of the pieces. By combining the effects from each of the small pieces, a broader effect on the entire structure can be measured and visualized.

This simulation does not account for the molecular stability of the structure or the atomic-scale forces within it. It assumes that DNA can be modeled like a uniform material with certain mechanical properties that have been measured empirically, and simulates the stresses and deformations on a structure using these mechanical properties only. It cannot give a broader view of the molecular stability of the structure, or its stability in various environmental/solvent conditions.

However, finite element analysis is useful as a starting point for evaluating the mechanical stability of the structure, especially when the structure is subjected to the random thermal forces and fluctuations that can have a large effect on the nanoscale. For our project, we used finite element analysis to optimize the staples and crossovers within a single C-shaped subunit of our DNA origami structure. Since the staple layout and design is primarily what determines the structural rigidity of a DNA origami structure, we used the finite element analysis results to optimize our staple designs and increase structural rigidity. We ran our simulations at 37 degrees Celsius to simulate body temperature and used the default empirically determined material properties of DNA. We used the open-access webserver CanDo to run our simulations, with the output from each simulation being used to guide changes and improvements to the staple layout, which was then simulated again to inform further changes in an iterative process. Staple improvements were conducted manually in caDNAno after each iteration and ranged in complexity from minor length changes to entire reworkings and redesigns of the staple layout. In total, 13 iterations were conducted. The figure below shows selected iterations, the simulation results for those iterations and a snapshot of the staple layout for each iteration. The maximum structural deformation from an initial resting position was calculated for each iteration. We decided that a 33% decrease in maximum deformation would be suitable for downstream applications. By the 13th iteration, deformation had been reduced by 37%, with the final result being deemed suitable for more in-depth simulations (see the “Molecular Dynamics” section below).

Selected iterations of the CanDo simulations and staple optimizations. Each column shows, in order from top to bottom: the iteration number; a color-coded image showing the deformation of the DNA origami structure, with blue being the least deviation from the original structure and red being the greatest deviation; the maximum deformation of the structure in arbitrary units (AU), as determined from the simulation video; and a representative snapshot of the staples in each iteration. Note the increasing complexity and number of staple crossovers as the design is iterated and the max deformation decreases.

A movie showing the thermal fluctuations of Iteration 13 from the figure above. The colorbar at the bottom of the movie indicates the deformation of the structure from the original structure in nanometers.

Molecular Dynamics

The fully optimized structure was then evaluated using molecular dynamics, a more comprehensive simulation technique that simulates the structure on a much finer scale than finite element analysis. Molecular dynamics analyzes the forces between atoms or molecules within a structure as well as the electrostatic forces from outside the structure (i.e. from the solvent or nearby ions) to determine whether the forces holding a structure together are enough to counteract opposing forces and random thermal fluctuations. The purpose of using molecular dynamics was to simulate the stability of the structure at a more detailed level, since molecular dynamics can operate at the molecular or atomic scale, and our hope was that the results from these simulations would not only tell us about the stability of the structure in a physiological environment but also shed light on whether the structure would self-assemble properly in the first place. Please note that these simulations are still a work-in-progress, as will be explained later.

The specific molecular dynamics simulation we used is called oxDNA, which is a simplified molecular dynamics program designed for DNA origami nanostructures. Because of the large size and complexity of DNA origami structures, oxDNA makes some simplifying assumptions and coarse-graining of calculations to ensure that the simulation can be completed in a reasonable amount of time. Firstly, oxDNA focuses mostly on the forces and potential energy within individual DNA double helices in a structure. It treats each sugar-phosphate backbone as a single entity (since their interactions with the rest of the structure are likely limited) and each nitrogenous base as a single entity with certain electrostatic properties. The oxDNA model then simulates the strength of hydrogen bonding, base stacking and backbone stacking, all of which are favorable interactions that increase the stability of the helix. It also assumes that each backbone entity and base entity have a fixed volume and accounts for this volume by ensuring that no two molecules are in the same place or overlapping at a given time. Then, using a given input temperature, the model simulates the thermal energy and entropy that the structure would have at that temperature, and determines what would happen to each helix in the structure.

Molecular dynamics simulations are discrete - they calculate the forces on a structure at a certain point in time, then advance the simulation by a certain fixed timestep, and calculate the new forces based on the new positions of the molecules, and so on. They are not continuous and as a result some of the interactions measured may be exaggerated or unrealistic depending on the timestep chosen. Also, because of further simplifications that the oxDNA model uses in modeling diffusion, it is difficult to match time in the simulation with time in real-life (i.e. it cannot really be said that one timestep in the simulation is a certain number seconds in real life). Finally, oxDNA is a Monte Carlo-type molecular dynamics simulation, which means that it is a stochastic model that employs a certain amount of randomness in the simulation to improve its performance, including randomness in determining the precise starting configuration of the structure. Despite these simplifications, oxDNA simulations have been shown to match experimental/empirical data for the folding and dynamics of a wide variety of DNA origami structures, so we decided to use it to gain a deeper understanding of the stability of our own structure. Additionally, given the simplifications that the oxDNA model makes and the fact that it operates on a structure that is already folded, the simulation gives the structure the best possible chance to demonstrate its stability. If the pre-folded structure is shown to be unstable in these simulations, then it is unlikely to assemble correctly in a laboratory setting.

We used the freely available oxDNA webserver to simulate one C-shaped subunit of our larger box structure. All simulations were run at body temperature (37C) and a salt concentration similar to that found in blood (~0.2M). Because caDNAno only produces a graphical representation of the structure that does not provide the real-life location of each molecule in the structure, we used the tacoxDNA webserver to convert our caDNAno structure to a physical representation of the real-life structure. We then relaxed the structure using the oxDNA webserver to provide a more realistic starting configuration for the molecular dynamics simulations. Finally, we ran the molecular dynamics simulation for our structure and obtained the trajectory for our structure shown in the video below. (All parameters for simulations are listed in the Appendix page.)

The result of the oxDNA molecular dynamics simulation being run on Iteration 13 (the final iteration) from the previous simulation step. The free-floating strands are likely staple strands that have fallen off the structure due to thermal fluctuations.

A graph of the potential energy of the structure in the video above over time. Both energy and time are in arbitrary units (SU). Note the initial steep increase in energy that then begins to fluctuate around a mean energy value, possibly indicating convergence of the structure to some mean configuration.

Judging by the trajectory obtained from this first simulation, it seems that our DNA origami structure is not very stable and tends to unravel at body temperature. Upon closer examination, we determined that this is because of the short length of many of the staples in the structure (16nt), which means that not much thermal energy is required to unbind them from the scaffold. Because many of the short staples that hold the structure together ended up falling off over the course of the simulation, the structure began to unravel and lose its shape by the end of the video. The graph of energy versus time above also shows that the structure has indeed converged to some average configuration and is merely oscillating/fluctuating around this mean configuration. This also shows that our structure in its current form would likely not assemble correctly, since the same thermal energy that causes the staples to unbind from the structure would also prevent them from binding in the first place during the assembly process. We simulated a second iteration of the structure in which many of the staples were lengthened or connected to other staples to increase the melting temperature, but most short staples could not be extended due to design constraints. As a result, the second oxDNA simulation produced results that were very similar to the first simulation depicted above, in which many of the staples again fell off the scaffold due to their low melting temperatures, causing the structure to unravel.

To ensure that the simulation was producing results that reflected real-life experiments, we additionally simulated a hexagonal bundle of 6 DNA helices bound together by staples. This 6-helix bundle is already previously known by multiple research groups to be highly stable in vitro and assemble with high efficiency in a laboratory environment. We simulated the 6-helix bundle using the same parameters that were used to simulate the C-shaped subunit (exact parameters can be found in the Appendix). The image below shows the 6-helix bundle in caDNAno, and the video shows the simulation trajectory for this structure. While there is some minor unraveling at the ends of the bundle, the bundle holds itself together with no major loss of structural integrity or shape over the course of the simulation, as is expected based on past laboratory results. The energy plot below the video shows the convergence of the 6-helix bundle towards a mean configuration. These results confirm that our simulation results for the C-shaped subunit are likely to represent reality and are not a flaw arising from the simulation parameters.

A snapshot of the 6-helix bundle structure in caDNAno with a representative closeup of the staple layout near the middle of the bundle.

The result of the oxDNA molecular dynamics simulation being run on the 6-helix bundle using the same parameters that were used to simulate the C-shaped subunit above. The fact that the 6-helix bundle maintains its integrity as expected helps validate the simulation results for the C-shaped subunit.

A graph of the potential energy of the 6-helix bundle in the video above over time. Both energy and time are in arbitrary units (SU). Note the initial steep increase in energy that then begins to fluctuate around a mean energy value, possibly indicating convergence of the structure to some mean configuration.

We are currently working on tweaking the staples in the structure to make them longer and more resistant to thermal fluctuations. While our current structure may not have had a favorable result in these simulations, we believe that with further staple optimization and redesigns our structure can be made stable at body temperature for a reasonable amount of time. We will update this page as progress is made on optimizing our structure and improving its stability.

In case our structure proves to not be viable, there are multiple designs in literature that could alternatively be adapted to carry an mRNA payload. In particular, the large box described by Andersen et al. and the nanocapsule described by Anastassacos are both large-format DNA origami structures that could be used to deliver a long mRNA antibody sequence.