Team:Harvard/Results

Results

Project Achievements

After several months of hard work, our team is excited to announce the progress that we have made so far:

  • Based on the experimentally validated data from the Oxford Protein Informatics Group, we were able to use machine learning to create a CDR-H3 Scoring Function that provides a reasonable metric for an antibody’s efficacy against the SARS-CoV-2 virus. In particular, this model is conservative and works well at the upper tail--all the antibodies scored above a certain threshold by our model were also identified as having “strong binding affinity and neutralization activity” by the Oxford Protein Informatics Group.
  • Even with limited input data, our Differential Evolution-based Optimization Algorithm is capable of significantly optimizing antibody sequences with respect to a scoring function. As the CoV-AbDaB database continues to grow, we’re excited to incorporate the new data into our model.
  • We designed a novel DNA origami nanostructure from scratch that can carry our antibody mRNA to its target. Our nanostructure is designed to specifically release the mRNA only inside plasma B cells, and employs a variety of modifications and programmable behaviors to do so.
  • We improved the mechanical integrity of our structure iteratively using finite element analysis and obtained valuable information about its stability using molecular dynamics simulations.

In the interest of scientific honesty and the spirit of consistent progress, here are some things that haven’t been as successful as we hoped. As our project continues to evolve, we’re excited to continue making headway in these areas.

  • Though our scoring function does create general separation that works well enough for our proof of concept, there are still some isolated instances where its evaluations significantly differ from those of the Oxford Protein Informatics Group. We expect this to be less of an issue as more data gets added to the CoV-AbDaB database (and subsequently incorporated into our model), but we’re currently exploring other avenues to resolve this problem.
  • Our model incorporates some of the most important AAIndices (as determined by our literature reviews with our mentors), but since the runtime of our model increases significantly with each additional AAIndex, we were not able to include some factors which could be relevant. We’re working on further runtime optimizations such as substacking within the DE algorithm, as explained in the Model section.
  • Due to the design constraints of our DNA origami structure, some of the staple strands in the structure are too short to withstand the thermal energy from body temperature, which causes the structure to unravel over time. We hope to remedy this by lengthening the staples or modifying our design to accommodate longer staples. Alternatively, there are large DNA origami structures described in literature that could be adapted to carry an mRNA payload.

Machine Learning

We’ll now walk you through some figures that outline the key results of our project.

The heat map displays the variances of the first 9 principle components calculated by the PCA reduction. Clearly, these first 9 PCs describe a lot of the variance, meaning we can reduce the amount of data to optimize our model performance by only looking at a few of the PCs. This made our algorithm run significantly faster, which was crucial to get the results we observed. See the Design section for more on this.

This is an example visualization of one of the decision trees, randomly selected, that was incorporated into our Random Forest Algorithm. The x-axis represents the true ordinal classification, and the y-axis represents the prediction outputted by the decision tree. The algorithm works by generating hundreds of decision trees like this and then employing an ensemble method to "average" the results together. In this particular decision tree, we see that it tended to underestimate the scores of most sequences. In fact, this holds true for our scoring model in general, as explained later in this section. This conservative approach is by design, as we wanted to minimize the false positive rate so our limited lab resources are used towards validating antibodies that have already passed a very strict test.

The above figure illustrates the convergence of our differential evolution model, as measured by mean score of our population over each epoch. Due to the large number of optimizations done within each epoch (see Model page), we saw in some cases that there could be substantial improvement within just a few epochs. This “jumping” behavior motivated our substacking approach--doing a lot of small ‘blasts’ with a low number of epochs might provide output populations with relatively high scores. Since most of the rapid improvement seems to happen within the earlier epochs, we hoped that using these output populations as the input for a second round would help produce another burst in optimization.

Finally, the 24% overall improvement (Input population had mean approximately 0.37; the initial value of 0.40 shows the improvement after the first epoch) in mean population fitness (evaluated by our conservative scoring model!) over 100 epochs shows that our algorithm is able to produce meaningful improvement in a practical timeframe.

Here is a similar visualization of our model’s convergence after 500 epochs. Once again, we see that convergence tends to happen in “jumps”, as is characteristic of these differential evolution approaches. Basically, once the model causes a mutation in one sequence that turns out to be favorable, that sequence motif quickly gets propagated to the rest of the population, causing a “jump” in the overall population fitness. We were able to consistently achieve a 27% improvement in mean fitness over 500 epochs.

In order to visualize how our algorithm acted on individual sequences, we plotted the scores of the seed (input) sequences compared to the scores of the output sequences. Now we see separation in not just the population means, but in every sequence that we put in.

This is a similar visualization of our model’s sequence optimization over 500 epochs. We see that many more sequences are clustered along the upper spectrum of scores, so the optimization algorithm continues to be effective across long timescales.

Overall, our above results demonstrate that our differential evolution algorithm was able to consistently and meaningfully optimize seed sequences in a computationally efficient manner. Trying to optimize any objective function over a sample space of size 20^15 is not an easy task, and our method is readily generalizable to doing this for any objective function (although of course, results may vary). Our method of projecting encoded sequences back and forth between discrete and continuous space allows for additional intermediate optimization and also makes it compatible with most numerical data analysis techniques. We hope that this approach to antibody optimization will be useful to teams working on similar projects in the future.

DNA Origami

To summarize some of the key DNA origami results:

We designed a bespoke DNA origami nanostructure from scratch that is capable of carrying a long antibody mRNA. Our nanostructure takes the form of a box that has been split into two identical C-shaped subunits. The subunits bind to each other using special staple strands that refold at acidic pH, allowing the box to specifically release the mRNA in the acidic endosome environment. Additionally, we made sure to design our nanostructure to specifically deliver the antibody mRNA to plasma B cells by decorating the outside of the structure with CD70 proteins, which are the ligand for a protein (CD27) which is highly expressed on plasma B cell surfaces.

A depiction of two identical DNA origami C-shaped subunits combining to form a single box structure.

We simulated the mechanical integrity of the nanostructure using finite element analysis, which uses the material properties of DNA to simulate its deformations and thermal fluctuations. In order to enhance its structural rigidity, we used an iterative process to simulate a structure and then modify and improve the staples based on the simulation results. Over 13 iterations, we were able to reduce the maximum deformation of the structure by 37%, exceeding our goal for improvement of structural rigidity.

Selected iterations of the CanDo simulations and staple optimizations. Each column shows, in order from top to bottom: the iteration number; a color-coded image showing the deformation of the DNA origami structure, with blue being the least deviation from the original structure and red being the greatest deviation; the maximum deformation of the structure in arbitrary units (AU), as determined from the simulation video; and a representative snapshot of the staples in each iteration. Note the increasing complexity and number of staple crossovers as the design is iterated and the max deformation decreases.

Finally, we used molecular dynamics to simulate the DNA origami nanostructure on a fine scale. We found that our design is likely not stable at body temperature due to the short length of some of the staples, which cause the structure to unravel in physiological conditions. A second iteration of the structure with some lengthened staples did not significantly improve the simulation outcome. We plan to continue improving the stability of the structure by lengthening more of the staples and making changes to the DNA origami design if necessary.

A movie showing the thermal fluctuations of Iteration 13 from the figure above. The colorbar at the bottom of the movie indicates the deformation of the structure from the original structure in nanometers.