Short and Sweet
GausHaus is a software developed in R for parameterizing molecular dynamic simulations. This is done by fitting a Gaussian process dynamic model to each principal axis of the movement of the protein. This provides the user with a quantitative value indicating the proteins variance of movement in the x, y, and z plane. This provides users with an additional metric to understand the dynamics of their protein, where they can compare and understand the changes in dynamics associated with changes in sequence.
GausHaus was inspired by iGEM Calgary 2019s work in molecular dynamic simulation. From the vast amounts of data and simulations generated, there were still questions and curiosities left unanswered. One of the big issues left that most intrigued our resident statistician was how these simulations could be fit and understood via Gaussian Process Dynamic Models.
Figure 1. A drylab member in utter aweh and confusion.Now we will not blame you if this inquiry seems odd to you, but this presented a unique challenge for our drylab to tackle. Gaussian process dynamic models are models that use normally distributed variance around an average point over time to predict the movement of a system. In the broad sense, this can make perfect sense as each amino acid can be approximated to a series of points moving in melody with its bonds around its equilibria. This is similar to models of Brownian motion. These parallels paint a promising light for Gaussian processes in this use case.
What is the secret sauce?
The first step is to generate dynamic simulation data. This was completed by our team using GROMACS version 18 on the UCalgary ARC computing cluster. This data can also be created via outsourcing or patience and personal computers.
With these files generated, it is time to open the structure and trajectory files in Pymol. Once opened, use the Excel Exporter plugin to generate an excel that can be turned into a comma-separated file (csv). It is important that this file is generated for all states and does not include solvent molecules. The program will crash if there are and extra few thousand water molecules hanging around.
This csv can now be read into R on line 14 of the script it will take the variable name ssn, which is named such after a legacy use case. After reading in the csv file for the dynamics in R, the next subset of the script parses down the information into a large matrix that can easily undergo matrix manipulations, and act as a lookup table. This matrix will then be used to generate the design matrix and simulator outputs for the Gaussian process fit.
This fit (in the current state) will be conducted using the first ten data points in every dimension for the design matrix. Then the eleventh data point will be used as the simulator output. This can be changed by modifying the ten present in the 'aCarb_.df' register in the lower half of the program. Eleven was selected as it provided a compromise between accuracy and computational load.
Below is a kind simplification of the mathematics being carried out during the model fitting. In each dimension, the software trains a Gaussian process Eta x, y, and z. Then from the Z(x) portion of the model, we extract its variance. The variances are then combined into the total variance vector for analysis by the user.
Fitting of the Gaussian processes is completed in order x, y, then z on lines 79 through 104. This is done using the GP_fit function available through the GPfit package available on the CRAN repository.
The protein ends with a print out of every Gaussian process and the variance corresponding to these measurements.
Little Haus on the Prairie
GausHaus was developed to quantify molecular dynamic variance along individual dimensions. That is what GausHaus does, and it does it quite effectively while also leaving the potential for vast improvements by either our team or others in the future. This year the measuring software was used on the dynamics provided by the SEGI-8 protein. This lead to the knowledge that the modifications conducted did not induce new variance into the movement of the protein.
The future of GausHaus can be seen in the sprawling, and turmoilous GausHausWIP.R script available on the GitHub. Please do not use this as a resource; it is merely a sneak peek into what is to come.
Goal 1: Predict future dynamic data points. This has actually been completed with great success. Despite this success, it is yet to be friendly for other users and therefore was not included in the currently released version. Though keep your ears open for its release in 2021!
Goal 2: Modular/functional design GausHaus is currently being overhauled to a modular/functional form for more mixing and matching such as in iGAM 2.0. This can be seen with the many harry potter inspired functions available in the WIP script
Goal3: Integrate feedback from other teams to improve the usability of the software. We want GausHaus to grow into a tool that can help the iGEM community and for this to be the case, we need to ensure that iGEM teams have a voice and part in its development.
R Core Team (2019). R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Osorio, D., Rondon-Villarreal, P. & Torres, R. Peptides: A package for data mining of antimicrobial peptides. The R Journal. 7(1), 4-14 (2015)
Blake MacDonald, Pritam Ranjan, Hugh Chipman (2015). GPfit: An R Package for Fitting a Gaussian Process Model to Deterministic Simulator Outputs. Journal of Statistical Software, 64(12), 1-23. URL http://www.jstatsoft.org/v64/i12/.
Yihui Xie (2013). animation: An R Package for Creating Animations and Demonstrating Statistical Methods. Journal of Statistical Software, 53(1), 1-27. URL http://www.jstatsoft.org/v53/i01/.
Karline Soetaert (2019). plot3D: Plotting Multi-Dimensional Data. R package version 1.3. https://CRAN.R-project.org/package=plot3D