Revision as of 09:23, 24 October 2020

Why Rosetta? Laccase Docking EreB CM-Modeling EreB MD Simulation TasA fusion proteins

Why Rosetta?

Rosetta is a software suite capable of solving a multitude of computational macromolecular problems such as de novo protein design, enzyme design, ligand docking and structure prediction of biological macromolecules or macromolecular complexes. The Rosetta energy functions enables relatively precise solution of a broad range of applications by considering many different energy terms relevant for protein folding such as solvation, electrostatic effects and hydrogen bonding. It can be used to perform simulations starting from designing macromolecular structures, interactions and RNA or fibril structures up to the de novo design of a fully functioning enzyme! It was originally developed by the David Baker Lab. ^[1] Rosetta is free for academic users and a very powerful tool for multiple problems that come along with elaborating a synthetic biology problem. Although there are lot of information available in the Rosetta Documentation, it is very hard for people that want to get started with Rosetta to improve their project, especially for people without experience with console-based applications. Nevertheless, Rosetta displays an amazing and multifaceted tool for synthetic biology. To counter the starting issues with the program we provide a guide for Rosetta on our Wiki.

We used multiple Rosetta applications to model the properties of our enzymes. The collected data allows us to predict enzyme functionality when immobilized in our biofilm with TasA, binding affinity towards different pharmaceuticals or pollutants and many more.
RosettaCM was used to generate structure predictions of the azithromycin transforming enzyme EreB and fusion proteins consisting of matrix protein TasA and our enzymes CueO, CotA and EreB ^[2].
Rosetta Ligand Docking was used to study the binding affinity of various substrates towards the enzymes active site^[3].
Protein Design was used to enhance the binding affinity of our target molecules to the corresponding enzymes’ active site by introduction of mutations ^[4].

To find out more about the different aspects of the modelling using Rosetta you can read the articles thematizing these applications. If you want to use Rosetta on your own you can use our Rosetta Guide to get started with the program.

Laccase Docking

EreB CM-Modeling

By carrying out structure prediction calculations using the Rosetta comparative modelling application RosettaCM we hope to create a precise 3D-model of the enzyme EreB. RosettaCM is based on homology modelling, comparing the protein structure to known crystal structures of proteins with a high sequence homology. Structure of unaligned sequences showing no or low homology to the given template structures are generated using the Rosetta ab initio protocol. Low homology is a consequence of mismatches or gaps in the structures’ alignments. This protocol uses a library of nine-mer and three-mer fragments of known protein structures to predict possible folding. ^[5]
EreB possesses high similarity in its active site with other esterases such as protein data bank (PDB) entry succinoglycan biosynthesis protein. This enhances the accuracy of the structure prediction method homology modelling since the similar domains can be used as a template for the structure. Nevertheless, the highest found structural homology of another PDB entry is 25,90% (Succinoglycan biosynthesis protein 2QGM, Chain A). Since the structure prediction relies on the statistical Monte Carlo method, multiple modelling runs are necessary to obtain a precise structure.

Proteins with high sequence homologies are found by blasting the EreB sequence against the PDB using the NCBI Blast application (blastp). 3 protein structures were identified to be sequentially related to EreB and were threaded onto the query sequence. Rosetta’s partial thread application assigns the templates’ structural data on the aligned sequences of the target structure to prepare the structure prediction run.

Succinoglycan biosynthesis protein (2QGM_A: X-ray diffraction, 1,70 Å + 2RAD_A: X-ray diffraction, 2,75 Å) by Bacillus cereus ATCC 14579 expressed in E. coli and Q81BN2_BACCR protein also from Bacillus cereus ATCC 14579 (3B55_A: X-ray diffraction, 2,30 Å) were used as templates.

The resulting threaded models are aligned in a single global frame and are then used to create a full chain model of the proteins 3D structure by Monte Carlo sampling. Monte Carlo sampling relies on random sampling of variants for solution of a problem that is deterministic, for example protein folding.

Fragment files were generated using the old Robetta fragment server. It outputs a three- and nine-mer file by aligning fragments to PDB entries. The fragment files are then used as input structures to enhance the model’s precision.

The structural changes are then scored using the Rosetta low resolution energy function. This approach relies on the fact that the desired proteins 3D structure is expected to be the minimum of its free energy function. Rosetta’s energy function takes hydrophobic interactions between non-polar residues, van der Waals interactions between buried atoms and the strong size dependence of forming a cavity in the solvent for accommodation of the folded protein in account. In this process atom-atom-interactions are calculated as Lennard-Jones potentials, hydrophobic interactions and the electrostatic desolvation of polar residues inside the molecule as implicit solvation models and an explicit hydrogen bonding potential for hydrogen bonds. ^[2, ^6] The calculated energy is then assessed by the Metropolis acceptance criterion: If ΔE < 0 the structure is accepted, otherwise the newly proposed structure is accepted with probability p.

Structures are built using the Rosetta “fold tree”.^[6] Therefore, backbone and side chain conformation are displayed in a torsion space (Bonded interactions are mostly treated with ideal bond lengths and angles). Additionally, the position of each residue is displayed in a Cartesian space. Torsion angles are changed according to the fragment files or the provided templates and positions of homology fragments are substituted, combining an ab initio approach with a homology modelling approach. This combination of template derived fragments in Cartesian space with torsion angles and residue positions derived from fragments from the PDB database should ideally converge into the correctly folded protein topology. ^[2]

To solve clashes, distorted peptide bonds and poor backbone hydrogen bond geometry that often arises from this CM approach further improvements are necessary: In a second step the structure is improved by replacing backbone segments through Monte-Carlo-Method by either segments taken from the PDB that span the region and can roughly be superimposed on the selected residues or segments from the template structures that superimpose the complete segment. Afterwards the structure’s energy is minimized using a smoothed version of Rosetta’s low energy function. In a third step side chain residues are added and structure refinement is carried out using a physically realistic energy function.^[2]

Since Comparative modelling is a statistical approach for protein folding based on the Monte-Carlo-Method and comparison to related structures, a high number of structures has to be generated to ensure the predicted structure is as close as possible to the actual protein structures. In this context 20000 structures were generated using the “Lichtenberg high performance computer” of the TU Darmstadt. After the run finished the best structures were sorted by their total score calculated by RosettaCM and the best structure (S_17070.pdb) was used for further calculations. The total run was analysed using the Biotite Python package and its implemented superimpose and RMSD feature. ^[7]

RMSD values equals the Root-mean-square deviation of atomic positions to represent the structure similarity of two molecules. It is calculated using the following formula:

For structure refinement an additional Relax run was carried out with an output of 100 structures to ensure realistic torsion angles. Relax is an all atom structure refinement application working in the structures local conformational space. ^[8] Regions of high energy in the proteins structure are optimized considering backbone and sidechain restraints to minimize structural derivation. Optimization follows usual methods as torsion-space sidechain minimization, torsion-space backbone minimization, and re-sampling of sidechain rotamers. ^[9]

To evaluate the obtained protein structure Ramachandran plots were created and can be compared to Ramachandran plots generated using a broad variety of protein’s crystal structures using the Procheck webserver. We checked whether the dihedral angles of the modelled secondary structure show the typical distribution to validate the model's accuracy.^[10] The evaluation showed that 328 residues are located in the most favoured regions, 44 in the additional allowed regions and 2 residues in the disallowed regions. Glycine and proline residues were excluded since they show no predictable dihedral angle distribution. 87,7% of the structures dihedral angles are located in the most favourable regions and 99,5% in total in allowed regions. Therefore, the structure is expected to be a good model of EreB’s crystal structure.

To summarize we used the RosettaCM application to predict the structure of our target enzyme EreB by creating 20000 structures on the Lichtenberg server cluster. We then relaxed the best scoring structure and validated it dihedral angles using a Ramachandran plot. For further investigations on enzyme stability MD simulation will be performed on the obtained structure.

EreB MD Simulation

Molecular dynamics simulation (MD) simulates the behaviour of a molecule in a small space that can be filled with solvent molecules (mostly H₂O). Therefore, it is suited to study the stability of our enzymes solved in aqueous environment. MD displays a far more dynamic and physical approach than comparative modelling.

Comparative modelling only creates a temporary, stationary image of a dynamic biomolecule without simulating the molecules behaviour within a force field even though dynamic motion of the protein is crucial for enzymatic activity. Molecular dynamics offers a physical approach over a certain simulated period of time for structure evaluation in contrast to statistical approaches like homology modelling neglecting dynamic physical forces between residues or interactions with solvent molecules. To validate the modelled structure of EreB and the fusion proteins of our enzymes with matrix protein TasA MD simulations were executed with the GROMACS (Groningen Machine for Chemical Simulations) software suit.

Although the structures obtained from our CM run were relaxed using the Rosetta Relax application, Monte Carlo based structure prediction tools do not always output fully relaxed protein structures and implicitly simulate the interactions with water molecules, Water molecules are a main part of the cause for highly important interactions responsible for the proteins structure such as hydrophobic interactions mostly between a protein’s inner residues or hydrophilic interactions of the proteins surface residues with water molecules. Also, interaction with water is one of the primary mechanisms for protein folding and consequently has to be considered for structure prediction or validation.

These interactions can be considered in MD simulation by providing a set of explicit water molecules interacting with the molecule and physical, time-dependant calculation of the systems’ forces using Newton’s equation of motion. To limit computing power only a small system containing one protein and enough water to negotiate interactions between different proteins is created and results are transferred to a realistic system. Periodic boundary conditions allow this transfer: The system is assumed to be consisting of multiple small systems that act identical, so when a molecule moves out of the simulated box it enters again on the other site, keeping the particle number constant and acting like a subsystem for simulation of a big system. By proving stability of our predicted (fusion)proteins we can also forecast their functionality: If the residues catalysing the ester cleavage of EreB form their already described catalytic centre and the active site is accessible for azithromycin the catalytic activity can most likely be assumed.^[11] Also, for the laccases catalytic activity can be assumed, if the copper sites are correctly folded to coordinate copper ions for oxidation catalysis and the substrate binding site equals the one of the obtained crystal structures.^[12]

The GROMACS software suit combines many tools for chemical and biochemical calculations. For MD simulation an external force field is integrated into the application. MD simulations contain methods calculating the forces exerted to all atoms of a biomolecular systems, mostly a target molecule solvated in water. Therefore, Newtonian equations of motions are used to predict the position and velocity of any atom in the system in small timesteps (femtoseconds). The forces are calculated using a force field, for example GROMOS or AMBER. “Force fields are sets of potential functions and parametrized interactions that can be used to study physical systems.” ^[13] They are derived from the equations of motion and therefore introduce time-dependency of the system. The force field consist of 3 types of interactions:
1. Bonded interactions between 2, 3 or 4 particles including harmonic, cubic and morse potentials for 2 particle systems, harmonic interactions for 3 particle systems
2. Nonbonded interactions between different molecules. The repulsion described by an exponential term, e. g. Lennard Jones potential, and a Coulomb term.
3. Special interactions defined by the position restraint of a given system, f. e. distance restraints obtained by Nuclear Overhauser Effect data from high resolution NMR.^[14]

Molecular dynamics simulation considers Newtonian forces on every atom in the simulated system and can therefore deliver, dependent of the force fields accuracy, precise results.^[16] Nevertheless, it is still an approximation and thus connected to imprecisions: The forces calculated are cut after a defined distance to limit required computing power. Also, the forcefield calculates forces on atomic levels without quantum mechanics taken into account. This adoption is based on the Born-Oppenheimer MD approximation that splits an atom’s energy into core-energy and electron-energy since the electrons’ dynamics do not directly influence the atomic core. The core’s kinetic energy can then be taken into account by classical approximation of Newton’s law of movement. The quantum mechanical parts of this system, the electrons’ wave functions, are not considered for classical MD simulation. ^[17] Thus, MD simulations are very accurate methods for most large systems but still based on approximations. Therefore, the obtained results are reliable but always have to be double checked by experiment in the laboratory.

The system used is a cubic box filled with explicit water molecules (TIP3 model) with the target enzyme centred in the box. Also, ions and counter ions are added to simulate realistic conditions and equilibrate the enzymes charge in solution due to (de)protonation. The system topology was created using the CHARMM27 forcefield with the TIP3P water model, specifying a 3-site rigid water molecule with charges and Lennard-Jones parameters assigned to each of the 3 atoms. ^[18,^19] Afterwards a cuboid box with at least 1.2 nm distance of the borders to the protein is created and filled with water molecules and ions countering the proteins charge (system size: 8.000 5.425 5.921 (nm)). The system's energy is then minimized. Afterwards NVT (constant number of particles, volume and temperature) and NPT (constant number of particles, pressure and temperature) equilibration is done using a position restraint file to keep the enzymes structure during equilibration of temperature and pressure of the system. Both NVT and NPT are carried out for 100 ps to ensure stabilization of the parameters. After the equilibration the main simulation can be started. The simulation was carried out for 100 ns with totally 50000000 steps (each 2 fs). ^[20,^21,^22]

can be used to analyse and validate a MD simulation. These values give an important first insight into the structural stability of the structure obtained from CM modelling. Converging RMSD and radii of gyration can therefore be used as primary indicators for stable protein structures. RMSF displays the average derivation of particles over a time from its original position and therefore shows which regions of the protein show the highest dynamic and which domains are structurally preserved. Radius of gyration displays the root-mean-square distance from each atom of a protein to its centroid and can be used to analyse which regions of a protein are denatured or which regions show a high amount of secondary structure motifs. RMSD was calculated compared to both relaxed and crystal structure, RMSF was calculated for C alpha backbone atoms and gyration radius for the whole protein. The specific value of convergence depends to the size of the protein that is subject to MD simulation. The RMSD and gyration angle plots were analysed after the 100 ns simulation and show a clear trend of convergence. Also, the RMSF plot shows really small movement at the residues essential for the catalytic process of EreB E43, H46, R55, R74, H285 and H288.^[11]

For further analysis principle component analysis (PCA) was performed on the simulation logs. PCA allows us to filter global collective movement from local, fast movement to further visualize and study the dynamics of a protein. GROMACS covar tool is used to calculate a covariance matrix of the proteins atomic fluctuation that can be diagonalized to create a set of eigenvalues and eigenvectors describing the proteins modes of fluctuation. The covariation matrix of a protein describes the covariance meaning the dependency of each fluctuation to the other movements. Hereby, the eigenvectors represent the largest-amplitude correlated motions and are called principal components or essential modes.^[23] The GROMACS anaeig tool can be used to visualize these principal components by projecting the proteins trajectory on the given eigenvectors (here 1-4). ^[24] As shown in the animation the molecule shows strong internal movement, but the catalytically important residues and the azithromycin binding pocket stay structurally preserved, suggesting activity of the enzyme.
In conclusion we used a MD simulation run of 100 ns to validate the structure determined by homology modelling in a physical time-dependant forcefield. The simulation was done in a cubic box with at least 1.2 nm distance of the centred protein to the corners of the cube. The system was minimized and equilibrated by NVT and NPT simulations for both 100 ps. We analysed the systems temperature and pressure afterwards, which showed small fluctuation about the expected values. We were able to start the MD production run for 100 ns. Analysis of the production run showed converging RMSD and radii of gyration values as well as small RMSF values on the active site residues. Consequently, we validated the obtained structure as a possible crystal structure of erythromycin esterase type II EreB.

**Figure 6:** Visualization of the first four essential modes' extremes. The covariance matrix was analysed using the GROMACS anaig tool and compared to the CM modeling derived structure candidate S_17070.pdb. To limit computational dependendencys and because they allow good conclusions to the protein structures only backbone atoms were considered. The first two principal modes (PMs) containing the stongest fluctuation are shown on the left. On the right the 3^rd and 4^th PMs are displayed.
1^st PM: dark blue, 2^nd PM: light blue	3^rd PM: purple, 4^th PM: light blue

TasA fusion proteins

Our selected enzymes for pharmaceutical transformation are supposed to be embedded into the extracellular polymeric matrix of our B. subtilis biofilm by fusing the enzymes to the matrix protein TasA (PDB entry 5OF2). Therefore, we are following the method described in Programmable and printable Bacillus subtilis biofilms as engineered living materials by Huang et al. (2018), that fused it exemplary to various proteins or protein domains like mCherry or MHETase to introduce new functions to the biofilm ^[25]. The exact methods for fusion protein construction is documented in our corresponding wiki text in the biofilm category.

We also planned a fusion protein by integration of TasA into a surface loop of CotA as described in Engineering Bifunctional Laccase-Xylanase Chimeras for Improved Catalytic Performance by Ribeiro et al (2011)^[26]. Therefore, we had to move the signal peptide to the N-terminus of CotA. CM modeling indicates that the signal peptide is still accessible for cell export and both protein domains are correctly folded. Nevertheless, we discarded this approach, because end-to-end linkage to TasA also promises functioning enzymes and is far more modular. For integration of TasA into the enzymes sequence we always had to find surface loops that aren’t sufficient for the enzymes’ functionality by allowing correct protein folding, including residues essential for substrate binding or active site residues. This would contradict with our idea of a modular biofilm, to which new enzymes could easily be added.

First the fusion proteins structure was predicted using the RosettaCM application. Comparative modelling is a homology modelling approach that uses known structures with high sequence homologies to determine the enzymes 3D structure combined with an Ab-Initio approach for regions that cannot be aligned. Changes in secondary structure are randomly introduced and validated using the Rosetta Energy Function to approach the proteins lowest free energy state. Comparative modelling is an excellent method for fusion proteins, because both structures of the enzymes (PDB: CueO: 5B7E, CotA: 1GSK, EreB and TasA (PDB: 5OF2) were already determined and show 100% complementary to the corresponding protein domains. ^[27] The domains structures were aligned and threaded onto the fusion protein sequence using the Rosetta partial thread application. Fragment files were generated using the old Robetta fragment server. It outputs a three- and nine-mer file by aligning fragments to PDB entries. The fragment files are then used as input structures to enhance the Ab-Initio model’s precision. Detailed information on the algorithm of RosettaCM can be found in the modelling section for EreB.

To investigate functionality of the protein domains and structure stability of the fusion proteins MD simulations were carried out using GROMACS and the Charmm27 forcefield with explicit TIP3P water. The forcefield calculates the forces working on all atoms in the system considering interactions such as bonded, nonbonded and special interactions. MD is a useful tool to validate the proteins folding in aqueous environment and study the enzyme’s movements in a time-dependent physical forcefield of Newtonian equations of motion. The structure’s energy is minimized in a first step. Afterwards the system gets equilibrated in a NVT and NPT simulation step. All of these steps are carried out using a system restraint file to maintain the target’s structure during the preparation steps. After equilibration is finished the main simulation for 100 ns can be started. If the MD simulations output conserved tertiary structures of the protein domains we assumed functionality of the immobilization induced by TasA matrix protein. Also, preserved structure indicates successful azithromycin or diclofenac etc. transformation by our enzymes (CotA, CueO and EreB). All steps were carried out equally to the previously described MD simulation of our EreB crystal structure candidate.

Root-mean-square deviation (RMSD), small root-mean-square fluctuation (RMSF) and radii of gyration (Rg) were plotted to analyse the simulation and validate the structure’s stability. Converging RMSD and Rg are primary indicators for stable protein structures. Analysis of the EreB-TasA fusion protein shows fluctuation of both values around a certain value showing a clear trend of convergence. RMSF analysis displays low derivation of the internal residues, especially those relevant for catalytic activity suggesting stable protein domains and could thus signify preserved catalytic activity and enzyme immobilization, respectively. It was already shown that the active side residues of unfused EreB protein show low atomic fluctuation in the EreB MD simulation. Also low Rg values are an indicator for preserved secondary structure, so we can assume that the protein did not denaturate during the simulation.

For further analysis principal component analysis (PCA) was performed to analyse internal movement of the protein. By PCA we are able to filter global collective movement from local movement to study the enzymes dynamics. The MD simulation run’s covariance matrix was generated and diagonalized using the GROMACS covar tool. The resulting eigenvectors were visualized using the GROMACS anaeig tool and are here presented as 3D visualization of the first eigenvectors showing strongest protein fluctuation. As visible in the simulation the protein shows internal movements of outer residues but functional internal domains stay structurally preserved as already visible in the RMSF graph. Especially the azithromycin binding pocket and residues E43, H46, R55, R74, H285 and H288 of EreB enzyme domain (highlighted in the simulation) important for catalytic activity remain stable and show weak movement. The most relevant principal mode (PM) shows increasing distance of the EreB and TasA protein domain during the simulation. This way accessibility of both domains increases and the function of the linker peptide as dynamic connection of both domains can be assumed. In summary the PMs describing internal movements show reinforces our method of immobilizing the transforming enzymes with extrapolymeric matrix protein TasA and suggests our assumption of correct protein folding after fusion based on the publication of Huang et al. (2018). ^[25]

**Figure 11:** Visualization of the first four essential modes' extremes. The covariance matrix was analysed using the GROMACS anaig tool and compared to the CM modeling derived structure candidate S_0484.pdb. Only backbone atoms were considered. TasA domain is coloured in light blue, the linker peptide in red and the EreB domain in blue. **Top left:** Fist (most relevant) PM **Top right:** second PM containing **Bottom left:** third PM **Bottom right** fourth PM

We used an MD simulation run of 100 ns to validate the structure determined by Rosetta comparative modelling in a physical time-dependant forcefield. A cubic box with at least 1.2 nm distance of the centred protein to the corners of the cube was used for the simulation run. The system was minimized and equilibrated by NVT and NPT simulations for both 100 ps. The stability of temperature and pressure was analysed and showed small fluctuation so consistency of the values and stability of the system could be assumed allowing the production run for 100 ns. The fusion proteins structure remained stable during the simulation suggesting functionality of both protein domains TasA and EreB, since they are structurally preserved and accessible. The signalling peptide (TasA N-terminus) is also accessible allowing transportation of the fusion proteins extracellular matrix to enable immobilization of the enzyme in our biofilm.

References

Difference between revisions of "Team:TU Darmstadt/Model/Enzyme Modeling"