Why did we do it?

With the delivery method of summer work undetermined, the team initially endeavoured to create a neural network capable of predicting chi angles to aid in homology modelling. We reasoned that homology modelling would be especially useful this year, given the lack of lab access. When looking for data that we could use to train this neural network, we failed to find resources containing protein data that suited our specific needs. Databases are the cornerstone for computational biochemistry, providing access to data that is both reliable and accessible (Xu & Xu, 2004). Thus, to facilitate this, Bellatrix was born. Bellatrix enables the creation of customizable libraries of protein structural information. Bellatrix stores this information into a novel, standardized file type called Stars which is suitable for machine learning.


How did we do it?

To make a useful fileform, current repositories of protein data available on the internet were analyzed. These libraries were analyzed by their size and their content. It was noted that data banks like PDBREPORT, PDBFINDER, and PDB_REDO are summaries of the PDB file content, and these in particular attempted to represent PDB file content in a more organized manner. Using these data banks as an example, it was apparent that raw data on protein structural information could be found through an internet search, but what lacked was a platform that enabled the production of custom, standardized data banks meant for machine learning applications.

A Star file is a matrix with dimensions i,j where the ij th element is the vector from amino acid i to amino acid j, as numbered in the protein sequence. Along with this, the Star file includes important metadata on the protein such as its method of structural characterization, and the amino acids where no coordinates were given. To minimize the size of a Star file while also being the most representative of protein structure, only the coordinates of alpha carbons in protein residues were used. This matrix and metadata are then written to a comma-separated file for users to apply as needed. Figure 1 illustrates the anatomy of a Star file.







Amino acid

Ala 1

Gly 2

Ser 3

Thr 4

Ala 1


Ala 1 → Gly 2

Ala 1 → Ser 3

Ala 1 → Thr 4

Gly 2

Gly 2 → Ala 1


Gly 2 → Ser 3

Gly 2 → Thr 4

Ser 3

Ser 3 → Ala 1

Ser 3 → Gly 2


Ser 3 → Thr 4

Thr 4

Thr 4 → Ala 1

Thr 4 →Gly 2

Thr 4 → Ser 3


Figure 1: Star file anatomy. Cells in green represent the coordinate matrix. The red cells indicate the amino acid sequence. Blue cells represent metadata collected from the PDB. Yellow cells are for figure interpretation and are not included in Star files. “→” denotes a vector from one amino acid to another.

Each row in a Star file represents the entire structure of the protein. However, each row is distinct from the other in that the vectors that compose each row originate from a distinct residue.The generation of Star files requires structural data. To fulfill this requirement, Bellatrix was developed to harness the structural information presented within PDB files. Bellatrix is a translational tool operated via Python 3.7.6, that takes data in a PDB file format and converts it into accessible Star files (Van Rossum, 2019). The following Python packages were integral to Bellatrix’s development; csv, pandas, numPy, tkinter (GUI), urlLib, and biopandas. (McMaster et al., 2020)(Walt, Colbert, & Varoquaux, 2011)(Lundh, 1999)(Open Source, 2020)(Raschka, 2017). PDB files were queried and supplied from the Research Collaboratory for Structural Bioinformatics (RCSB) data bank. After PDB translation, the Star files were read and displayed through Microsoft Excel. Development and testing of the Bellatrix software was conducted using PDB files for cellulases from multiple organisms. These PDB files were selected due to their abundance, and possible use cases in synthetic biology.

Some special considerations were made while working with PDB files. The inconsistent numbering of residues and variation in the header and periphery sections were dealt with by a custom text reader that parsed this data by finding keywords and executing logic. In PDB files created through x-ray crystallography, a residue is often represented as multiple coordinates due to an experimental inability to exact its location. To simplify working with Star files, it was decided to include coordinates of the residue only at its most probable location, thereby making the Star files as representative of the structure as possible. In addition, an aberrance detection function was created to notify the user of anomalous data. Bellatrix classifies anomalous data as inconsistencies with residue numbering, and a deviation from the standard format of a PDB file. All of these techniques were then conglomerated to form a PDB troubleshooting algorithm that was integrated to run automatically during the creation of every Star file. This was considered important, as it can tell users how the PDB file is formatted.

Figure 2. Bellatrix circular workflow diagram

Bellatrix was then expanded to be conducted not only on a single protein but on a user-defined interrogation set. Interrogation sets are text or comma-separated files that include a list of proteins. From this interrogation set a list of proteins can then be translated in series and aggregated to form Star libraries (a CSV file composed of multiple Star files).

Once the basic functionality was completed, we built a graphical user interface (GUI) to increase the accessibility of Bellatrix.

Figure 3. Bellatrix graphical user interface

File handling was also implemented to allow the user to input a text file containing a list of proteins that they want to be included in a Star file library. This feature also allows users to run Bellatrix on PDB’s that are not directly from the Protein Data Bank. For example, PDB files obtained from a molecular dynamics simulation can be represented as Star files.


Research was done into using Bellatrix for clustering, to group proteins together based on their structural similarities. We foresee potential uses of this as a secondary BLAST. BLAST is an algorithm for comparing primary biological sequences. Star files can be used to cluster the list of proteins returned from a BLAST search, based on their structural similarities. Although this was not accomplished, a future direction is to extract 3D descriptors from Star files, and use these to cluster star files using K-medoids.


When Bellatrix was carried out on a single protein it generated a Star file in 59.32 seconds. The star files were then verified empirically using Microsoft Excel. Working on an interrogation set of 20 identical 1EG1 proteins (Kleywegt etal., 1997), Bellatrix was able to construct a Star file library on this set in 18.16 minutes, averaging 54.5 seconds per protein. This result was shared when conducted on a set of dynamically obtained PDBs of similar size. PCA, a common and powerful machine learning technique, has been successfully run on Stars as a proof of concept illustrating the potential of these files in machine learning. Bellatrix has undergone user testing within a diverse group of undergraduate students, from varying backgrounds. Through testing and tweaking of the user interface, the accomplished product has been successfully used by users of different backgrounds, who identified plausible implementation into their workflows. This testing was instrumental in file formatting and for intuitive program documentation

There were a handful of challenges we faced when creating Bellatrix. First of all, .pdb files come in many different forms. Due to this, we had to write the program with many different input formats in mind, making coding more difficult. However, in the end, we were able to reliably create a standardized output with a variety of inputs. Secondly, Bellatrix is limited by the data that it is based on. Oftentimes .pdb files lack coordinates for certain residues in a protein. This is due to the limitations of the experiments used to determine the structures. As a result, the Star files inherit these missing coordinates in their structure. We hope to remedy this problem in the future by implementing a machine-learning algorithm to predict the missing coordinates. It is also important to note that Bellatrix does not construct a Star file for PDB files consisting of nuclear magnetic resonance spectroscopy data.


The true power of Bellatrix lies not in the Star files it generates, but in the imagination and function awarded to it by its users. We have seen Bellatrix become increasingly impactful as we discover new, unintended, uses for it. One of the most potent use cases we foresee is clustering through machine learning. Star files enable the direct comparison between structural relationships in proteins and therefore provide a way for structurally based clustering. Coordinate matrices are already currently utilized in structural protein-based clustering methods, such as k-medoids (Polychronidou, et al., 2018). Bellatrix can supply scientists with new data architectures that can be extorted.

Another promising utilization of Bellatrix is the potential for a stability criterion of proteins that have undergone molecular dynamic simulation. From dynamic data, multiple PDBs can be generated at instantaneous time points. These files can then be sorted temporally and run through Bellatrix as an interrogation set. The resulting Star Library can then be manipulated to observe the entire protein's movements with respect to any amino acid. The result can be quantified by a variety of different methods. This information can then be used as a potent alternative to the commonly used root-mean-square deviation, allowing for more specialized indicators of protein stability. This will leave a lasting impact on protein modelling execution and analysis.

In our project, Bellatrix was used to help us model. In two of our models, SEGI-8 and Penny, Bellatrix helped support structure characterization, making our modelling more accurate and efficient. You can read more about these models and how Bellatrix was integrated here and here.


Xu, D., & Xu, Y. (2004, November). Protein databases on the internet. Retrieved August 24, 2020, from

Van Rossum, G. (2019, Winter). Python 3.7.6[Computer software]. Retrieved from

McMaster, A., Saxton, D., Goddard, E., Li, F., Virshup, I., Van den Bossche, J., . . . S. (2020). Pandas (Version 1.1.1) [Computer software]. Retrieved 2020, from

Travis E, Oliphant. A guide to NumPy, USA: Trelgol Publishing, (2006).

Walt, S. V., Colbert, S. C., & Varoquaux, G. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering, 13(2), 22-30. doi:10.1109/mcse.2011.37

Lundh, F. (1999). An introduction to tkinter. URL: Www. Pythonware. Com/Library/Tkinter/Introduction/Index. Htm.

Open Source. (2020). Urllib. Retrieved 2020, from

Sebastian Raschka. Biopandas: Working with molecular structures in pandas dataframes. The Journal of Open Source Software, 2(14), jun 2017. doi: 10.21105/joss.00279. URL

Polychronidou, E., Kalamaras, I., Agathangelidis, A., Sutton, L., Yan, X., Bikos, V., . . . Tzovaras, D. (2018). Automated shape-based clustering of 3D immunoglobulin protein structures in chronic lymphocytic leukemia. BMC Bioinformatics, 19(S14). doi:10.1186/s12859-018-2381-1

Open Source. (2020). Time. Retrieved 2020, from

Kleywegt, G. J., Zou, J., Divne, C., Davies, G. J., Sinning, I., Ståhlberg, J., . . . Jones, T. (1997). The crystal structure of the catalytic core domain of endoglucanase I from Trichoderma reesei at 3.6 Å resolution, and a comparison with related enzymes 1 1Edited by K.Nagai. Journal of Molecular Biology, 272(3), 383-397. doi:10.1006/jmbi.1997.1243