During the COVID-19 pandemic, due to inaccessibility of
labs and experimentation, we couldn’t perform mathematical
modelling for our experiments. Hence, we chose bioinformatics
modelling for validating our results. It serves as a strong
proof-of-concept of our fusion protein design and we believe
that it’ll closely resemble the results expected through experiments.
What is protein modelling?
Bioinformatics tools can be used to model a protein structure if the sequence of the protein is known. Computational protein structure prediction relies on principles of protein structures obtained through X-ray crystallography, NMR spectroscopy and other physical energy functions to predict the three-dimensional structures of proteins. It uses various Machine Learning algorithms to develop protein structures. There are three methods for modelling proteins:
- Homology Modelling:
This is used when we have a structurally unknown protein and a similar structurally known protein. The structurally known protein is used as a template to predict the structure of the unknown protein. A primary BLAST search is performed in the Protein Data Bank (PDB) to find the template protein that resembles the unknown protein sequence. High percentage identity, high query coverage, high alignment score and low e-value is desired for the template sequence. Once the matching template is found, it is then used to model the unknown structure. Several tools can be used for performing Homology Modelling. SWISS-MODEL is commonly used. It is a Python-based program to predict the protein structure.
- Threading/Fold-recognition method:
With this method, you can predict the protein structures of your target protein using known protein folds of similar proteins found in different databases. Web-server I-Tasser was used for modelling the protein using this method.
- Ab-Initio method:
When nothing is known about the protein structure or in other words structural information is not available for similar protein, this method is used to model the protein structure from scratch. Most favourable energy conformations for the protein are taken into account while modelling the protein using this method. Robetta-Baker Lab’s online modelling server was used to model our protein using this method.
R2 tail fiber protein sequence comprises 691 amino acids. The AP22
bacteriophage tail fiber protein that targets A. baumannii is 271 amino
acids in length. However, through literature review, it was found that the
R2 pyocin- NTF (N-truncated fragment) (G-443 to R-691) is sufficient to
bind to the bacterial surface. Therefore, only the N-truncated fragment was modelled.
Through preliminary bioinformatics analysis during fusion protein design, we identified the restriction sites needed to create an R2 pyocin-AP22 fusion protein. It was inferred that removing the last 134 amino acids of the R2 pyocin tail fiber protein sequence and ligating the last 137 amino acids of the AP22 bacteriophage sequence to this truncated pyocin sequence would create a functional fusion product.
>R2-NTF tail fiber Sequence
>AP22 tail fiber protein
The last 134 amino acids from the R2-NTF pyocin were removed and the last 137 amino acids of the AP22 phage were inserted in the remaining sequence to generate the R2-NTF-AP22 fusion tail fiber protein sequence.
>R2-NTF-AP22 Fusion tail fiber
We model this sequence of the fusion tail fiber using all the three
modelling methods discussed above.
In homology modelling, we use the sequence input for modelling. After entering the sequence of the fusion protein in the fasta format we searched for templates on the platform. 6cl6 was chosen as a template structure as it had a GMQE score of 0.71 and identity of 63.64 and the target model was predicted to be a homo-trimer. The GMQE (Global Model Quality Estimation) score is a quality estimation which combines properties from the target–template alignment and the template search method. The resulting GMQE score is expressed as a number between 0 and 1, reflecting the expected accuracy of a model built with that alignment and template and the coverage of the target. Higher numbers indicate higher reliability. It also takes into account the QMEAN score to increase the reliability of the quality estimation.
For threading and ab-initio modelling, we submitted our target sequence on the I-Tasser & Robetta Baker Lab’s modelling server respectively and results were obtained via email.
The overall structure of the trimeric R2-NTF is a barbell-like protein, with a three-domain organization consisting of a “head”, medial “shaft”, and “foot”. The head (G443-M525) and foot domains (P598-R691) are globular and connected by an intertwined, helical, and fibrous-looking shaft (W529-V597).
We modelled our protein through all the three methods and compared the results of our models. The models obtained through threading were monomers and since our native structure is a trimer, we didn’t consider the monomeric models. The models obtained through homology modelling & ab-initio were trimers and were considered & further compared to choose the best model.
The SWISS-MODEL template library (SMTL version 2020-09-09, PDB release 2020-09-04) was searched with BLAST and HHBlits for evolutionary related structures matching the target sequence of the fusion protein. Overall 43 templates were found and the best one was chosen. 6cl6.1.A having a sequence identity of 63.64, a sequence similarity of 0.49 and a coverage of 0.96 was chosen as the template. The QSQE score is a number between 0 and 1, reflecting the expected accuracy of the interchain contacts for a model built based on a given alignment and template. In general, a higher QSQE is "better", while a value above 0.7 can be considered reliable to follow the predicted quaternary structure in the modelling process. The chosen template had a QSQE score of 0.79, indicating it to be a good template for modelling.
Template Parameters -
|Template||Seq Identity||Oligo-state||QSQE||Found by||Method||Resolution||Sew Similarity||Range||Coverage||Description|
|6cl6.1.A||63.64||homo-trimer||0.79||BLAST||X-Ray||1.90Å||0.49||1 - 251||0.96||Tail fiber protein|
Model Parameters -
|ProMod3 3.1.1||homo-trimer (matching prediction)||None||0.77||-3.42|
The fusion protein sequence was submitted for modelling on the I-Tasser web server and the modelling results were obtained via email. Top 5 models are predicted by I-Tasser using the threading approach for protein modelling. The confidence of each model is quantitatively measured by a C-score that is calculated based on the significance of threading template alignments and the convergence parameters of the structure assembly simulations. C-score is typically in the range of [-5, 2], where a C-score of a higher value signifies a model with higher confidence and vice-versa.
All the 5 models obtained were monomers and through C-score comparison model 1 comes out to be the best model predicted by this method.
The fusion protein sequence was submitted on the Robetta Baker Lab web server. The modelling results were obtained in a couple of days and 5 best models were predicted for the fusion protein sequence. All the suggested models were homo-trimers similar to the native R2-NTF pyocin structure.