Team:Lund/Model

iGEM Lund 2019

Mathematical Models

STAAC

We here present our machine learning algorithm STAAC. STAAC is a combination of two neural networks and a tree-based model that when integrated into a genetic algorithm can generate peptides with different antimicrobial properties by classifying them into the categories antifungal (AFP), non-antifungal antimicrobial (NoAFP-AMP), and non-antimicrobial (NoAMP). The categories are illustrated in the Venn diagram below:

The objective of STAAC is to generate new, previously unseen peptides that would suit our project of producing anti-oomycete peptides in bacteria. This means that the peptides should have strong antifungal and therefore anti-oomycete, properties, while at the same time having weak antibacterial properties. STAAC is therefore trained to maximize the harmonic mean (HM) of the AFP classification precision and the NoAFP-AMP classification recall. Overall, STAAC can classify unseen peptides with an accuracy of 88 %, AFPs with a precision of 88 %, and NoAFP-AMPs with a recall of 65 %. We would like to stress that there are models out there with much better accuracy and precision if one would like to answer the binary question: “Is this peptide antifungal?”. Indeed, when we converted our model to answer this binary question, we received a precision of 93.5 % and an overall accuracy of 94 %. The code is available for download, inspiration, or critique on GitHub

Theory

The STAAC model is a combination of a Random Forest Classifier and two neural networks, combined using a Support Vector Machine.


Random Forest Classifiers:

The Random Forest Classifier (RFC) is a so-called ensemble learning algorithm, meaning it consists of several preliminary algorithms merged into one. In the case of the RFCs, these preliminary algorithms are decision tree classifiers. The decision tree classifier is a binary tree where each node is a binary condition (ex. Is peptide length > 20 peptides) and each leaf is a predicted label. A prediction is generated by passing the features through the tree, from the root to the leaves by answering these binary conditions. The RFC supplies a randomly chosen subset of the training features to each of its decision trees and then comes up with a prediction based on the majority vote of the individual trees. This randomized behavior of the RFC does very well in tackling the otherwise great issue of overfitting that comes with a simple decision tree. A more in-depth explanation of the RFC can be found on Wikipedia.

Neural Networks:

Neural networks are becoming increasingly popular in bioinformatics. This is due to their ability to handle a huge number of features without overfitting. They are also very capable of finding non-linear patterns in the data. The input is here passed through layers of nodes by edges connecting one layer to the next. Each edge has a weight and each node has an activation function. This means that the number of fitted parameters often becomes very large and it is seldom possible for a human to understand the operations performed by the network. This leads to the obvious downside with neural networks; it’s often very hard to figure out what kind of pattern the network finds. There are several great resources to further deepen your understanding of neural networks, for example Wikipeida.

SVM:

The Support Vector Machine is a common supervised machine learning algorithm when dealing with classification problems. The algorithm works by constructing hyperplanes, maximizing the margin between the data points of each classification label closest to one another, also called the support vectors. This is mainly done by constructing a linear plane or, when the data is nonlinear, by increasing the dimensions of the data. A great explanation can be found on DataCamp.

Metrics:

In our case, we had an issue with an unbalanced data set. While having about 8000 peptides labelled as “non-antimicrobial”, we had about 3000 AMPs and only about 500 labelled as “non-antifungal - antimicrobial”. If one would use the overall accuracy of the model to evaluate the performance, then an algorithm classifying everything as NoAMP would get a 70 % accuracy. If it manages to correctly differentiate between AFPs and NoAFPs but completely ignores NoAFPAMPs, we will get an overall accuracy of 95 %! To overcome this, it’s often good practice to use precision (Eq. 1, Eq. 4), recall (Eq. 2, Eq. 5), or f1-score (Eq.3) instead, all of which can be defined through the confusion matrix:

True 1 True 0
Predicted 1 True positive (TP). False-positive (FP).
Predicted 0 False-negative (FN). True negative (TN).


True A True B True C
Predicted a Aa Ba Ca
Predicted b Ab Bb Cb
Predicted c Ca Cb Cc

Multilabel metrics:

Which metric to use depends on the objective of the model. When dealing with a multilabel dataset, one gets even more alternatives. One could look at global metrics such as the macro-average or micro-average of multilabel metrics such as precision and recall, or individual metrics of each label.

Given the stated issues and challenges when choosing a metric as well as the purpose of the STAAC model of generating new peptides with high antifungal properties and low antibacterial properties, we decided to mainly monitor the precision of the AFP-classification and the recall of the NoAFPAMP-classification. A logical way of combining the two metrics into one is taking the harmonic average (Eq. 6), similar to how one computes the commonly used f1-score.

Optimizing the AFP-classification precision results in a model that strongly punishes “risky” patterns that may misclassify peptides as AFPs. We hypothesize that this would be analogous to the model promoting only strong antifungal properties in peptides. Optimizing the NoAFPAMP-classification recall results in a model that more strongly classifies peptides as NoAFPAMPs. We hypothesize that this is analogous to the model promoting non-antifungal, antimicrobial properties, classifying even the weak properties as NoAFPAMP. Knowing that the model is more sensitive to promoting AFPs and more strongly promotes NoAFPAMPs we hypothesize that if we then maximize the predicted probability of a given peptide being an AFP and minimize the probability of it being a NoAFPAMP, we will be able to generate peptides with high antifungal properties and low antibacterial properties.

Method

Data Management:

Peptides were imported, cleaned, and sorted into the groups “Antifungal peptides” (AFPs), “Non-antifungal, antimicrobial peptides” (NoAFP-AMPs), and “Non-antimicrobial peptides” (NoAMPs). Peptides with a length between 10 and 50, inclusive, were selected. Peptides with missing information on their antifungal properties were not removed but classified as not having antifungal properties. Not removing peptides with missing data was a mistake which probably resulted in a worse model. A limited schedule is to blame for why this issue persists.

120 AFPs, 20 NoAFP-AMPs, and 100 NoAMPs were randomly chosen to be moved to a holdout data set. The remaining 11563 peptides were randomly split into two groups: 80 % to a data set used to train the first three independent classifiers (Sequential Model, Trigrams Model, PseAAC Model), and 20 % to a data set used to train the SVM. The complete distribution of peptides can be seen in the table below.


AFP NoAFP-AMP NoAMP
Classifier Training 4317 506 5995
SVM Training 1079 126 1499
Holdout Data 120 20 100

Feature extraction:

Features were extracted from the peptide strings in three different ways:

Sequential Transformation:

Each given string of amino acids was transformed into a 50 x 20 binary matrix where the 50 rows represented each position in the string and the 20 columns represented the 20 possible values at each position. The string “ABD” would hence be transformed into a 50 x 20 matrix of zeros with ones at the positions (0,0), (1,1), and (2,3). The matrices were then flattened into a single list of binary values. Using a sklearn Standard Scaler, the data was normalized.

Ngrams Transformation:

Mono-, bi- and trigrams were extracted from the strings using sklearn CountVectorizer analyzing characters with a fixed vocabulary. The three transformations were concatenated. Using a sklearn Standard Scaler, the data was normalized.

PseAAC:

The Pseudo amino acid composition of each peptide was computed using lambda = 2 and for every 63 combinations of the six amino acid properties 'Hydrophobicity', 'hydrophilicity', 'residue mass', 'pK1', 'pK2' and 'pI' using PyBioMed .

This resulted in a 63 x 22 matrix that was then flattened into a single list. Using a sklearn Standard Scaler, the data was normalized. Similar to the SVM-model by Mousavizadegan and Mohabatkar, features were then selected using recursive feature elimination, removing two features at a time. Here sklearn RFE was used with an inversely weighted sklearn Random Forest Classifier. 400 of the 1386 features were selected.

Model creation:

Four different models were created; two neural networks, trained on the sequentially transformed data and the mono-, bi- and trigrams data respectively, a random forest classifier, trained on the PseAAC data, and an SVC, used to combine the three previously mentioned models.

Sequential Model:

 A neural network was built using TensorFlow.Keras:

  • Layers: 3 dense layers
  • Nodes: 10 per layer
  • Activation function: ReLu
  • Regularization: L2, L=0.05
  • Output layer: 3 nodes, activation: sigmoid
  • Optimizer: Adam
  • Loss function: Binary Crossentropy

The model was trained on the designated 80 % of the sequentially transformed data, inversely weighted, for 12 epochs.


Ngrams Model:

 A neural network was built using TensorFlow.Keras:

  • Layers: 3 dense layers with 30 % dropout
  • Nodes: 10 per layer
  • Activation function: ReLu
  • Regularization: L2, L=0.01
  • Output layer: 3 nodes, activation: sigmoid
  • Optimizer: Adam
  • Loss function: Binary Crossentropy

The model was trained on the designated 80 % of the mono-, bi- and trigrams transformed data, inversely weighted, for 4 epochs.


PseAAC Model:

The designated 80 % of the PseAAC data was used to train a Random Forest Classifier (200 estimators, maximum depth of each tree: 20 nodes, maximum number of features supplied to each tree: 9, minimum samples per leaf: 4, criterion: entropy). The training data was inversely weighted to account for label imbalance.


SVM Model:

The designated 20 % of the peptides were used to generate predictions from the three different previously described models. The models’ predictions were concatenated into a 3x3 matrix. The column describing the predictions of NoAFP-AMP was dropped to reduce the number of features, resulting in a 3x2 matrix that was then flattened to a 6x1 list. The data was inversely weighted and used to train a sklearn SVC using an RBF kernel (gamma=0.1).



Training Accuracy Test Accuracy AFP Precision NoAFPAMP Precision AFP Recall NoAFPAMP Recall HM
Sequential Model 0.74 0.70 0.65 0.17 0.65 0.73 0.69
Ngrams Model 0.89 0.77 0.70 0.24 0.76 0.60 0.65
PseAAC Model 0.97 0.86 0.78 0.44 0.80 0.25 0.38
SVM 0.85 0.84 0.83 0.29 0.72 0.67 0.74
Holdout 0.78 0.88 0.3 0.69 0.55 0.65

Creation of new peptides:

New peptides were created using a genetic algorithm. Several different metrics were used as loss functions to efficiently generate 10 very different peptides, two of which are presented below. Due to limited time, we were not able to test the efficiency of the generated peptides experimentally.


Peptide name Sequence P(AFP) P(NoAMP) P(NoAFP-AMP)
WHIPR KATRIVWWRCEKKIKLLLLEFWHIPRPRFH 0.91 0.05 0.04
MAFP-1 IGKHWKHWAKR 0.87 0.08 0.05

Derived Equations:

Equation (1) describes the precision in the binary confusion matrix.


Equation (2) describes the recall in the binary confusion matrix.


Equation (3) describes the f1 score in the binary confusion matrix.


Equation (4) describes the precision of A in the multilabel confusion matrix.


Equation (5) describes the Recall of A in the multilabel confusion matrix.


Equation (6) defines the harmonic mean of the AFP precision (here A) and the NoAFPAMP recall (here B). The values for the NoAMP class are denoted by C. See the multilabel confusion matrix for reference.


Precision =\frac{TP}{TP + FP} Recall =\frac{TP}{TP + FN} F1Score=\frac{Precision*Recall}{Precision+Recall} Recall_{A}=\frac{ Aa }{ Aa+Ab+Ac } HM =\frac{2}{Precision_{ A }^{-1} + Recall_{ B }^{-1}} = (1+ \frac{ Ba + Bc }{ Bb } + \frac{ Ba + Ca }{ Aa })^{-1}