Team:IIT Roorkee/ML Model

Model

Machine learning refers to the study of computer algorithms that tend to improve its performance automatically with experience without being explicitly programmed. Over the past decade, machine learning has been applied to perform several complex tasks such as image classification and object recognition. We decided to explore its applications in helping to overcome the problem of antibiotic resistance, a major threat to the human population. The availability of large public datasets makes it crucial and important to utilize the power of machine learning in understanding and predicting biological phenomena such as antibiotic resistance.

To this end, we have used a class of machine learning algorithms called Support Vector Machines in understanding and uncovering the genetic interaction when a bacterial strain is treated with a particular antibiotic. We identified Acinetobacter baumannii as our target pathogen since it is a critical priority pathogen according to the World Health Organisation.

Step 1: Data Collection

We utilized and selected 1360 strains of A. baumannii whose AMR phenotypic data was available in the PATRIC database [1]. The testing data includes the outcome when the strain is treated with a particular antibiotic. The outcome is binary i.e. strain can either be resistant or susceptible to the antibiotic. Strains with verified phenotypes from laboratory data were selected for the analysis, thus excluding strains that were validated via computational methods.

We choose 10 different antibiotics to understand the influence of genetic information on resistance phenotype when different strains are treated with these drugs. The different antibiotics are as follows,

Antibiotic	Mechanism
Ciprofloxacin	DNA Replication
Levofloxacin	DNA Replication
Gentamicin	Protein Synthesis
Tobramycin
Amikacin
Ceftriaxone	Cell Wall Synthesis
Imipenem
Ceftazidime
Trimethoprim + Sulfamethoxazole	Folate disruption
Ampicillin + Sulbactam	Cell Wall Synthesis

Step 2: Genome Annotation

The genomes of the strains were then annotated to develop a pan-genome which is the entire set of genes present in all the selected strains. The genome annotation was carried out using Prokka software [2] which is used for prokaryotic genome annotations. This software was able to identify and annotate alleles as well as their respective genes. The software is publicly available, here. We utilized the bioconda channel of the conda environment to run the software.

Step 3: Binarization

After the formation of the pan-genome and getting the list of all genes and alleles present in all the strains, we created a binary matrix with each row representing a particular strain and each column representing a particular gene/allele. If a particular strain has that particular gene/allele, the value at that position in the matrix is 1, else 0. In simpler terms, if there are ‘n’ number of genes/alleles in the pan-genome, we represent each strain as a vector of ‘n’ dimensions wherein a particular index of the vector refers to a gene/allele. The value at a particular index is 1 if gene/allele corresponding to that index is present in the strain. The strains are referred to as examples, while genes/alleles are referred to as features.

Along with representing the strains in terms of binary vectors, we also collected the phenotype information of strains for a particular antibiotic. So at this stage, we have vector representation of strain i.e. input and phenotype of strain i.e. output. The machine learning algorithm will be developed to predict the phenotype of strain using gene/allele vector representation of strains. We have used Support Vector Machines (SVMs) as a machine learning algorithm.

Step 4: SVM Training

Support Vector Machines (SVM)

SVM [3] is a supervised machine learning algorithm which is mainly used for analyzing the data for the classification task. The algorithm represents all the examples of different labels in higher-dimensional space, the number of dimensions of this space is usually the number of features which are the number of genes/alleles in this case. Here in our case, the SVM algorithm represents the strains in ‘n’ dimensional space where ‘n’ is the number of genes/alleles in the pan-genome. Each dimension represents a particular gene/allele.

After representing the strains in ‘n’ dimensional space, the algorithm tends to find the most optimal plane which can differentiate between both labels i.e. Resistant and Susceptible. This optimal plane is also referred to as a hyperplane. The hyperplane is constructed such that the distance between the hyperplane and the nearest example represented in the space is maximized.
The illustration about the working of SVM is shown in the above figure, wherein the samples are represented in two-dimensional space using two alleles for the sake of simplicity. However in reality, the space is occupied in ‘n’ dimensions. Different hyperplanes are shown which act as the decision boundary for predicting the phenotype, i.e. labels on either side of this boundary will be different. The decision boundary can never be perfect but SVM tries to achieve the most optimal decision boundary based on the examples given.

Step 5: Computation of Weights

The type of SVM algorithm used in our case is Linear SVM i.e. the hyperplane is a linear boundary or the hyperplane is a linear function of features. As in our case, there are ‘n’ features representing a particular strain (or example), so the equation of the hyperplane will be linear and can be represented as,

wherein, x_i refers to the i^th gene/allele and w_i refers to the linear coefficient of the i^th gene/allele

This linear coefficient is referred to as the weight of the particular gene/allele and it represents the quantitative weightage given to the presence/absence of a particular gene/allele while making predictions. The linear coefficient can be positive or negative, wherein the +/- sign decides the impact of the gene/allele on the final prediction.

We trained SVM for multiple iterations since machine learning algorithms are probabilistic in nature and they tend to produce a different output each time. Running the algorithm for more and more number of iterations helps in achieving more stable and reliable results. We find the hyperplane for each iteration and from where we calculate the weight of the particular gene/allele and represent them as a matrix as shown in the figure. Each row of the matrix represents a particular gene/allele while each column represents a particular iteration of the process. The value at a particular position refers to the weight of the gene/allele in that row during the iteration number of that column.

Step 6: Top AMR alleles

As mentioned above, every gene/allele is given a weightage while developing a hyperplane. The more the magnitude of the weight, the more is the importance of that gene/allele in predicting the phenotype of the strain. Since the sign (+/-) of the value of weight merely indicates the direction of impact of that particular gene/allele i.e. if the sign is negative, it means that the gene/allele is responsible for shifting the prediction to Susceptible and if the sign is positive, it means that the gene/allele is responsible for shifting the predicted phenotype to Resistant. So, it is the magnitude of the weight, which determines the relative importance of different genes/alleles. We calculated the sum of absolute values of weights given to each gene/allele for every iteration. The higher the value of this sum, the higher is the relative importance of that gene/allele. We sorted different genes/alleles in the order of their relative importance based on the sum of absolute weights and found out the list of top AMR genes/alleles. It is not sure that these genes/alleles will confer resistance to the antibiotics, they can confer susceptibility to the antibiotic as well since we have taken the sum of absolute weights neglecting the direction of impact of that gene/allele. It must be noted that the absolute weights have no mathematical, or physical, or biological significance, but only provide us an idea about the relative importance of different weights in predicting resistance or susceptibility. They have no absolute significance but surely possess a relative importance.

Step 7: Correlation analysis

We selected the list of top 40 genes/alleles based on the sum of the absolute weights resulting from the iterations. Now, since we have the weights of these genes/alleles for each iteration as well, we calculate the pairwise correlation between the weights of these top 40 genes/alleles. For example let us suppose, there are ‘k’ iterations, then every gene/allele will have a ‘k’ number of weights i.e. it can be represented as the vector of ‘k’ dimensions. For finding a correlation between two genes/alleles, we calculated the Pearson correlation between their corresponding vectors. The positive correlation would mean that an increase in weights of a particular gene/allele is accompanied by the increase in weights of another gene/allele and vice versa.

These correlation analyses provide us with an idea of the relationship between two genes/alleles which is further explored while analyzing the impact of a mutation in particular genes on the resistance phenotype of the strain.

Step 8: Mutational Analyses

Like in the case of weights given to a particular gene/allele, the sign of the value was merely an indicator of the direction of its impact on resistance phenotype, similarly, the sign in the case of correlation between two genes/alleles is also an indicator of the direction of variation of their weights. We selected the top pair of genes/alleles based on the magnitude of their correlation and analyzed them for the impact of a mutation in the respective genes on the resistant phenotype. We mainly look for the cases, for example where a mutation in gene A was responsible for resistance to a particular antibiotic, but not in the case when another gene B was also present along with mutated gene A. We performed these analyses for the pairs with the highest correlation values. These help us to make better conclusions about the relationship between a particular pair of genes/alleles.

References

Wattam, A. R. et al. 2013. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Research, Database issue (42), pp.581-D591
Seemann, T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-9
Cortes, C., Vapnik, V. 1995. Support-Vector Networks. Machine Learning, 20, pp.273-297