Graphic Interpretation

Intuition

In an experiment to improve targeting specificity through directed evolution for the Cas9/dCas9 protein, error-prone PCR is a major method to generate some point mutations in the products ^[1]. Several low-fidelity DNA polymerases are usually used to induce variation in DNA sequence randomly to imitate the natural evolution. The mutation frequency from AT to GC is apparently higher than the frequency from GC to AT^[2] in the initial error-prone PCR system proposed by Leung et al. Currently widely-used error-prone PCR kits have narrowed the difference in mutation frequencies between different bases, but we can still observe mutation tendency to some extent.

The single-base mutation preference leads to the difference in the mutation probability between the corresponding amino acids, and finally affect the polypeptide chains in the mutation library. It is hard for us to quantitatively describe the differences in amino acids or proteins caused by mutation preference. In order to visually depict the relation about mutation probability between amino acids, graph machine learning algorithm is adopted to transform the single base mutations according to the error-prone PCR kit into a kind of distance. As a result, sequence differences among a batch of protein products obtained in directed evolution experiments can be measured by the mutation distance we have defined. In addition, the comparison between the phylogenetic tree and hierarchical clustering result based on our mutation distance may be an effective way to measure the similarity between natural evolution and directed evolution.

Codon Network

Codon, a sequence of triplet nucleotide residues on mRNA (or DNA) that encodes a specific amino acid, is known as the basic way DNA transmits genetic information to proteins. Therefore, we bridge between mutation probability among bases and mutation differences at the amino acid level through the representation for codons. Meanwhile, the network model is a popular approach in biology to describe the relationship between genes, proteins, and other biological substances. We define our codon network with reference to the structure and properties of other biological networks^[3].

In our directed evolution experiment, Mutazyme II DNA polymerase in GeneMorph II Random Mutagenesis Kit is responsible for producing low-fidelity DNA strands. We have found that the mutational spectra of the mutazyme exhibit a somewhat higher tendency to create transversions over transitions which is consistent with our intuition. We assume the influence by insertion and deletion can be neglected, and integrate other information about single base mutation frequency into the codon graph. We construct the network $G(V,E)$ where the vertex set $V$ comprise all 64 codons including 3 stop codons, and the edge set $E$ is composed of single base mutation like (A$\rightarrow$T, G$\rightarrow$C) including transitions and transversions. Next, we annotate the nodes with corresponding amino acids and define the edge attribution as the mutation frequency for the mutazyme.

NetworkX, a python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks^[4] help us to construct our codon graph. We also conduct visualization basic analysis of the network with Cytoscape^[5], an open-source software platform for visualizing complex networks.

Codon network and the betweenness centrality distribution in cytoscape.

In graph theory, betweenness centrality is a measure of centrality in a graph base on shortest path. For every pair of vertices in a connected graph, there exists at least one the shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized.

The shortest path from node $u$ to $v$ is defined as: $$P_{min}(u,v) = \arg\min_{P(u,v)}\sum_{e_i \in \{P(u,v)\}} W(e_i)$$

where $\{P(u,v)\}$ is the set of all directed path from node $u$ to $v$, and $W$ represent all weights of these edges. $$\text{Betweenness Centrality}(v) = \frac{\text{number of shortest paths including }v}{\text{Total number of shortest paths}}$$

Betweenness centrality measures the ability of a node to transmit information to other nodes^[6], and indicates the probability to turn into other codons from the initial one. Interestingly, the codons encoding Gly and Ala (GC-rich) show relatively low betweenness centrality due to the ratio of AT→GC/GC→AT equals 0.6 for Mutazyme II DNA polymerase.

Mutation Distance

Currently, we can measure the centrality for each codon in the mutation relation network which indicates the importance of the node. Besides the centrality metrics, the distance based on mutation frequencies is also a key indicator to explain the process of directed evolution. A naive way to define the distance between codons is the negative logarithmic transformation of the probabilities from the perspective of information theory and entropy. In addition, the distance between amino acids is the average of the distance between the corresponding codons.

$$d(u,v) = -\log(p_{uv}), D(AA_1,AA_2) = \sum_{u \in AA_1}\sum_{v \in AA_2}d(u,v)$$ where $AA_1$ and $AA_2$ represent two different amino acids.

We intend to map the codons (nodes) to points in the Euclidean space to purse a more straightforward representation for codons. That is to say, we need to select an algorithm to transform the node into a vector. The classical word2vec model trains the combination of words(corpus) to realize word embeddings^[7], which is a milestone achievement in NLP. Therefore, we can learn from the word2vec model by training the combination of nodes in the network to generate node embeddings. DeepWalk[8] and node2vec^[9] algorithm use random walk strategy starting from each node to represent the nodes by vectors and have made success in practical applications like node classification and other tasks.

We adopted the random walk strategy from Deepwalk, and apply the algorithm to our codon graph by modifying the random walks to walking by normalized mutation frequencies (Codon DeepWalk). We build the Deepwalk corpus with 100 random paths starting from each node, and the length of walks equals 4. The node embeddings model is trained in a very similar way to the word2vec model, and 50-dimension vectors are generated for all codons.

We continue to analyze the distance between codons based on mutation after node embeddings, so hierarchical clustering, PCA, and t-SNE are used to visualize the mutation distance and relationship between codons. Three clusters are categorized according to the node embedding vectors, and we can obviously notice the difference between the clusters.

Principal component analysis of codon encoding vectors.

T-SNE visualization for codon encoding vectors.

The result of clustering come from mutation characteristics is different from the classification of the amino acids the codons encode, which is actually evaluation based on sequence and specific DNA polymerase.

When the impact of mutations produced by error-prone PCR rises to the amino acid level, we create the representatin for the distance between amino acids if the nucleic acid sequence encoding the amino acid is unknown. All corresponding codon's embeddings are involved in the computation of the average matation distance for amino acids. Amino acids whose corresponding codons are more closely connected in the network are more likely to be converted among each other on average during directed evolution.

Hierarchical clustering and heatmap about amino acids' mutation distance

We now can regard the polypeptide chain as the translation for the splicing result of multiple codon vector so that we can measure the distance for the Cas9/dCas9 variants in the mutation library during the directed evolution experiment.

Reference

[1]McCullum, E.O., Williams, B., Zhang, J., & Chaput, J. (2010). Random mutagenesis by error-prone PCR. Methods in molecular biology, 634, 103-9.
[2] Cadwell, R., & Joyce, G.F. (1992). Randomization of genes by PCR mutagenesis. PCR methods and applications, 2 1, 28-33 .
[3] Yu, D., Kim, M., Xiao, G., & Hwang, T.H. (2013). Review of Biological Network Data and Its Applications. Genomics & Informatics, 11, 200 - 210.
[4] Hagberg, A., Schult, D., & Swart, P.J. (2008). Exploring Network Structure, Dynamics, and Function using NetworkX.
[5] Shannon, P., Markiel, A., Ozier, O., Baliga, N., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., & Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research, 13 11, 2498-504 .
[6] Wang, H., Hernández, J., & Mieghem, P.V. (2008). Betweenness centrality in a weighted network. Physical review. E, Statistical, nonlinear, and soft matter physics, 77 4 Pt 2, 046105 .
[7] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. ArXiv, abs/1310.4546.
[8] Perozzi, B., Al-Rfou, R., & Skiena, S. (2014). DeepWalk: online learning of social representations. KDD '14.
[9] Grover, A., & Leskovec, J. (2016). node2vec: Scalable Feature Learning for Networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Acknowledgements

SJTU-BioX-Shanghai

contact us

Contact us: sjtuigem@gmail.com or sjtuigem@gmail.com

Bio-X Institute, Shanghai Jiao Tong University, Dongchuan Rd. 800