Contribution
Introduction
While exploring the field of genomic stability, we realized that many labs and iGEM teams encounter different problems when trying to estimate or improve the stability of their given plasmid utilizing existing tools.
One of the most known web tools that assists in such analysis is the Evolutionary Failure Mode (EFM) calculator [1] found here. This calculator, developed in 2015 by the University of Texas iGEM team, is meant to analyze a DNA sequence and find evolutionary hotspots – sites that are much more likely to undergo mutation. Using this calculator, one could find these sites, and modify or erase them as possible to reduce mutation likelihood significantly.
As our project focuses on analysis of genetic stability, Prof. Tuller’s lab requested that we examine many configurations of a COVID-19 vaccine that they are developing. We decided that our best approach for a rapid response would be quickly developing a platform that performs the EFM calculation for many files at a time, instead of inserting them one by one to the web tool. In addition, as the vaccine is developed on Chow cells, we also added a mechanism for recognizing methylation sites – addition of a methyl group to adenine(A) or cytosine(C) which affects the sequence properties, in some cases can enhance mutation probability.
Further engagement with iGEM teams has provided us with significant insights into the growing need for a thorough solution regarding genomic stability.
One such interaction was the Stability Workshop we led in the German iGEM Meetup. In this workshop we introduced teams with possible strategies for improving the stability of a given construct, including optimization of codon usage, utilizing strong promoters, and minimizing mutational hotspots recognized by EFM. The discussion at the second part of our workshop led us to understand that many iGEM teams would have use for such strategies if they find them conveniently accessible. Our friend George from Gaston Day School iGEM team, who received our help in the past, came to visit us during the Meetup and shared his thoughts about the benefits of an EFM optimizer that will help teams to increase their construct stability.
Furthermore, following the Israeli iGEM Meetup we established a collaboration with BGU team, and addressed their need for a stable plasmid.
All in all, we understood that an improvement of the detection mechanism provided by the first-generation EFM calculator should include three important aspects:
- Automatization of the EFM calculator algorithm and implementation on large sets of input files and sequences from different folders in a convenient way for the user.
- Consideration of Methylation sites that tend to develop mutations.
- Optimization of the sequences, considering not only the EFM parameters but other criteria that promotes stability, as we described in our workshop.
We decided to develop a user-friendly software based on the EFM calculator, that will see to these three aspects. Our goal was to create an automated mechanism for a large number of input files that provides a sequence-level detection of mutational hotspots, and an optimization engine of these input sequences by avoiding the identified mutational patterns, reducing the GC content and increasing the frequency of optimal codons. Thus, our software provides an end-to-end solution, a concept that is yet to exist in the field of genomic stability analysis.
We wrapped this software in a friendly Graphical User Interface (GUI), downloadable as an application to the user’s computer, thus allowing greater computational capabilities.
The original calculator
The original EFM calculator considers three forms of mutation – Simple Sequence Repeats (SSR), Recombination-Mediated Deletions (RMD), and Base-Pair Substitution (BPS) which gives a baseline to compare with. From these, a Relative Instability Prediction (RIP) score is calculated as follows:
$$RIP=\frac{SSR+RMD+BPS}{BPS}$$
This score gives a measure of how unstable a sequence is, where its minimum is 1 for the case of no SSR or RMD mutational hotspots.
SSR are sites where there is a repeating short sequence, causing potential polymerase slippage. For instance, the following sequence is an SSR: ATGATGATGATGATG, it has a length of base unit (L) of 3 and number of units (N) of 5. The calculator considers SSR sites if they have N≥3, L≥2, or N≥4, L=1. Marking generational mutation rate as , the SSR score of a site is calculated as follows:
$$(\mu)=-12.9+0.729N, \quad L=1-4.749+0.063N, \quad L \geq 2$$
All these calculations (mutation rate and considered sites) are based on empirical data [1].
RMDs are long (L≥16), identical sites appearing in different locations in the sequence, causing potential recombination faults. The recombination probability between two sites is based on their length L and the distance between them Ls, and is calculated as follows:
$$\mu=(A+L_{s})^{-\frac{\alpha}{L}} \cdot \frac{L}{1+B \cdot L_{s}}$$
Where A=5.8±0.4, B=1465.6±50.0, α=29.0±0.1 were found empirically [2].
BPS is the probability of spontaneous mutation and is empirically estimated in [3] from genome sequencing of E. coli mutation accumulation lines, as $\mu=2.2 \cdot 10^{-10}$ .
Note that all empirical findings were estimated for E. coli. While the probability of mutation will be different for other organisms, the comparative order of hotspot sites is approximately maintained.
Methylation Sites
For the purpose of optimizing mammalian and insectoid cells (Chow for example), methylation sites must be removed as well.
We found this comprehensive database containing 313 reported methylation motifs by Wang. et al 2019 [4] in MEME format. A motif is a sequence pattern that occurs repeatedly in a group of related sequences. Motifs in the MEME Suite are represented as position-dependent letter-probability matrices that describe the probability of each possible letter at each position in the pattern.
We wished to report to the user whether his sequence contains any of the known motifs. However, after discussing the matter with Wang's Lab, there was no clear way to compare the strengths of the methylation sites, or to compute the likelihood of methylation. We decided to find the next best thing – the probability of a site being a specific methylation site (times a constant). As we have no prior on the relative probabilities of a methylation site, we can assume equal probabilities, turning $\frac{p(certain\,methylation\,site)}{p(our\,site)}$pcertain methylation sitepour siteinto a constant. From Bayesian theory:
$$p(certain\,methylation\,site|our\,site)=\frac{p(certain\,methylation\,site)}{p(our\,site)} \cdot p(our\,site|certain\,methylation\,site)$$
We can calculate $p(certain\,methylation\,site|our\,site)$ by simply multiplying the probabilities per nucleotide in this methylation site. We find the methylation site maximizing this likelihood and sort all sites by this score.
Stability Optimizer
The user of the software can choose to optimize the input sequences if they are divisible by 3 and match the required reading frame (target amino acid translation). This option is available for the following organisms: B. subtilis, C. elegans, D. melanogaster, E. coli, G. gallus, H. sapiens, M. musculus, M. musculus domesticus, S. cerevisiae.
As mentioned before, the optimization engine involves three routes:
- Avoiding the identified mutational patterns (with tendency of recombination, slippage, or methylation) by the EFM calculator.
- Reducing the GC content to a requested range (minimum and maximum) specified by the user. The term GC content refers to the percentage of G/C nitrogen base pairs in the sequence, affecting the strands connection as G-C pairs form 3 hydrogen bonds and A-T pairs form 2 hydrogen bonds. The algorithm will split the sequence to windows of a specified size and optimize each window.
Regarding stability, the assumption is that the lower the GC content, the more stable is the sequence, since it has been proved that genes with high GC had a substantially elevated rate of mutations, both single-base substitutions and deletions [5].
- Codon optimization - each codon is optimized based on its relative frequency in the host genome (specified by the user as “organism name” input). The frequency data is from the Kazusa database [6].
The assumption behind codon optimization procedure regarding the stability of the sequence is that matching the codon usage of a host organism helps the sequence to adapt to the entire translation mechanism of the cell. This adaptation can potentially lead to the preservation of the mRNA and later the protein of interest in the host cell for a longer duration of time. In literature, codon optimization is often associated with higher production of proteins, likely (but not solely) due to the assumption above.
There are three common methods for codon optimization:
- "Use best codon": in this method, every codon will be replaced by the "optimal" (i.e. most frequent) synonymous codon in the target organism. This is equivalent to Codon Adaptation Index (CAI) optimization that is described often in literature.
- “Match codon usage”: the final sequence's codon usage will match as much as possible the codon usage profile of the target species. This method is used throughout the literature, for instance - Hale and Thomson 1998 [7].
- “Harmonize RCA”: each codon will be replaced by a synonymous codon whose usage in the target organism matches the usage of the original codon in its host organism [8].
The optimization algorithm we used is provided by the DNA chisel optimization framework [9] (complete documentation here). DNA Chisel is an easy-to-use, easy-to-extend Python library used in many bioinformatic tools (such as Benchling) for optimizing DNA sequences with respect to a set of constraints and optimization objectives. It can also be used via a command-line interface, or a web application.
We discovered this library by examining the codon optimization tool in Benchling as we prepared for the workshop we led in the German Meetup. This Benchling tool referred to DNA-chisel as the origin of the optimization algorithm, and we decided to give it a try as well (and as you can see it paid off!).
The DNA-chisel algorithm has two main steps:
- Resolution of all hard constraints, ignoring optimization objectives.
- Objectives maximization with respect to the constraints.
This steps separation allows to detect incompatible constraints early, and quickly reach satisfactory solutions.
What is a constraint and what is an objective?
Constraints are specifications (or rules) that must be verified in the final sequence (“at all cost” approach), and objectives are specifications that must be as close to verified as possible in the final sequence. The objectives are defined by a score, that is maximized during the process.
Thus, the objective in our problem is to codon-optimize the sequence, and the constraints are the required GC content, patterns to be avoided and preserving the amino-acid translation of the sequence.
The algorithm also defines a Mutation Space, used by the solver to explore new sequence variants. This space stores the changes that can be made to the sequence - it indicates for each nucleotide or sub-fragment the different acceptable values and is determined by the problem’s nucleotide-restricting constraints. For example, when an Enforce Translation constraint is applied to a stop codon, it constrains the nucleotides triplet to only 3 possible choices, TAG, TGA, and TAA. For further reading one can refer to the paper [9] or the DNA-chisel documentation (link above).
The optimization process we decided to implement involves two steps. First, we optimize the input sequences without the EFM constraint (the remaining specifications are translation enforcement, GC content and codon optimization). Then, we apply all the constraints (including EFM patterns) on the new half-optimized sequence.
This two-step strategy allows the algorithm to reach a sequence that is closer to optimum, and only then to deal with recombination, slippage, and methylation sites. Thus, the probability that new problematic sites will appear after optimization decreases dramatically, as opposed to direct implementation of all constraints (including EFM patterns) that can lead to creation of new mutational hotspots. Of course, this scenario is still possible, but is less likely with our two-step strategy that aims to balance the need to avoid EFM constraints with the rest of the specifications (each of them may also influence the stability of the input construct).
Additional feature of the DNA optimization engine is a report (in ZIP format) that is produced at the end of the process, containing for each input sequence:
- Annotated GenBank file of the optimized sequence with all the changes that has been made. The user can track the changes, and if not satisfied with one of them, he can change the sub-fragment back to the original sequence.
- A list of the successive constraints and objectives in a CSV format and in a PDF format. The PDF report also includes the objective score, and a Sequenticon.
The Objective score indicates whether and how well the sequence complies with the specification (the more compliant, the higher the value). By convention, a negative score indicates that the specification is breached, while a score of 0 or more indicates compliance.
Sequenticon (documentation here) is an icon that is unique to the final output sequence. This is an important feature especially when dealing with large sets of input sequences (which are often renamed or updated), because it enables the user to differentiate between sequences that otherwise might get confused with one another. Sequenticons provide a simple visual way to know that two sequences are different (different identicons) or very probably the same (same identicon).
Software
We designed our software to be user-friendly and accessible, taking into account that it is used against large sets of input sequences and thus the number of input arguments should be minimized, and the inputs and outputs should be concise.
The user may choose a path of a folder in his/her computer, and all the subfolders within are searched for FASTA files.
The subfolder structure is copied to the output path, where each FASTA file is replaced by a subfolder with the same name, containing three CSV files of SSR, RMD, and methylation sites (if relevant). If the user chooses to implement optimization (agreeing that all the input sequences are divisible by three and match the target reading frame), then each subfolder will also include a ZIP report (described above) for the sequence.
If a FASTA file contains more than one sequence, the three CSV files will include a column that indicates the serial number of the sequence within, and the ZIP report’s name will contain the serial number.
Maintaining this structure allows users to insert many files at once and keep logical orderings between them. One can also open a User Guide from within the software and specify how many hotspots to show from each type (only the worst will be presented in the output).
The software is downloadable from this link.
Check out our User Guide:
Bibliography
[1] Jack, B. R., Leonard, S. P., Mishler, D. M., Renda, B. A., Leon, D., Suárez, G. A., & Barrick, J. E. (2015). Predicting the genetic stability of engineered DNA sequences with the EFM calculator. ACS synthetic biology, 4(8), 939-943.
[2] Oliveira, P. H., Lemos, F., Monteiro, G. A., & Prazeres, D. M. (2008). Recombination frequency in plasmid DNA containing direct repeats—predictive correlation with repeat and intervening sequence length. Plasmid, 60(2), 159-165.
[3] Lee, H., Popodi, E., Tang, H., & Foster, P. L. (2012). Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by whole-genome sequencing. Proceedings of the National Academy of Sciences, 109(41), E2774-E2783.
[4] Wang, M., Zhang, K., Ngo, V., Liu, C., Fan, S., Whitaker, J. W., ... & Zheng, L. (2019). Identification of DNA motifs that regulate DNA methylation. Nucleic acids research, 47(13), 6753-6768.
[5] Kiktev, D. A., Sheng, Z., Lobachev, K. S., & Petes, T. D. (2018). GC content elevates mutation and recombination rates in the yeast Saccharomyces cerevisiae. Proceedings of the National Academy of Sciences, 115(30), E7109-E7118.
[6] Nakamura, Y., Gojobori, T., & Ikemura, T. (2000). Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic acids research, 28(1), 292-292.
[7] Hale, R. S., & Thompson, G. (1998). Codon Optimization of the Gene Encoding a Domain from Human Type 1 Neurofibromin Protein Results in a Threefold Improvement in Expression Level inEscherichia coli. Protein Expression and Purification, 12(2), 185-188.
[8] Claassens, N. J., Siliakus, M. F., Spaans, S. K., Creutzburg, S. C., Nijsse, B., Schaap, P. J., ... & Van Der Oost, J. (2017). Improving heterologous membrane protein production in Escherichia coli by combining transcriptional tuning and codon usage algorithms. PloS one, 12(9), e0184355.
[9] Zulkower, V., & Rosser, S. (2019). DNA Chisel, a versatile sequence optimizer. bioRxiv.