Team:TAU Israel/Software

sTAUbility

sTAUbility

Software

Introduction

Our software products support the general need for evolutionary stable constructs, with distinct approaches. Following the introduction to the problem of genetic stability, we discovered that there are no existing tools that allow the design of stable constructs. Specifically, Previous iGEM projects focused on identifying sequences that contribute to the relative instability of genetic constructs but did not suggest a proper solution on how to avoid these patterns. In addition, many commercially available optimization tools do not consider stability levels when offering an optimized sequence.

We addressed this issue by creating two software products - Staubility Enhancer and Staubility EFM Optimizer. Each was designed with the following guidelines:

  1. Providing useful, empirically based outputs for our users.
  2. Being as intuitive and user-friendly as possible.
  3. Interfacing easily with workflows. This includes using standard data formats, and providing outputs that can be used as is, without further manipulation
  4. Utilizing existing, established platforms, in order to facilitate higher confidence in our product.

For this purpose, we created a robust, versatile GUI framework.


Staubility Enhancer EFM optimizer GUI framework

Staubility Enhancer

In order to address the difficulty in long-term expression of engineered constructs in a foreign organism, we proposed interlocking a target gene to the N terminus of an essential gene, under the same promoter. However, due to the variety of linkage options and given that each conjugated essential gene will provide different stability levels when attached to a specific target gene, our solution requires extra information for effective implementation. This is why we decided to create a software to act as an interface between the user and the proposed solution.

The Staubility Enhancer is a tool created for designing embeddings of target genes into host genomes, with much higher evolutionary stability and expression levels. In depth explanations of the theory and models behind this tool are available in our Model page .

The modular design simplifies the design process and allows the user to make intelligent decisions regarding their constructs.

Figure 1. The Staubility Enhancer scheme


We decided to program the software using Python as it is a language that is widely documented and open-sourced. However, some of the packages we used are not trivial nor intuitive. This is why we created the GUI (guided user interface) as an exe file, that will allow the user to run the software regardless of their programming knowledge.

The Staubility Enhancer is a three-phase software, and it allows you to:

  1. Choose the best conjugated gene to increase your target gene's stability.
  2. Select the best linker for your construct's purpose, considering protein folding in case of fusion linkers
  3. Optimize the combined construct for efficient translation and increased stability.

The software's main window allows the user to specify the target gene of interest, the preferred linker type, and many optimization parameters that are described later.

Figure 2. Staubility Enhancer main window. View full size.


First Stage – Conjugated Gene Predictor

Given a target gene in string format, and other optimizations and biological preferences, our model first predicts the stability of all possible conjugated genes.

This prediction is based on a Random Forest model. It was trained and validated using the empirical results for the SWAT library created by Prof. Schuldiner; see our Experiments page for more details regarding this library.

In addition, we designed our Gene-SEQ experiment for gathering large-scale empirical data that will improve our model. Due to Covid-19 related delays, we received some of our deep-sequencing results and did not have enough time to analyze and integrate this data for our model; see our Measurements page for more details. In the future version of this software we hope to utilize this data for the training and validation of our prediction model.

Following this prediction process, the user is presented with best top conjugated gene candidates, ranked according to their predicted stability.

Figure 3. Conjugated Gene Selection.


Originally, we had planned to provide the user with a single, optimal construct. However, during our academic and industry consultations, we received feedback that providing users with multiple sequences would increase their confidence significantly, as it will allow them to empirically test various possible constructs. The users can now parallelize their development process, improving our product’s output for existing workflows.

The ranking of the genes is normalized with the score of the best candidate, in order to indicate the relative difference between the candidates.

Second Stage – Linker Selection

Following the selection of a conjugated gene, and in case a fusion linker was chosen in the main window, the software estimates the disorder profile induced by many possible fusion linkers. This disorder profile provides an indication of a protein’s 3D folding.

Using the linker database provided by The Centre for Integrative Bioinformatics VU (IBIVU), and the IUPred2A library, we predict the change in the target and conjugated gene’s disorder profile. This allows the user to select a linker that causes a minimal change. This change is measured as Linker Cost. The lower this cost, the better the linker.

Figure 4. Linker selection. View full size.


Once more, users are presented with several linkers, from which they can select with which to proceed. In case of a 2A linker, the software will present the four available options and allow the users to select their preferred one, considering length and efficiency.

It should be noted that in future versions of our software we plan to integrate more types of linkers to our software, as described in our Design page.

Third Stage – Optimization

Following the selection of a linker, the user has effectively created a complete construct. At this stage, our software uses various models in order to optimize this construct for higher stability and expression levels. This resulted, in part, from academic and industry consultations that clarified that mutational stability is not enough for users – higher expression levels are equally important.

For this optimization, the amino acid sequence of the user’s construct is maintained.

The first framework we utilized for this optimization is the EFM Calculator developed by the 2015 UTexas iGEM team. Their EFM Calculator is a webtool, currently integrated into Benchling, that finds mutational hotspots in input sequences. For more details, see our contribution page. We implemented its calculation and corroborated our results with the original calculator. Using this hotspot detection, our software detects these mutational hotspots in sequences, and avoids them.

There is much evidence that weak mRNA folding in the first ~15 amino acids of a protein is correlated with high expression levels. The second framework we utilized for this optimization is the seqfold library. This python library predicts the minimum free energy structure of an RNA sequence, using a thermodynamic approach. The more folded a sequence is, the lower (more negative) it's minimal free energy. Thus, in order to induce weak mRNA folding, our software maximizes the sequence’s minimal free energy in this starting region. This allows the pre-initiation ribosomal complex to easily identify the start codon.

Our software optimizes for GC content of the genetic sequence as well. Our users define the range of permissible values, as this is often dictated by other biological needs.

Finally, our software optimizes for codon usage bias, using one of several methods often used in the literature. For more details on these methods, please refer once more to our model page.

It is also worth mentioning that the optimization process involves two steps: first, we optimize the input concerning mRNA folding at the start of the sequence, required GC content and codon optimization. Then, we avoid mutational patterns detected by the EFM in the half-optimized sequence, while optimizing the codons and keeping the start of the sequence as-is. This two-step strategy allows the algorithm to generate a sequence that is closer to optimum, and only then deal with mutational hotspots. Thus, the probability that new problematic sites will appear after optimization decreases dramatically.

All these optimizations are performed utilizing the DNAChisel library. This platform is used by Benchling as well. Using it, our software quickly and efficiently resolves all constraints presented, including amino acid sequence maintenance, GC content, mutational hotspot avoidance, and weak mRNA folding. Afterwards, it optimizes the objectives presented by the codon usage bias methods.

Output

Aiming to provide ease of use for our users, we provide the output sequence in various formats for their convenience.

First, we provide the output sequence in the GeneBank format. We also provide the construct’s separate components in Fasta format.

Second, we report the mRNA secondary structure and minimum free energy after the optimization process, for the user to track the changes.

Finally, we provide a sequenticon of the sequence. A sequenticon is an icon that is unique to each sequence, resembling in concept a QR code. This is an important feature, especially when dealing with large sets of input sequences (which are often renamed or updated), as it enables the user to easily, visually differentiate between sequences that otherwise might be confused with one another.

Figure 5. An example of a sequenticon.


For more information about the theory and models behind this tool, please visit our model page.

For more details regarding our code and installation, please visit our Github repository.

For more details regarding our user interface, please refer to our user guide.


Install from Github!

Staubility EFM Optimizer

The Staubility EFM Optimizer is a tool created for designing multiple genetic sequences in tandem, optimized for mutational stability and expression levels. For mammalian and insectoid cells, this tool removes epigenetic inheritance hotspots as well.

We initially decided to develop this tool in response to a collaboration request from Prof. Tamir Tuller’s lab. They are working on development of a Covid-19 vaccine, and requested our aid in improving its stability. Many of the features of this tool were developed in direct response to this need, providing immediate research support and getting further feedback. Afterwards, we used this tool to aid other iGEM teams, such as Ben Gurion University. As this tool was designed during users testing, we constantly refined it for seamless integration with their workflows and ease of use.

In depth explanations about the theory and models behind this tool are available within our contribution page.

The Original Calculator

Our EFM Optimizer is based on the 2015 Texas University iGEM team’s EFM Calculator. Their calculator, already been embedded into Benchling, allows you to easily find and rank Simple Sequence Repeats (SSR) and Repeat Mediated Deletions (RMD), using the equations derived in this paper, based on empirical data. These sites are significant mutational hotspots, and by modifying or removing them, users can increase their gene’s evolutionary stability significantly.

Simple Sequence Repeats are repeating short sequences, for example AGAGAGAG. When such a repeating sequence occurs, there is an increased chance of polymerase slippage, causing an addition or subtraction of one of these units.

Figure 6. Simple Sequence Repeats illustration.


Repeat Mediated Deletions are long, identical sites appearing in different locations in the sequence, causing potential recombination errors. This would lead to a deletion of the sequence between them.

Figure 7. Repeat Mediated Deletions illustration.


Our improvements

  1. The original calculator is a webtool, in which one sequence is inserted at a time.
    For many reasonable applications, there are multiple sequences to be analyzed. This hortens significantly the insertion time of each sequence and saving the results separately. In addition, for many files, this will lead to loss of files or worse, confusion between sequences.
    In our software, we allow the user to select an input directory, and all sequences within are analyzed. They keep their internal ordering, this reduces time and confusion.
  2. Some, users might need to analyze mammalian or insectoid cells. This was the case with Prof. Tuller’s research.
    For these cells, methylation sites act as epigenetic inheritance hotspots, causing genetic sequences to be more tightly folded and reducing expression levels. This is a greater issue than the mutational hotspots previously mentioned.
    In our software, we utilize a methylation database, which is an empirically-based database of methylation sites and their probabilities of appearing for different genetic sequences. Using this data, we scan input sequences, and find sites with high probability of methylation.
  3. In the original calculator, the user is provided with a ranked list of mutational hotspots. At this point, they would need to manually alter or delete their genetic sequences in order to avoid these hotspots. This could be painstaking work, consuming much time and effort.
    Using the tools we developed for our main software, we decided to provide the user with an optimized sequence. This sequence would avoid these hotspots and be optimized for stability and expression to boot.
    For this optimization, our first constraint is to maintain the construct’s amino acid sequence. Our second constraint is avoidance of the hotspots found.
    Our software optimizes for GC content of the genetic sequence as well. Our users define the range of permissible values, as this is often dictated by other biological needs.
    Finally, our software optimizes for codon usage bias, using one of several methods often used in the literature. For more details on these methods, please refer once more to our Model page.
    All these optimizations are performed using the DNAChisel library. This platform is used by Benchling as well. Using it, our software quickly and efficiently resolves all constraints presented, including amino acid sequence maintenance while optimizing codon usage, GC content, and mutational hotspot avoidance.

Empirical Testing

As previously mentioned, this product has been used by labs and iGEM teams. It integrated well into their workflows, and received positive reactions regarding its usefulness. Thus, we believe we met our goal of creating a user-friendly, easy-to-use platform that integrates well into actual research.

Unfortunately, evolutionary experiments take a very long time. Due to Covid-19 related delays, we could not empirically ascertain that our design improves mutational stability.

Thus, we needed to find some other validation for our outputs. We first validated that our EFM Optimizer and the original EFM Calculator return the same results, regarding the recognition of mutational hotspots.

Afterwards, we used data and known measures of evolutional conservation. Genes from different orthologous groups are compared, using empirical data. Different regions and codons get a conservation score; this indicates how well is the codon preserved evolutionally. We compared these scores for mutational hotspots that we found with the rest of the genome, and found them to have a lower conservation score. Repeating this analysis for genes before and after optimization demonstrates an improvement in the total conservation score, giving our software empirical validation.

User Interface

Figure 8. Staubility EFM optimizer main page.


In this screen, our users can select their input directory, and all fasta files within will be analyzed. They will select whether to scan for methylation sites, and whether to conduct an optimization process.


Figure 9. Optimization settings.


Once the optimization process is selected, users are requested to insert the following configurations:

  1. The permissible range of GC content, whose purpose was previously described.
  2. The organism name, as the codon usage bias will be optimized to match this organism. If none_specified is selected, codon usage bias will not be considered.
  3. The codon optimization method previously mentioned.
  4. ORF coding region per sequence, as this range defines the area optimized for codon usage. It is especially important, as it provides the correct division to codons.

Output

Our users are provided with CSV files, detailing a ranked list of the hotspots described. This allows both validation of our findings with the previously mentioned EFM Calculator, and manual corrections if our users so desire.

If our users select the optimization option, they will receive the output sequence in various formats for their convenience, to facilitate ease of use.

First, we provide the output sequence in the GeneBank format.

In addition, we provide a sequenticon of the sequence (previously explained, Fig.5).

For more information about the theory and model behind this tool, please visit our Contribution page.

For more details regarding our code and installation, please visit our Github repository.

For more details regarding our user interface, please refer to our user guide.


Install from Github!


GUI Framework

At an early stage of our work, we realized that we would need to create a GUI framework.

This was clear to us due to several factors:

  1. We needed a flexible framework, in which we can change our requirements, use various kinds of input, and allow different displays as needed. This was due to the complicated nature of our project – we did not want to waste a lot of time trying out different options, as this would limit our experimentation.
  2. We had two software products, the Staubility Enhancer and the EFM Optimizer. We wanted to easily share their features, and create a similar look for ease of transition between them.
  3. Our code is built on many different libraries, models, and frameworks, and would be very difficult to run as a script. Thus, it was extremely important to be able to compile it properly for our users.
  4. We had talented members in our group who had done it before and could do it again efficiently.

We built our framework using the pyqt5 python library.