Design
Analyze our vision
In our Inspiration and Description page, we described our initial idea and how we came up with our new, innovative approach for addressing the problem of genetic stability. We propose interlocking a target gene of interest to the N-terminus of an essential gene of the host organism, under the same promoter. We hypothesize that each conjugated essential gene will generate different stability levels when attached to a specific target gene.
After careful characterization of the problem, we concluded that a generalized solution is needed in order to provide an answer for a wider range of synthetic-biology related procedures. Thus, we stated that our end goal is creation o a software that will generate a customized construct for each target gene, composed of the best matching essential gene and linker for improved stability and higher expression.
Stating our vision is the first step of the design process, but clearly it is not enough. We still needed to overcome four challenges in our design:
- We would like to advise synthetic biologists which essential gene to link to their target gene. In addition, it is possible that that a non-essential gene will better promote a target gene’s stability, perhaps because it is preserved in evolution even though it is not defined as "essential". Therefore, we need to characterize all the genes in the host organism.
The questions that arise are:- Which features influence the genetic stability of genes?
- What available databases can we use to predict the best fitting conjugated gene
- How do we establish quantifiable measures of success?
- What experiments can we conduct to support such prediction models?
- Do linkage options differ from one possible construct to another, and if so, how to choose the best linker?
- How do we measure the stability of so many constructs, regardless of the target gene's type? Specifically, many stability-measurement procedures are done in small-scale, and they often involve fluorescent genes, since it is easier to quantify their evolutionary half-life (the time it takes until the target gene is lost in the population). So, we had to ask ourselves:
- What is our measure of stability?
- How can we expand our solution for non-reporter genes and still propose this measure for stability?
- A valid solution is one that has proven to work. How can we provide an empirical proof-of-concept?
- How do we preserve high-level expression while increasing the stability of genes?
Following those questions, we fine-tuned our design and came up with a final scheme. Before we present this scheme, let us intoduce the theoretical background that will help us answer the linkage question - How to interlock a target gene to an essential gene?
Linkage options between target and essential genes
A crucial aspect in our design is the linkage between the target and essential genes. Strategies for multigene co-expression include fusion proteins, “self-cleaving” 2A peptides, ribosomal frameshifts, in-vivo proteolytic cleavage sites between genes, and signal peptides combined with ribosomal frameshifts. We will now describe each one of the above methods and how they are incorporated into our final design.
Please note that not all the described options are integrated into our final software, since we chose to focus on methods that match our time-frame while still providing a valid proof of concept for our the main idea. Still, we encourage future teams to accept the challenge and continue our work!
Fusion Linkers
Protein chains are generally long and consist of multiple "domains": distinct structural units of a protein that can evolve and function independently. Naturally occurring multi-domain proteins are composed of functional domains joined by linker peptides [1], called "fusion linker". These linkers covalently join functional domains together to act as one molecule throughout the in vivo processes. They provide the conformation, flexibility and stability needed for the protein's biological function in its natural environment.
Fusion linkers are divided into two main categories [1]:
- Flexible linkers: short sequences that are rich in small or polar amino acids such as Gly and Ser that provide good flexibility and solubility. They are suitable choices when certain movements or interactions are required for fusion protein domains. Although flexible linkers do not have rigid structures, they can serve as passive linkers in order to keep a distance between functional domains. The length of flexible linkers can be adjusted to allow proper folding or to achieve optimal biological activity of the fusion proteins.
- Rigid alpha-helical linkers: exhibit relatively stiff structures by adopting alpha-helical structures or by containing multiple Pro residues. Rigid linkers are chosen when the spatial separation of the domains is critical for preservation of the stability or bioactivity of the fusion proteins, since they maintain the distance between domains more efficiently than the flexible linkers.
Inter-domain linker peptides of natural multi-domain proteins provide an ample source of potential linkers for novel fusion proteins [2]. Their main advantage in the context of our objective is their simplicity of design and usage, and their ability to work for any in-vivo production. Such procedures involve production of a target molecule in the cell, such as bio-fuel creation or expression of genes that gives E.coli the ability to metabolize carbon dioxide.
The main challenge when using such a linker is predicting the spatial disturbance it causes to the target and essential proteins it is attached to.
An online program, LINKER, was designed to automatically generate linker sequences for fusion proteins [3]. Unfortunately, the server website is no longer accessible, most probably due to lack of maintenance. Other similar tools such as SynLinker [4] are also not available.
George and Heringa (2002) [2] developed a web-based linker database (http://www.ibi.vu.nl/programs/linkerdbwww/) that provides a group of linker candidates satisfying the user-specified queries such as the length, sequence and secondary structure of the linker. Although it hasn't been updated since it was released, it is still frequently used for designing fusion proteins and thus it can be used for our models.
We plan to integrate the fusion linker into our design, and advise the user on which best linker to use based on unique models that we developed.
“self-cleaving” 2A peptides
2A peptides are 18–22 amino-acid long viral oligopeptides that mediate “cleavage” of polypeptides during translation in eukaryotic cells [5]. They help break apart polyproteins by causing the ribosome to fail at making a peptide bond. In general, conventional approaches for co-expression have several limitations, most notably imbalanced protein expression and large size. The use of 2A peptide sequences alleviates these concerns. In addition, adding the optional linker “GSG” (Gly-Ser-Gly) on the N-terminal of a 2A peptide increases its efficiency [6].
There are four main 2A sequences. In order to obtain a desired ratio of protein expression, it is important to select the right 2A construct.
The P2A and T2A are considered to be the most efficient 2A peptides [6,7]. The conventional way to rank the 2A peptides is from the most efficient P2A, followed by T2A, E2A and F2A [7].
Regarding our objective, the 2A system has been used before in biotechnology companies, and therefore will be easy to implement. It is an easy solution if the fusion of the target protein in some way hinders its catalytic activity. However, the cleavage leaves a 'left-over' of about 20-22 amino-acids (depending on the kind of 2A used) in the N' terminal protein (in our case – the target gene), making it a bit less appealing if the purpose is to manufacture and purify the protein itself.
We plan to integrate the 2A into our design, and allow the users to choose the 2A that suites their need based on collected data from literature.
Ribosomal Frameshifts Using Pseudoknots (PKs)
Ribosomes typically translate mRNA without shifting the translational reading frame. However, viruses have evolved mechanisms to cause site-specific or programmed frameshifting of the ribosome in either +1 or −1 direction. This ribosomal frameshift is facilitated by RNA structures called pseudoknots [8].
The frameshift happens in a fixed percentage of the translations (for each pseudoknot a different percentage, all in the range of 1%-10%) and correlates with the strength of the pseudoknot structure [9]. While translating the mRNA sequence, in a fixed percentage of the translations the ribosome will slide back one nucleotide and change the reading frame, thus altering all the amino acid sequence downstream of the slide site.
The pseudoknot-induced frameshift system can be applied to express two versions of the protein with one DNA sequence. In the main transitional frame, only the target protein will be translated, possibly with a leader sequence, allowing for extracellular extraction; as for the other frame, the target and the essential protein will be translated fused.
This method will help significantly in upregulating the expression level of the target gene. It leaves the smallest left-over, with a possibility of no left-over at all, which enables obtaining a purified extracellular protein. This feature makes the system appealing to therapeutics and pharmaceutical companies.
We hope in the future to offer a more sophisticated recommendation engine, that will allow us to match the PK such that its left-over will fit the C-terminus of the target protein.
In-vivo proteolytic cleavage sites between genes
By using in-vivo protease as linkers, one can separate between the two joined proteins without any side effect or left-over [10,11]. While highly appealing, the site-specificity and the low number of known proteases is limiting and thus will only work for small fraction of cases.
We hope to integrate this linker in future versions of our software.
The "Super Linker": Signal peptides (SP) combined with PK ribosomal frameshifts
In this innovative linker, we are trying to combine two biological processes for our advantage – the signal peptide and the pseudoknot.
A signal peptide is a short sequence of amino acids, usually located at the N’ terminal of the gene, but also be found also at the C’ terminal. Tunnels protein embedded in the cell membrane can recognize this sequence. After recognition, the protein can be sent to any organelle in the cell, or outside of the cell completely. The tunnel protein, while in the process of the secretion, cleaves the signal peptide. A variety of signal peptides have already been characterized, and many are currently used in biotechnology manufacturing today.
As explained above, the pseudoknot is an RNA secondary structure that evolved during millions of years of evolution in viruses. This structure grants viruses with the ability of dual coding: code more than one protein in a single DNA sequence, by inducing a frameshift that alters the reading frame and produces a new amino-acid sequence.
Because signal peptides contain relatively flexible recognition sites for the tunnel protein, and a pseudoknot is simply an RNA secondary structure, we can quite easily create a linker sequence that serves both as a pseudoknot and a signal peptide at the same time. Then:
- Under the main reading frame, there will be a stop codon right after the pseudoknot/signal peptide sequence. Thus, most of the time we will get an amino acid sequence of the target gene with the signal peptide exposed and recognizable, allowing for the secretion of the target gene, and for the cleavage of the linker in the process.
- When frame-shifted, the ribosome will translate a fused protein composed of the target and essential genes that will stay in the cell.
This linkage method is complicated and requires bioinformatic models to understand how to construct the signal peptide along with the PK. We hope to integrate this linker in future version of our software.
Using plasmids for a better implementation of our solution
In a future version of our product, we would like to address the integration of plasmids into a host organism. In our vision, when a user of our software inserts their target gene, the purpose of insertion and the host organism he wishes to work with, the software will output three sequences:
- The target gene with the suitable linker and an essential gene (all optimized by our software) within a high copy number.
- A DNA fragment containing a selection marker (according to the user's selection) with homology to the flanking region of the essential gene, adjacent to its 5′ end. This way, the essential gene is deleted from the genome, reducing evolutionary instability by removing unstable genetic elements from the host genome.
- A guide RNA (sgRNA) that directs the CRISPR-associated endonuclease Cas9 to the essential gene in the genome (but not to the optimized essential gene in the plasmid). This will allow robust transformation even if the organism is characterized by transformation difficulty (for example, mammalian cells).
Design
As part of the synthetic-biology community, we understand the importance of following engineering principles in order to convert our idea from a vision to reality. In order to create a valid, reliable solution, we decided upon a modular design, that is supposed to answer all the questions above:
This design is composed of the following elements:
Model
- Creation of a model that predicts the genetic stability of all genes in yeast with respect to GFP or RFP as target genes. This model will help us rank the stability of conjugated genes for a given target gene in our final software. Developing such model will involve thorough characterization of all the genes in yeast. Initially, the model will train on empirical data from the SWAT database as a stability measure, but later it will receive the results of our own Gene-SEQ large-scale experiment.
- Development of another model, which will simulate the spatial disturbance that a fusion linker causes to its attached genes, utilizing known tools that calculates the disorder residues of genes. This way, we can evaluate different fusion linkers, quantify their influence, and select the best one, considering the context of target and essential gene.
- Building an optimization model that maintains high expression levels while increasing the stability of the mutual construct.
Software
Generation of a user-friendly software that combines and applies all the modules for a new given target gene. This software generates a customized construct for each target gene, composed of the best matching essential gene and linker for improved stability and higher expression.
Measurement
Creation of a novel technique for large-scale measurement of genetic stability called "Gene-SEQ": stability enhancing quantifier.
- Use the Gene-SEQ libraries to prepare two co-cultures, GFP culture and RFP culture. Each culture contains thousands of constructs which are all the library’s variants.
- Operate a state-of-the-art robotic platform called Chi.Bio for the evolution experiment, which enables the user to grow these cultures for many generations and gather data.
- Analyze the mutational footprints of each target-essential construct by Deep Sequencing, in order to test our hypothesis and eventually predict the evolutionary preservation of the target gene.
Proof of Concept (POC)
Provide a Preliminary empirical POC, showing that the fusion of a target gene to an essential gene would likely prolong its evolutionary half-life. In addition, demonstrate that different essential genes provide varing stability levels for the target-essential construct.
Integrated Human practice:
Throughout our project, we received feedback from our Human Practice engagements that further improved our design. The discussions we had with companies and other iGEM teams emphasized the need for a generalized solution that takes into account the expression level. Moreover, academic experts and industry executives advised us on how to characterize genes, which linkers to use, and how to approach our main users.
Human Practices Proposed Implementation
references
[1] Chen, X., Zaro, J. L., & Shen, W. C. (2013). Fusion protein linkers: property, design and functionality. Advanced drug delivery reviews, 65(10), 1357-1369.
[2] George, R. A., & Heringa, J. (2002). An analysis of protein domain linkers: their classification and role in protein folding. Protein Engineering, Design and Selection, 15(11), 871-879.
[3] Crasto, C. J., & Feng, J. A. (2000). LINKER: a program to generate linker sequences for fusion proteins. Protein engineering, 13(5), 309-312.
[4] Liu, C., Chin, J. X., & Lee, D. Y. (2015). SynLinker: an integrated system for designing linkers and synthetic fusion proteins. Bioinformatics, 31(22), 3700-3702.
[5] Liu, Z., Chen, O., Wall, J. B. J., Zheng, M., Zhou, Y., Wang, L., ... & Liu, J. (2017). Systematic comparison of 2A peptides for cloning multi-genes in a polycistronic vector. Scientific reports, 7(1), 1-9.
[6] Szymczak-Workman, A. L., Vignali, K. M., & Vignali, D. A. (2012). Design and construction of 2A peptide-linked multicistronic vectors. Cold Spring Harbor Protocols, 2012(2), pdb-ip067876.
[7] Kim, J. H., Lee, S. R., Li, L. H., Park, H. J., Park, J. H., Lee, K. Y., ... & Choi, S. Y. (2011). High cleavage efficiency of a 2A peptide derived from porcine teschovirus-1 in human cell lines, zebrafish and mice. PloS one, 6(4), e18556.
[8] Staple, D. W., & Butcher, S. E. (2005). Pseudoknots: RNA structures with diverse functions. PLoS Biol, 3(6), e213.
[9] Hansen, T. M., Reihani, S. N. S., Oddershede, L. B., & Sørensen, M. A. (2007). Correlation between mechanical strength of messenger RNA pseudoknots and ribosomal frameshifting. Proceedings of the National Academy of Sciences, 104(14), 5830-5835.
[10] Volkmann, G., Volkmann, V., & Liu, X. Q. (2012). Site-specific protein cleavage in vivo by an intein-derived protease. FEBS letters, 586(1), 79-84.
[11] Pacini, L., Vitelli, A., Filocamo, G., Bartholomew, L., Brunetti, M., Tramontano, A., ... & Migliaccio, G. (2000). In vivo selection of protease cleavage sites by using chimeric Sindbis virus libraries. Journal of virology, 74(22), 10563-10570.