Engineering

ENGINEERING SUCCESS

The following cycle chart describes our procedure for engineering new entangled sequences. To run CAMEOS we need two gene sequences to entangle as input, as well as protein Multiple Sequence Alignments (MSAs) for each of the two sequences. CAMEOS then randomly computes hundreds of sequences for this specific entanglement, which we also call entanglement variants. We select optimal variants in hope to make new BioBricks that would be useful for the iGEM community. From the failures we encountered when entangling sequences, selecting variants, and doing lab experiments, we can improve our methodology for the next sequence entanglements.

We successfully generated CAMEOS entanglements and selected optimal variants with Pareto optimization for the following couples of sequences, among others:

luc x gfp [FASTA]	knt¹ x ccdB [FASTA]	knt² x gfp [FASTA]
luc x knt¹ [FASTA]	knt² x ccdB [FASTA]	luc x ccdB [FASTA]

Yarrowia lipolytica entanglements:

hph x nat [FASTA]	hph x Turquoise [FASTA]	RedStar x nat [FASTA]
hph x RedStar [FASTA]	hph x yfp [FASTA]	Turquoise x nat [FASTA]

The links in the table above should redirect you to a FASTA file containing the optimal entangled sequences.
Additionally, here is a ZIP file containing plots of CAMEOS' scores annotated with Pareto optimal sequences for all entanglements.
Plots and optimal sequences were computed from CAMEOS outputs using our scripts pareto.py and extract_data.jl.

Look for ideas
(Research & Imagine)

We were first inspired by the ideas proposed by Blazejewski et al in their article, including entangling a gene of interest with a toxin to contain it within a chassis that provides the antitoxin and entangling a biosynthetic gene with an essential gene to restrict accumulation of mutations.

We were also inclined to try reproducing the case of entangled sequences in bacteriophage phiX174 which is one of the few examples of the kind of entangled sequences generated by CAMEOS in nature.

For the sake of having lab experiments accessible to do for iGEM, we finally opted for genes encoding mostly reporter proteins and antibiotic resistance proteins.

Design (Design)

For the first set of entanglements, the coding and protein sequences were retrieved on Uniprot. Our sequences include luc, encoding for the firefly luciferase, two knt sequences encoding for kanamycin resistance, ccdB encoding for bacterial toxin CcdB, and gfp, which really should be sfgfp, since it encodes for the superfolder Green Fluorescent Protein.

For the second set of entanglements, the sequences were provided to us by Tristan Rossignol (a researcher who works at the INRAE in Jouy-en-Josas) as part of a Human Practices collaboration aiming to entangle a sequence in the yeast Yarrowia lipolytica. It includes hph confering hygromycin resistance, streptothricin acetyltransferase nat, and various colored GFP variants.

All sequences used in these entanglements are provided in the proteins.fasta and cds.fasta files.
From these same sequences we realized BLAST alignments to retrieve protein MSAs to give as input to CAMEOS.

Entangle (Build)

Using the protein and coding sequence from the two genes to entangle, and the protein MSAs retrieved for each gene, CAMEOS generated sets of 300 variants.

We used our bash script to help the generation of inputs from the MSAs to speed up the process of entangling our own sequences.

All of our entanglements presented here follow this format: aaa...aaBBB...BBBaa...aaa where sequence a is on frame 1 and sequence b is on frame 2 or 3.

In addition, we always picked a small and a large sequence to entangle: the small sequence ends up being the one inside the large sequence.

Select variants (Test)

Optimal entangled sequences were calculated with Pareto optimization using our scripts as described in the Software page as well as in the CAMEOS Course.

This is an important part: we want to eliminate all suboptimal variants and check with BLASTp identity scores if our optimal variants are close to the original proteins.

Lab experiments (Test)

Some of these entanglements were used for lab experiments, these are colored in red.

We selected the sequences among these that were found to have good enough BLASTp identity scores to be considered for the lab. Most of these lab experiments unfortunately did not turn out well.

We also used Phyre² to model some of the proteins entanglements, here are two examples of .pdb files obtained by running Phrye² in intensive mode on gfp and knt² for one of our BioBricks: knt.pdb and gfp.pdb.

Note: the knt-gfp BioBrick did not come from the FASTA file presented above as knt² x gfp, but from a previous knt² x gfp entanglement.

Improve chances of success
(Learn & Improve)

More often than not, we didn't perform any lab experiments, but we still managed to find improvements to make to our entanglement and selection process.

→ Selecting sequences with Phylogeny analysis and BLASTp

CAMEOS took a long time to be familiar with. At first we arbitrarily selected a few sequences and ran them through a phylogeny analysis in order to find the ones who would be the most resemblant to the original proteins.

However, we found their BLASTp scores to be a bit lacking, which makes sense if we didn't select the best entangled sequences in the lot.

→ Selecting sequences with Pareto optimization of CAMEOS' scores

Then we discovered CAMEOS' top scoring system, which already included scores ranking the entangled sequences by resemblance to the first and second protein sequences. Wanting to go further with CAMEOS' scoring, we developed a script using Pareto optimization to select the optimal entangled sequences.

→ Taking tridimensional structure into consideration

Fearing that our entanglements proteins end up having dysfunctional tridimensional structures, we also decided to take structural analysis into consideration for the selection process.

In particular, this lead us to use Phyre² as explained above, but also to search if our protein sequences have mutations on important binding sites, which are for some of them described on the protein's Uniprot page.

In many cases, the entanglement variants actually had mutations on very important spots, which reduced the likelihood of it being functional in lab.

→ Checking Open Reading Frames and Ribosome Binding Sites

For our BioBrick knt² x gfp, there was an ATG codon in upstream of the second coding sequence (the one that ends up in the middle of the other sequence) with a better RBS. This is a major problem: we don't want the ribosome to translate the wrong reading frame. This can be avoided by replacing such starting codons or generating synonymous mutations and it underlines how important it is to check all frames for ORFs with a tool such as ORF FINDER.

→ Find out which MSAs result in the best entanglements

Finally, in most of the cases the biggest problem for us to trust the entanglement sequences to make for good BioBricks was the BLASTp identity scores, which are especially low for the second coding sequence: it very rarely reaches an identity percentage of 70%.

For example, in the first optimal sequence of the hph x nat entanglements provided above, the hph sequence has an identity score of about 92% with hygromycin phosphotransferase, while the nat sequence only has an identity score of about 42% with N-acetyltransferase.

To remedy to this problem, we hypothesized that the obtained BLASTp identity scores were purely dependant in the quality of the protein MSAs given to CAMEOS and decided to run the following investigation.

→ Case study: Comparing MSAs from several genomic databases

For the couple knt¹ and ccdB, we did not only run one CAMEOS execution. Since CAMEOS depends on the MSAs given in input, we selected 5 MSAs from different databases (BLAST, Uniprot, etc.) for both knt¹ and ccdB and generated a total of 25 entanglements, one for each MSA combination.

The entanglements in question are not provided here but their analysis and comparison lead to interesting results which are presented in detail in the CAMEOS Case Study.

In particular, we generate HMM consensus sequences for each MSA, and the results suggest that the closest the HMM consensus sequence is to the original protein sequence, the better CAMEOS scores for the outer sequences are, meaning that taking distant homologues in MSAs to have diversity in our HMMs would not always be the right choice.

To learn more about how CAMEOS uses HMMs and MRFs generated from the protein MSA given in input, check the Model description.

In conclusion, we engineered a remarkable number of CAMEOS entanglements and provided an engineering methodology for sequence entanglement softwares. First, we carefully choose sequences and corresponding MSAs to entangle with CAMEOS. Secondly, we select the entanglement variants using Pareto optimization of CAMEOS' scores, BLASTp identity scores, and structural analysis tools. If the optimal sequences aren't acceptable, we search for possible improvements with the chosen sequences and MSAs. Conversely, we select some entangled sequences for the lab, and the observations done at the lab give us insights for the future sequence entanglements. Further details can be found in the CAMEOS tutorial course.

Pareto optimization

Our selection technique is based on Pareto optimization or multiobjective optimization.
In the figure below we attempt to explain Pareto optimization applied to entangled sequences in a nutshell.
We have two parameters to optimize: we want to select the sequences which minimize the divergence from Protein A and Protein B.
In this case, the sequences that appear on the graph in yellow are said to be Pareto-optimal. They constitute a Pareto front.

Pareto-optimal sequences are sequences that are not dominated by any other sequence in both categories.
Intuitively, for each sequence that does not belong to the Pareto front, there is another sequence on the front which scores at least as good in resemblance to both Protein A and Protein B, being a better choice.

Team:GO Paris-Saclay/Engineering

ENGINEERING SUCCESS

Pareto optimization