Our project deals with the software CAMEOS developed by Blazejewski et al, Science 2019, DOI: 10.1126/science.aav5477.
Our team first developed improvements to the software of CAMEOS in the form of a bash script named
bash.sh in order to design entanglements of our own proteins more easily.
The script takes a Multiple Sequence Alignment (MSA) in input and from there it generates all the necessary inputs for the CAMEOS algorithm.
This helped us reduce greatly the computational work required for designing entanglements as we describe in the Engineering page.
We then developed a Python script named
pareto.py adapting Pareto optimization to the purpose of finding the optimal entanglements after a run of CAMEOS.
This script takes outputs of CAMEOS in input and displays a graph of the Pareto optimal sequences (see the graph below). It also outputs the indices of Pareto optimal sequences.
CAMEOS is a software written in Julia which returns Julia data structures as output.
In consequence, to examine these structures we need to use Julia code.
We made a third script called
extract_data.jl which is meant to be used right after our pareto.py script.
It takes the pareto optimal sequences indices in argument and return the optimal entangled coding sequences as well as the corresponding protein sequences.
A complete explanation of how to use these two scripts is also found in the README.md and on the CAMEOS Course.
We provide the aforementioned scripts and programs in a zip file along with a README.md file with instructions for the three scripts that serves as documentation.
- bash.sh - Bash script
- pareto.py - Python script
- extract_data.jl - Julia script
- README.md - Markdown Documentation
- LICENSE.txt - MIT License
Our scripts are available open-source and free to use under the MIT License.
Pareto optimizationOur selection technique is based on Pareto optimization or multiobjective optimization.
In the figure below we attempt to explain Pareto optimization applied to entangled sequences in a nutshell.
We have two parameters to optimize: we want to select the sequences which minimize the divergence from Protein A and Protein B.
In this case, the sequences that appear on the graph in yellow are said to be Pareto-optimal. They constitute a Pareto front.
Pareto-optimal sequences are sequences that are not dominated by any other sequence in both categories.
Intuitively, for each sequence that does not belong to the Pareto front, there is another sequence on the front which scores at least as good in resemblance to both Protein A and Protein B, being a better choice.