Software

SOFTWARE

Our project deals with the software CAMEOS developed by Blazejewski et al, Science 2019, DOI: 10.1126/science.aav5477.

Our team first developed improvements to the software of CAMEOS in the form of a bash script named bash.sh in order to design entanglements of our own proteins more easily.

The script takes a Multiple Sequence Alignment (MSA) in input and from there it generates all the necessary inputs for the CAMEOS algorithm.

This helped us reduce greatly the computational work required for designing entanglements as we describe in the Engineering page.

The algorithm of CAMEOS is explained in detail in the Model page, and a complete explanation of the script is found in the second part of the CAMEOS Course.

We then developed a Python script named pareto.py adapting Pareto optimization to the purpose of finding the optimal entanglements after a run of CAMEOS.

This script takes outputs of CAMEOS in input and displays a graph of the Pareto optimal sequences (see the graph below). It also outputs the indices of Pareto optimal sequences.

CAMEOS is a software written in Julia which returns Julia data structures as output.
In consequence, to examine these structures we need to use Julia code.

We made a third script called extract_data.jl which is meant to be used right after our pareto.py script.

It takes the pareto optimal sequences indices in argument and return the optimal entangled coding sequences as well as the corresponding protein sequences.

A complete explanation of how to use these two scripts is also found in the README.md and on the CAMEOS Course.

We provide the aforementioned scripts and programs in a zip file along with a README.md file with instructions for the three scripts that serves as documentation.

bash.sh - Bash script
pareto.py - Python script
extract_data.jl - Julia script
README.md - Markdown Documentation
LICENSE.txt - MIT License

Our scripts are available open-source and free to use under the MIT License.

Pareto optimization

Our selection technique is based on Pareto optimization or multiobjective optimization.
In the figure below we attempt to explain Pareto optimization applied to entangled sequences in a nutshell.
We have two parameters to optimize: we want to select the sequences which minimize the divergence from Protein A and Protein B.
In this case, the sequences that appear on the graph in yellow are said to be Pareto-optimal. They constitute a Pareto front.

Pareto-optimal sequences are sequences that are not dominated by any other sequence in both categories.
Intuitively, for each sequence that does not belong to the Pareto front, there is another sequence on the front which scores at least as good in resemblance to both Protein A and Protein B, being a better choice.

Team:GO Paris-Saclay/Software

SOFTWARE

Pareto optimization