Team:GO Paris-Saclay/Course/Part 3



Analyzing the results

Once done with the CAMEOS execution, we are left with one issue. Which entangled sequence from the 100 or so we have generated should we select? In order to answer this, many tools are at our disposal.

We will be looking into each one of them and try to see what they can bring to the table. The original idea was the following: if a sequence manages to survive through many selection criteria, we can objectively declare it to be suitable to the iGEM lab.

In practice though I just quickly select the optimal sequences with the scripts I created and pray for the BLAST identity scores to not be terrible. They usually are, I curse, and I assume the problem comes from the MSA file (see Part 2 and Case study), meaning I have to start all over again! Anyway, let's delve into it!

Selecting the top results from CAMEOS

In fact, CAMEOS uses its own scoring system for deciding which entangled sequences are the best. Remember, this is the one it uses in order to decide which sequences are the top hits in the file top_twelve_barcode.fa.

Let's simply call gene A and gene B the two genes we are trying to entangle with CAMEOS. For each entangled sequence, CAMEOS returns two scores, one that represents the divergence from protein A, the Mark Score and one that represents the divergence from protein B, the Deg Score.

These divergence scores are reported in the CAMEOS output file all_final_fitness_barcode.txt that we presented back in Part 1 of this course.

As you'd expect, we'd want the two protein sequences issued from our entanglement to be as close as possible from the original protein sequences. That means we have to minimize the divergence from protein A and protein B, in the form of the Mark Score and Deg Score.

Figure obtained from entanglement infA x aroB using script

However, minimizing several parameters at once is actually not trivial at all. In fact, multi-objective optimization, or pareto optimization, is a whole existing domain of research!

For their top results, the authors of CAMEOS took the sequences which are minimal in respect to the sum of the Deg Score and Mark Score*. In reality, solutions that are minimal in the sum of both parameters may not actually be pareto optimal.

*To be exact, the CAMEOS top 12 consists of the 3 best ranking sequences according to Deg Scores, the 3 best ranking sequences according to Mark Score and the 6 best ranking sequences in terms of the sum Mark Score plus Deg Score.

As such, as part of our iGEM contribution, I made a Python script called computing the pareto optimal solutions from CAMEOS' output as well as a nice graph showcasing them.

Make sure to download it from our Software page and put it in your working directory, which at this point should be CAMEOS/main/output/. The script requires Python 3 with the modules pandas and matplotlib so you need to have those installed.

Here is what the graph looks like with the CAMEOS output I obtained in Part 1 for the entanglements of infA and aroB. The random barcode I obtained was oVwNsNbF.

$> python infA_aroB_p1/all_final_fitness_oVwNsNbF.txt > pareto_sequences.txt
Figure obtained from entanglement infA x aroB using script
Annotated with the indices of optimal sequences

This python command outputs all optimal sequences indices into a file called pareto_sequences.txt.

I coupled it with some Julia code to extract the pareto optimal entangled coding sequences and protein sequences from the saved_pop_barcode.jld output file.

$> julia extract_data.jl infA_aroB_p1/saved_pop_oVwNsNbF.jld pareto_sequences.txt

The Julia script, extract_data.jl generates three files:

  • pareto_entangled_genes.fasta: coding entangled sequences
  • pareto_protein_A.fasta: protein sequences for the 1st gene
  • pareto_protein_B.fasta: protein sequences for the 2nd gene

These two scripts are available along with the bash script in the Software page of this wiki.

Make sure to run the following command on your files pareto_protein_A.fasta and pareto_protein_B.fasta. This will remove all dashes that sometimes appear in these files.

$> sed -i 's/-//g' pareto_protein_A.fasta

Now let's look at the annotated graph of infA x aroB entangled sequences ranked by Mark and Deg Scores. Notice that the script only annotated pareto optimal sequences, and that such sequences are connected by a dotted line, the pareto front.

These pareto optimal sequences are the same that we retrieve with the Julia script. In this case, there are 11 pareto optimal sequences on the pareto front. Due to their optimality, these sequences are guaranteed to be the best Mark and Deg scores compromises of the dataset.

In the case of the entanglement infA x aroB, Mark Score corresponds to the divergence from protein InfA, and Deg Score to the divergence from protein AroB.

Considering this, here is how to read the graph. Sequence 26 is the closest from InfA and the farthest from AroB. Sequence 23 is the closest from AroB and the farthest from InfA. Sequence 25 is both one of the closest to InfA and to AroB.

Then, we can go look at the files pareto_protein_A.fasta and pareto_protein_B.fasta to see if the protein sequences are indeed close or far from the original sequences.

This pareto method allows for a proper optimization technique to replace the previously used top twelve. In a way this is still arbitrary, since the size of the pareto front may vary with each execution of CAMEOS.

Phylogeny analysis

Now you may be thinking, what if CAMEOS' scores aren't accurate? Wouldn't selecting the sequences this way be a waste of time? Let us look at other tools.

View of the site for the alignment of infA and entanglement variants

We are going to use the online tool We will provide a file containing the protein sequences issued from our entanglements along with the original protein sequence to in One-Click Mode. The website will perform an alignment, a curation and construct a phylogenic tree.

Phylogeny for aroB and entanglement variants
Phylogeny for infA and entanglement variants

Here is how to read the phylogeny trees presented above. The length of each branch represents the distance between the sequences, so for example the variant 23 is the closest to aroB and the variant 38 is the closest to infA. Variant 26 seems to be far from both sequences surprisingly.

Earlier, we took the two extrema from the pareto graph, variants 23 and 26, as well as variant 25. Let's also pick variant 38 since the phylogeny seems to indicate it would also do very well. Here are their sequences:

>gene_23 (farthest)
>gene_25 (closer to infA than 38)
>gene_26 (closest)
>gene_23 (closest)
>gene_26 (farthest)
>gene_38 (closer to aroB than 25)

An important thing to note, by aligning the entanglement variants with the original proteins, we are missing a large set of information that is given in input to CAMEOS: the MSAs (see more about this in Part 2).

This means for example that variant 26 might still be very close to AroB, only it is not close to the original AroB sequence, it is close to one of the homologous sequences we gave to CAMEOS in the input MSAs.

In the end, a phylogeny analysis is a great tool for selecting optimal sequences, but it does not consider as much information as CAMEOS' scores.

Interpreting the BLAST identity scores

While CAMEOS scores might give us a good indicator of the quality, there are many factors that might invalidate an entanglement. To validate an entanglement, we can also use the tool BLASTp from NCBI and take a look at identity percentages.

The tool BLASTp returns a protein sequence alignment, along with indicators such as the identity percentage, the positives percentage and the number of gaps.

Intuitively, an identity score or identity percentage gives us the similarity of the query sequence with the original sequence. If protein A from our entanglement has an identity score of 90% on the alignment with the original protein A, then about 90% of the amino acids are the same than in the original protein.

The positives percentage is the same kind of idea, except it is more lenient: amino acids that are very similar in structure are considered the same in the scoring.

We simply use BLASTp of our entangled protein sequences against the NCBI database. This means that unlike with the phylogeny analysis, we will not be directly comparing our variants to the original protein sequence. We will retrieve the top BLASTp identity score, and that score might belong to a protein which is very different from the original protein.

This is fair because the HMMs and MRFs might return parts of the sequences from the other protein sequences in the alignment. The algorithm depends just as much if not more on the MSAs than on the original sequences.

Otherwise, a way to check the similarity of our entangled protein sequences with the original protein sequences with BLASTp would be by using the "Align two or more sequences" checkbox option. We won't be using this option for now.

Remember, BLASTp does not actually sort the top hits by the identity percentages, so the top scoring hit may not be the protein with the higher identity percentage. Other indicators can of course be taken into consideration to check if our proteins are close to the originals, such as the query coverage.

As an example, here are the results of a BLASTp against the NCBI database for entanglement variant 25. In the end, this sequence has 87.05% identity with AroB and 61.11% identity with InfA.

BLASTp result for variant 25 protein AroB
BLASTp result for variant 25 protein InfA

Secondly, here are the results for entanglement variant 38. This sequence has 85.67% identity with AroB and 68.06% identity with InfA.

BLASTp result for variant 38 protein AroB
BLASTp result for variant 38 protein InfA

We retrieve that variant 25 is closer to AroB than variant 38, and conversely variant 38 is closer to InfA than variant 25. This could be evidenced before by the pareto graph and the phylogeny tree.

From different tests we have confirmed that the score provided by CAMEOS actually gives a really solid idea of what the BLASTp identity percentage will be. A protein with a lower Mark Score will have a higher identity percentage for that protein, and so on.

As such, if as explained earlier we take the entanglement which point is right at the middle of the graph of CAMEOS scores, we should likely have a really good compromise of scores, and the identity percentages found with BLASTp should be a good compromise too.

However, it doesn't give an idea of the score percentage itself, so it is still extremely important to check the BLAST identity scores as early as possible. Low BLAST identity scores reduce likelihood of the protein being functional.

We considered an identity score to be low if it is below 75% for the outer protein sequence and below 60% for the inner protein sequence. This is really permissive, however with CAMEOS we often don't have much better identity scores.

Finally, once we have selected the best looking entangled sequences, we have to verify that the protein sequences derived from the nucleotide sequence actually match the protein sequences retrieved from CAMEOS.

Remember, entangled coding sequences are in the file pareto_entangled_genes.fasta generated earlier by our script. Here is what the variant 25 looks like.


Notice that the entangled part starts extremely soon, can a ribosome bind on such a site? I suggest keeping Ribosome Binding Sites in mind when reviewing an entangled sequence.

Additionally, we should add a stop codon at the end of the nucleotide sequence. CAMEOS does not automatically complete the nucleotide sequence with a stop codon.

Finally, it is advised to look for the coding sequences of the selected nucleotide sequences. A good tool for that purpose is ORF Finder. The coding sequence for the inner sequence found by ORF Finder may not be the right one.

Result of ORF Finder for a knt x ccdB entanglement.
Notice the ORF number 1 in frame 2 starts with two methionines, as a result of the subsequence "gatGATG".
The first "atG" is part of the outer coding sequence.
This can be avoided by using a synonymous codon in frame 1 (gat -> gac), replacing "gatGATG" by "gacGATG".

While the outer sequence is just the frame 1 of the whole nucleotide sequence, the inner sequence is on frame 2 or 3. CAMEOS unfortunately does not check these frames for other start codons so one of the coding sequences found may be starting before the actual intended inner sequence. This results in several additional amino acids at the start of the protein if on the same frame, and a totally different nosensical protein if on a different frame.

In that case, we have to hope the upstream start codon on frame 2 or 3 can be altered by replacing a codon on frame 1 by a synonymous codon. We wouldn't want the ribosome to translate the wrong ORF. Most codons have synonymous codons so this is not usually a problem.

Phyre2 and structural analysis

In addition, we can use the tool Phyre2 to perform a protein structural analysis. Here are some great resources to understand what a Phyre2 analysis is all about: Phyre 2 Interpreting Results, Phyre2 Visualisation tools.

Phyre2 submission box

Submitting a protein sequence to Phyre2 might take a long time, but it is very straightforward. For entangled proteins, we suggest selecting the intensive search option, which generates a new 3D model for our protein using the top matching proteins. Once the analysis done, Phyre2 will email you the results.

Phyre2 intensive analysis for sfGFP

Phyre2 returns us an identity score just like BLAST. In fact, it also aligns the proteins, but this time taking structures into consideration. The matches found by Phyre2 are models of our protein based on protein structure templates from databases such as PDB. They are all downloable in .pdb format and you can have detailed overview of each alignments.

There are other useful analysis done by Phyre2 such as secondary structure prediction. What we suggest doing is downloading the generated .pdb 3D model. This can be an useful way to analyze protein structure for structural biology specialists. PDB files must be visualized with an appropriate software, for example VMD.

Another step in the structural analysis of our resulting proteins can be done by simply looking at Uniprot. In particular, Uniprot sometimes reports the binding site positions and regions, as well as known defective mutations. Here are some examples.

Uniprot known mutations for ccdB
Uniprot binding regions for aroB

Let's try to see if our aroB sequence from earlier have mutations in the nucleotide binding regions (71-76, 105-109, 129-130, 169-172). I'm using Julia to stay in the theme.

julia> aroB
julia> variant_25

The sequences are offset by one position at one point, hence we index the variant one amino acid farther (I highlighted in blue protein subsequences that match, notice how the variant is 1 amino acid longer).

julia> aroB[71:76]
julia> variant_25[72:77]
julia> aroB[129:130]
julia> variant_25[130:131]
julia> aroB[169:172]
julia> variant_25[170:173]

Unfortunately, we notice that the 71-76 nucleotide binding region is different in our variant 25, three amino acid residues differ (highlighted in red).

That means we might have to select another entanglement. It is possible to automate a bit the search of positional mutations, though, remember, this must be specifically tailored for each protein.

This is a good note to conclude this part on. In the end, if all obtained entangled sequences were altered on several essential positions, there is not much else to do. Sometimes it comes from the MSA proteins containing unusual AAs at essential positions, sometimes it is because CAMEOS cannot find a place in the sequence to entangle without modifying the position.

To help with this issue, in the Case Study we propose to find which MSAs would give us the proteins that are the closest possible to the original ones.

Back to the top:
Faculté des Sciences d'Orsay- Université Paris-Saclay-Logo
Team GO Paris-Saclay
Université Paris-Saclay
Faculté des Sciences d'Orsay
Building n°400
91 405 Cedex, Orsay
GO Paris-Saclay logo - like Eiffel Tower with a DNA strand

Thank you very much to our generous Sponsors