Analyzing the results
Once done with the CAMEOS execution, we are left with one issue. Which entangled sequence from the 100 or so we have generated should we select? In order to answer this, many tools are at our disposal.
We will be looking into each one of them and try to see what they can bring to the table. The original idea was the following: if a sequence manages to survive through many selection criteria, we can objectively declare it to be suitable to the iGEM lab.
In practice though I just quickly select the optimal sequences with the scripts I created and pray for the BLAST identity scores to not be terrible. They usually are, I curse, and I assume the problem comes from the MSA file (see Part 2 and Case study), meaning I have to start all over again! Anyway, let's delve into it!
Selecting the top results from CAMEOS
In fact, CAMEOS uses its own scoring system for deciding which entangled
sequences are the best. Remember, this is the one it uses in order to
decide which sequences are the top hits in the file
Let's simply call gene A and gene B the two genes we are trying to entangle with CAMEOS. For each entangled sequence, CAMEOS returns two scores, one that represents the divergence from protein A, the Mark Score and one that represents the divergence from protein B, the Deg Score.
These divergence scores are reported in the CAMEOS output file
all_final_fitness_barcode.txt that we presented
back in Part 1 of this course.
As you'd expect, we'd want the two protein sequences issued from our entanglement to be as close as possible from the original protein sequences. That means we have to minimize the divergence from protein A and protein B, in the form of the Mark Score and Deg Score.
However, minimizing several parameters at once is actually not trivial at all. In fact, multi-objective optimization, or pareto optimization, is a whole existing domain of research!
For their top results, the authors of CAMEOS took the sequences which are minimal in respect to the sum of the Deg Score and Mark Score*. In reality, solutions that are minimal in the sum of both parameters may not actually be pareto optimal.
*To be exact, the CAMEOS top 12 consists of the 3 best ranking sequences according to Deg Scores, the 3 best ranking sequences according to Mark Score and the 6 best ranking sequences in terms of the sum Mark Score plus Deg Score.
As such, as part of our iGEM contribution, I made a Python script
pareto.py computing the
pareto optimal solutions from CAMEOS' output as well as a nice
graph showcasing them.
Make sure to download it from our Software page and put it in your working directory, which at this point should be
The script requires Python 3 with the modules
matplotlib so you need to have those installed.
Here is what the graph looks like with the CAMEOS output I obtained in
Part 1 for the entanglements of
infA and aroB. The random barcode I obtained was
This python command outputs all optimal sequences indices into
a file called
I coupled it with some Julia code to extract the
entangled coding sequences and protein sequences from the
saved_pop_barcode.jld output file.
The Julia script,
extract_data.jl generates three files:
pareto_entangled_genes.fasta: coding entangled sequences
pareto_protein_A.fasta: protein sequences for the 1st gene
pareto_protein_B.fasta: protein sequences for the 2nd gene
These two scripts are available along with the bash script in the Software page of this wiki.
Make sure to run the following command
on your files
This will remove all dashes that sometimes appear in these files.
Now let's look at the annotated graph of infA x aroB entangled sequences ranked by Mark and Deg Scores. Notice that the script only annotated pareto optimal sequences, and that such sequences are connected by a dotted line, the pareto front.
These pareto optimal sequences are the same that we retrieve with the Julia script. In this case, there are 11 pareto optimal sequences on the pareto front. Due to their optimality, these sequences are guaranteed to be the best Mark and Deg scores compromises of the dataset.
In the case of the entanglement infA x aroB, Mark Score corresponds to the divergence from protein InfA, and Deg Score to the divergence from protein AroB.
Considering this, here is how to read the graph. Sequence 26 is the closest from InfA and the farthest from AroB. Sequence 23 is the closest from AroB and the farthest from InfA. Sequence 25 is both one of the closest to InfA and to AroB.
Then, we can go look at the files
pareto_protein_B.fasta to see if the protein sequences are indeed
close or far from the original sequences.
This pareto method allows for a proper optimization technique to replace the previously used top twelve. In a way this is still arbitrary, since the size of the pareto front may vary with each execution of CAMEOS.
Now you may be thinking, what if CAMEOS' scores aren't accurate? Wouldn't selecting the sequences this way be a waste of time? Let us look at other tools.
We are going to use the online tool phylogeny.fr. We will provide a file containing the protein sequences issued from our entanglements along with the original protein sequence to phylogeny.fr in One-Click Mode. The website will perform an alignment, a curation and construct a phylogenic tree.
Here is how to read the phylogeny trees presented above. The length of each branch represents the distance between the sequences, so for example the variant 23 is the closest to aroB and the variant 38 is the closest to infA. Variant 26 seems to be far from both sequences surprisingly.
Earlier, we took the two extrema from the pareto graph, variants 23 and 26, as well as variant 25. Let's also pick variant 38 since the phylogeny seems to indicate it would also do very well. Here are their sequences:
>infA MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR >gene_23 (farthest) MSSRGSIEVEGVVEESLPSGQWTVRDDDGSVLTAHASGQIRRFRIRVLSGDRVTGELSPADLTRGRITFRLS >gene_25 (closer to infA than 38) MAGNAGIEVSGVVREVLPSSRFRVELQTGHTVTARVSGKLRKHRIRILAGDTVVLELSPCDLGRGRITFRGK >gene_26 (closest) MANNDRIEVSGVVVEALPNTLFRVKLQTGQEILAHLSGKLRIHRIRVLAGDKVLVELSPYYGGRGRITRRGK >gene_38 MSKEELIELEAVVKESLRDARFRLELESGFVVLGHISGRIRRNHIRILPGDRVRIEQSPYDLTKGRIVSRLK
>aroB MERIVVTLGERSYPITIASGLFNEPASFLPLKSGEQVMLVTNETLAPLYLDKVRGVLEQAGVNVDSVILPDGEQYKSLAVLDTVFTALLQKPHGRDTTLVALGGGVVGDLTGFAAASYQRGVRFIQVPTTLLSQVDSSVGGKTAVNHPLGKNMIGAFYQPASVVVDLDCLKTLPPRELASGLAEVIKYGIILDGAFFNWLEENLDALLRLDGPAMAYCIRRCCELKAEVVAADERETGLRALLNLGHTFGHAIEAEMGYGNWLHGEAVAAGMVMAARTSERLGQFSSAETQRIITLLKRAGLPVNGPREMSAQAYLPHMLRDKKVLAGEMRLILPLAIGKSEVRSGVSHELVLNAIADCQSA >gene_23 (closest) MERIVVTLGERSYPITIASGLFNEPASFLKPLKSGEQVMLVTNETLAPLYLDKVRGVLEQAGVNVDSVILPDGEQYKSLAVLDTVFTALLQKPHGRDTTLVALGGGVVGDLTGFAAASYQRGVRFIQVPTTLLSQVDSSVGGKTAVNHPLGKNMIGAFYQPASVVVDLDCLKTLPPRELASGLAEVIKYGIILDGAFFNWLEENLDALLRLDGPAMAYCIRRCCELKAEVVAADERETGLRALLNLGHTFGHAIEAEMGYGNWLHGEAVAAGMVMAARTSERLGQCPAEVLSRLRALLRRASLPVSGPSEMTTEAYLPHMLRDKYVVSGSVCLVVIVSLGNSALRTSLAAELLLDSLRDCQSA >gene_25 MERWQVTLGSRSVELSVRSSLLADSGSSFRPVIRSPRVFLVSSESTASAFLPATRSSLSSARVTLDEVVLPSGESSKSLAVLDTVFTALLQKPHGRDTTLVALGGGVVGDLTGFAAASYQRGVRFIQVPTTLLSQVDSSVGGKTAVNHPLGKNMIGAFYQPASVVVDLDCLKTLPPRELASGLAEVIKYGIILDGAFFNWLEENLDALLRLDGPAMAYCIRRCCELKAEVVAADERETGLRALLNLGHTFGHAIEAEMGYGNWLHGEAVAAGMVMAARTSERLGQFSSAETQRIITLLKRAGLPVNGPREMSAQAYLPHMLRDKKVLAGEMRLILPLAIGKSEVRSGVSHELVLNAIADCQSA >gene_26 (farthest) MERWQITTGSRSVELSLRPSLIRSSASSFRPAKRSSRTFLVSSESTASAFLPAIRSSLSSARITVDEVELPDGESDKSLAVLDTVFTALLQKPHGRDTTLVALGGGVVGDLTGFAAASYQRGVRFIQVPTTLLSQVDSSVGGKTAVNHPLGKNMIGAFYQPASVVVDLDCLKTLPPRELASGLAEVIKYGIILDGAFFNWLEENLDALLRLDGPAMAYCIRRCCELKAEVVAADERETGLRALLNLGHTFGHAIEAEMGYGNWLHGEAVAAGMVMAARTSERLGQFSSAETQRIITLLKRAGLPVNGPREMSAQAYLPHMLRDKKVLAGEMRLILPLAIGKSEVRSGVSHELVLNAIADCQSA >gene_38 (closer to aroB than 25) MERIVVTLGERSYPITIASGCLRKSSSSWRPLLRNPSVMLVSDSNLSPALLSSVTSLAESAGITSASYQVTASESSKALTTLQKVVSFLASNRHGRDTTLVALGGGVVGDLTGFAAASYQRGVRFIQVPTTLLSQVDSSVGGKTAVNHPLGKNMIGAFYQPASVVVDLDCLKTLPPRELASGLAEVIKYGIILDGAFFNWLEENLDALLRLDGPAMAYCIRRCCELKAEVVAADERETGLRALLNLGHTFGHAIEAEMGYGNWLHGEAVAAGMVMAARTSERLGQFSSAETQRIITLLKRAGLPVNGPREMSAQAYLPHMLRDKKVLAGEMRLILPLAIGKSEVRSGVSHELVLNAIADCQSA
An important thing to note, by aligning the entanglement variants with the original proteins, we are missing a large set of information that is given in input to CAMEOS: the MSAs (see more about this in Part 2).
This means for example that variant 26 might still be very close to AroB, only it is not close to the original AroB sequence, it is close to one of the homologous sequences we gave to CAMEOS in the input MSAs.
In the end, a phylogeny analysis is a great tool for selecting optimal sequences, but it does not consider as much information as CAMEOS' scores.
Interpreting the BLAST identity scores
While CAMEOS scores might give us a good indicator of the quality, there are many factors that might invalidate an entanglement. To validate an entanglement, we can also use the tool BLASTp from NCBI and take a look at identity percentages.
The tool BLASTp returns a protein sequence alignment, along with indicators such as the identity percentage, the positives percentage and the number of gaps.
Intuitively, an identity score or identity percentage gives us the similarity of the query sequence with the original sequence. If protein A from our entanglement has an identity score of 90% on the alignment with the original protein A, then about 90% of the amino acids are the same than in the original protein.
The positives percentage is the same kind of idea, except it is more lenient: amino acids that are very similar in structure are considered the same in the scoring.
We simply use BLASTp of our entangled protein sequences against the NCBI database. This means that unlike with the phylogeny analysis, we will not be directly comparing our variants to the original protein sequence. We will retrieve the top BLASTp identity score, and that score might belong to a protein which is very different from the original protein.
This is fair because the HMMs and MRFs might return parts of the sequences from the other protein sequences in the alignment. The algorithm depends just as much if not more on the MSAs than on the original sequences.
Otherwise, a way to check the similarity of our entangled protein sequences with the original protein sequences with BLASTp would be by using the "Align two or more sequences" checkbox option. We won't be using this option for now.
Remember, BLASTp does not actually sort the top hits by the identity percentages, so the top scoring hit may not be the protein with the higher identity percentage. Other indicators can of course be taken into consideration to check if our proteins are close to the originals, such as the query coverage.
As an example, here are the results of a BLASTp against the NCBI database for entanglement variant 25. In the end, this sequence has 87.05% identity with AroB and 61.11% identity with InfA.
Secondly, here are the results for entanglement variant 38. This sequence has 85.67% identity with AroB and 68.06% identity with InfA.
We retrieve that variant 25 is closer to AroB than variant 38, and conversely variant 38 is closer to InfA than variant 25. This could be evidenced before by the pareto graph and the phylogeny tree.
From different tests we have confirmed that the score provided by CAMEOS actually gives a really solid idea of what the BLASTp identity percentage will be. A protein with a lower Mark Score will have a higher identity percentage for that protein, and so on.
As such, if as explained earlier we take the entanglement which point is right at the middle of the graph of CAMEOS scores, we should likely have a really good compromise of scores, and the identity percentages found with BLASTp should be a good compromise too.
However, it doesn't give an idea of the score percentage itself, so it is still extremely important to check the BLAST identity scores as early as possible. Low BLAST identity scores reduce likelihood of the protein being functional.
We considered an identity score to be low if it is below 75% for the outer protein sequence and below 60% for the inner protein sequence. This is really permissive, however with CAMEOS we often don't have much better identity scores.
Finally, once we have selected the best looking entangled sequences, we have to verify that the protein sequences derived from the nucleotide sequence actually match the protein sequences retrieved from CAMEOS.
Remember, entangled coding sequences are in the file
pareto_entangled_genes.fasta generated earlier by our script.
Here is what the variant 25 looks like.
Notice that the entangled part starts extremely soon, can a ribosome bind on such a site? I suggest keeping Ribosome Binding Sites in mind when reviewing an entangled sequence.
Additionally, we should add a stop codon at the end of the nucleotide sequence. CAMEOS does not automatically complete the nucleotide sequence with a stop codon.
Finally, it is advised to look for the coding sequences of the selected nucleotide sequences. A good tool for that purpose is ORF Finder. The coding sequence for the inner sequence found by ORF Finder may not be the right one.
While the outer sequence is just the frame 1 of the whole nucleotide sequence, the inner sequence is on frame 2 or 3. CAMEOS unfortunately does not check these frames for other start codons so one of the coding sequences found may be starting before the actual intended inner sequence. This results in several additional amino acids at the start of the protein if on the same frame, and a totally different nosensical protein if on a different frame.
In that case, we have to hope the upstream start codon on frame 2 or 3 can be altered by replacing a codon on frame 1 by a synonymous codon. We wouldn't want the ribosome to translate the wrong ORF. Most codons have synonymous codons so this is not usually a problem.
Phyre2 and structural analysis
In addition, we can use the tool Phyre2 to perform a protein structural analysis. Here are some great resources to understand what a Phyre2 analysis is all about: Phyre 2 Interpreting Results, Phyre2 Visualisation tools.
Submitting a protein sequence to Phyre2 might take a long time, but it is very straightforward. For entangled proteins, we suggest selecting the intensive search option, which generates a new 3D model for our protein using the top matching proteins. Once the analysis done, Phyre2 will email you the results.
Phyre2 returns us an identity score just like BLAST. In fact, it also aligns the proteins, but this time taking structures into consideration. The matches found by Phyre2 are models of our protein based on protein structure templates from databases such as PDB. They are all downloable in .pdb format and you can have detailed overview of each alignments.
There are other useful analysis done by Phyre2 such as secondary structure prediction. What we suggest doing is downloading the generated .pdb 3D model. This can be an useful way to analyze protein structure for structural biology specialists. PDB files must be visualized with an appropriate software, for example VMD.
Another step in the structural analysis of our resulting proteins can be done by simply looking at Uniprot. In particular, Uniprot sometimes reports the binding site positions and regions, as well as known defective mutations. Here are some examples.
Let's try to see if our aroB sequence from earlier have mutations in the nucleotide binding regions (71-76, 105-109, 129-130, 169-172). I'm using Julia to stay in the theme.julia> aroB
"MERIVVTLGERSYPITIASGLFNEPASFLPLKSGEQVMLVTNETLAPLYLDKVRGVLEQAGVNVDSVILPDGEQYKSLAVLDTVFTALLQKPHGRDTTLVALGGGVVGDLTGFAAASYQRGVRFIQVPTTLLSQVDSSVGGKTAVNHPLGKNMIGAFYQPASVVVDLDCLKTLPPRELASGLAEVIKYGIILDGAFFNWLEENLDALLRLDGPAMAYCIRRCCELKAEVVAADERETGLRALLNLGHTFGHAIEAEMGYGNWLHGEAVAAGMVMAARTSERLGQFSSAETQRIITLLKRAGLPVNGPREMSAQAYLPHMLRDKKVLAGEMRLILPLAIGKSEVRSGVSHELVLNAIADCQSA" julia> variant_25
The sequences are offset by one position at one point, hence we index the variant one amino acid farther (I highlighted in blue protein subsequences that match, notice how the variant is 1 amino acid longer).julia> aroB[71:76]
Unfortunately, we notice that the 71-76 nucleotide binding region is different in our variant 25, three amino acid residues differ (highlighted in red).
That means we might have to select another entanglement. It is possible to automate a bit the search of positional mutations, though, remember, this must be specifically tailored for each protein.
This is a good note to conclude this part on. In the end, if all obtained entangled sequences were altered on several essential positions, there is not much else to do. Sometimes it comes from the MSA proteins containing unusual AAs at essential positions, sometimes it is because CAMEOS cannot find a place in the sequence to entangle without modifying the position.
To help with this issue, in the Case Study we propose to find which MSAs would give us the proteins that are the closest possible to the original ones.