Team:GO Paris-Saclay/Course/Case Study




Case study: Comparing Multiple Sequence Alignments from several genomic databases

As you may remember from the second part of this course, the most intricate part in the process of getting good sequence entanglements would probably be to choose Multiple Sequence Alignements appropriately.

As such, we are going to look into detail into the various databases from which we could get multiple sequence alignments. Which databases will provide us the best results in terms of entanglement?

While the authors of CAMEOS' paper recommended the use of protein families such as from the sites PFAM and Interpro, our hypothesis is that getting alignments with very few variations in sequences such as from BLAST and ENSEMBL would provide us better results.

Throughout this analysis we conducted, we will consider an entanglement to be valuable if the protein similarity scores obtained from CAMEOS and from BLAST identity scores are high.

Presentation of the different genomic databases

For the purpose of this analysis, we chose several well-known tools and databases to retrieve MSAs from. Namely, NCBI BLAST, UniProt, ENSEMBL and Interpro.

The sequences we are trying to entangle this time are the toxin ccdB and the kanamycin resistance gene knt. This is one of the entanglements we chose for our final BioBricks, although it didn't go as well as planned.

We will not go into detail into how to retrieve MSA from each of the databases since the process is usally very similar between tools. We submit the protein sequence, the homology search algorithm returns top results, and we download these top results as an alignment.

For the case of InterPro, the sequence search usually redirects us to a protein family which contains up to thousands of sequences and we can download these as alignments. However, InterPro distinguishes between reviewed and unreviewed sequences.

  InterPro family entry for protein ccdB

In our purpose, reviewed sequences are obviously more interesting to avoid errors in our entanglements, but it turns out that for most proteins, the number of reviewed proteins is in tens, while the number of unreviewed proteins is in thousands.

Let us note here that all tools may provide us unreviewed sequences in our alignments. That even includes UniProt. Although protein sequences from the Swiss-Prot database are curated and annotated, protein sequence alignments retrieved from UniProt may include sequences from TrEMBL, which are uncurated.

An option would be to always remove all unreviewed sequences from all alignments, but with less well-known proteins it is often not possible. The toxin ccdB, which is pretty well-known, only has 5 reviewed sequences on InterPro.

Let us detail a bit more what we mean by our hypothesis from the start, namely that getting alignments with few diversity in the sequences could provide better results than alignment with a lot of diversity.

By taking many unreviewed proteins and using high level homology search techniques, the multiple sequence alignments coming from protein families, especially Interpro families, may be too diverse for obtaining good entangled sequences.

AliView visualization of the ccdB MSA obtained from InterPro

In contrast, NCBI BLAST, which databases contain the most massive amount of data, might give us sequences that are very close to the original and thus entangled sequences with a higher identity percentage. However, this is to be nuanced by the fact that NCBI may include sequences of defective mutants of the protein, which would mean defective entanglements as well. This why we are also looking at Uniprot and ENSEMBL.

AliView visualization of the ccdB MSA obtained from BLAST

After downloading an alignment, it is always essential to look at it, to deem it acceptable or not*. A lot can be told from what an alignment looks like. A great tool we are using to visualize alignments is AliView.

*For example, in the case of noisy alignments such as the one of ccdB from Interpro, you might want to look into trimming algorithms.

In addition to being great for visualizing alignements, AliView also supports the ability to realign the sequences with the alignment algorithm MUSCLE. This is very useful and we have used it quite a few times, since sometimes it gives us a better alignment than the one we downloaded.

Now let us begin the analysis process. For both proteins ccdB and knt, we have retrieved 5 MSAs each on the aforementioned databases. Here are the names we have given them:

  • ccdB_dirty, knt_dirty: MSA issued from a NCBI BLAST alignment without curation.
  • ccdB_lax, knt_lax: MSA issued from a Uniprot BLAST alignment without curation.
  • ccdB_stringent, knt_stringent: MSA issued from a Uniprot BLAST alignment with manual curation.
  • ccdB_ensembl, knt_ensembl: MSA issued from an ENSEMBL BLAST alignment without curation.
  • ccdB_interpro, knt_interpro: MSA issued from an InterPro family without curation.

Analyzing the impact of MSAs on entangled genes

In order to analyze the impact of MSAs on entangled genes, we looked for a good indicator of the quality of an alignment.

It turns out such an indicator exists, in the form of HMM consensus sequences. Remember, from every MSA, we generate an HMM which roughly contains probabilities for every amino acid to appear at each position.

Since the HMM is aware of these probabilities, we can use it to generate the most probable sequence. This is called the consensus sequence.

The command we will use to generate HMM consensus sequences from a HMM file prot.hmm is hmmemit -c prot.hmm

This is a convenient tool since it is provided along with hmmbuild and hmmpress in the HMMER suite.

$> cat ccdB.fasta
$> hmmemit -c ccdB_dirty.hmm
$> hmmemit -c ccdB_ensembl.hmm

Remember, the first step of CAMEOS is evaluating HMM seeds. HMM are usually used as profiles for determining similar sequences, and in that case having a lot of variety in the consensus sequence is great, since it means finding many homologues.

In our case though we might not want the consensus sequence to be too distant from the original protein sequence, as in a way this implies that the randomly generated sequences evaluated by CAMEOS themselves would be distant from the original protein sequence.

We theorize that a good alignment might be one that has enough variety while still staying generally close to the original sequences.

To check if our consensus sequences are close to the original sequences, we realized a BLAST alignment of the sequences using the option "Align two or more sequences" (see Part 3 for more details on NCBI BLASTp).

Blast identity scores for the knt consensus sequences
Blast identity scores for the ccdB consensus sequences

We immediately notice that, despite potentially containing defective mutants, the ccdB_dirty and knt_dirty MSAs take the top position.

As the alignment that comes from UniProt and is curated, we would expect ccdB_stringent and knt_stringent to be the ones that are sufficiently diverse and not too far from the original sequences. In the case of Knt, the identity score is really solid, while it is decent for ccdB.

Quite surprisingly, we observe a really good alignment for ccdB_ensembl, with the highest e-value, while the identity score for knt_ensembl is just mediocre.

The alignments ccdB_lax and knt_lax perform worse than ccdB_stringent and knt_stringent, which is to be expected since stringent is lax but manually curated.

Unfortunately in both cases we observe terrible scores for ccdB_interpro and knt_interpro, which may be due to the number of unreviewed proteins in the MSAs, as well as the lack of curation for such a large number of sequences.

Final results

In the end, we realized 25 entanglements, one for each different condition. We specified a number of 300 entangled sequences for each CAMEOS execution.

Spreadsheet describing all 25 CAMEOS executions

For each condition, we retrieved the mean and standard deviation of the Mark and Deg scores. This was compiled in the barplot below.

Barplot of the mean Mark and Deg scores indicating divergence from CcdB and Knt with confidence intervals

We observe that the population that did the worse overall is knt_interpro x ccdB_interpro, losing by a large margin. In addition, all entanglements that incorporate either InterPro MSA end up having really high scores, with a sum totalizing over 500.

According to the consensus sequences, the population that was supposed to have the best scores was knt_dirty x ccdB_ensembl.

Figure obtained from entanglement knt_dirty x ccdB_ensembl using script

Notice that we can clearly observe two populations on this graph. One with excellent Deg Scores, and another with mediocre Deg Scores. That explains the high standard deviation of the population as explicited by the confidence intervals on the barplots.

Instead, the population with the best overall scores is knt_dirty x ccdB_lax. Quite surprisingly, ccdB_lax did better than ccdB_stringent, despite being the same alignment but uncurated.

Figure obtained from entanglement knt_dirty x ccdB_lax using script

As expected from the top scoring mean Mark Score plus mean Deg Score entanglement, all sequences present great scores overall. However, knt_dirty x ccdB_ensembl had individual sequences with better Deg Scores, at the detrmient of worse Mark Scores, so if we'd want to maximize the resemblance to ccdB, that's something to take into consideration.

In conclusion, the observed results in the barplot for Knt are in accordance with what we observed with the consensus sequences earlier, with the order being exactly what would be expected from the consensus sequences identity scores.

For CcdB however, surprisingly, ccdB_lax ends up at the first places, despite its really low consensus identity score. This might be explained by the fact that ccdB is the smallest sequence and ends up being the inner sequence of the entanglement in every case. As the inner sequence, it is subject to a lot more variety and it may benefit from a diverse alignment.

From the analysis results we observed, it seems that MSAs from BLAST, ENSEMBL and Uniprot are significantly more prone to resemble the original proteins than protein families from Interpro.

This finding seems to support our starting hypothesis: alignments with proteins as close as possible to the original lay overall better CAMEOS entangled sequences than alignments with distant homologues.

This seems to be especially true for the outer sequence that is entangled into, and less for the inner sequence which can benefit from having variety in the MSA.

In order to truly discuss this hypothesis, we think this analysis can be done on other proteins and datasets. Potential additional databases and protein families we could have tested include eggNog, TIGRFAM, SUPERFAMILY, KEGG, and more.

In closing, although we do not know if this hypothesis has some validity or it is just simple coincidence, we think that our analysis of MSAs leads to very interesting results.

That concludes this course and case study. I hope this was enjoyable to read! It was our first experience with a software that has great potential for the future of synthetic biology. I hope our tutorial will facilitate the road to generating your own entangled genes.

Please let us know if you have any comments, we would really appreciate discussing about our results!

- Maxime Mahout
for iGEM GO Paris-Saclay 2020

Back to the top:
Faculté des Sciences d'Orsay- Université Paris-Saclay-Logo
Team GO Paris-Saclay
Université Paris-Saclay
Faculté des Sciences d'Orsay
Building n°400
91 405 Cedex, Orsay
GO Paris-Saclay logo - like Eiffel Tower with a DNA strand

Thank you very much to our generous Sponsors