CAMEOS Course
Case study: Comparing Multiple Sequence Alignments from several genomic databases
As you may remember from the second part of this course, the most intricate part in the process of getting good sequence entanglements would probably be to choose Multiple Sequence Alignements appropriately.
As such, we are going to look into detail into the various databases from which we could get multiple sequence alignments. Which databases will provide us the best results in terms of entanglement?
While the authors of CAMEOS' paper recommended the use of protein families such as from the sites PFAM and Interpro, our hypothesis is that getting alignments with very few variations in sequences such as from BLAST and ENSEMBL would provide us better results.
Throughout this analysis we conducted, we will consider an entanglement to be valuable if the protein similarity scores obtained from CAMEOS and from BLAST identity scores are high.
Presentation of the different genomic databases
For the purpose of this analysis, we chose several well-known tools and databases to retrieve MSAs from. Namely, NCBI BLAST, UniProt, ENSEMBL and Interpro.
The sequences we are trying to entangle this time are the toxin ccdB and the kanamycin resistance gene knt. This is one of the entanglements we chose for our final BioBricks, although it didn't go as well as planned.
We will not go into detail into how to retrieve MSA from each of the databases since the process is usally very similar between tools. We submit the protein sequence, the homology search algorithm returns top results, and we download these top results as an alignment.
For the case of InterPro, the sequence search usually redirects us to a protein family which contains up to thousands of sequences and we can download these as alignments. However, InterPro distinguishes between reviewed and unreviewed sequences.
In our purpose, reviewed sequences are obviously more interesting to avoid errors in our entanglements, but it turns out that for most proteins, the number of reviewed proteins is in tens, while the number of unreviewed proteins is in thousands.
Let us note here that all tools may provide us unreviewed sequences in our alignments. That even includes UniProt. Although protein sequences from the Swiss-Prot database are curated and annotated, protein sequence alignments retrieved from UniProt may include sequences from TrEMBL, which are uncurated.
An option would be to always remove all unreviewed sequences from all alignments, but with less well-known proteins it is often not possible. The toxin ccdB, which is pretty well-known, only has 5 reviewed sequences on InterPro.
Let us detail a bit more what we mean by our hypothesis from the start, namely that getting alignments with few diversity in the sequences could provide better results than alignment with a lot of diversity.
By taking many unreviewed proteins and using high level homology search techniques, the multiple sequence alignments coming from protein families, especially Interpro families, may be too diverse for obtaining good entangled sequences.
In contrast, NCBI BLAST, which databases contain the most massive amount of data, might give us sequences that are very close to the original and thus entangled sequences with a higher identity percentage. However, this is to be nuanced by the fact that NCBI may include sequences of defective mutants of the protein, which would mean defective entanglements as well. This why we are also looking at Uniprot and ENSEMBL.
After downloading an alignment, it is always essential to look at it, to deem it acceptable or not*. A lot can be told from what an alignment looks like. A great tool we are using to visualize alignments is AliView.
*For example, in the case of noisy alignments such as the one of ccdB from Interpro, you might want to look into trimming algorithms.
In addition to being great for visualizing alignements, AliView also supports the ability to realign the sequences with the alignment algorithm MUSCLE. This is very useful and we have used it quite a few times, since sometimes it gives us a better alignment than the one we downloaded.
Now let us begin the analysis process. For both proteins ccdB and knt, we have retrieved 5 MSAs each on the aforementioned databases. Here are the names we have given them:
-
ccdB_dirty
,knt_dirty
: MSA issued from a NCBI BLAST alignment without curation. -
ccdB_lax
,knt_lax
: MSA issued from a Uniprot BLAST alignment without curation. -
ccdB_stringent
,knt_stringent
: MSA issued from a Uniprot BLAST alignment with manual curation. -
ccdB_ensembl
,knt_ensembl
: MSA issued from an ENSEMBL BLAST alignment without curation. -
ccdB_interpro
,knt_interpro
: MSA issued from an InterPro family without curation.
Analyzing the impact of MSAs on entangled genes
In order to analyze the impact of MSAs on entangled genes, we looked for a good indicator of the quality of an alignment.
It turns out such an indicator exists, in the form of HMM consensus sequences. Remember, from every MSA, we generate an HMM which roughly contains probabilities for every amino acid to appear at each position.
Since the HMM is aware of these probabilities, we can use it to generate the most probable sequence. This is called the consensus sequence.
The command we will use to generate HMM consensus sequences from a HMM
file prot.hmm
is hmmemit -c prot.hmm
This is a convenient tool since it is provided along with
hmmbuild
and hmmpress
in the HMMER suite.
>ccdB
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDES WRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI
$> hmmemit -c ccdB_dirty.hmm
>ccdB_dirty-consensus
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDES YRLLTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI
$> hmmemit -c ccdB_ensembl.hmm
>ccdB_interpro-consensus
MAQFDVYRNPGKRSAIPYLVDVQSDLLDDLATRVVIPLVPAEQLSPRPQRLTPVIEIEGE SYVLLTQELASVPAKELGEPVASLSAERDEIIAAIDLLFQGI
Remember, the first step of CAMEOS is evaluating HMM seeds. HMM are usually used as profiles for determining similar sequences, and in that case having a lot of variety in the consensus sequence is great, since it means finding many homologues.
In our case though we might not want the consensus sequence to be too distant from the original protein sequence, as in a way this implies that the randomly generated sequences evaluated by CAMEOS themselves would be distant from the original protein sequence.
We theorize that a good alignment might be one that has enough variety while still staying generally close to the original sequences.
To check if our consensus sequences are close to the original sequences, we realized a BLAST alignment of the sequences using the option "Align two or more sequences" (see Part 3 for more details on NCBI BLASTp).
We immediately notice that, despite potentially containing defective
mutants, the ccdB_dirty
and knt_dirty
MSAs take the
top position.
As the alignment that comes from UniProt and is curated, we would expect
ccdB_stringent
and knt_stringent
to be the ones that are
sufficiently diverse and not too far from the original sequences. In the
case of Knt, the identity score is really solid, while it is decent for
ccdB.
Quite surprisingly, we observe a really good alignment for
ccdB_ensembl
, with the highest e-value, while the identity
score for knt_ensembl
is just mediocre.
The alignments ccdB_lax
and knt_lax
perform worse
than ccdB_stringent
and knt_stringent
, which is to be
expected since stringent
is lax
but manually
curated.
Unfortunately in both cases we observe terrible scores for
ccdB_interpro
and knt_interpro
, which may be due to the
number of unreviewed proteins in the MSAs, as well as the lack of
curation for such a large number of sequences.
Final results
In the end, we realized 25 entanglements, one for each different condition. We specified a number of 300 entangled sequences for each CAMEOS execution.
For each condition, we retrieved the mean and standard deviation of the Mark and Deg scores. This was compiled in the barplot below.
We observe that the population that did the worse overall is
knt_interpro
x ccdB_interpro
, losing by a
large margin. In addition, all entanglements that incorporate either
InterPro MSA end up having really high scores, with a sum totalizing
over 500.
According to the consensus sequences, the population that was supposed
to have the best scores was knt_dirty
x
ccdB_ensembl
.
Notice that we can clearly observe two populations on this graph. One with excellent Deg Scores, and another with mediocre Deg Scores. That explains the high standard deviation of the population as explicited by the confidence intervals on the barplots.
Instead, the population with the best overall scores is
knt_dirty
x ccdB_lax
. Quite surprisingly,
ccdB_lax
did better than ccdB_stringent
,
despite being the same alignment but uncurated.
As expected from the top scoring mean Mark Score plus mean Deg Score
entanglement, all sequences present great scores overall. However,
knt_dirty
x ccdB_ensembl
had individual
sequences with better Deg Scores, at the detrmient of worse Mark Scores,
so if we'd want to maximize the resemblance to ccdB, that's
something to take into consideration.
In conclusion, the observed results in the barplot for Knt are in accordance with what we observed with the consensus sequences earlier, with the order being exactly what would be expected from the consensus sequences identity scores.
For CcdB however, surprisingly, ccdB_lax
ends up at the
first places, despite its really low consensus identity score. This
might be explained by the fact that ccdB is the smallest sequence
and ends up being the inner sequence of the entanglement in every case.
As the inner sequence, it is subject to a lot more variety and it may
benefit from a diverse alignment.
From the analysis results we observed, it seems that MSAs from BLAST, ENSEMBL and Uniprot are significantly more prone to resemble the original proteins than protein families from Interpro.
This finding seems to support our starting hypothesis: alignments with proteins as close as possible to the original lay overall better CAMEOS entangled sequences than alignments with distant homologues.
This seems to be especially true for the outer sequence that is entangled into, and less for the inner sequence which can benefit from having variety in the MSA.
In order to truly discuss this hypothesis, we think this analysis can be done on other proteins and datasets. Potential additional databases and protein families we could have tested include eggNog, TIGRFAM, SUPERFAMILY, KEGG, and more.
In closing, although we do not know if this hypothesis has some validity or it is just simple coincidence, we think that our analysis of MSAs leads to very interesting results.
That concludes this course and case study. I hope this was enjoyable to read! It was our first experience with a software that has great potential for the future of synthetic biology. I hope our tutorial will facilitate the road to generating your own entangled genes.
Please let us know if you have any comments, we would really appreciate discussing about our results!