Team:GO Paris-Saclay/Course/Part 2



Designing your own entanglements

Now let's get to the fun part. While applying CAMEOS to generate entanglements is nice, we would like to entangle our own proteins instead! However, as it turns out, the process for entangling our own proteins is way more complicated than it may seem.

To begin, one must have a bit of understanding of how entanglements work. As we have seen during the previous part, launching an entanglement requires many inputs. Let's go over them in detail.

First, we need to retrieve protein and nucleotide sequences for our two sequences of interest. These sequences are added to the files cds.fasta and proteins.fasta.

Secondly, for each sequence we need a Multiple Sequence Alignment, Hidden Markov Models, Markov Random Fields, and finally protein energies and pseudolikelihoods.

Finally, we need a file describing the execution. This is the file we described along with CAMEOS earlier in the tutorial, the example.txt file.

Now you may be thinking, that seems like a lot of parameters to take into account. And you would be right. However, as it turns out, the user only has three of those inputs to consider in practice:

  • the original protein and nucleotide sequences
  • Multiple Sequence Alignments of both protein sequences
  • the file describing the execution

This is because Hidden Markov Models, Markov Random Fields, energies and pseudolikelihoods can all be derived from a single protein Multiple Sequence Alignment.

Thankfully, this means we can automate their generation with a clever bash script. On the other end, this also means Multiple Sequence Alignments have a lot of impact on the end result.

For the sake of completeness, we will attempt to go over every command in the bash script provided with this course, although you likely will not use them as such. If you are only interested in running the script, we host it on the Software page of this wiki.

Understanding HMMs and Multiple Sequence Alignments

In this part I assume that you are familiar Multiple Sequence Alignments (MSA) of protein sequences. For more information on how to choose and retrieve MSAs, check out the Case Study.

To illustrate, let's look up a Multiple Sequence Alignment of protein CcdB on the EGGNOG orthology database. Download this fasta file and rename it ccdB.msa.

The MSA file is in the format FASTA + GAPS. It is mandatory to use this format and to have your sequences aligned for the next steps, so if your file contains raw unaligned FASTA sequences or badly aligned sequences, you might want to use a Multiple Sequence Aligner online such as MUSCLE or ClustalW and retrieve the result.

At this point I suggest creating a new folder to run the commands in, because quite a few files will be generated from the ccdB.msa file. Personally the folder I'll be using is ~/Design.

Hidden Markov Models or HMMs for short are probabilistic data structures. They are commonly used in biology as profiles for modeling MSAs.

Source : EMBL-EBI Training, Copyright Creative Commons CC

HMM profiles can be used for homology search, similar to a BLAST, but with more accurate results. HMMs profiles are used to describe protein domains.

To gain access to HMM command line tools, use the following installation command. This will install the HMMER suite, which is better explained by this user guide.

$> sudo apt-get install hmmer

With HMMER installed, you now have access to many new commands, including the ones we will be using, hmmbuild and hmmpress.

Let's generate a HMM profile from our ccdB.msa MSA file. This command should generate a file named ccdB.hmm

$> hmmbuild ccdB.hmm ccdB.msa

Now, by using hmmpress, we will generate binary indices, supposedly to help searching proteins against a HMM profile database.

$> hmmpress ccdB.hmm

This command generates the following files: ccdB.hmm.h3m, ccdB.hmm.h3i, ccdB.hmm.h3f, ccdB.hmm.h3p.

Actually, I took this command from CAMEOS' manual, but I don't think these files are used anywhere? They apparently help with the command hmmscan, which is not ever run.

Installing and using CCMPred

Now that we are done with generating HMMs, we will make Markov Random Fields (MRFs) models to accurately predict long-range protein interactions for every protein sequence of the MSA. Long-range protein interactions are vital to protein structure and function.

Markov Random Fields is actually the name of the artificial intelligence method used to predict the interactions. The prediction is directly done from the MSA, which is why we only need the MSA to generate it.

The software used by CAMEOS for these interactions predictions is CCMPred by Seemayer S. et al. (2014) in Bioinformatics. There are of course other similar prediction software available, with different models and parameters considered, which may give other accuracy results.

CCMPred must be manually compiled. Here are the steps I personally recommand to build CCMPred for a light usage (without CUDA). Make sure you have git, gcc and cmake installed on your computer.

$> sudo apt install git
$> sudo apt install gcc
$> sudo apt install cmake
$> cd ~
$> git clone --recursive
$> cd ~/CCMpred ; cmake . -DWITH_CUDA=OFF
$> make

Hopefully this worked. If not, feel free to ask us for a precompiled CCMPred binary.

Use the same one-liner as before with Julia to add the binary to your PATH and go back to your working directory.

$> sudo ln -s ~/CCMpred/bin/ccmpred /usr/local/bin/ccmpred
$> cd ~/Design

CCMPred will generate a MRF structure from the MSA, however the computational power required depends heavily on the number of sequences and on the length of sequences.

Here is a formula to calculate the memory required in bytes, with L being the number of characters of the aligned sequences and N the number of sequences:

4*(4*(L*L*21*21 + L*20) + 23*N*L + N + L*L) + 2*N*L + 1024

Let's try this on our ccdB.msa alignment, which contains 178 proteins and 165 characters per sequence. We obtain about 186 MB of RAM required. This should be fine for most computers, thankfully.

As an advice, do not attempt to run CCMPred if the requirements do not seem feasible enough. On the computer I used for example, computing the MRF for a protein of length > 1000 took me about 12 hours. In any case, we were told that the entangling process of CAMEOS also works better with relatively small protein sequences.

Now let's run the following commands. The grep part just adapts the MSA to CCMPred's input format.

$> grep -v ">" ccdB.msa > ccdB.ccm
$> ccmpred -r ccdB.raw -n 100 ccdB.ccm ccdB.mat

CCMPred generates the MRFs into an output .mat file and into a .raw file containing the full prediction matrix.

We now want to convert the raw output .raw file into a Julia .jld structure using one of CAMEOS' scripts.

The script requires the HDF5 and GZip packages to be installed. Both as a Julia module and a Linux package.

$> sudo apt-get install hdf5-tools # HDF5
$> sudo apt-get install zlib1g-dev # GZip
julia> using Pkg
julia> Pkg.add("HDF5")
julia> Pkg.add("GZip")
julia> using HDF5, GZip

The CAMEOS script is located into the CAMEOS/prepare/ folder.
Run the following command.

$> julia ~/CAMEOS/prepare/convert_ccm_to_jld.jl ccdB.raw ccdB.jld

At this point, you should have in your working directory: a .msa MSA file, a .hmm HMM file (along with HMM binary indices), and a .jld MRF file. The other files are intermediates, feel free to delete them whenever you want.

Preparing CAMEOS entanglements

From there on, another CAMEOS Julia script predicts energies and pseudolikelihoods for each protein from the MSA file. It needs in input the MSA file and MRF .jld file.

Both energies and pseudolikelihoods are really important and take part into the many optimizations done by CAMEOS to find overlapping sequences.

As with the other one, the Julia script is in CAMEOS/prepare. This time we have to be in the CAMEOS/prepare directory for it to work. So for example if your current working directory was called ~/Design, use the following commands:

$> cd ~/CAMEOS/prepare/
$> julia energies_and_psls.jl ccdB ~/Design/ccdB.jld ~/Design/ccdB.msa
$> cd ~/Design

This Julia command generates files into the CAMEOS/main/energies and CAMEOS/main/psls folders.

Back in your working directory, move all files created during the previous steps into the ~/CAMEOS/main folder.

$> mv ccdB.hmm* ~/CAMEOS/main/hmms
$> mv ccdB.jld ~/CAMEOS/main/jlds
$> cp ccdB.msa ~/CAMEOS/main/msas

And that's it! You should see all inputs derived from the MSA currently in their right place! Let me illustrate this with a command quickly.

$> cd ~/CAMEOS/main
$> find -type f -name "*ccdB.*"

Perfect! These are all the inputs required for the first protein. Remember, we needed a Multiple Sequence Alignment, Hidden Markov Models, Markov Random Fields, energies and pseudolikelihoods.

Now we only have left to do the same process with the other protein, put both protein and coding sequences into proteins.fasta and cds.fasta respectively, and create a file describing the execution!* Don't worry, you have done the hardest part!

*In the execution file I suggest changing the number of variants to 300 so you have more material to analyze. This is what I did for the BioBricks design and the Case Study.


Now, if you were to run my script, all the previous steps would have been done automatically from the file ccdB.msa. Here is how to configure and run the script.

First, you need to put the script and the alignment ccdB.msa in the same folder, let's say ~/Design. Then, in the script, modify the value protein and replace it by ccdB. Modify the value CAMEOS and replace it by the path to your actual CAMEOS folder, for example ~/CAMEOS. And then you are ready to go with command ./!

Now that we know what we are dealing with inside CAMEOS, the software should be a little more clear! (but if not, don't worry, I don't really get the inner workings of the algorithm that much either!).

CAMEOS generates sequences by using HMMs and a stochastic backtracking, then it estimates and optimizes HMMs/MRFs pseudolikelihoods iteratively.

These pseudolikelihoods are then used to rank each entangled sequence. In the next part, we are going to try to make use of these scores and find out ways to select the best sequences. If you want to dive into the model in detail, please check our Model page!

Back to the top:
Faculté des Sciences d'Orsay- Université Paris-Saclay-Logo
Team GO Paris-Saclay
Université Paris-Saclay
Faculté des Sciences d'Orsay
Building n°400
91 405 Cedex, Orsay
GO Paris-Saclay logo - like Eiffel Tower with a DNA strand

Thank you very much to our generous Sponsors