CAMEOS Course
Designing your own entanglements
Now let's get to the fun part. While applying CAMEOS to generate entanglements is nice, we would like to entangle our own proteins instead! However, as it turns out, the process for entangling our own proteins is way more complicated than it may seem.
To begin, one must have a bit of understanding of how entanglements work. As we have seen during the previous part, launching an entanglement requires many inputs. Let's go over them in detail.
First, we need to retrieve protein and nucleotide sequences for our two
sequences of interest. These sequences are added to the files
cds.fasta
and proteins.fasta
.
Secondly, for each sequence we need a Multiple Sequence Alignment, Hidden Markov Models, Markov Random Fields, and finally protein energies and pseudolikelihoods.
Finally, we need a file describing the execution. This is the file we
described along with CAMEOS earlier in the tutorial, the
example.txt
file.
Now you may be thinking, that seems like a lot of parameters to take into account. And you would be right. However, as it turns out, the user only has three of those inputs to consider in practice:
- the original protein and nucleotide sequences
- Multiple Sequence Alignments of both protein sequences
- the file describing the execution
This is because Hidden Markov Models, Markov Random Fields, energies and pseudolikelihoods can all be derived from a single protein Multiple Sequence Alignment.
Thankfully, this means we can automate their generation with a clever bash script. On the other end, this also means Multiple Sequence Alignments have a lot of impact on the end result.
For the sake of completeness, we will attempt to go over every command in the bash script provided with this course, although you likely will not use them as such. If you are only interested in running the script, we host it on the Software page of this wiki.
Understanding HMMs and Multiple Sequence Alignments
In this part I assume that you are familiar Multiple Sequence Alignments (MSA) of protein sequences. For more information on how to choose and retrieve MSAs, check out the Case Study.
To illustrate, let's look up a Multiple Sequence Alignment of protein
CcdB on the EGGNOG orthology database. Download this
fasta file
and rename it ccdB.msa
.
The MSA file is in the format FASTA + GAPS. It is mandatory to use this format and to have your sequences aligned for the next steps, so if your file contains raw unaligned FASTA sequences or badly aligned sequences, you might want to use a Multiple Sequence Aligner online such as MUSCLE or ClustalW and retrieve the result.
At this point I suggest creating a new folder to run the commands in, because quite
a few files will be generated from the ccdB.msa
file. Personally the folder I'll be using is ~/Design
.
Hidden Markov Models or HMMs for short are probabilistic data structures. They are commonly used in biology as profiles for modeling MSAs.
HMM profiles can be used for homology search, similar to a BLAST, but with more accurate results. HMMs profiles are used to describe protein domains.
To gain access to HMM command line tools, use the following installation command. This will install the HMMER suite, which is better explained by this user guide.
$> sudo apt-get install hmmer
With HMMER installed, you now have access to many new commands,
including the ones we will be using,
hmmbuild
and hmmpress
.
Let's generate a HMM profile from our ccdB.msa
MSA file.
This command should generate a file named ccdB.hmm
Now, by using hmmpress
, we will generate binary indices, supposedly to
help searching proteins against a HMM profile database.
This command generates the following files:
ccdB.hmm.h3m
, ccdB.hmm.h3i
, ccdB.hmm.h3f
, ccdB.hmm.h3p
.
Actually, I took this command from CAMEOS' manual, but I don't think these files
are used anywhere? They apparently help with the command hmmscan
, which is not ever run.
Installing and using CCMPred
Now that we are done with generating HMMs, we will make Markov Random Fields (MRFs) models to accurately predict long-range protein interactions for every protein sequence of the MSA. Long-range protein interactions are vital to protein structure and function.
Markov Random Fields is actually the name of the artificial intelligence method used to predict the interactions. The prediction is directly done from the MSA, which is why we only need the MSA to generate it.
The software used by CAMEOS for these interactions predictions is CCMPred by Seemayer S. et al. (2014) in Bioinformatics. There are of course other similar prediction software available, with different models and parameters considered, which may give other accuracy results.
CCMPred must be manually compiled. Here are the steps I personally recommand to build CCMPred for a light usage (without CUDA). Make sure you have git, gcc and cmake installed on your computer.
$> sudo apt install git$> sudo apt install gcc
$> sudo apt install cmake
$> cd ~
$> git clone --recursive https://github.com/soedinglab/CCMpred.git
$> cd ~/CCMpred ; cmake . -DWITH_CUDA=OFF
$> make
Hopefully this worked. If not, feel free to ask us for a precompiled CCMPred binary.
Use the same one-liner as before with Julia to add the binary to your PATH and go back to your working directory.
$> sudo ln -s ~/CCMpred/bin/ccmpred /usr/local/bin/ccmpred$> cd ~/Design
CCMPred will generate a MRF structure from the MSA, however the computational power required depends heavily on the number of sequences and on the length of sequences.
Here is a formula to calculate the memory required in bytes, with L being the number of characters of the aligned sequences and N the number of sequences:
4*(4*(L*L*21*21 + L*20) + 23*N*L + N + L*L) + 2*N*L + 1024
Let's try this on our ccdB.msa
alignment, which contains 178 proteins and 165 characters per sequence.
We obtain about 186 MB of RAM required. This should be fine for most computers, thankfully.
As an advice, do not attempt to run CCMPred if the requirements do not seem feasible enough. On the computer I used for example, computing the MRF for a protein of length > 1000 took me about 12 hours. In any case, we were told that the entangling process of CAMEOS also works better with relatively small protein sequences.
Now let's run the following commands. The grep
part just adapts the MSA to CCMPred's input format.
$> ccmpred -r ccdB.raw -n 100 ccdB.ccm ccdB.mat
CCMPred generates the MRFs into an output .mat file and into a .raw file containing the full prediction matrix.
We now want to convert the raw output .raw file into a Julia .jld structure using one of CAMEOS' scripts.
The script requires the HDF5 and GZip packages to be installed. Both as a Julia module and a Linux package.
$> sudo apt-get install hdf5-tools # HDF5$> sudo apt-get install zlib1g-dev # GZip julia> using Pkg
julia> Pkg.add("HDF5")
julia> Pkg.add("GZip")
julia> using HDF5, GZip
The CAMEOS script is located into the CAMEOS/prepare/
folder.
Run the following command.
At this point, you should have in your working directory: a .msa
MSA file,
a .hmm
HMM file (along with HMM binary indices), and a .jld
MRF file. The other files
are intermediates, feel free to delete them whenever you want.
Preparing CAMEOS entanglements
From there on, another CAMEOS Julia script predicts energies and pseudolikelihoods for each protein from the MSA file. It needs in input the MSA file and MRF .jld file.
Both energies and pseudolikelihoods are really important and take part into the many optimizations done by CAMEOS to find overlapping sequences.
As with the other one, the Julia script is in CAMEOS/prepare
.
This time we have to be in the CAMEOS/prepare
directory for it to work.
So for example if your current working directory was called ~/Design
, use the following commands:
$> julia energies_and_psls.jl ccdB ~/Design/ccdB.jld ~/Design/ccdB.msa
$> cd ~/Design
This Julia command generates files into the CAMEOS/main/energies
and CAMEOS/main/psls
folders.
Back in your working directory, move all files created during the previous steps into the ~/CAMEOS/main
folder.
And that's it! You should see all inputs derived from the MSA currently in their right place! Let me illustrate this with a command quickly.
$> cd ~/CAMEOS/main$> find -type f -name "*ccdB.*"
./energies/energy_ccdB.txt
./jlds/ccdB.jld
./hmms/ccdB.hmm
./psls/psls_ccdB.txt
./msas/ccdB.msa
Perfect! These are all the inputs required for the first protein. Remember, we needed a Multiple Sequence Alignment, Hidden Markov Models, Markov Random Fields, energies and pseudolikelihoods.
Now we only
have left to do the same process with the other protein, put both protein and coding sequences
into proteins.fasta
and cds.fasta
respectively, and create a file
describing the execution!* Don't worry, you have done the hardest part!
*In the execution file I suggest changing the number of variants to 300 so you have more material to analyze.
This is what I did for the BioBricks design and the Case Study.
Now, if you were to run my script, all the previous steps
would have been done automatically from the file ccdB.msa
. Here is how to configure and run the script.
First, you need to put the script bash.sh
and the alignment
ccdB.msa
in the same folder, let's say ~/Design
. Then, in the script, modify the value protein
and
replace it by ccdB
. Modify the value
CAMEOS
and replace it by the path to your actual CAMEOS
folder, for example ~/CAMEOS
. And then you are ready to go
with command ./bash.sh
!
Now that we know what we are dealing with inside CAMEOS, the software should be a little more clear! (but if not, don't worry, I don't really get the inner workings of the algorithm that much either!).
CAMEOS generates sequences by using HMMs and a stochastic backtracking, then it estimates and optimizes HMMs/MRFs pseudolikelihoods iteratively.
These pseudolikelihoods are then used to rank each entangled sequence. In the next part, we are going to try to make use of these scores and find out ways to select the best sequences. If you want to dive into the model in detail, please check our Model page!