Team:GO Paris-Saclay/Course/Part 1

CAMEOS Course

CAMEOS Course

Installing and using CAMEOS

In this tutorial you will learn about using CAMEOS, a brand new software developed by a team of talented researchers which go by the names of Tomasz Blazejewski, Hsing-I Ho and Harris Wang (check their article in Science, 2019).

This software allows one to entangle two gene sequences together, and is the central part of our iGEM team project.

Throughout this tutorial we will explore the various tools required to build and run CAMEOS, as well as a number of variables to take into consideration for getting theoritically optimal entangled sequences.

Installing and using Julia

The first thing to note about the software CAMEOS is that it runs on the programming language Julia. While Julia is not the first programming language that may come to mind, it is a language of choice when doing complex matrix computations, such as the ones featured in the CAMEOS algorithm.

In computational biology, Julia is a language that is often overshadowed by its peers such as Python and MATLAB. Fortunately, there is a community project called BioJulia, which attempts to gather all computational biology applications written in Julia, though CAMEOS is not a part of it (yet?).

In any case, if you are a computational biologist, then studying CAMEOS would be the perfect opportunity to learn the language Julia.

Let's get started. This tutorial assumes you are using a Linux environment with administrator rights. Here is the download page for all versions of Julia: we chose Linux on x86.

We have encountered issues running CAMEOS with the latest versions of Julia (> 1.3.1), so we suggest sticking to the two versions that our team succedeed entangling genes with: Julia 1.0.5 and 1.3.1. Beware though, version 1.3.1 of Julia is unmaintained.

Now your download of Julia should be starting. Once it is done, you should notice a file called julia-1.0.5-linux-x86_64.tar.gz in your ~/Downloads folder. Unzipping it with tar -zxvf will give you a folder named ~/Downloads/julia-1.0.5.

I suggest this quick one-liner command to create a shortcut file to the Julia binary in your local binaries folder.

sudo ln -s ~/Downloads/julia-1.0.5/bin/julia /usr/local/bin/julia

Now you just need to close and reopen and your terminal and type in julia and the command should work! Here is what the Julia console looks like:

$> julia
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.5 (2019-09-09)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |
 
julia>

And because no programming tutorial is complete without it, here is how to print a "Hello world!" in Julia:

julia> print("Hello world!")
Hello world!

Let us now consider two DNA sequences seq_A and seq_B of 9 nucleotides each. Here you can see me using a quick loop to mix sequences seq_B and seq_A.

julia> seq_A = "atgtcgtag"
"atgtcgtag"

julia> seq_B = "ATGCGGTAA"
"ATGCGGTAA"

julia> for i in 1:9
           print(seq_A[i], seq_B[i])
       end
"aAtTgGtCcGgGtTaAgA"

This isn't quite a sequence entanglement yet. For that, we will need to insert a sequence into another, such as seq_B into seq_A, while making sure both Open Reading Frames remain functional.

Here is an example of concatenating the strings in a way that seq_B is entangled into seq_A. While seq_B still reads the same Open Reading Frame, the Open Reading Frame of seq_A now contains the full sequence of seq_B.

julia> string(seq_A[1:4], seq_B[1:9], seq_A[5:9])
"atgtATGCGGTAAcgtag"

Now this really looks like a sequence entanglement. At that point though there is no guarantee that sequence seq_A is still functional.

As you can imagine, the sequence entanglement process done by CAMEOS is much more complicated, given that it takes functionality into account.

Since through these examples, you may have an idea of how Julia works and what we will be doing with CAMEOS, we are now ready to go through the next section.

Your first CAMEOS entanglements

Let's begin by downloading the software. CAMEOS is provided in open-source on the hosting site GitHub.

This means you can use git clone to download it if you have git installed on your computer, or alternatively, you can use the Download Zip button.

Once the download complete, you should have a folder called CAMEOS/ with three subfolders, one of them called main/. We won't use the other subfolders so you can go ahead and change your current directory to CAMEOS/main/. This is where you will use terminal commands.

There is a catch though, some of the files are really, really voluminous and as a result, GitHub provides them as corrupt files instead.

The voluminous files in question are main/jlds/infA.jld and main/jlds/aroB.jld. Use the following links to download the raw files: infA.jld and aroB.jld. Make sure to replace the corrupt files in the jlds/ folder with the raw files you just downloaded.

Before starting the entanglement process, I suggest checking the integrity of the files just to be sure. Here is the command you can use:

$> md5sum jlds/*.jld
af0a9df70c0d408476ae1c17b3261852 jlds/aroB.jld
8b8eaf1f0459df87e088d6da887712d5 jlds/infA.jld

If you don't have exactly the same MD5 signatures, then the file is corrupt. In general, MD5 is a really useful tool for checking the integrity of a file.

Let's now install all needed Julia packages for running CAMEOS. Open the julia terminal and type the following commands:

julia> using Pkg
julia> Pkg.add("BioAlignments")
julia> Pkg.add("BioSymbols")
julia> Pkg.add("Logging")
julia> Pkg.add("StatsBase")
julia> Pkg.add("JLD")
julia> Pkg.add("Distributions")
julia> Pkg.add("ArgParse")
julia> Pkg.add("NPZ")
julia> using BioAlignments, BioSymbols, Logging, NPZ
julia> using StatsBase, JLD, Distributions, ArgParse

Hopefully this part went out without any package-related errors. Now the next time you will open the Julia shell, these packages will be precompiled.

For CAMEOS to work, we need to take two additional steps. First, we need to create the output folder output/ by using command mkdir output. Secondly, install HMM command line tools using the following installation command.

$> sudo apt-get install hmmer

As it turns out, CAMEOS is provided as a Julia script named main.jl in the main/ folder, which means we will have to call it on the terminal with julia main.jl.

Besides, the script takes a file in argument. Let's take a look at the file example.txt.

This file describes the execution. Here is how to read this file: all outputs will be stored into the subfolder main/output/. We are entangling infA and aroB.

The number 100 indicates the number of entanglements generated. That's right, one execution of CAMEOS actually generates several entanglements* to choose from!

*Also called variants of the entanglement, depending on if you refer to an entanglement as the process or as the entangled sequence.

The necessary Hidden Markov Models are in the hmms/ folder and the Markov Random Fields are in the jlds/ folder. These are some of the really important CAMEOS inputs we will look into in the later sections.

Let's go ahead and start running CAMEOS with this file! This might take a bit of time. If runtime errors occur, they will be stored into the file problem_runs_0.txt.

$> julia main.jl example.txt
┌ Warning: `@add_arg_table` is deprecated, use `@add_arg_table!` instead
└ @ ArgParse :-1
The random barcode on this run is: oVwNsNbF
CAMEOS tensor built
Evaluating HMM seeds
Beginning long-range optimization.
Step 0 of 250...
Step 50 of 250...
Step 100 of 250...
Step 150 of 250...
Step 200 of 250...

If everything worked out perfectly, you should see the exact same terminal output except for the barcode which is randomly generated.

In addition, the execution generated an output folder named output/infA_aroB_p1/ with a few files.

The barcode is used in the name of the output files. For example, here are some files from the folder output/infA_aroB_p1/ we may be interested in:

  • all_final_fitness_barcode.txt: a text file describing the CAMEOS rating scores of all 100 entanglements.
  • saved_pop_barcode.jld: a Julia data structure containing the entanglements. The Julia code for exploring this structure is detailed in CAMEOS' manual.
  • top_twelve_barcode.fa: the protein sequences of the top 12 rating entanglements, according to CAMEOS.

I suggest only looking at the top hits for now. Let's look at them. Below are my results, it is fully expected that yours will be entirely different since CAMEOS generates potential entanglements randomly.

And there we go. Here are our first entanglements! Well, the protein translated sequences of our first entanglements. Let's quickly check the resemblance of the very top result with the original protein sequences.

The original protein and nucleotide sequences are respectively listed in the files proteins.fasta and cds.fasta in CAMEOS/main/.

Here is a simple comparaison with the original protein sequences for infA and aroB. We can clearly see drastic changes in both sequences, that's because our proteins were entangled!

Now for the actual entangled nucleotide sequence, we have to use Julia. This is a bit trickier. Notice the (ind 23) in the fasta header of my top result? That means this exact entanglement is at index 23 in the data structure from my file saved_pop_barcode.jld.

Change your directory to output/infA_aroB_p1/ and type the following commands in the julia shell. Make sure to modify the parts in blue so that they match your outputs.

julia> using JLD
julia> vars = load("saved_pop_oVwNsNbF.jld")["variants"];
julia> vars[23].full_sequence
"atggagaggatagtggtcaccctgggtgagcggtcttatccaatcacgatcgcttcggggttatttaatgagcctgcttcgttcctgaaaccgctgaaatcgggggagcaggtaatgctcgtgacaaatgagacactggcccccctgtatttggataaagttaggggtgtcttggagcaggcaggtgtgaacgtcgatagtgtcatcctgcctgatggcgagcagtataaatctttagcggtgctggacaccgtgttcaccgcactgctacaaaaaccgcatgggcgcgacactactttagtcgcactggggggcggtgtcgttggtgatctgacaggcttcgcggccgcgagttatcaacggggcgtgcggtttattcaggttccgaccactctactgagtcaagttgattctagtgtaggcggcaaaacggcagtcaaccatccattgggcaaaaatatgattggtgcgttctaccaaccagcttcggtcgtcgtggatctggattgcctgaagacgctgccgccccgggaactagcaagtggtctggctgaggttatcaaatatggcatcattctggacggcgcattctttaactggcttgaagagaatctggatgccctactaagactggacggtccagctatggcctattgtattagacggtgctgtgagctaaaagccgaggttgttgctgcggacgaacgagagactggtctacgtgctttgttgaatctaggtcacacgtttggtcacgccatcgaagccgaaatgggttatggaaattggcttcacggggaggcggtggcagcggggatggtcatggccgcacggacatctgagcggctgggacaATGTCCAGCAGAGGTTCTATCGAGGTTGAGGGCGTTGTTGAGGAGAGCCTCCCTTCCGGTCAGTGGACCGTCCGAGATGACGACGGAAGCGTACTTACCGCACATGCTTCGGGACAAATACGTCGTTTCAGGATCCGTGTGCTTAGTGGTGATCGTGTCACTGGGGAACTCAGCCCTGCGGACCTCACTCGCGGCAGAATTACTTTTAGACTCTCTTAGAgactgtcaatcagcg"

With all these commands said and done, you can now admire one of the first 100 sequence entanglements you've done with CAMEOS.

That marks the end of this first part of the course. In the second section we will be looking into how to create our own entanglements, and in the third section we will look more into the scores computed by CAMEOS and how to find the best entanglements.

Both sections are independent of each other so you can jump directly to the third part if you want to analyze your entanglements.


Back to the top:
Faculté des Sciences d'Orsay- Université Paris-Saclay-Logo
Team GO Paris-Saclay
Université Paris-Saclay
Faculté des Sciences d'Orsay
Building n°400
91 405 Cedex, Orsay
GO Paris-Saclay logo - like Eiffel Tower with a DNA strand

Thank you very much to our generous Sponsors