Model
Overview
S.cerevisiae has approximately 1200 essential genes, out of approximately 6000 genes. An important characteristic of essential genes is that almost any mutation to them would be lethal to the mutated organism.
In our project, we assume that by transcriptionally interlocking a target gene (a component of our choice in the genetic circuit) to the N terminal of an essential gene, we prevent mutations such as non-sense and frameshift. This would also prevent many mutations that lead to misfolding of the target gene, as they would likely cause misfolding of the essential gene as well. The large variety of essential proteins gives us the opportunity to engineer many different constructs, creating many different genetic circuits while maintaining genetic and evolutionary stability. Also, it is possible that that a non-essential gene will better promote a target gene’s stability, perhaps because it is preserved in evolution even though it is not defined as "essential". Therefore, we need to characterize all the genes in the host organism in order to understand their contribution to the co-stability.
A main goal of our project is to develop an automated recommendation engine (as a software) that, given an input gene target gene, would match the best fitting essential gene to it, based on bioinformatic tools and empirical data. Next, our software would choose the best linker candidates considering their tendency to fold and interfere with the natural folding of the fused genes. And as a final step, the software would optimize the mutual construct (composed of essential gene, target gene, and linker) for efficient translation and increased stability.
However, this modular design raises some fundamental questions:
- Which features influence the genetic stability of genes, and how can it be predicted?
- In case of fusion linkers, how to simulate the folding disorder caused to the attached genes?
- Can genetic stability go hand-in-hand with the need for high expression levels?
In order to answer these questions, simulate the co-stability of interlocked genes and understand their behavior, we devised novel prediction and optimization models. For each model, we establish quantifiable measures of success and utilize available databases for the development.
So basically, we are developing three modules:
- Stability predictor that will help us select the best matching conjugated gene (not necessarily essential, as we explained above) to an input target gene.
- Linker folding predictor, that will indicate which linker will not cause a spatial disturbance in the mutual construct.
- Translation optimizer of the combined construct, in terms of mRNA folding energy, codon usage, GC content and avoidance of mutational hotspots.
In the next tabs of this page, we describe our efforts in developing three models that inform on the stability-related design principles of our solution. The first 4 tabs are related to our first model - the stability predictor - since it is composed of many sub-steps, each requires special attention. An introduction is waiting for you in the "stability predictor introduction" tab!
Stability predictor Introduction
Regarding the stability prediction of the best essential gene, our modeling approach is based on three steps:
- Develop a predictive machine-learning model of stability levels, when referring to GFP as a target gene. We predict the co-stability of all ORFs in yeast when attached to a given target gene. We initially designed our model to predict the genomic stability data obtained in our Gene-SEQ experiment, determined by the distribution of mutations within the target gene. We recieve the sequencing results too close to the competition due to COVID-19 related issues. Hence, the model is designed to predict similar data from Schuldiner's database, that contains measurements of the fluorophore intensity of ~90% of the genes in yeast when tagged with GFP in their N terminus. The libraries on which the measurement were conducted are described in our Experiments page.
- NOP1pr-GFP intensity: Median of GFP intensity of all strains tagged with GFP under the synthetic promoter NOP1.
- NATIVEpr-GFP intensity: Median of GFP intensity of SWAT strains under their native promoter, after swap with Seamless GFP donor (Native promoter-GFP).
- Gain information from the above model about the important features for the prediction and choose the configuration that yields the best results.
- Test the performance of this model on a similar task for RFP empirical data, to simulate the implementation process of the developed model on a new target gene.
The predicted values are:
We refer to the fluorescence levels as the model’s ground truth, or labels (= the thing we wish to predict with high accuracy), assuming they are highly correlated to the stability levels. This assumption makes sense because the fluorescence values indicate the amount of GFP-tagged proteins; in other words, it is the amount of genes that kept their target gene up until a specific time point. Therefore, the fluorescence levels can be used as a measure of stability when dealing with GFP.
The prediction in this step is based on many features we created from different data sources as described below.
In a future version of this model, we will use the data from our own Gene-SEQ experimental results to improve the stability prediction and gain more information regarding the features that influence the stability of the construct (steps A-B).
Feature Generation
The input data for machine learning (ML) algorithms consists of instances, which are samples of specific population (in our case, these instances are equivalent to the genes in yeast), and features, which are a set of characteristics that describe the instance. Features can be continuous or categorical. In any case, one converts these features to be numerical, as they are more convenient to use and adapt to existing ML methods. Another important input is the label of each instance, which is the property you wish to predict - fluorescence levels, stability, or anything that you have the data for.
Since we don't know in advance what are the characteristics of stable genes, we generated many descriptive features for each of the yeast's ORF - individual features that are related to genomic stability, and features with respect to the target gene as well, since our aim is to predict the co-stability (i.e. the stability of the combined construct composed of yeast's gene and a target gene).
For this purpose we consulted with experts, read much literature and collected data from a variety of sources (academic publications, empiric measurements, etc.). We divided the features into groups:
- Protein to protein interaction networks
- Transcription factors binding
- Protein folding disorder
- Codon usage bias features
- Target-gene based features
- Chemical properties features
- Local mRNA folding energy
- Co-expression mRNA network
- ATG context score
- Number of shifted ORFs in the coding sequence
- evolutionary conservation
- Features from the "selection of 10 essential genes" project
Extracting features from PPI networks and graphs
Essential proteins are those that are indispensable to cellular survival and development. Existing methods for essential protein identification generally rely on knock-out experiments and/or the relative density of their interactions (edges) with other proteins in a Protein-Protein Interaction (PPI) network [1].
Essential proteins are usually more stable.
We used 4 different data sources:
STRING network
Source: STRING website.
Files: 4932.protein.links.v11.0.txt, this table contains 2 columns of genes that exist in S. cerevisiae and a column of combined score. This score is the strength of interaction between 2 genes.
Features:
- Rank of gene- number of interactions that each gene creates with other genes. The rank of the gene is an important feature because genes that create a larger number of interactions are probably evolutionarily older and more conserved. This is because if a gene is related to more genes, then it tenda to develop fewer mutations. For these reasons there may be a high correlation between high rank and high levels of gene expression.
- $\textbf{Weighted}\ \textbf{rank}_{i}=\sum_{i}\frac{score_{i}}{(max\ score\ in\ the\ graph)*M}$
Where M is the rank of the gene, and the weight is the strength of interaction.
The weighted rank feature takes into account not only the number of interactions in which each gene is involved, but also the weight of each one of its edges and generates an average measure of the strength of interaction in which the gene is involved. Using this feature, we examined if there is a high correlation between the average interaction strength of a gene and its expression levels.
CellMap network
Source: TheCellMap website.
Files: SGA_ExE.txt, SGA_NxN.txt, SGA_ExN_NxE.txt tables.
This dataset includes three different genetic interaction maps organized in 3 different tables. In each network a query (gene) and array mutant strains are represented in the genetic interaction network along with their corresponding fitness estimates. The networks are different, as each contains a different type gene pairings: nonessential x nonessential (NxN) network, essential x nonessential (ExN) network and an essential x essential (ExE) network [2].
Each table contains 2 columns for the pair of mutant genes that create an interaction between them, a column for genetic interaction score which is the strength of positive or negative interaction between the two genes (see details below), additional columns of the fitness value of each gene, and a column of the fitness of their interaction.
Each genetic interaction between a pair of genes can be negative or positive. Negative genetic interactions describe double mutants that exhibit a lesser phenotype than expected based on the phenotypes of the corresponding single mutants. Conversely, positive genetic interactions describe double mutants exhibiting a larger phenotype [3].
After combining and sorting the 3 networks, we have collected information about 5707 genes in yeast.
Features:
- Rank of gene - number of interactions that each gene creates with other genes.
- $\textbf{Positive}\ \textbf{rank}_{i}=\sum_{i}\frac{positive\ score_{i}}{max\ positive\ score\ in\ the\ graph}$
- $\textbf{Negative}\ \textbf{rank}_{i} = \sum_{i}\frac{negative\ score_{i}}{|max\ negative\ score\ in\ the\ graph|}$Where i is the gene index. The summation is just on the negative scores i.e. negative interaction of gene.
- $\textbf{Weighted}\ \textbf{rank}\ \textbf{positive}_{i} = \sum_{i}\frac{positive\ score_{i}}{(max\ positive\ score\ in\ the\ graph)*M_{i}}$Where i is the gene index. In fact, this feature is the “positive rank” feature after normalization by M, which is the rank of the ith gene.
- $\textbf{Weighted}\ \textbf{rank}\ \textbf{negative}_{i} = \sum_{i}\frac{negative\ score_{i}}{|max\ negative\ score\ in\ the\ graph|*M_i}$In fact, this feature is the “negative rank” feature after normalization by M, which is the rank of the ith gene.
- $\textbf{Absolute}\ \textbf{weighted}\ \textbf{rank}_{i}= Weighted\ rank\ positive_{i}+|Weighted\ rank\ negative|_{i}$
- Single Mutant Fitness (Smf) - the average fitness per mutation strain.
- $\textbf{Weighted}\ \textbf{dmf}_{i} = \sum_{i}\frac{dmf_{i}}{(max\ dmf\ value\ in\ the\ graph)*M_i}$Where i is the gene index, and dmf score is the double mutant fitness of the interaction between two genes.
YeastMine network
Source: YeastMine
Files: Lists-> ALL_Verified_Uncharacterized_Dubious_ORFs->Interactions
5000 rows that depict how many interactions each yeast's protein is involved in. The data contains 3 columns –
- secondaryIdentifier gives information about the protein's name.
- name describes the protein's role. Some of the rows in this column have the value – none.
- Interactions- a numeric column that tells how many interactions a protein is involved in [4].
Features: We used the data to create a feature that depicts in how many interactions does a protein participate (a rank).
BioGRID network
Source: BioGRID
BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Their data includes publications for over a million protein and genetic interactions, chemical associations and post-translational modifications from major model organism species [5].
Files: BIOGRID-ALL-3.5.184.tab3.zip
We used a MATLAB script to calculate some features from the yeast-yeast interactions’ data.
Features:
- Rank- how many interactions does each protein have with other proteins. We describe in the code how do we count the interactions and which interactions we consider as duplications. We noticed that sometimes we would have the same interaction multiple times and we decided to count it once. Moreover, we found out that sometimes an interaction is counted twice- once A and B are the interactors and once B and A, we removed these duplicates as well.
- Interaction type- How many of the protein’s interactions are 'genetic' and how many are 'physical', as seen in the column Experimental System Type.
- Experimental System Name – How many of the protein’s interactions have each one of the possible values in the column Experimental System Name in the original data. The experimental systems are listed and explained here.
- Interaction Throughput- How many of the protein’s interactions have each one of the possible values in the column Interaction Throughput (High, Low, High and Low).
Transcription factors
The process of transcription is the first stage of gene expression, resulting in the production of a primary RNA transcript from the DNA of a particular gene. Both basal transcription and its regulation are dependent upon specific protein factors known as transcription factors. These bind to specific DNA sequences in gene regulatory regions and control their transcription [2]. The number of transcription regulatory sites is an important feature because the transcription factors regulate the expression levels of the genes, and thus can help us to predict the intensity of fluorescence that we are interested in.
Source: YEASTRACT website
Files: TF_merged_table.csv, this is a table contains two columns: the first column contains the gene name, and the second column contains the transcription regulatory sites in each specific target gene in yeast.
Features: The number of transcription regulatory sites in each target gene in yeast.
Protein's folding disorder
Disordered proteins [7] follow unique sets of biophysical characteristics that are very distinct from that of well-structured globular proteins. At the primary structure level, Intrinsically Disordered Proteins (IDPs) are enriched by the presence of numerous uncompensated charged groups, resulting in low mean hydrophobicity and a high net charge at neutral pH . Disordered regions are found to be encoded mainly by polar and charged (specifically, alanine, glycine, arginine, glutamine, serine, glutamic acid, lysine, and proline) amino acids and are devoid of hydrophobic and aromatic amino acids. Due to the relatively higher rates of amino acid substitutions and fixation of insertion and deletions, disordered regions are known to evolve at significantly higher rates than ordered regions.
Files: In the work detailed in the Tuller et.al [7], 5500 files were created. Each file is a prediction result for one Saccharomyces Cerevisiae’s genes. In each file, there are per-residue prediction results from disorder prediction algorithms and ANCHOR disordered binding sites.
In every file, for each amino acid in the gene there is a prediction whether it is disordered or not using two approaches:
- Consensus approach #1- calculate the score per amino acid. This score method is called per-residue disorder score. The score is computed with 4 different algorithms:
IUpred, VLS2B, MoreRONN, DisEMBL. Each algorithm utilizes a different approach or architecture. These algorithms were chosen because they aren't based on homology and therefore are considered unbiased. In order to decide whether a position is ordered or not, the following rule was used:
If in a specific position we have score>0.5 for at least 3 out of 4 algorithms, the position is considered disordered. - IUPRED predicts disorder score corresponding to interactions between amino acids in the sequence. There are two types of IUPRED: IUpred-L- for long disordered sequences and IUpred-S- for short disordered sequences.
- VLS2B uses a neural network to predict a score that is a weighted average between 2 different algorithms – one is trained on short sequences and the other on long ones.
- MoreONN is also based on a neural network and is an adjusted calculation of an older method called RONN.
- DisEMBL-465, a neural network algorithm that predicts disorder according to interruptions named "missing coordinates". It does not give a score; it gives a result of "ORDERED" or "DISORDERED".
The following data is compiled into a file: - Codon- in which position on the protein are we.
- Position_amino_acid- the name of the codon and the amino acid it codes. For example, ATG_M.
- IUPRE- the score calculated from IUPRED algorithm (score>0.5 = disorder).
- VSL2B- the score calculated from VSL2B algorithm (score>0.5 = disorder).
- 14_Rem465—the decision from this algorithm whether the position is ordered or disordered.
- MoreRONN- the score calculated from MoreRONN algorithm (score>0.5 = disorder).
- Predicted_by_algorithms- how many algorithms out of the four mentioned above classified the position as disordered. (values between 0-4)
If 3 or 4 algorithms classified a position as disordered, then it is predicted to be disordered. - Consensus approach #2- In this approach, three newer algorithms were used:
- IUPred2A- this algorithm is trained on both short and long sequences, as opposed to the older algorithms.
- ESpritz- predicts disorder areas with a bidirectional and recursive neural network. This algorithm is considered state of the art. It has three variants, where each is trained on a different kind of interruption.
- SPOT-Disorder-Single- uses an ensemble of several neural networks to decide whether a position is disordered or not.
A position is considered disordered by a majority vote of these algorithms.
The following data is added:
- Consensus_3_algorithms- a position is defined as ordered or disordered by a majority vote. The score calculated for each algorithm is not returned.
- Dis_30- for each position, it is determined whether it is a part of a block of at least 30 amino acids that are disordered. If it is a part of a such a block, it is considered disordered, otherwise it is considered ordered.
It is known that proteins can communicate with each other and create interactions.
Anchor is used to identify the residues along the proteins that may be important for protein binding. Anchor predicts the elements within disordered regions who play crucial roles in molecular recognition and protein-protein interactions. To predicts such residues, it utilizes the same general principle as the IUPred disorder protein algorithm.
However, the basic difference between these algorithm is that while IUPred predicts all the potential disordered regions within a protein ,ANCHOR predicts the disordered regions which could act as potential protein-protein binding sites. Thus, the scores predicted by Anchor were suggested to be independent of the IUPred algorithm.
Similar to other methods, if the Anchor score is above 0.5, the position is considered disordered with a potential to interact with other proteins.
The following data is generated:
- The position of the amino acid (identical to column 1).
- The name of the amino acid (identical to column 2).
- The Anchor score.
- Anchor’s prediction - if in column 12 the score is above 0.5 than the position is considered disordered with potential to be a bonding site and the value will be 1. Otherwise, it will be considered ordered and the value will be 0 [7].
Features: This data has been processed by MATLAB. We computed the following features based on the file compiled:
- Gene_name- the gene's name as extracted from the file's name.
- Mean_score- the mean of all the numerical columns on all the positions (columns 3,4,6,12 in the original file).
- Disorder_percentage1- using consensus approach #1, we counted the number of disordered sites in the gene (3/4 in column 7) and divided by the number of positions.
- Disorder_percentage2- using consensus approach #2, we counted the number of disordered sites in the gene ("DISORDERED" in column 8) and divided by the number of positions.
- Disorder_percentage_both_algorithms- we counted the number of disordered sites in the gene where both consensus approaches determined the site as disordered ("DISORDERED" in column 8 and 3/4 value in column 7) and divided by the number of positions.
- Anchor_disorder_percentage- we counted the number of disordered sites in the gene (1 in column 13) and divided by the number of positions.
- Anchor_percentage- we averaged the Anchor score on all positions (column 12).
- IUPRED_percentage- we averaged the IUPRED score on all positions (column 3).
- VSL2B_percentage- we averaged the VSL2B score on all positions (column 4)
- DisEMBL_percentage- we counted the number of disordered sites in the gene ("DISORDERED" in column 5) and divided by the number of positions.
- MoreRONN_percentage- we averaged the MoreRONN score on all positions (column 6)
- Disorder30_percentage- we counted the number of disordered blocks of size 30 in the gene ("DIS_30" in column 9) and divided by the number of positions.
- Disorder_percentage1_window1-disorder_percentage1_window30- We calculated the disorder average across sliding windows. We took the windows starting in position 1 through 30, each of size 50. For example, window 17 is from position 17 until position 66. For each window, we counted the number of disorders according to consensus #1 and divided by 50, the window size. If the gene's length is smaller than 80, some windows aren't full, and these windows’ values will be NaN.
- Disorder_std1_window1-disorder_std1_window30- for each window in xiii, we calculated the standard deviation. If this is not possible due to incomplete window, the value assigned will be NaN.
- Disorder_percentage2_window1-disorder_percentage2_window30- same as described in xiii, using consensus #2. If not possible due to incomplete window, the value assigned will be NaN.
- Disorder_std2_window1-disorder_std2_window30- same as described in xiv using consensus #2. If not possible due to incomplete window, the value assigned will be NaN.
- Disorder30_percentage_window1- disorder30_percentage_window30- for the windows described in xiii, we counted the number of disordered blocks ("DIS_30") and divided by 50. If not possible due to incomplete window, the value assigned will be NaN.
- Disorder30_std_window1- disorder30_std_window30- for each window in xiii, we generated a binary vector, it is 1 when a block is disordered ("DIS_30") and 0 when ordered ("ORD_30"). Afterwards, we calculated the vector's std. If not possible due to incomplete window, the value assigned will be NaN.
- Disorder_anchor_percentage_window1 disorder_anchor_percentage_window30- for each window described on xiii, we calculated the Anchor percentage as described in vi. If not possible due to incomplete window, the value assigned will be NaN.
- Disorder_anchor_std_window1- disorder_anchor_std_window30- the Anchor's values are binary. Therefore, std was easily computed for each window. If not possible due to incomplete window, the value assigned will be NaN.
- Max_disorder_percentage_window1-
max_disorder_percentage_window30- for each window, we calculated the maximal score between all positions and all score columns in the original file (3,4,6,12). If not possible due to incomplete window, the value assigned will be NaN. - Disorder_percentage1_lastwindow, disorder_std1_lastwindow, disorder_percentage2_lastwindow, disorder_std2_lastwindow, disorder30_percentage_lastwindow,disorder30_std_lastwindow, disorder_anchor_percentage_lastwindow, disorder_anchor_std_lastwindow, max_disorder_percentage_lastwindow- the following were calculated in the same way described in xiii-xxi, but for the last 50 positions in the sequence.
Codon usage bias (CUB) features
The sequences analyzed for the calculation of codon usage bias features are only those that can be divided into codons, i.e. the length of the gene ORF is divisible by 3.
The motivation behind the generation of codon usage features is that they can indicate genes that are highly biased in terms of their codon usage, meaning they do not “choose” their codons randomly. Often, this means that those genes are regulated by the cell or evolution to maintain specific expression values or a unique function; genes that are highly biased are hypothesized to include preserved areas, as they are biased towards signals that promote (or inhibit) their translation or stability. Thus, they can be related to the evolution rate or stability of the gene.
Data for calculation of codon usage bias features
- mRNA levels in RPKM
Source: //www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE75897
The mRNA levels are calculated from the RNA-seq experiment (Ribo-Zero treated from log-phase S. cerevisiae).
RNA-seq is an experiment used for estimation of mRNA levels of specific genes in the organism. Following is the experiment’s procedure:
- Isolate the RNA from the organism and break the RNA into small fragments.
- Convert the RNA fragments into a double-stranded copy-DNA (ds-cDNA)
- Add sequencing adapters, that allow the sequencing machine to recognize the fragments. They allow us to sequence different fragments at the same time, since different samples can use different adapters.
- PCR amplification followed by high-throughput, short-read sequencing methods.
- Count the reads per gene (by aligning the fragments to the reference genome).
The units of the levels are RPKM (Reads Per Kilobase per Million mapped), meaning that for each gene:
$$(1) RPKM_{g}=\frac{r_{g}*10^9}{L_{g}*R}$$
Where rg is the number of reads per gene, Lg is the length of the encoding region (in codons), R is the total number of reads in the experiment (sum on all genes).
Note: This is the same feature as in the Shuldiner’s Loqate database. The reason it appears here again is for the calculation of the NTE index (see below).
- Protein abundance (PA)
Source: https://pax-db.org/dataset/4932/450/
The protein abundance for each gene is reported in the GPM database, 2012. It is used as a measure of expression level of the gene. The protein abundance of a gene is related to its transcription rate/mRNA levels, its translation rate, and the degradation rate of the corresponding mRNA molecules and proteins.
It is used as an individual feature, but also in the context of CUB. This feature is utilized for the extraction of the 2% most highly expressed genes’ sequences. These sequences are set as a reference genome for the calculation of CAI and RCA indices.
Measures of codon usage bias
- Effective Number of Codons (ENC)
The effective number of codons (ENC) [8] measures the deviation from uniform use of synonymous codons. The ENC is the number of codons that, if used equally, would generate the level of codon bias observed. This is a statistical measure that is independent of gene length and amino acid composition. Also, knowledge of the optimal codons or the “reference set” of highly expressed genes is not required.
ENC takes values from 20.0 (maximum bias) to 61.0 (no bias).
The relevant equations:
$$(2)ENC = 2+\frac{9}{F_{2}}+\frac{1}{F_{3}}+\frac{5}{F_{4}}+\frac{3}{F_{6}}$$
$$(3)F_{k}=\frac{1}{k} \sum_{j}^{k}H_{j}\ for\ k=2,3,4,6 (k is the degeneration degree)$$
$$(4)H_{j}=\frac{n_{j}\sum_{i=1}^{k}p_{i}^2-1}{n_{j}-1}$$
$$(5)p_{i}=\frac{n_{i}}{n_{j}}$$
Where:
- $F_{k}$ is the average homozygosity ($H_{j}$) of amino acids 'j' with degeneration degree k.
- $H_{j}$ is the homozygosity of amino acid 'j' and is derived from the square codon frequencies.
- $p_{i}$ is the relative frequency of the codon ‘i’ that encodes the amino acid ‘j’, e.g. the number of appearances of the codon, divided by the number of appearances of the amino acid that the codon encodes.
Features:
- ENC number.
- CAI index
Source: for the calculation of this index, we used the CAI package (python), written by Benjamin Lee [9], version 1.0.3. This package is available online, and it’s documentation is also available here. It is an implementation of Sharp and Li's 1987 formulation [10] of the codon adaptation index.
The codon adaptation index (CAI) for a gene estimates the extent of its bias towards codons that are known to be favored in a reference set of genes. These are typically highly expressed genes (whereas similarity indicates highly regulated genes) or the whole genome of the organism (whereas similarity indicates preserved signals). The index uses this set of genes to assess the weights of each codon, called the “relative adaptedness”. Then, a score for a gene is calculated from the usage frequency of all codons in that gene (geometric mean).
Relevant equations:
- A “relative adaptedness” value or the “weight” of the codon , is calculated from its relative frequency of use in a species-specific reference set of genes:
Where $x_{i}$ is the number of appearances of the codon in the reference set, and $x_{max}$ is the number of appearances of the most frequently used codon for an amino acid (from the synonymous codons). This is equal to the RSCU ratio, which is defined as the observed appearance of the codon divided by the expected appearance under equal use of synonymous codons.
In case a codon i is not observable in the reference set, we assign its count to 0.5, so that its weight is set to:
$$(7)w_{i}=\frac{0.5}{x_{max}}$$
This calculation ignores stop codons.
- From these weights we create a profile for the gene: a list of the weights for the sequence, not counting codons without synonyms. Also, the number of ATG and TGG codons are subtracted from L, since the RSCU values for AUG and UGG are both fixed at 1.0, and so do not contribute to the CAI.
$$(8)CAI=(\prod_{i=1}^{L}w_{i})^\frac{1}{L}$$Like the ENC, this is a statistical (non-biophysical) approach for CUB.
CAI takes values from 0.0 (no bias) to 1.0, where 1.0 indicates maximum bias, meaning the gene uses the codons of the reference set. The higher the CAI, the greater the similarity between the gene and the reference set.
Features:
- CAI index with the whole genome as a reference set.
- CAI profile’s standard deviation (STD) with the whole genome as a reference set.
- CAI index of the first 50 codons of the gene with the whole genome as a reference set.
- CAI index of the last 50 codons of the gene with the whole genome as a reference set.
- CAI index with top 2% highly expressed gene (HEG) as a reference set.
- CAI profile’s standard deviation (STD) with the HEG as a reference set.
- CAI index of the first 50 codons of the gene with the HEG as a reference set.
- CAI index of the last 50 codons of the gene with the HEG as a reference set.
The motivation behind calculating the index in the first and last 50 codons: The efficiency of translation is universally lower across the first ∼50 codons. Additionally, in eukaryotes, the last ∼50 codons show the highest efficiency over the full coding sequence [11].
We also calculated the standard deviation of the profile. The STD can be used as a measure of noise in the signal. We do not expect a uniform distribution of the codon usage along a sequence. On the contrary, highly expressed genes have a higher selection for a specific usage of codons that promotes translation efficiency. Under this assumption, one can say that highly expressed genes may present a higher STD (a more noisy signal) in a given usage index.
- tAI (tRNA adaptation index):
The tAI is a biophysical measure of codon usage bias. The assumption behind this measure is that highly expressed genes are more adapted to the tRNA pool, i.e., abundance and availability of the tRNA types that recognize these genes’ codons. Thus, codons with available tRNA are considered more optimal: More codon-recognizing tRNAs in the cell result in faster translation elongation for the codon.
This index uses estimation of tRNA pools to assess the weights of each codon (relative adaptedness to the pool). A score for a gene is calculated from the usage frequency of all codons in that gene (geometric mean).
Relevant equations:
- Estimate the tRNA pool:
Measuring tRNA levels is currently still a challenge, and measurements are only available for several organisms. However, tRNA expression is correlated to the number of copies in the genome of each tRNA gene. So, for each tRNA in a genome we have the tRNA Gene Copy Number (tGCN) from http://gtrnadb.ucsc.edu/genomes/eukaryota/Scere3/. We did not extract the tCGN from this source directly, because it is already available in the Supplemental Information of Tuller et.al. 2010 article [11], along with the tAI weight of each codon. We checked that the values matched the above source, and the sources are in agreement.
Figure 1. An example of tRNA Gene Copy Number (tGCN) as appear in the Genomic tRNA database. - Assign each codon a score (weight):
For each codon, consider all tRNAs that recognize the codon (through Watson-Crick or wobble interactions):
$$(9)W_{i}=\sum_{j=1}^{n}(1-s_{ij})tGCN_{j}$$Where n is the number of tRNA molecules that recognize the codon i, and S-values denote the effectivity of the tRNA(j) to codon(i) interaction.
The S values used for S. cerevisiae:
Figure 2. The effectivity scores of tRNA interactions with matching codons, based on Watson Crick and Wobble interactions for S. cerevisiae. Finally, the normalized weights with respect to the most optimal codon out of all 61 codons:
$$(10)w_{i}=\frac{W_{i}}{max(W_{i})}$$
As mentioned above, the tAI weights for each codon are available in the Tuller et.al. 2010 article [11], so we did not calculate them ourselves. We tested several to see that they correspond to the equation.
- From these weights, we create a profile for each gene: a list of the weights for the sequence, with L being the sequence length in codons:
tAI takes values from 0.0 to 1.0, where 1.0 indicates maximum elongation efficiency in terms of codon-anticodon interactions, meaning the gene uses the codons that are most adapted to the tRNA pool.
Features:
- tAI index.
- tAI profile’s standard deviation (STD).
- tAI index of the first 50 codons.
- tAI index of the last 50 codons.
- NTE (normalized translation efficiency)
The NTE measure [12] is an improvement of the previous approach for estimating codon optimality. It considers not only the supply of tRNA but also the demand, i.e. the number of codons and mRNA copies that need to be translated by the same tRNA.
An optimal codon is characterized with supply/demand ratio greater than 1.
The index supply/demand ratio is used to assess the weights of each codon (translational efficiency). Then, a score for each gene is calculated from the usage frequency of all codons in that gene (geometric mean).
Formulas:
- Compute the normalized tAI weight for each codon as an estimation of the supply of tRNA.
- Compute the demand for each codon:
Given all transcript sequences (named the “`transcriptome”), and their mRNA levels, the total codon count across all the transcriptome is:
$$(12)U_{i}=\sum_{g}n_{g,i}*m_{g}$$
Where $n_{g,i}$ is the number of appearances of codon ‘i’ in gene ‘g’, and $m_{g}$ is the mRNA level of the gene (copies of the gene transcript).
The normalized value is:
$$(13)cu_{i}=\frac{U_{i}}{U_{max}}$$- The final weight of the codon is the supply to demand ratio:
This value is normalized:
$$(15)c=\frac{W_{i}}{W_{max}}$$- The NTE index of a gene:
The division of the individual translational efficiencies $W_{i}$ by the maximum translational efficiency $W_{max}$ linearly rescales all translational efficiencies so that the maximal value is 1. NTE takes values from 0.0 to 1.0, where 1.0 indicates maximum translational efficiency, meaning the gene uses the codons that are most adapted to the tRNA pool and the demand.
Features:
- NTE index.
- NTE profile’s standard deviation (STD).
- NTE index of the first 50 codons.
- NTE index of the last 50 codons.
- RCBS (Strength of Relative Codon Bias)
The RCBS [13] calculates codon usage bias with respect to mutation bias, assuming the nucleotide bias induces the bias in the codon distribution.
Relevant equations:
- For each codon in a specific gene we calculate the RCBS weight, which is defined as:
where f(x,y,z) is the normalized frequency of the codon xyz (number of appearances divided by length of the sequence). $f_{i}(x)$ is the normalized frequency of the nucleotide x in the i’th position within the codons in the sequence. It should be noted that the weight of a codon is not calculated from a reference set of genes, rather it is unique to each specific gene.
- From these weights (which are unique for the gene), we create a profile of $d_{xyz}+1$, calculate the geometric mean, and subtract 1. This is the RCBS index of the sequence.
Intuitively, codons deviating from what is expected based on the distribution of their individual base frequencies will contribute to larger RCBS values, whereas codons following the expected distribution will minimize the index.
In most organisms there is a correlation between tAI and RCBS, though RCBS is a statistical measure. The results of this new index when applied to E. Coli and S. Cerevisiae led the authors of the original publications to conclude that natural selection favors higher expression and enhanced codon usage optimization in short genes.
Features:
- RCBS
- Relative Codon Adaptation (RCA)
This measure is similar to RCBS, only it uses a reference set to calculate the weights for each codon. In Fox and Erril's 2010 [14], the researchers showed that the positive results reported by Roymondal et al. are based on built-in biases within the small data sets used for validation. Furthermore, they demonstrated that the conclusion that selection favors codon optimization in short genes is erroneous; it stems from systematically ignoring an intrinsic bias in their method due to undersampling, which is implicit in short sequences.
Owing to its reliance on the query sequence alone, the RCBS index has a strong bias towards short sequences. This length dependency prevents its application to sequences shorter than 1000 bp. The length dependency of RCBS can be corrected by using a reference set and calculating the relative codon adaptation as following:
- The weight for each codon is calculated from the reference set and is defined as:
where $\chi_(xyz)$ is the observed frequency of codon xyz in any particular reference gene set, and $\chi_{n}(m)$ the observed frequency of base m at codon position n in the same reference set.
- Like CAI and RCBS, RCA is computed as the geometric mean of the RCAxyz term for each codon xyz in the sequence of interest, where L is the length in codons of the query sequence.
Features:
- RCA index with the whole genome as a reference set.
- RCA profile’s standard deviation (STD) with the whole genome as a reference set.
- RCA index of the first 50 codons of the gene with the whole genome as a reference set.
- RCA index of the last 50 codons of the gene with the whole genome as a reference set.
- RCA index with top 2% highly expressed gene (HEG) as a reference set.
- RCA profile’s standard deviation (STD) with the HEG as a reference set.
- RCA index of the first 50 codons of the gene with the HEG as a reference set.
- RCA index of the last 50 codons of the gene with the HEG as a reference set.
- Sliding windows' features
We used 100 sliding windows with a window size of 48 nucleotides and a slide of 1 nucleotide, and calculated for each window the following features:
- CAI with the whole genome as a reference set.
- CAI with the HEG as a reference set.
- tAI
- NTE
- RCBS
- RCA with the whole genome as a reference set.
- RCA with the HEG as a reference set.
This way we generated additional 700 features for each gene. For genes shorter than 150 nucleotides, the window value is set at NaN. The window size is chosen to be 48 nucleotides because it usually contains signals related to translation efficiency (such as ramps or mRNA folding) in the start of the sequence [11] but also in the whole sequence [15].
- FOP
We took this feature from the SGD website. This index measures the frequency of optimal codons (FOP) in a gene. It is a species-specific measure of bias towards certain codons that appear to be translationally optimal in a species.
Target gene features - GFP based
The features we generate are to become part of the recommendation engine, that outputs the recommended gene to be interlocked with a given target gene, by predicting the stability of the mutual construct.
Thus, some features must take into account the interactions between the target gene and the analyzed genes (all ORFs in yeast).
We decided to generate the CAI and RCA features with the target gene as a reference. These features will estimate for each gene its extent of bias toward codons (CAI) or nucleotides (RCA) that are known to be favored in the target gene.
The calculations are the same as above, with the reference sequence being the target gene.
Using another data source, we calculated the Chimera ARS score for the target gene [16]. The score estimates the suitability of the target gene for the desired gene, by finding the longest common sub-strings.
We also calculated distances features between the target gene and each of the yeast genes. All features described below are calculated based on profiles we created for each subcategory. For each profile, we applied several distance functions between it and the target gene’s profile. The main advantage in these calculations is that the comparisons are local and can help find regions of similarity between the target gene and the essential gene. Similarity in specific regions may indicate similarity in the regulatory systems of both genes and similarity in the formation of mutations, thus indicating better stability for the combined genes.
- Codon relative frequency
- Euclidian distance
- L1 distance
- Spearman correlation
- Pearson correlation
- KS distance
- Cartesian product
- Amino acid frequency
- Euclidian distance
- L1 distance
- Spearman correlation
- Pearson correlation
- KS distance
- Cartesian product
- GC content profiles
- String PPI
- mRNA folding energy
For each gene, we calculated the relative frequency of each codon – the number of appearances of a codon divided by the number of appearances of the amino acid that the codon encodes. The result is a vector of length 64 that contains each relative frequency.
This vector was also calculated for the target gene.
The features that were calculated based on the target gene’s vector and each yeast gene’s vector:
For each gene, we calculated the relative frequency of each amino acid – the number of appearances of an amino acid divided by the length of the amino-acid sequence.
This vector was also calculated for the target gene.
The features that were calculated based on the target gene’s vector and each yeast gene’s vector:
We created a GC content profile based on 10 windows of 30 nucleotides (nt) in a 300 nt window (the first 300 nt). For each sub-window, the GC content was calculated.
As for the target gene – the same calculation was done using the 300 nt from the end of the sequence.
The features that were calculated based on the target gene’s vector and each yeast gene’s vector:
- Euclidian distance
- L1 distance
- Spearman correlation
- Pearson correlation
- KS distance
- Cartesian product
* for genes with a sequence length less than 300, the rest of the profile vector was filled with the average of the existing GC content in windows of 30 nt. For example – for a gene with sequence length of 100 nt – a 1X3 vector was created which contains 3 GC content results of 3 windows of 30 nt. The rest of the vector was filled with the average of this 1X3 vector in order to create a 1X10 vector.
We used the string PPI in order to create for each gene a vector containing its 5 strongest connections (5 highest scores from the string PPI).
For each gene we calculated the mean codon relative frequency, the mean amino acid frequency and the mean GC content profile of its strongest connections and then calculated the following distances between these vectors and the target gene’s vectors of codons relative frequency and amino acid frequency:
- Euclidian distance
- L1 distance
- Spearman correlation
- Pearson correlation
- KS distance
- Cartesian product
* for genes with no connections the distance is NaN.
* if a gene has less than 5 connections, then the mean is calculated based on the connections it has available.
We created mRNA folding energy profiles using the function RNAfold from the package ViennaRNA. The profiles were calculated for each essential gene, based on 12 windows of 40 nucleotides (nt) with a 10 nt shift, resulting in a total window of the first 150 nt window. For each sub-window, the folding energy was calculated.
The target gene had the same calculation using the 150 nt from the end of the sequence.
The features that were calculated based on the target gene’s vector and each conjugated gene’s vector:
- Euclidian distance
- L1 distance
- Spearman correlation
- Pearson correlation
- KS distance
- Cartesian product
* for genes with a sequence length less than 150, the rest of the profile vector was filled with the mean on the existing folding energy value. For example – for a gene with sequence length of 100 nt, a vector was created containing 7 folding energy values resulting from 7 windows of 40 nt with a 10 nt shift each time. The rest of the vector was filled with the average of this 1X7 vector in order to create a 1X12 vector.
These features were defined under the following assumption – if the target gene and an essential gene exhibit similar profiles (small distances), then they are functionally or structurally similar. Thus, they could create more stable constructs
We first generated these features for GFP, which is used for developing and training the model. Then, the software will calculate these genes again for a different target gene and use transfer learning to adapt the model to the changed features’ values. Further explanation of this technique appears in the “stability predictor” section.
Chemical properties features
The following features were taken from SGD website.
- Aliphatic index- The aliphatic index of a protein is defined as the relative volume occupied by aliphatic side chains (alanine, valine, isoleucine, and leucine). Aliphatic index plays a role in protein thermal stability. Proteins with a high Aliphatic index are more thermally stable. Aliphatic amino acids are also hydrophobic in nature.
- Gravy score- The GRAVY number of a protein is a measure of its hydrophobicity or hydrophilicity. The two measures are combined in a hydropathy scale or hydropathy index. The more positive the value, the more hydrophobic are the amino acids located in that region of the protein. These scales are commonly used to predict the transmembrane alpha helices of membrane proteins.
Local folding energy (lfe) for mRNA
Folding of mRNA affects the efficiency of gene expression. The stronger the fold, the more energy must be expended to release the fold and it will be more difficult to perform the translation process. This can impair the level of expression of the gene.
We used data from [17]. The data contains 3 tables with information for 5797 yeast’s genes. We used each of the tables as a feature:
- Table 1- native local folding energy (empirical results).
- Table 2- shuffled lfe (based on a random model).
- Table 3- dlfe, the difference between the results in table 1 and table 2.
- In addition, we calculated the mean value per gene for each table.
The calculation for the first two tables was made in 40 nucleotides (nt) windows with 10 nt shifts for the first 300 nt.
Co-expression MRNA network
We first created a calculator, which derives a correlation score between each pair of genes according to their mRNA levels from different databases. Using it, we calculate the weighted rank feature for each gene.
The matrix was taken from [18].
The weighted rank is computed according to the following equation:
$$(21)Weighted\ rank=\frac{sum(scores)}{(max\ score\ in\ graph)*(\# of\ scores\ per\ gene)}$$The higher the rank, the more connections the gene has and is therefore likely to be significant [18].
ATG context score
We used a table from [19] which contains ATG context score for 5861 genes.
The score was calculated for the main ATG, for each alternative ATG in the 150 nucleotides of the untranslated region on the 5’ side (5UTR) and for each alternative ATG in the 150 nucleotides of the open reading frame (ORF).
For each alternative ATG, an absolute score and a relative score (absolute score divided by the main ATG score) were both calculated.
The features calculated:
- The number of alternative ATG in the following windows (default value=0):
- ORF
- 5UTR
- Total
- 30 codons in the ORF
- 200 codons in the ORF
- Main ATG context score.
- Mean and maximum values for the scores in the following windows (default value=10 orders of magnitude less than the worst score):
- 30 codons in the ORF
Alternative ATG can affect translation and impair it.
These features were calculated for both absolute and relative scores.
* The scores are all in [19].
Number of shifted ORFs in the coding sequence
sORFs are shifted Open Reading Frames beginning at alternative ATGs downstream in the ORF from the main START codon and terminating with a stop codon.
Shifted ORFs can impair the translation.
The features we calculated using the gene's sequence:
- Number of shifted ORFs
- Number of shifted ORFs 30 codons
- Number of shifted ORFs 200 codons
- Max length
- Max length 30 codons
- Max length 200 codons
- Mean length
- Mean length 30 codons
- Mean length 200 codons
HiC
A matrix of 'smoothed Hi-C graph distance between S. cerevisiae genes' (6664 genes) that was taken from [18].
HiC data calculation involves several experimental steps. Therefore, the matrix contained some genes for which there was no information (each distance was zero). We didn't take these genes into account while calculating the features [20].
We took 5% of the connections, each connection is a distance value for each pair of genes.
we calculated for each gene:
- Rank - number of connections
- Mean value of distances
- Median value of distances
- Min value of distances
- Mean mRNA levels of each gene and its connected genes
Conservation features
Conservation indicates that a sequence has been preserved by natural selection. It means that a highly conserved sequence is one that has remained relatively unchanged far back up the phylogenetic tree. In this project we examined conservation in amino acid sequences of orthologous genes of S. cerevisiae.
Amino acid sequences can be conserved to maintain the structure and function of a protein or domain. Conserved proteins undergo fewer amino acid replacements or are more likely to substitute amino acids with those of similar biochemical properties. Within a sequence, amino acids that are important for folding, structural stability, or that form a binding site may be more highly conserved.
We used Multiple Sequence Alignment (MSA) in order to visualize conserved sequences and estimate them.
First part of the project
Input / data files:
- Clw files - This kind of file contains the MSAs of genes in cerevisiae and their orthologous genes from up to 5 Orthologous Groups (OGs) ( Saccharomycetaceae, Saccharomycetes, Ascomycota, Fungi, Eukaryota) in ClustalW format.
OGs for S. cerevisiae genes were obtained from OrthoDB [21]. Amino-acid MSA for each OG was performed using Muscle with the option - maxiters 4 [22]. The coding sequences for each genome were obtained from Ensembl FTP. For each amino acid position in each S. cerevisiae protein, we examined the MSAs for all OGs which include it and analyzed the codons in that position.
Output / results file: entropy features gene names converted.csv. This table contains 417 columns: the first column is the gene name, while the other 416 columns are extracted features- see details below. In the output file there are 4691 genes of S. cerevisiae.
Features
The first kind of feature was a calculation of the mean and median of the percentage of indels f(gaps) in each position (column) within a multiple sequence alignment. The second kind of feature was a calculation of entropy measures. The Shannon Entropy (SE) score was calculated for the amino acid distribution of each column within a MSA, since it was the first and the most commonly used method for estimating conservation [23]. The entropy score is important for conservation prediction because it is directly proportional to the rate of conservation of the different positions in the gene, i.e. if positions in the MSA have higher entropy it signifies that they are more conserved.
Shannon entropy is defined as follows:
$$(22)SE(i)=\sum_{i=1}^{N}p_{i}\cdot log_{2}p_{i}$$where $p_{i}$ is the probability of a given amino acid and N is the number of possible amino acids in a specific position (i.e. column) of the MSA [24]. The entropy range lies between 0 and $\log2N$. If the SE score in a specific position of the MSA is very low (close to zero), it means that the distribution of the amino acids is uniform, namely this position is not conserved. In contrast, if the SE in a specific position of the MSA is very high, then that position is very conserved and apparently has an important impact on the proper functioning of the protein.
The calculation of the SE score (the features) was performed using two different methods: the first time we omitted the indels from the SE calculation, i.e. we took into account the 20 amino acids only, while in second time we recalculated the SE score including indels as one of the possible options, i.e. we considered 21 options overall in this SE calculations.
In each of the different methods for SE calculation, the mean and median values of SE were calculated based on the entropy values of all the gene's sequence. In addition, we calculated the SE in the first and the last 100 sliding windows of the MSA, where each window contains 30 codons / positions. Afterwards, we added more features that calculate the indices of the first window with a minimal and maximal entropy score, based on the values of the SE score in the first and the last 100 windows within the MSA.
The second part of project
Input / data files: 5310 csv files based on the MSA files (see data files of the first part of the conservation project). Each file contains the frequency of each of the amino acids that make up the gene and the frequency of each of the nucleotides in the codon. In addition, the maximal value found in any of the OGs in a specific position was recorded for each amino acid and for each nucleotide position in the codon.
Features
The calculations of the genetic preservation measures (the features) were performed on two sets of creatures: one set is based on data from all creatures that have orthological genes for S. cerevisiae, and the other set contains creatures that are evolutionarily closest to S. cerevisiae.
In each of the sets, the geometric mean and median values of the frequencies were calculated for both amino acids and nucleotides. Each time we focused on different areas of the sequence for each gene: the beginning, the end and the whole sequence of the gene. The beginning and end of the gene were defined as the first or last 50 amino acids respectively [25], [26].
The geometric mean was calculated as follows:
$$(23)geometric\ mean=(\prod_{k=1}^{L}C_{k})^\frac{1}{L}=exp(\frac{1}{L}\sum_{k=1}^{L}ln(C_{k}))$$
where L is the length of the amino acid sequence, and $C_{k}$ is frequncy for the k'th amino acid.
Additional features
- Instability index- The Instability index is a measure of proteins, used to determine whether a gene will be stable in a test tube. It will probably bbe stable for values less than 40.
- Features from 10 genes project- We included all the features from the Ten Genes project. Originally, we focused only on essential genes. We applied these features to all 6,000 genes, except for a few individual features like dn/ds and CAI as they had many missing values.
We recalculated the CAI feature values for all the genes. Instead of the dn/ds feature we created several evolutionary conservation features, as we know this type of feature is important for our predictor.
In addition, we used a more up-to-date version of the PPI network from the STRING website to create the protein rank feature. We also added some new features based on this data (for more details see project "Extracting Features from PPI networks And Graphs" *)
It should be noted that both the fluorescence intensity and the protein location were in the 10 genes project as features, while in the new model they functioned as the predictor labels.
References
[1] Wang, Y., Sun, H., Du, W., Blanzieri, E., Viero, G., Xu, Y., & Liang, Y. (2014). Identification of essential proteins based on ranking edge-weights in protein-protein interaction networks. PloS one, 9(9), e108716. http://doi.org/10.1371/journal.pone.0108716
[2] Latchman, D. S. (1993). Transcription factors: an overview Function of transcription factors. In Int. J. Exp. Path (Vol. 74).
[3] STRING: functional protein association networks. (n.d.). http://string-db.org/cgi/input.pl?sessionId=erLCrPhS5PXf&input_page_show_search=on
[4] Cherry JM, Hong EL, Amundsen C, Balakrishnan R, Binkley G, Chan ET, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Karra K, Krieger CJ, Miyasato SR, Nash RS, Park J, Skrzypek MS, Simison M, Weng S, Wong ED (2012) Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. Jan;40(Database issue):D700-5. [http://doi.org/10.1093/nar/gky1079
[6] [Usaj, M., Tan, Y., Wang, W., VanderSluis, B., Zou, A., Myers, C. L., Costanzo, M., Andrews, B., & Boone, C. (2017). TheCellMap.org: A web-accessible database for visualizing and mining the global yeast genetic interaction network. G3: Genes, Genomes, Genetics, 7(5), 1539–1549. http://doi.org/10.1534/g3.117.040220
[7] Arup Panda, Tamir Tuller (2020), Exploring Potential Signals of Selection for Disordered Residues in Naturally Occurring Prokaryotic and Eukaryotic Proteins, bioRxiv 2020.03.10.979443; http://doi.org/10.1101/2020.03.10.979443
[8] Wright, F. (1990). The ‘effective number of codons’ used in a gene. Gene, 87(1), 23-29. link.
[9] Lee, B. D. (2018). Python implementation of codon adaptation index. Journal of Open Source Software, 3(30), 905. link.
[10] Sharp, P. M., & Li, W. H. (1987). The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic acids research, 15(3), 1281-1295. link.
[11] Tuller, T., Carmi, A., Vestsigian, K., Navon, S., Dorfan, Y., Zaborske, J., ... & Pilpel, Y. (2010). An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell, 141(2), 344-354. link.
[12] Pechmann, S., & Frydman, J. (2013). Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding. Nature structural & molecular biology, 20(2), 237. Link.
[13] Roymondal, U., Das, S., & Sahoo, S. (2009). Predicting gene expression level from relative codon usage bias: an application to Escherichia coli genome. DNA research, 16(1), 13-30. link.
[14] Fox, J. M., & Erill, I. (2010). Relative codon adaptation: a generic codon bias index for prediction of gene expression. DNA research, 17(3), 185-196. link.
[15] Zur, H., & Tuller, T. (2013). Transcript features alone enable accurate prediction and understanding of gene expression in S. cerevisiae. BMC bioinformatics, 14 Suppl 15(Suppl 15), S1. BioMed Central. link.
[16] Hadas Zur and Tamir Tuller (2014) Exploiting hidden information interleaved in the redundancy of the genetic code without prior knowledge. Bioinformatics, Doi: 10.1093/bioinformatics/btu797. PDF
[17] Michael Peeri and Tamir Tuller (2020) High-resolution modeling of the selection on local mRNA folding strength in coding sequences across the tree of life. Genome Biology, DOI: 10.1186/s13059-020-01971-y. PDF
[18] Alon Diament and Tamir Tuller (2017) Tracking the evolution of 3D gene organization demonstrates its connection to phenotypic divergence. Nucleic Acids Research, DOI: 10.1093/nar/gkx205. PDF
[19] Tuval Ben-Yehezkel, Hadas Zur, Tzipy Marx, Ehud Shapiro, Tamir Tulller (2013) Mapping the translation initiation landscape of an S. cerevisiae gene using fluorescent proteins. Genomics, DOI: 10.1016/j.ygeno.2013.05.003. PDF
[20] Ofir Hakim and Tom Misteli (2019) SnapShot: Chromosome Conformation Ca
[21] E. M. Zdobnov et al., “OrthoDB v9.1: Cataloging evolutionary and functional annotations for animal, fungal, plant, archaeal, bacterial and viral orthologs,” Nucleic Acids Research, vol. 45, no. D1, pp. D744–D749, Jan. 2017, doi: 10.1093/nar/gkw1119.
[22] R. C. Edgar, “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Research, vol. 32, no. 5, pp. 1792–1797, 2004, doi: 10.1093/nar/gkh340.
[23] Durbin and Richard, Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
[24] S. Sen, A. Dey, S. Chowdhury, U. Maulik, and K. Chattopadhyay, “Understanding the evolutionary trend of intrinsically structural disorders in cancer relevant proteins as probed by Shannon entropy scoring and structure network analysis,” BMC Bioinformatics, vol. 19, no. S13, p. 549, Feb. 2019, doi: 10.1186/s12859-018-2552-0.
[25] T. Tuller et al., “An evolutionarily conserved mechanism for controlling the efficiency of protein translation,” Cell, vol. 141, no. 2, pp. 344–354, Apr. 2010, doi: 10.1016/j.cell.2010.03.031.
[26] S. Karlin, J. Mrázek, and A. M. Campbell, “Codon usages in different gene classes of the Escherichia coli genome,” Molecular Microbiology, vol. 29, no. 6, pp. 1341–1355, Sep. 1998, doi: 10.1046/j.1365-2958.1998.01008.x.re. Doi: 10.1016/j.cell.2012.02.019. PDF
Feature Selection
Motivation:
The input data for machine learning (ML) algorithms consists of instances, which are samples of specific population (in our case, these instances are equivalent to the genes in yeast), and features, which are a set of characteristics that describe the instance. Features can be continuous or categorical. In any case, one converts these features to be numerical, as they are more convenient to use and adapt to existing ML methods. Another important input is the label of each instance, which is the property you wish to predict - fluorescence levels, stability, or anything that you have the data for.
Basically, each instance or gene is represented as a single dot in what is called the “feature space”. This space’s dimension is equal to the number of existing features. For example, you could describe a gene by its length and GC content, so these two features form a two- dimensional feature space. Understanding your genes and exploring additional datasets is the best way to generate features, and thus reaching a more optimal predictor for your labels.
However, when one tries to use too many features, he may encounter several problems. First, each feature we add creates another axis and therefore increases the computation and memory required for our algorithm exponentially. A bigger problem is that each redundant feature also inserts noise, and increases the chance of overfitting. In overfitting, the model that we are developing begins to describe the random error in the data rather than the actual relationships between variables.
Let us visualize this phenomenon. Suppose we have data over time, and we introduce random noise. The model could give us the following output within the underfit/overfit paradigm:
The random error inherent in the data causes the data points to fall randomly around the good fit line. The over-fitted model is too complex, and it attempts to explain the random error present in the data.
The major cause for overfitting is “the curse of dimensionality” –it implies that one could create hyperplanes to separate the data in ways which might not encapsulate intrinsic structures of the data. To reduce the risk of overfitting and computational complexity, we need to filter the features. Following our conversation with Prof. Zvika Shinar, we realized that our model probably performed overfitting, and concluded that a good way to start dealing with it is to:
- Decrease the number of missing values for each instance (gene).
- Choose approximately 300 features from our entire set of 1000+ features. The feature selector we implemented filters features that do not contribute to the prediction, based on a correlation-based measure that is explained in detail below.
It should be noted that this is only a pre-processing step, and we did apply an active feature selection following the training of the predictor, as described in the next tab.
Part A: Dealing with Missing Values
The input data for ML algorithms is often incomplete, because some features cannot be computed for all instances. There are many strategies to impute missing values, each affecting the performance of the predictor differently.
A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with this statistic. It is a popular approach because it is computationally easy, utilizes well the training set, and because it often results in satisfactory performance.
However, when you have too many missing values, this approach may cause the model to cluster instances that express similar missing values, even though they are not necessarily related. A better approach would be to understand why each value is missing and fill the missing values accordingly. This is not applicable when you have many features. Another option is to detect the instances and features that have above a threshold of missing values and eliminate them out of the analysis - not blindly of course, but in a controlled process.
In our model, the genes are the instances, and for each gene we generated 1000+ features as described in detail in the former section. By examining the table that contained all the features, we discovered that there is a certain group of genes that expressed many missing values, with systemic names starting with “Q” or “R”. According to the SGD, these are noncoding genes. The “Q” group often appears as a dubious ORF in the SGD website; thus, it makes sense that such genes are less characterized. ORFs encoded by mitochondrial DNA receive a systematic name beginning with the letter Q and a historically defined gene name (e.g., Q0045 defines the coding sequence for the COX1 gene that encodes subunit I of cytochrome c oxidase). Systematic names for the four annotated features of the endogenous two-micron plasmid of yeast originate with the letter R (e.g., R0020C encodes the Rep1p transcriptional regulator of the FLP1 recombinase gene) [27].
Since neither of these groups encode proteins, we decided that it would improve our predictor if we filter them from the analysis. Another filtered group is the paralogs genes, that have an extra copy in the genome. When interlocking a target gene to a paralog gene, the host cell will not try its best to preserve them because there is a duplication. Thus, these genes are irrelevant for our end goal and eliminated from the analysis. For further filtering, we set a threshold such that genes with more than 5% missing features are filtered.
Next, we evaluated the number of missing values per feature and extracted a list of the “highest nans” - the genes with the highest number of missing values. Consequently, we decided to remove features with 1000 missing values or more. This threshold was chosen because it preserved important features such as mRNA abundance, as it had around 800 missing values but was still an essential feature.
After this analysis we remained with a total of 5609 genes and 1460 features.
Part B: Characteristics of the SFS & CFS Feature Selection Algorithm
Most feature selection algorithms perform a search through the “space of feature subsets”, defined as all the possible groups of features for selection.
These algorithms choose a subgroup that is evaluated as “best” (in case of optimal solution) or “good enough” (in case of a heuristic solution). As such, they address four basic principles:
- Starting point. Selecting a point in the feature subset space from which to begin. Since our evaluation strategy is correlation-based, we decided to choose the first feature as the one with the highest correlation to the labels (fluorescence levels and localization).
There are two common kinds of correlation coefficients: Pearson, that studies linear relations between given variables, and Spearman, that indicates monotone relations between variables. We used Spearman correlation because we do not necessarily know that the labels and the features form linear connections, and if they do, Spearman will catch that too. - Search Strategy. With N initial features, there are $2^{N}$ possible subsets (as seen in figure 4), so it is impractical to search the whole space exhaustively unless N is small, which is hardly the case for our problem. Heuristic search strategies are more feasible and can give good results, although they do not guarantee finding the optimal subset. One such strategy is Sequential Forward Selection (SFS), also known as forward selection. In each iteration, the algorithm considers only subsets that include the current subset and adds features that improve the current state. It starts from the first feature defined above, adds one more feature that improves the evaluation measure, and repeats this process until the stop criterion is met.
- Evaluation strategy.
paradigms: Two main paradigms for evaluation are called “Filter” and “Wrapper”, and they are demonstrated in figure 5. - Stopping criterion. A feature selector must decide when to stop searching through the space of feature subsets. Our evaluation strategy requires an arbitrary cutoff on the number of features to be selected. We chose this threshold to be 300 features for now, acknowledging that we may continue with the feature selection depending on the chosen model and its performance.
The Filter method operates independent of any learning algorithm - undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets [29]. The main advantages of the Filter method are fast execution and generality. The latter occurs due to filters evaluating the intrinsic properties of the data, rather than their interaction with a specific predictor. Thus, the solution will perform well on a larger family of predictors, contrary to the Wrapper method. This is actually the main reason behind our use of the Filter method - we would like to try many models for the predictions, and we don’t know in advance which one of them will work best. Thus, we prefer to filter features independently.
In contrast, the Wrapper method is dependent on the predictor, as it chooses the features based on the performance of the model with those features. This is usually implemented with a re-sampling technique such as cross validation. Wrappers usually achieve better recognition rates than filters, since they are tuned to the specific interactions between the predictor model and the dataset. However, as mentioned before, the purpose of our feature selection process is to reduce the number of features without considering a specific predictor. This is due to our approach - try many models until we get satisfying results, and thus we need a mechanism that will be robust for all the predictors we test. Therefore, we chose the Filter approach.
Evaluation measure (merit): The evaluation measure helps us score sets of features. We specify a merit, or an evaluation function, for each feature which indicates its contribution to the current subset. The feature that provides the highest merit is chosen. There are many ways to define this merit - mutual information, RELIEF etc., and we chose to use the CFS (correlation-based feature selector) measure [29].
CFS is a simple filter algorithm that ranks feature subsets according to a correlation-based heuristic evaluation function:
$$(24)M_{s}=\frac{k\underline{r}_{cf}}{\sqrt{k+k(k-1)\underline{r}_{ff}}}$$Where $M_{s}$ is the heuristic merit of a feature subset S containing k features, $\underline{r}_{cf}$ is the mean feature-label correlation, and $\underline{r}_{ff}$ is the mean feature-feature correlation.
The numerator of the above equation provides an indication of how predictive of the label a set of features are; the denominator of how much redundancy there is among the features. Since chosen feature are the one with high $M_{s}$ values (low denominator, high numerator), redundant features should be filtered as they will be highly correlated with one or more of the remaining features, and the included features will be highly correlated with the labels.
References
[27] Chen, M., & Rymond, B. (2012). Appendix A1: Yeast Nomenclature Systematic Open Reading Frame (ORF) and Other Genetic Designations. Alternative pre-mRNA Splicing, 1, 605.
[28] Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324.
[29] Hall, M. A. (1999). Correlation-based feature selection for machine learning
Stability Predictor
The predictor
We devised an innovative prediction model for genetic co-stability, i.e the stability of two genes interlocked together, using bioinformatic tools and empirical data collected from large-scale genomic experiments. We designed this model to predict the genomic stability obtained in our Gene-SEQ experiment, determined by the distribution of mutations within the target gene. However, these results are still being analysed, so in the meantime we trained our model on similar experimental data – GFP fluorescence measurements of each variant in Prof. Schuldiner's library.
The stability predictor presented here is a Regression model, predicting the stability of fluorescent protein expression in yeast's genes, based on many features.
Introduction
Regression models have been around for many years and have proven very useful in modeling real-world problems and providing useful predictions, both in science, industry, and business environments. In parallel, neural networks and machine-learning algorithms are growing in adoption and can model complex problems and provide predictions for real-world problems. In the following chapter, we will introduce the regression methods we used, provide a short intro for regression using Neural Networks and Machine Learning, and introduce our model.
Regression Analysis
Regression analysis is a set of statistical techniques that serve as a basis for inferring the relationships between interrelated variables. Since these techniques are applicable in almost every field of study, including the social, physical, and biological sciences, business and engineering, regression analysis is now perhaps the most used of all data analysis methods. The most common form of regression analysis is linear regression, in which a researcher finds the line (or hyperplane) that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique hyperplane that minimizes its sum of squared distances from the true data. This permits estimation of the conditional expectation (or population average value) of the dependent variable, when the independent variables take on a given set of values.
Types of regression:
- Linear Regression - suitable for dependent variables that are continuous and can be fitted with a linear function.
- Polynomial Regression - suitable for dependent variables that are best fitted by a curve or a series of curves. Polynomial models are prone to overfitting, so it is important to remove outliers which can distort the prediction curve.
- Logistic Regression - suitable for binary dependent variables. Binary variables follow a binomial distribution and cannot be fitted with a linear regression function.
- Stepwise Regression - an automated regression technique that can deal with high dimensionality—many independent variables. Stepwise regression observes statistical values to detect which variables are significant and drops or adds co-variates one by one to see which combination of variables maximizes prediction power [30].
Random Forest
Random Forest (RF) is a supervised machine learning algorithm, which uses many decision trees. It is a way of averaging multiple decision trees, which are trained on different parts of the same training set, in order to reduce the variance.
The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models improves the overall result. This is calles the "wisdom of the crowd" approach. RFs add additional randomness to the model while growing these trees, building distinct decision trees and averaging the results. Instead of searching for the most important feature among all features when splitting a node, it searches for it among a random subset of features. This results in a wide diversity that generally results in a better model.
Neural Networks
Artificial Neural Networks (ANN) are comprised of simple elements called neurons, each of which can make simple mathematical decisions. Together, these neurons can be used to analyse complex problems, approximate almost any function including very complex ones, and provide accurate answers. A shallow neural network has three layers of neurons: an input layer, a hidden layer, and an output layer. A Deep Neural Network (DNN) has more than one hidden layer, which increases the complexity of the model and can significantly improve prediction power [31].
Neural networks are reducible to regression models—a neural network can estimate any type of regression model. For example, a very simple neural network, with only one input neuron, one hidden neuron, and one output neuron, is equivalent to logistic regression. It takes several dependent variables (or input parameters), multiplies them by their coefficients (or weights), and runs them through an activation function and a unit step function, which closely resembles the logistic regression function with its error term.
Our Predictor Mission
Our predictor is designed to predict the fluorescent protein expression in yeast's genes as a measure of co-stability. It considers features related to the conjugated gene alone, and features describing its relation to the target gene as well.
Training
We trained 2 different models, one based on the RF algorithm and the other based on Neural Networks, to predict fluorescence values of 6000 genes in yeast when tagged to GFP in their N terminal. We refer to GFP as the target gene and the yeast's genes as the candidates for linkage. For this task, the features related to the target gene were calculated according to the GFP sequence.
Regarding the GFP, we had two databases to train on as the target values:
- "NOP1" database, that contains fluorescence values of yeast's genes when tagged to GFP in their N terminal under the synthetic promoter NOP1.
- "Native" database, where the fluorescence values of the yeast's genes were measured under the native promoter.
For each model we tested several architectures – neural network regression or RF regression.
Neural Network Predictor
We designed a fully connected ANN with 5 layers: an input layer, 3 hidden layers, and an output layer. The simplest model uses normalized and pre-selected features as input. The size of the hidden layers is the input size divided by 2,4 and 8 respectively, and the output layer size is 1, as we are computing a single value as an output. Our network uses the leakyRelu activation function and several methods of the loss function to obtain the best prediction.
While training, we apply several methods of regularization, like dropout and batch normalization.
Dropout is a regularization method that approximates training many neural networks with different architectures in parallel. During training, some of the layer outputs are randomly ignored or “dropped out.” This has the effect of making each layer change its number of nodes and connectivity to the previous layer. In effect, each update to a layer during training is performed with a different “view” of the configured layer [32].
Batch normalization is a technique for training deep neural networks that standardizes the a layer’s inputs for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks. Batch normalization is achieved through a step that normalizes the means and variances of each layer's inputs. Ideally, this normalization would be conducted over the entire training set [33].
Neural Network Predictor
Our second predictor is an implementation of the RF algorithm. As explained, RFs average the decisions of many decision trees, called "estimators". The number of estimators determines the model's size. To select the best model size, we scanned various numbers of estimators to achieve the best RF architecture. The best results of the 2 models were with 300 estimators for the Native dataset and 900 estimators for the NOP1 dataset.
As part of the RF training, we applied an active feature selection using the feature importance table of our model. Each feature’s importance is measured by calculating how much tree nodes using that feature reduce impurity across all trees in the forest, after every training procedure. We deleted the 5% least significant features. After this procedure, we create our final feature table which contains the 201 most significant features.
Testing and Choosing The Best Model
Finally, after achieving the best model for each of the 4 trained models: ANN- NOP1, ANN-Native, RF- NOP1, RF-Native, we evaluated each model using three different metrics: $R^2$, agreement, and accuracy on each model's test set (all of the metrics and results are explained in the Results page). Additionally, we tested the generalization of each model for a new target gene using fluoresence measurements of 4000 genes in yeast RFP is attached to their N’ terminal.
The results clearly demonstrated that both of the RF models (RF- NOP1, RF-Native) are much more accurate and much more efficient when testing on the RFP dataset. Similar results were obtained on the GFP test set. This is probably due to the RF's unique "wisdom of the crowd" decision-making approach, which usually results in much more stable models when compared to ANNs.
Finally, we chose the RF-NOP1 model which achieved the best results on the GFP test set based on the 3 evaluation metrics. It got an $R^2$ score improved by 0.2 compared to the related RF-Native, it provided better results in predicting and ranking the 20 most likely genes to get the highest fluorescent protein expression, and it was more accurate on the RFP prediction.
Moreover, there is a biological preference for selecting the NOP1-trained model over the NATIVE-trained. In future versions of our software, we would like to propose synthetic promoters for increasing the stability and expression levels of the given construct. The use of synthetic promoters adds a degree of stochasticity to the system; theoretically, we can expect that a native and a synthetic promoter to return similar rankings of genes that improve stability. That is, genes predicted to be highly stable, will present high levels of stability in the second model. However, in reality the behavior of constructs under a synthetic promoter is less predictable. A promoter replacement can cause some of the genes to be in over-expression or under-expression. We do not know in advance how the genes behave in such a situation, but clearly it is preferable to use a prediction based on synthetic promoters, as opposed to a native promoter that does not address this uncertainty.
Implementation
After selection and optimization the RF-NOP1 predictor, we saved the model's weights and imported it to our software. When a user initiates our product and inserts his gene of interest, the target-gene related features are calculated and combined with the other pre-calculated features describing the yeast's genes. Then, our software provides a solution using a real-time model based on the Scikit-learn RF model, which loads the saved weights file and performs predictions on the user’s data.
Our real-time software is optimized for performing fast predictions by loading the weights in less than 0.5sec and getting an inference time of 0.707sec for 4035 genes, or predicting more than 5700 genes per second.
References
[30] Golberg, Michael & Cho, Hokwon. (2010). Introduction to Regression Analysis.
[31] Subana Shanmuganathan, Sandhya Samarasinghe (2016) Artificial Neural Network Modelling. DOI: 10.1007/978-3-319-28495-8
[32] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.(2014). Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929–1958.
[33] Sergey Ioffe and Christian Szegedy. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 448–456.
Fusion Linkers
Optimizing Selection of Linker
In this model, we focused on two types of linkers- 2A and fusion linkers. The users will select which kind of linker they would like to use, and we will provide them with the best linkers of this type.
2A linkers
There are four common 2A sequences. In order to obtain a desired protein expression level, it is important to select the correct sequence among them. The conventional way to rank the 2A peptides is from the most efficient P2A, followed by T2A, E2A and F2A [34].
If users selects this kind of linker, we will display these 4 options – their names, sequences and scores in descending order of efficiency, letting them choose which sequence they would like to use. This way, we enable the users to consider factors other than efficiency.
Fusion linkers
This model’s purpose is to select the linker that will best fuse the target and essential genes. We used a linkers' database provided by The Centre for Integrative Bioinformatics VU (IBIVU) containing 1280 linkers. The linkers in this database are inter-domain linker peptides of natural multi-domain proteins. Therefore, they provide an ample source of potential linkers for novel fusion proteins. These linkers provide the conformation, flexibility and stability needed for a protein’s biological function in its natural environment [35].
The first research question we dealt with is: what is the ideal way to choose a linker?
Our approach was to use a genetic sequence’s disorder profile. Folded proteins usually have a well-defined structure, and therefore tend to be ordered. As disorder indicates a lack of defined structure and is much easier to predict, we utilize this as an indicator of folding likelihood. This analysis is usually done per residue (amino acid).Our initial hypothesis was that an ideal linker has two properties:
- Its tendency for 3D folding is minimal, i.e. it contains more disordered residues.
- It minimally effects the essential and target proteins' natural 3D folding, i. e. the change in disorder profile is minimal.
We searched for an existing tool that would model a protein's 3D folding /disorder profile, and provide a folding score. Most of the programs we found either had a long, infeasible running time (we need to test ~1300 possible linkers) or returned a complex folding prediction that couldn’t be analyzed automatically.
We found an efficient tool called IUPred2A. The main aim of IUPred2A is identification of Intrinsically Disordered Protein Regions (IDPRs, i.e. proteins that have no single well-defined tertiary structure under native conditions) based on a biophysics-based model. This tool enables the user to input any protein sequence and IUPred2A would return a score between 0 and 1 for each residue, corresponding to the probability of the given residue being part of a disordered region.
The disordered nature of a protein segment can be context dependent: certain protein regions can switch between an ordered and a disordered state depending on various environmental factors. Currently, the IUPred2A server is able to detect such context-dependent disorder in the case where the environmental factors are either a change in the redox state or the presence of an ordered binding partner (for more details see here).
There are three different disorder prediction types offered, each using different parameters optimized for slightly different applications. These are: long disorder, short disorder, and structured domains.
Long disorder (default option):
The main purpose of IUPred2A is prediction of global structural disorder that encompasses at least 30 consecutive residues of the protein. The long disorder option is optimized for this task.
Short disorder:
In this setting, IUPred2A uses a parameter set best suited for predicting short disordered regions, such as missing residues in the X-ray structure of an otherwise globular protein. For this application, a smaller sequential neighborhood of residues is considered for the calculation of the IUPred score. As chain termini of globular proteins are often disordered in X-ray structures, this is considered using an end-adjustment parameter which favors disorder prediction at the ends.
Structured domains:
The reliable identification of ordered protein regions is a crucial step in target selection for structural studies and structural genomics projects. Finding putative structured domains suitable for structure determination is another potential application of IUPred2A. In this case, the algorithm aims to find continuous regions predicted ordered with high confidence. Neighboring regions close to each other are merged, while regions shorter than the minimal domain size of at least 30 residues are ignored. When this prediction type is selected, the region(s) predicted to correspond to structured/globular domains are returned [36].
We found the long disorder prediction type suitable for us. We downloaded the IUPRED2A python package and used it offline.
The second research question we dealt with in this model is: which property of the ideal fusion linker is more important?
After computing both the linker's disorder profile alone and the construct's (target gene + linker + essential gene) disorder profile we computed 3 scores based on Euclidean distance for each of the ~1300 linkers:
- Essential gene's score- a Euclidian distance between the essential gene disorder profile before and after fusion.
- Target gene's score- a Euclidian distance between the target gene disorder profile before and after fusion.
- Linker's score- a Euclidian distance between the linker disorder profile after fusion and the constant 1s vector of the same length (representing a completely disordered, unfolded protein).
We had to weigh each of these scores according to their importance. We decided that our initial hypothesis would be to give each score the same importance, and therefore we should give equal weight to each one as a default option. The users would have the ability to change these weights according to their wishes.
Therefore, our final score was computed as:
$$(25)Final\ score=\frac{1}{3}*Essential\ gene's\ score+\frac{1}{3}*Target\ gene's\ score+\frac{1}{3}*Linker's\ score$$
So how is the best linker selected?
The chosen linker is the one with the minimal final score.
Should each one of the scores be normalized by its length?
By modeling some disorder profiles before and after the fusion, we noticed that the score change usually occurs in a limited number of amino acids, those closest to the linker. This matched our intuition, that this is a local phenomenon. Therefore, we concluded that we should not normalize the Euclidian distance of the target and essential genes by their length, as most of their sequence is unchanged.
However, as previously described, the linker's score is computed as Euclidean distance between the disorder profile and the 1s vector. Thus, the longer the linker is, the larger score it receives. This created a bias towards short linkers. We realized that if we want to mitigate this bias, the linker’s score should be normalized.
Were our hypotheses validated?
We conducted few statistical tests in order to check whether the essential genes’, target genes’ and linkers' scores are of the same scale.
We tested ~900 different combinations of GFP (target gene) and conjugated genes (~900 options, randomly chosen from yeast's genes).
For each combination, we tested all linkers in the database (~1300 options). For each linker, we calculated the score for each component. Thus, for each combination, we calculated 3 scores vectors (essential gene, target gene and linker) containing ~1300 values (the scores for each linker). For each vector, we found the mode (most common value), giving us a single score per combination.
This way, we obtained 3 vectors (essential gene, target gene and linker) containing ~900 values (mode of scores for each possible essential gene).
*Only in this analysis, the linker's score is a Euclidian distance between its disorder profile and 0s vector. Therefore, we would like to have maximal score for the most unfolded linker.
*The scores presented are multiplied by a weight of 1/3.
We calculated these vectors' mean and standard deviation and found that:
The essential gene's score = 0.0735± 0.0710
The target gene's score = 0.115±0.0339
The linker's score = 0.0945±0.0762
In order to further analyze these vectors, we created a histogram for each, representing the most frequent scores' distribution:
We observed that both the essential and target genes' scores are very low. These scores describe the difference between their folding before and after fusion - we desire lower scores.
However, the essential genes’ change is smaller than the target genes’.
In the linker's case, we measured their average disorder - a value of 1 means the linker is completely unfolded. Thus, we desire higher scores. We noticed an unexpected behavior in the linkers; when inserted into a construct, most of the linkers have very low scores. This was contrary to our intuition, where we expected linkers to be disordered, or unfolded, meaning their score will be close to 1.
We also presented these plots in log scale (without the 1/3 weights) in order to better determine the scores' scale:
When viewed in log scale, we can infer the order of magnitude in these changes. The change in essential genes is approximately $10^{-2.25}$ where in the target genes it is approximately $10^{-3.5}$.
In addition, many of the linkers have very low scores, order of $10^{-5}$, as seen in the log scale graph.
Due to the surprising finding about the linkers, we wanted to check their score's distribution without being fused.
For each linker, a disorder profile was obtained, and we took the mode of its profile. Then, we created a histogram of these modes:
We concluded that indeed the linkers are naturally folded. This finding raised a doubt whether our assumption that the best linker should be unfolded is true.
The final method of linker's choice
Due to our findings about the linkers' disorder profile, and with additional consultation regarding the importance of the linker's folding or lack thereof, we decided to not to use this measure. In the final score, we considered only the change in the disorder profile of the essential and target genes, with equal weights. The selected linker would be the one with minimal score.
$$(26)Final\ score=\frac{1}{2}*Essential\ gene's\ score+\frac{1}{2}*Target\ gene's\ score\ Chosen\ linker=linkers[argmin(final score)]$$
$$(27)\ Chosen\ linker=linkers[argmin(final score)]$$
We decided to present the user with the ten best linkers (sorted from the lowest to highest scores) including their sequence, name, score and a graphic view of the change in their disorder profile. This way, users can take into consideration additional factors relevant for their purposes.
References
[34] Kim, J. H., Lee, S. R., Li, L. H., Park, H. J., Park, J. H., Lee, K. Y., ... & Choi, S. Y. (2011). High cleavage efficiency of a 2A peptide derived from porcine teschovirus-1 in human cell lines, zebrafish and mice. PloS one, 6(4), e18556.
[35] George RA. and Heringa J. (2002) An analysis of protein domain linkers: their classification and role in protein folding. Protein Engineering 15, 871-879.
[36] Bálint Mészáros, Gábor Erdős, Zsuzsanna Dosztányi, IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W329–W337, https://doi.org/10.1093/nar/gky384
Translation and Stability Optimization
Translation Optimization
Recent developments in the genetic engineering field have increased the demand for high-level expression of heterologous genes. Factors that influence the gene expression are explored widely in literature, and one can divide them into two categories:
- Factors that are related to synonymous mutations of a gene, such as codon bias, mRNA secondary structure and GC content.
- External factors that have no relationship with synonymous mutations, such as promoter strength.
The features in the first family have been deemed as primary mechanisms in modulating the protein abundance within synonymous mutations [37]. For this reason, we built an optimization module of the mutual construct that focuses on four key aspects:
- mRNA folding of the 3’ ORF.
- Codon usage bias.
- GC content.
- Avoidance of mutational hotspots.
While other platforms focus solely on codon usage tables when optimizing gene expression, our algorithm considers a variety of critical factors involved in different stages of protein expression, such as codon adaptability and mRNA structure.
The mechanism and motivation behind each optimization will be described below.
Optimization framework
The optimization framework we used is provided by DNA chisel [38] (complete documentation here). DNA Chisel is an easy-to-use, easy-to-extend Python library used in many bioinformatic tools (such as benchling) for optimizing DNA sequences with respect to a set of constraints and optimization objectives. It can also be used via a command-line interface, or a web application. We also used DNA Chisel for our contribution software.
The DNA-chisel algorithm has two main steps:
- Resolution of all hard constraints, ignoring optimization objectives.
- Objective optimization, while respecting the constraints.
This step separation permits early detection of incompatible constraints, and quick convergence to satisfactory solutions.
What is a constraint and what is an objective?
Constraints are specifications (or rules) that must be verified in the final sequence (an “at all cost” approach), and objectives are specifications that must be as close to verified as possible in the final sequence. The objectives are defined by a score, that is maximized during the optimization process.
The objective in our problem is to codon-optimize the sequence. The constraints are the required GC content, mRNA folding energy in the start of the sequence, and preserving the amino-acid translation of the sequence.
The algorithm also defines a Mutation Space, used by the solver to explore new sequence variants. This space stores the changes that can be made to the sequence - it indicates for each nucleotide or sub-fragment the different acceptable values, and is determined by the problem’s nucleotide-restricting constraints. For example, when an Enforce Translation constraint is applied to a stop codon, it constrains the nucleotides triplet to only 3 possible choices, TAG, TGA, and TAA. For further reading one may refer to [38] or the documentation (link above).
Additional feature of the DNA optimization engine is a report (in ZIP format) that is produced at the end of the process, containing for each input sequence:
- Annotated GenBank file of the optimized sequence with all the changes that have been made. The user can track the changes, and if not satisfied with any of them, he can change the sub-fragment back to the original sequence.
- A list of the successive constraints and objectives in a CSV format and in a PDF format. The PDF report also includes the objective score, and a Sequenticon.
The objective score indicates whether and how well the sequence complies with the specification (the more compliant, the higher the value). By convention, a negative score indicates that the specification is breached, while a score of 0 or more indicates compliance.
A Sequenticon (documentation here) is an icon that is unique to the final output sequence. This is an important feature, especially when dealing with large sets of input sequences (which are often renamed or updated), because it enables the user to differentiate between sequences that otherwise might get confused with each another. Sequenticons provide a simple visual way to know that two sequences are different (different identicons) or very likely the same (same identicon).
mRNA folding
RNA is mostly found as a single-strand molecule, such that its nucleotides are free for base pairing via hydrogen bonds between complementary bases on the same strand. This quality enables the RNA molecules to form into two-dimensional, mostly local secondary structures. There are two common recognizable "domains" of secondary structure for such a molecule:
- Stem - a region where nucleotides are paired together and form a helix sheet.
- Loops - unpaired regions, where the nucleotides form a two-dimensional circle-like structure. A loop can be an interior multi-loop, a bulge-loop or a hairpin loop (also called step-loop).
For example, take the following RNA sequence:
CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG
Its predicted secondary structure is the following:
This structure contains six elements: three stems (marked ‘s’ in the figure), two hairpin loops (marked ‘h’), and one multiloop (consisting of 3 multi-loops segments, each marked ‘m’).
Most algorithms that predict the secondary structure of a given RNA sequence use the binary space-efficient bracket notation, where each dot ‘.’ represents an unpaired base as part of a loop, and a base pair between base i and j is represented by an opening bracket '(' at position i and a closing bracket ')' at position j.
For example, the above sequence’s annotation is:
'(((((((((...((((((.........))))))........((((((.......))))))..)))))))))'
Why do RNA structures need to be predicted?
The structure of any molecule, not just RNA, is dynamic and depends on factors such as temperature, salt concentration, stress conditions applied on the organism, and interactions with other molecules. Thus, the RNA secondary structure is not unique per sequence, and the RNA does not always hold to its most thermodynamically stable structure.
Thermodynamically speaking, the RNA structure can be characterized with a folding free energy (known as ΔG, negative), describing the molecule’s tendency toward its most stable state. For a given mRNA length, a lower folding free energy corresponds to a more stable secondary structure.
Moreover, a lower (more negative) ΔG is correlative to a stable, strongly folded RNA, and a high (closer to zero) ΔG is correlative to weakly folded RNA. This is just a correlation, and there are many structures with a similar free energy for a given sequence. And yet, the energy function can be used to properly model the folded RNA structure.
Given the close link between structure and function, secondary structure is routinely used as a basis to design new and optimal constructs, utilizing exact dynamic programming algorithms.
How do we optimize a genetic construct in terms of mRNA structure?
We previously mentioned that a secondary structure with low ΔG more stable, thus one can refer to it as “optimal structure”, assuming we wish for a folded (and stable) RNA. Indeed, the structure that yields the minimum free energy (MFE) for a given sequence is a subject of interest, but that is not always the case. We previously mentioned, as well, that the secondary structures are mostly local, and therefore the considerations also should be local.
At the open reading frame (ORF) 5’ end, there is a well-conserved signal for weak folding of RNA [3]. The beginning of the ORF is a late initiation region, which couples the initiation and elongation steps; meaning, this region is part of the coding sequence but also includes signals that regulate the initiation step. Weak mRNA folding at the 5′ end promotes recognition of the start codon by the pre-initiation ribosomal complex: when the mRNA is not folded near the start codon (figure 23A, top) the pre-initiation complex recognizes the start codon more easily.
The region under selection for weak folding includes the first ∼10–13 codons (figure 23B); it is followed by a ‘ramp’ - a region of selection for strong mRNA folding and slower codons, enables coupling initiation and elongation; and after ∼40 codons there is no specific local signal of selection for mRNA folding.
For the optimization problem, we defined a new constraint that maximizes the local free energy (LFE) of the most folded structure in the first 15 codons. As illustrated in figure 23B, the energy function plays a wider role than simply defining energies from thermodynamic studies - it is also used to enforce certain folding rules such as the minimum size of hairpin loops and the allowable positions of G-U base pairings [40].
Why did we use the free energy of the MFE structure (the most negative ΔG) as a measure of secondary structure formation?
As mentioned earlier, for a given sequence there may be other structures with similar ΔG. Yet, we are interested in gaining a weak folding via maximizing the G, and we assume that by maximizing the LFE of the most folded structure we ensure that other secondary structures of this sequence will also be weakly-folded.
The MFE, used as an objective function for the new constraint, was calculated using the seqfold tool with (T = 37 °C). The reason we preferred this tool over the more commonly-used Vienna RNA package [41] is that Vienna is not an open-source code, and does not integrate easily with Python. On the other hand, seqfold is meant to be an open-source, minimalist alternative for predicting minimum free energy secondary structure, allowing easy implementation for our needs.
This tool computes minimum folding energy and corresponding secondary structure from RNA sequences, using the thermodynamic nearest-neighbor approach. The MFE structure of an mRNA sequence is predicted using a loop-based energy model and the dynamic programming algorithm introduced by Zuker et al. [40] and recently improved by Turner et al. [42]. The implementation of the Turner free-energy model incorporates hundreds of parameters which have been adjusted over the years. This model involves the calculation of the entropic contribution for general loops and for stacks.
Simplifying assumptions for structure prediction include:
- RNA folds into one minimum free-energy structure.
- There are no non-canonical interactions (pseudoknots and hairpins)
- The energy of a particular base pair in a double stranded region is sequence independent, meaning neighbors have no influence.
While incorporating the MFE prediction into the optimization framework, we discovered that the algorithm converges to the optimal solution of ΔG=0 kcal/mol in the first 15 codons when we use it as a hard constraint instead of a simple objective. In cases where the system finds it hard to gain zero LFE, we raised the threshold of the optimal solution to ΔG = -2 kcal/mol.
In summary:
- There is a regulatory signal in the start of the ORF for weak mRNA folding, which promotes efficient translation and recognition of the start codon.
- As part of the optimization engine, we offer to maximize the local folding energy (LFE) of the given structure at the start of the ORF (approximately 15 first codons). The most negative free energy G, also called the estimated minimum free energy (MFE), is used as the objective function for this new constraint and is being maximized during optimization.
- For a given RNA sequence there may be other secondary structures with similar ΔG. The definition of the objective function serves our purpose under the assumption that by maximizing the LFE of the most stable (folded) structure for a given sequence, we ensure that other possible structures will also be weakly-folded.
- The seqfold package was used for prediction of MFE (Zuker algorithm for the rules, turner algorithm for the parameters).
- The initial objective score of the new constraint is set to G=0 kcal/mol. However, for non-convergence cases, the optimality threshold is set to G=-2kcal/mol.
GC content
Our optimization software enforces the GC content to a requested range (minimum and maximum) specified by the user. The term GC content refers to the percentage of nitrogenous bases in the sequence. The algorithm will split the sequence to windows of a specified size and optimize each window.
The GC content has proven to be a factor that affects the translation efficiency [43], and it is also correlated to the mRNA folding energy of the entire sequence, since G-C pairs stabilize the secondary structure (lower G) [44].
However, regarding genomic stability, it has been proven that genes with high GC content have a substantially elevated rate of mutations, both single-base substitutions and deletions [45].
We recommend a default value of 30%-70% GC content, which is considered a medium range and thus may balance the need for genomic stability along with high expression levels.
Codon bias optimization
Proteins can be encoded using numerous codon combinations due to the redundancy of the genetic code, where multiple codons encode the same amino acid.
The term "codon optimization" refers to experimental and bioinformatic approaches designed to improve the codon composition of a recombinant gene based on various criteria, without altering the amino acid sequence. Codon optimization is required when one wants to efficiently express a cDNA in a species different from which it was cloned.
Unbiased, uniform mutations result in uniform distribution of codons; however, many genes are the subject of selective pressure towards certain mutations. For example, in highly expressed genes, the evolutionary pressure leads to codon preference or avoidance in a way that promotes the gene translation mechanism and expression.
The codon usage can be viewed in different contexts:
- Genomic level: for example, GC-rich organisms could be biased towards GC-rich codons.
- Region level: Specific sites in genes may evolve under different constraints. For example, mRNA folding can constrain codon usage.
- Gene level: Different genes display varying degrees of bias. Highly expressed genes tend to have stronger codon usage bias.
Here we optimize at the genomic level.
Codon optimization algorithms rely on “codon usage tables” that map a weight for each codon. The weight of a codon is a measure of its optimality, and it ranges from 0 (non-optimal) to 1 (most optimal codon). Most known algorithms for codon optimization, such as benchling or GenScript, provide a single table for a specific host based solely on the relative frequency of each codon in the host’s genome. These algorithms operate under the assumption that a codon frequency in the target gene should match the relative frequency in the host, or should be replaced by the optimal synonymous codon.
Unlike these approaches, we propose an optimization paradigm based on many biophysical parameters and not just the codon frequency in the host organism. These parameters are based on empirical findings - supply and demand of tRNA molecules and more - and thus they provide a more comprehensive solution that addresses all steps in the translation process. The optimization parameters are described below. We also let the user choose his preferred optimization method.
Optimization parameters (codon usage tables):
Each codon is optimized based on one of the following parameters:
- “Fraction”: codon relative frequency in the host genome (specified by the user as “organism name” input). The frequency data is from the Kazusa database [46].
- “tAI”: tRNA Adaptation Index (tAI) is a biophysical measure of codon usage bias [47]. The assumption behind this measure is that highly expressed genes are more adapted to the tRNA pool, i.e., the abundance and availability of the tRNA types that recognize these genes’ codons. Thus, codons with available tRNA are considered more optimal: More codon-recognizing tRNAs in the cell lead to faster translation elongation for the codon.
This index uses estimations of the tRNA pool in order to assess the weight of each codon (the relative adaptedness to the pool). The tAI table for yeast is taken from Tuller et al [48].
- “nTE”: the normalized translation efficiency (nTE) is an improvement of the previous approach for estimating codon optimality [49], by considering not only the supply of tRNA but also the demand, i.e. the number of codons and mRNA copies that are translated by the same tRNA.
An optimal codon is characterized with a supply/demand ratio greater than 1.
We calculated the supply/demand ratio to assess the weight of each codon (translational efficiency). The calculation is performed similarly to the NTE feature presented in the "feature generation" tab.
- “TDR” (typical decoding rate): for each codon, this empirical measure is defined as the codon decoding time’s reciprocal.
The codon decoding time is an estimation of how long a ribosome typically stays on each of the 61 codons during translation based on ribo-seq experiments. The longer the ribosome stays, the slower the translation. This is like other optimality measures that we described but based on direct experimental measurements. For the calculation of TDR, statistics for each codon are collected from all transcripts. Afterwards, we calculate the codon distribution and extract the expected value. This value is estimated as the typical decoding time of the codon and is the reciprocal of the translation rate.
The typical decoding time of each codon was taken from this database (S. cerevisiae, Exponential) [50-51]. Then, it was converted to rate and normalized to the range of 0-1. Since this measure is not defined for stop codons, the weights of these codons are set to their frequency in the host genome (the “fraction”).
There are three common methods for codon optimization:
- "Use best codon": in this method, every codon will be replaced by the "optimal" (i.e. most frequent) synonymous codon in the target organism. This is equivalent to the Codon Adaptation Index (CAI) optimization that is described often in literature.
- “Match codon usage”: the final sequence's codon usage will try to match the codon usage profile of the target species. This method is used throughout the literature, for instance Hale and Thomson 1998 [52].
- “Harmonize RCA”: each codon will be replaced by a synonymous codon, whose usage in the target organism matches the usage of the original codon in its host organism [53].
The codon optimization serves two purposes: first, it promotes higher production of proteins as we required. Second, regarding the stability of the sequence, we assume that matching the codon usage of a host organism helps the sequence adapt to the entire gene expression mechanism of the cell. This adaptation can potentially lead to the preservation of the mRNA and later the protein of interest in the host cell for a longer duration of time. In literature, codon optimization is often associated with higher production of proteins, likely (but not solely) due to this assumption.
This specification is implemented in DNA Chisel, but we added the additional optimization parameters of tAI, nTE and TDR.
Avoidance of mutational hotspots
Another framework we utilized for this optimization is the EFM Calculator developed by the 2015 UTexas iGEM team. Their EFM Calculator is a webtool, currently integrated into Benchling, that finds mutational hotspots in input sequences. For more details, see our contribution page. We implemented its calculation and corroborated our results with the original calculator. Using this hotspot detection, our optimization model detects these mutational hotspots in sequences, and avoids them. Thus, we generete a sequence that is more evolutionary stable.
Trade-off between stability and translation efficiency
The optimization process involves two steps: first, we optimize the input concerning mRNA folding at the start of the sequence, required GC content and codon optimization. Then, we avoid mutational patterns detected by the EFM in the half-optimized sequence, while optimizing the codons and keeping the start of the sequence as-is. This two-step strategy allows the algorithm to generate a sequence that is closer to optimum, and only then deal with mutational hotspots. Thus, the probability that new problematic sites will appear after optimization decreases dramatically.
References
[37] Zur, H., & Tuller, T. (2013, October). Transcript features alone enable accurate prediction and understanding of gene expression in S. cerevisiae. In BMC bioinformatics (Vol. 14, No. S15, p. S1). BioMed Central.
[38] Zulkower, V., & Rosser, S. (2019). DNA Chisel, a versatile sequence optimizer. bioRxiv.
[39] Tuller, T., & Zur, H. (2015). Multiple roles of the coding sequence 5′ end in gene expression regulation. Nucleic acids research, 43(1), 13-28. link.
[40] Zuker, M., & Stiegler, P. (1981). Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic acids research, 9(1), 133-148. link.
[41] Lorenz, R., Bernhart, S. H., Zu Siederdissen, C. H., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for molecular biology, 6(1), 26. link.
[42] Mathews, D. H., Disney, M. D., Childs, J. L., Schroeder, S. J., Zuker, M., & Turner, D. H. (2004). Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences, 101(19), 7287-7292. link.
[43] Lercher, M. J., Urrutia, A. O., Pavlíček, A., & Hurst, L. D. (2003). A unification of mosaic structures in the human genome. Human molecular genetics, 12(19), 2411-2415.
[44] Wolfsheimer, S., & Hartmann, A. K. (2010). Minimum-free-energy distribution of RNA secondary structures: entropic and thermodynamic properties of rare events. Physical Review E, 82(2), 021902.
[45] Kiktev, D. A., Sheng, Z., Lobachev, K. S., & Petes, T. D. (2018). GC content elevates mutation and recombination rates in the yeast Saccharomyces cerevisiae. Proceedings of the National Academy of Sciences, 115(30), E7109-E7118.
[46] Nakamura, Y., Gojobori, T., & Ikemura, T. (2000). Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic acids research, 28(1), 292-292.
[47] Reis, M. D., Savva, R., & Wernisch, L. (2004). Solving the riddle of codon usage preferences: a test for translational selection. Nucleic acids research, 32(17), 5036-5044.
[48] Tuller, T., Carmi, A., Vestsigian, K., Navon, S., Dorfan, Y., Zaborske, J., ... & Pilpel, Y. (2010). An evolutionarily conserved mechanism for controlling the efficiency of protein translation. Cell, 141(2), 344-354. link.
[49] Pechmann, S., & Frydman, J. (2013). Evolutionary conservation of codon optimality reveals hidden signatures of cotranslational folding. Nature structural & molecular biology, 20(2), 237. Link
[50] Dana A. and Tuller T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 2014.
[51] Dana A. and Tuller T. Mean of the typical decoding rates: a new translation efficiency index based on ribosome analysis data. 3G. 2014.
[52] Hale, R. S., & Thompson, G. (1998). Codon Optimization of the Gene Encoding a Domain from Human Type 1 Neurofibromin Protein Results in a Threefold Improvement in Expression Level inEscherichia coli. Protein Expression and Purification, 12(2), 185-188.
[53] Claassens, N. J., Siliakus, M. F., Spaans, S. K., Creutzburg, S. C., Nijsse, B., Schaap, P. J., ... & Van Der Oost, J. (2017). Improving heterologous membrane protein production in Escherichia coli by combining transcriptional tuning and codon usage algorithms. PloS one, 12(9), e0184355.