Engineering Success
Introduction
Our work in sTAUbility is divided into four main planes:
- Model – predict the stability of different constructs and decide upon the best conjugated gene and linker for a given target gene, considering expression levels.
- Software – a recommendation engine that combines the developed models and outputs the best construct for increased stability and expression levels. The software is our way to communicate our solution for potential users.
- Measurement – quantify the stability of each variant in large-scale libraries or part collections; the data from such experiments can be used to train our models further and create a more generalized software.
- Validation – make sure that the basic idea is valid and reliable.
This modular design enables us to develop and improve each element separately, and quickly adapt to changes according to feedbacks we receive. However, some of these steps required preliminary assessment.
In order to solve some of our project's problems we used synthetic biology tools to generate expected results. We tested design principles through modelling and statistical analysis. In this page, we will elaborate on five different projects that are part of our engineering success:
- Probability analysis of overlapping frame-shifted sequences, that helped us analyze our initial vision and improve our design.
- Co-culture statistical analysis, to define the fitness-related limitations of using FACS in the initial Gene-SEQ experimental design.
- Chi.Bio setup, to be able to use the sophisticated hardware for our Gene-SEQ experiment.
- Restriction enzyme analysis, to choose the restriction enzymes that will enable us to identify the different strains in the co-culture, while preserving as much of the target gene (GFP or RFP) as possible for mutational-footprint analysis.
- Selection of 10 essential genes, for the Proof-of-concept experiment. The experiment's strategy includes measuring stability of 10 diverse essential genes interlocked to GFP, in comparison to GFP alone, thus proving that essential genes affect the stability of the GFP, but not in the same way. The selection model helped us define what is meant by "diverse genes".
Probability analysis of overlapping frame-shifted sequences
Our initial idea for the project (see inspiration page) was inspired by the 2013 Lethbridge team, who introduced the use of pseudoknots as a frame changer. Pseudoknots are RNA secondary structural elements that cause the ribosome to “slip” backwards by one nucleotide, thus switching between translational frames and reading the protein that is encoded in the -1 reading frame.
One application for Pseudoknot-Induced Frameshifting is Dual-coding: By utilizing dual-coding sequences downstream of a pseudoknot, the ORFs of two proteins can be overlapped and placed in different reading frames, thereby reducing the coding space required to produce a construct.
Our team came up with another possible way of using this pseudoknot’s quality, by encoding a target gene and an essential gene in different frames, and thus increasing the stability of the gene of interest. A proof of concept already exised in literature [1], that showed when a sequence of interest was co-encoded with an essential gene in a living bacterium, its evolutionary stability substantially increased. We wished to develop an automated system that will recommend such a contrast for a given target gene, without changing the amino-acid sequence.
As a first step, we decided to run a feasibility test and analyze the probability of a given target gene to overlap a frame-shifted essential gene in a meaningful way.
For that, we developed a generic algorithm that calculates the probability that a given target gene will have an overlap longer than 5 AA (amino acids) with any frame shifted ORF in the given reference genome, with the shift being -1 nucleotides.
The function receives a target gene sequence (for instance GFP or insulin) in an amino-acid alphabet, and a path to a FASTA file that contains the reference genome (for example, a yeast’s whole genome) in amino-acid alphabet. The reference genome can also consist of a specific set of genes such as essential genes. The last input 'n' is the number of iterations.
In each iteration 'j' the function randomizes the nucleotide sequences of the genome (preserving the amino acids) and then shifts the reading frame by +1 nucleotides. This results in a new set of nucleotide sequences, or a “frame-shifted genome”, that is translated back to amino acid sequences (different from the original input sequences). We call the translated set $FS_{AA}$, as in “frame-shifted amino acid set”.
The search for overlap is based on a unique data structure called “Suffix Array” (SA) that was first presented here [2] and is used in the context engineering gene expression and analyzing the adaptability of an ORF to a reference genome [3]. SA is basically a sorted list of all the suffixes of the reference genome, without storing the actual strings. Thus, it does not require a lot of memory and is easily searched. SA can be used as an efficient way to find the longest substring (or overlap) starting at position ‘i’ in the target gene, that appears somewhere in the reference genome [3].
Sorted arrays, such as suffix arrays, can be searched efficiently, using binary search - a search algorithm that compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found.
So, after generating a suffix array from $FS_{AA}$, the function calculates the length of the longest common prefix between the frame-shifted genome and the so-called "key" (the target gene starting from position 'i' in the sequence). Thus, for each position 'i' in the target gene we get a score of the longest common prefix, marked $S_{ij}$ where ‘i’ is the position and ‘j’ is the iteration.
The summarized "iteration score" $S_{j}$ for the specific 'j' iteration is calculated as the number of prefixes longer than 5. Mathematically:$S_{j}=count(S_{ij}>5)$ , where $S_{ij}$ is the longest prefix starting from the i-th position in the j-th iteration as described above.
The final probability is calculated as sum on the iteration scores, divided by the number of iterations and the length of the target gene. Mathematically:
$$(1) p=\frac{1}{n*L}\sum_{j=1}^{n}S_{j}$$
where $S_{j}$ is the iteration score, 'n' number of iterations, 'L' length of target gene (in AA).
For our analysis, we inserted the following inputs:
- Yeast essential genes
- GFP as a target gene
We also used e-coli whole genome and insulin as a gene of interest but achieved similar results, therefore we will only show the results of the yeast analysis.
After 50 iterations, the calculated probability was 0.01297.
The longest overlap was of 9 amino acids. Also, there is a maximum of 7 substrings per iteration that produce overlap longer than 5 amino acids, as can be seen in the histogram below, which is hardly enough to maintain a stable construct and form a meaningful overlap between the two genes.
This analysis led us to the conclusion that our current solution (see figure 2) has a major limitation, since we will not be able to produce a generalized construct for any given target gene.
Thus, we came up with other possible ways to interlock a target gene to an essential gene, as described in our project description page.
This kind of process reflects the engineering cycle:
- We researched and read about other iGEM teams, and this research inspired us to use the pseudoknot to solve the stability problem of genetically modified constructs.
- We Imagined how it can be used in a system by creating an overlap with an essential gene
- We designed an algorithm to test the feasibility of the solution
- We built a platform that does the calculations
- We tested this platform with many target genes and reference genomes.
- We learned that the probability of an overlap between gene of interest and an essential gene is not as high as we wanted
- We improved our methods accordingly and designed new ways to interlock the essential and target genes by using many different linkers.
Gene-SEQ statistical analysis
One of the experiments planned by our team is the gene-SEQ experiment. In this experiment, we used 6-7k variants of saccharomyces cerevisiae, each variant, has a fluorescent gene (GFP/RFP) was attached attachedN-terminally linked to another known yeast gene. All variants were grown as a mixed culture in the Chi.Bio machine, and fluorescence was monitored for 300 generations. This allowed us to analyze both the fluorescent gene’s stability, and the variant’s fitness. We were able to conduct this experiment due to the works of Prof. Schuldiner's lab, which timeprovided us their SWAT-GFP/RFP yeast libraries.
Initially, we planned to manually dilute the co-cultures every day. We referred to the volume sampled (in microliters) each day for dilution purposes as the "sample size". At the end of this evolution experiment we planned to use FACS to classify the different strains according to their fluorescence level.
In this analysis, we attempted to confirm the statistical viability and limitations of the usage of FACS for the gene-SEQ experiment, and quantify the sample size that will enable us to keep most of the variants in the population, regardless of their fitness. As it was done in early stages of our design and modelling, we did not have access to gene features or metadata and based our analysis on various simplifying assumptions. We tried different assumptions and monitored their effect on our results.
Probability of capturing all genes
In the previously described library of yeast variants, the quantity of each variant was different. We found empirically that the most prevalent variants are a hundredfold more frequent than the least prevalent ones. We wanted to estimate the effect of sample size on the probability of “losing a construct” between samples in the co-culture experiment.
We estimated the probability of not capturing all variants as a function of sample size. Our estimation was based on different assumptions of prior distribution on variant population quantities.
As a baseline, we calculated this probability for equal population sizes, in which case we can easily use Poisson’s formula. The probability of finding k samples of a specific variant in a sample is:
$$(2) P(k)=\frac{\lambda^{k}\exp (\lambda)}{k!}$$
In our case, $\lambda=\frac{sample\,size}{6000}$, as we assume 6k genes and equal populations, the probability of missing a gene is $1-P(0)$, and the probability of missing any gene is $(1-P(0))^{6000}$. Thus, our probability equals to:
$$(3) P(missing\:any\:gene)=(1-\exp(\frac{sample\:size}{6000}))^{6000}$$
The second distribution we tested is a uniform distribution on population sizes, along a 100-fold factor. We carried out a similar calculation, where we assumed the ratios between populations fit a uniform distribution, plugged the appropriate $\lambda$ for each population size, and found $P(missing\:any\:gene)$.
The third distribution we tested was a Gaussian distribution on population sizes. We randomly generated population sizes 100 times, with mean of 50 and STD of 13.6 – values which as expected value allow one variant to have a population size of 100 and one to have population size of 1. We used these generated population ratios, calculated appropriate values of $\lambda$, and calculated $P(missing\:any\:gene)$.
In our final and most realistic assumption, we assumed that the original populations were equal, co-culture experiment had started already in Shuldiner’s lab, and this caused the population size discrepancies. We assumed naively that the populations were not yet competing for resources (multiplying in 2 every generation), and thus their population size followed the following ratios:
$$(3) population\:ratio=2^{fitness*number\:of\:generations}$$
We assumed that their fitness is uniformly distributed between 0.75 and 1, based on a simplifying description of empirical data. Thus our 100-fold factor in population sizes allows us to estimate how many generations have passed:
$$2^{1*number\:of\:generations}=100*2^{0.7*number\:of\:generations}\rightarrow number\:of\:generations \approx 26.6 $$
Using the number of generations found and a uniform distribution on fitness, we generated population ratios and estimated $P(missing\:any\:gene)$.
The results are summarized in the following graph, for different population sizes.
Probability of all genes’ survival after 300 generations
We discovered an even more fundamental issue while calculating the above results. Given a co-culture experiment, we would not keep variants of none-similar fitness even with very large sample sizes.
To demonstrate this point, we once again assumed a uniform distribution of fitness scores among variants. We calculated the probability of survival of all constructs of fitness above 0.99, thus almost equal fitness.
We demonstrated mathematically that the probabilities of keeping almost all variants (losing just a few) is of a similar scale to keeping all variants, allowing us to estimate only the probability of keeping all variants. We generated the population sizes after 300 generations if no sampling is done along the way, and using our previous mechanism, calculated the probability of survival of all variants to the 301st generation. As one can see in the following graph, the likelihood of success is very small even for very large sample sizes, and we are ignoring the probability of losing populations until the 300th generation.
Due to these results, we realized both the possible importance of separating co-culture experiments, and the need to use very large samples in order to prevent losing variants with low fitness. This last conclusion led us to reconsider the use of FACS to separate samples. Therefore, following this analysis, we improved our design - we decided to switch to using a sophisticated robotic system called Chi.Bio. Instead of FACS separation, we will grow the cultures in the Chi.Bio machine and at the end of the experiment we will sequence them to find the mutational footprints. For more information, visit our Measurement page.
Chi.Bio setup
Chi Bio is an open-source robotic platform for experimental automation in biological science research and education. The system contains a control computer, main reactor and liquid handling (pumps and tubes). During our experimental work, we conducted a co-culture experiment using the Chi Bio system.
The engineering approach also means adopting given systems to our needs. For this purpose, we needed to make changes in the system’s main operation code. This included:
- Changing the default code that controls the liquid handling, so that we can split the pumps and use them for two different purposes. In addition, we changed the frequency of the pumps.
- Designing caps with nozzles that allows sterility and avoidance of leaks.
You can read more about the Chi.Bio system and how we implemented it in our Experiments page.
Restriction enzyme analysis
At the end of the Gene-SEQ experiment (i.e. the evolution experiment with the GFP and RFP co-cultures) we used Nanopore deep sequencing to divide fluorescent strains and non-fluorescent strains. We estimated that we were going to face 2 main challenges at that point:
- Identify a specific strain from the 6000 strains in the library.
- Identify the mutations that occurred in the fluorescent gene (RFP or GFP) and in the gene attached to it.
Unfortunately, the SWAT library we received from Schuldiner labs (Schuldiner, SWAT) [4,5] did not contain a bar code that allowed a relatively simple identification of the strains. We concluded that the best way to meet both challenges is to use restriction enzymes. Our approach using the restriction enzymes wasis to make a cut in the promoter of the fluorescent gene sequence, and another cut as far as possible in the conjugated gene, so the extracted sequence would consist the complete fluorescent gene and some of the conjugated gene. Using the same restriction enzyme for both ends would allow self-ligation of the cut sequence to create a circular DNA. This way, we could use the linker between the two genes (florescent and conjugated) to design one primer that fits to all the strains in the library and eventually make a deep-sequencing of these sequences (fig. 7).
To meet economic and biological constraints, we chose to use one or a maximum of two restriction enzymes, which would give the optimal result. Our model team had designed and implemented an algorithm that would compare and display the optimal set of enzymes.
The process
As soon as we realized the need, we started searching the internet for a tool that could help us accomplish the task. We had a specific need, going over all possible enzymes for all possible constructs and collecting statistics to find an optimal enzyme. As we had not found a method that would do this efficiently, we decided to design such a tool.
We designed a tool that would search for an optimal enzyme with the following properties:
- There is no cutting site within the promoter or fluorescent gene, which are identical for the 6000 yeast variants.
- Each restriction enzyme cuts a segment of the yeast variant, which includes the promoter, fluorescent gene, and some of the conjugated gene. Our method optimizes for a minimal amount of none-unique sequences cut by the enzyme, among the 6000 variants. This maximizes our differentiation capabilities.
- In addition, beyond a certain sequence length, the PCR may yield inaccurate results. Thus, we optimized for a minimal number of yeast variants which, when cut, create a sequence of length over 3K BP.
At first, we attempted two approaches:
- Search for a suitable restriction enzyme from a database of commercially available restriction enzymes [6].
- Search for a custom designed restriction site.
After comparing the results of the two approaches, we realized that there is no significant advantage to either one. Thus, so we chose to focus on commercially available enzymes. This would, of course, reduce costs significantly.
Later, we simulated the use of 2 different restriction enzymes. We saw some minor improvement of the results, but not enough to warrant the financial and biological cost. Thus, we decided to stick with one enzyme.
Instead of trying to optimize between two parameters where we do not know the relative weight (number of none-unique sequences vs. number of too-long sequences), we found the enzymes which are Pareto efficient [7]. An enzyme would be Pareto efficient if there is no way to reduce the percent of none-unique sequences without increasing the percent of sequences over length 3000. Pareto efficiency is a necessary but not enough condition for being an optimal solution. In addition, we used the fact that not many enzymes are Pareto efficient. We compared these enzymes manually and found BsrI to be the optimal enzyme for RFP fluorescent gene attached to conjugated genes in yeast. Still, due to difficulties in supply caused by the COVID-19 situation, we had to consider not only the efficiency but also the availability of the chosen enzymes.
Finally, we chose the enzymes that showed good performance in this analysis and also were availble in the lab. You can see the results in our Results page.
Selection of 10 genes for single gene experiment
During an evolution experiment, GFP (or mCherry) proteins conjugated to essential genes with certain properties will show greater preservation at the genetic sequence level than GFP expressed without being coupled to another protein. The evolution experiment was conducted by using the SWAT-tagged library of yeast genes, containing $\sim 5,500$ strains each tagged with GFP or mCherry at the N-terminus of a protein (Schuldiner, SWAT).
The purpose of the following model is to select 10 inviable genes that are as diverse as possible, in order to understand (when crossed with empirical information) which features helped maintain genomic stability of GFP or RFP proteins that are transcriptionally interlocked to it. This way, we could prove that distinctive essential genes contribute differently to the stability of the fused construct.
To achieve this goal, the model focuses on the characterization of inviable genes based on available databases and sequence analysis, followed by the clustering of genes into clusters.
Since the goal of this model is to choose ten inviable diverse genes, the first thing we had to do was to collect only the inviable genes out of the tagged genes that are available in Shuldiner's library. Thus, for the first filter we use the definition of "inviable" gene from the SGD website, and filter the genes accordingly.
Then, since the ten genes are about to be used for evolution experiment, we needed them to show adequte flourescence (so that it will be detected in a plate reader). Thus, in the second filter we took all the inviable genes and choose only the one that are flourecent above a threshold of 40 (see explanation below).
After filtering all the genes in the yeast library, the genes that remained are inviable and flourecent, and so match our purpose.
To each gene in this filtered database we generate a set of features, and then choose the 10 most diverse genes as described below.
- Filtering inviable with no overexpression genes from SGD
- Filtering fluorescent genes based on empiric database Loqate
- The GFP-tagged yeast library (C’-terminal GFP under natural promoter).
- The SWAT library (N’-terminal GFP under NOP1 promoter).
Source: The initial database is from Yeastmine search engine, a collaboration between SGD and the Intermine project.
Tab: PHENOTYPES.
Query: Gene >> Phenotype.
Gene list: ALL_Verified_Uncharacterized_Dubious_ORFs
Filters: keep only genes with:
(a): Observable: inviable, (b) Mutant Type: no overexpression
Those filters were chosen to fit the conditions of the evolution experiment.
Source: Link
Loqate is an online database that includes measurements of localization and abundance of 5,330 proteins of the yeast (single cell resolution) from two libraries:
For the fluorescence filtering we used the measurements of the SWAT library, because this is the library we eventually use in our experiment.
Condition: N’ NOP1pr-GFP in SD from SWAT library, and N' TEF2pr-mCherry in SD from SWAT library. All measurements also appear on the SWAT website.
Localization: All. We downloaded a separate file for each location in the cell and combined all files, as it gave a much more complete list. Each file includes the intensity (fluorescence) and the localization under each condition. The intensity is given in arbitrary units (a.u.).
Filters: keep only genes with above threshold fluorescence intensity in both conditions. The threshold is determined as 40, because it resulted in enough genes for the analysis, and appeared to take only those with observable intensity as can be seen in the database’s images.
Figure 10 shows the distribution of genes with a given intensity (a histogram of intensity vs. number of genes). It can be observed that below the threshold level of 40 is the 'center of mass' of the distribution, such that above this value are genes with significant fluorescence.
Figure 11 shows the genes that remain for the analysis after choosing a specific threshold. Thus we ensure that the threshold chosen does not leave us with too few genes.
For future acknowledgement, we may consider checking more thresholds, but since the goal of this model is to collect initial empirical data in an unsupervised way, and since the 10-genes evolution experiment will be followed by more experiments, this threshold is sufficient for our needs.
The filtering resulted in 600 genes. Now, for each gene we generated descriptive features, in order to divide them to "families" with distinct characteristics.
Loqate features
Source: Loqate website. The database: Link
As mentioned before, Loqate is an online database that includes measurements of localization and abundance of 5,330 proteins of the yeast (single cell resolution) [8],[9]. The experiments were conducted using the GFP-tagged yeast library (C’-terminal GFP under natural promoter).
Loqate started as a project that aimed to describe the dynamics of proteins in response to external stimuli, considering that regulation of protein levels and localization is an essential part of such responses. In case of starvation, the protein can change its subcellular localization and expression in comparison to its control condition.
The data collected from this database is the result of systematically tracking the localization and abundance of proteins under:
Normal conditions (control) - The growth experiments were conducted under normal conditions (no stress) on an SD medium (synthetic medium with dextrose) and the fluorescence intensity was measured.
Three stress conditions (external perturbations) -
-
Dithiothreitol (DTT): reducing stress, can influence gene cloning, growth and function.
-
Hydrogen Peroxide (H2O2): oxidative stress.
-
Nitrogen starvation: the yeast consumes a variety of nitrogen and carbon sources. Without it, the yeast starves.
-
Two genetic mutations -
- CCT mutant (MA3): MA3 is a mutation library first presented in this article [10].
This library includes $\sim 5,100$ strains in which the amino acid Glu is mutated to Asp in the ATP binding site of subunits 3 (MA3).
Each strain in the library expresses either wild-type or mutant CCT, and an endogenous protein fused to GFP (meaning the first library was crossed with the mutant strains). - Pup2-DAmP: Localization shift under hypomorphic mutation (partial loss of gene function) in the proteasome core, which is a protein complex in cells containing proteases; it breaks down proteins that have been tagged by ubiquitin.
Figure 13. MA3 mutation library.
- CCT mutant (MA3): MA3 is a mutation library first presented in this article [10].
Features: Under each condition mentioned above besides Pup2-DAmP, there are features:
- Median abundance: based on the fluorescence intensity of the GFP. in arbitrary units.
- Standard deviation (std) of the abundance: based on the fluorescence intensity of the GFP. in arbitrary units.
Localization: classified proteins into 13 localization categories and detected hundreds of proteins that shift between different cellular locals under stress.
-
Fold change: For each stress condition, the fold change value is calculated as the ratio of median GFP intensity measured under stress to the median GFP intensity under reference conditions.
For example: Say control median intensity = 26.65.
Say DTT median intensity = 22.74.
Then fold change value = 22.74/26.65 = 0.85.
In order to implement the clustering algorithm, we have decided to convert the categorial feature of localization to 13 numerical features (one hot encoding) and compare it to control localization. Those 13 features were assigned with low weight, so that their influence will not be too high when clustering.
SWAT features
Source: S-W-A-T website. The database: Link
The SWAp-Tag (SWAT) library is a genome-wide library of $\sim 5,500$ strains carrying the SWAT NOP1promoter-GFP module at the N terminus of proteins. This is the first whole-genome SWAT library with an N′ tag, covering $\sim 90\%$ of yeast genes. The imaging of this library allows to determine the localization of 796 yeast proteins that could not be visualized before with a C′ fluorescent tag. The researchers also constructed six additional libraries to explore many aspects of yeast cell biology: the role of promoters in regulation of protein expression, the mitochondrial protein roster, protein interactions on a whole-organelle level, and systematic assessment of protein topology.
The full libraries were first presented here [4]. The construction of this library involved using the SWAp-Tag method that was first introduced here [5], and allowed rapid creation of yeast libraries.
The database includes Information on protein localization and fluorophore intensity from this study [8] for every yeast strain in the N’ SWAT libraries: NOP1pr-GFP, NOP1pr-MTS-GFP, NOP1pr-SP-GFP, TEF2pr-mCherry, TEF2pr-MTS-mCherry, TEF2pr-SP-mCherry, NATIVEpr-GFP and TEF2pr-VC/cyto-VN.
Additional information for each yeast strain from relevant literature: protein localization and fluorophore intensity of images of C' GFP tagged strains (Breker et al., 2013; Huh et al., 2003), Flow cytometry data of the C’ GFP tagged library (Newman et al., 2006), Ribosome Profiling (Ingolia et al., 2009), Mass Spectroscopy (Picotti et al., 2013), Protein Half-life (Belle et al., 2006), mRNA abundance (Weinberg et al., 2016) and mRNA Half-life (Neymotin et al., 2014).
We haven’t used all additional information, because we wished to focus only on the relevant ones and not to overflow the feature space, but we may use the rest of them in the next model.
Features:
- NOP1pr-GFP intensity and localization: Median of GFP intensity of strains tagged with basic SWAT, SWAT-MTS or SWAT-SP cassettes (URA-Nop1p-GFP).
- TEF2pr-mCherry intensity and localization: Median of mCherry intensity of SWAT strains after swap withTef2-mCherry donor (NAT-Tef2-mCherry).
- NATIVEpr-GFP intensity and localization: Median of GFP intensity of SWAT strains after swap with Seamless GFP donor (Native promoter-GFP)
- Tef2pr-VC and Cyto-VN intensity and localization: Median of GFP intensity of SWAT strains after swap with Tef2-VC donor (Hygro-Tef2-VC), and mating with a cytosolic -VN marker strain.
- mRNA: mRNA abundance data from Weinberg et al., 2016.
- mRNA half life: mRNA turnover is the rate at which a mRNA is degraded intracellularly and is described in terms of mRNA half-life, which is the time required to degrade 50% of the mRNA. mRNA Half life data from Neymotin et al., 2014
Rates of protein evolution
Source: the article “Functional genomic analysis of the rates of protein evolution” [11]. Supporting Table 4. Link.
The above article aims to estimate the evolutionary rates over 3000 proteins in four species of the yeast genus Saccharomyces and to investigate their relationship with levels of expression and protein dispensability.
Nonsynonymous and synonymous substitution rates were obtained using multiple orthologous sequences and the following phylogeny ((S. paradoxus, S. cerevisiae), S. mikatae, S. bayanus). Codon bias (CAI) was used to derive an adjusted measure of synonymous rate that does not contain a selection effects from translational accuracy or efficiency.
Features:
- CAI - codon adaptation index, an estimation to codon bias. The CAI is a measure of synonymous codon usage bias in the direction of the bias seen in a reference set of 24 S. cerevisiae genes having high protein levels [12]. CAI takes values from 0.0 (no bias) to 1.0 (maximum bias). This estimation looks at the distribution of codons in a specific gene in relation to the reference genome, and from it learns about the similarity of the gene to the reference genome. CAI is twice as strongly associated with evolutionary rate as is mRNA abundance. This finding is consistent with the hypothesis that CAI reflects historical expression levels relevant to protein evolution, rather than expression levels in the laboratory [11].
- dS - Synonymous divergence, or synonymous substitutions per synonymous site. This is the mean amount of evolutionary silent mutations in comparison to Saccharomyces bayanus, Saccharomyces mikatae, Saccharomyces paradoxus, and S. cerevisiae.
- dN - nonsynonymous divergence.
- dN/dS - The ratio serves as a measure of evolutionary rate. High ratio means that the gene is less conserved.
- dS adjusted for codon bias - to correct for selection on synonymous sites, the writers use a measure of dS-adjusted according to the gene’s level of codon bias. Thus they derive a measure of synonymous rate that does not contain selectional effects from translational accuracy or efficiency.
- dN/dS adjusted - computes the ratio with dS adjusted.
Sequence analysis
The sequences were downloaded from the SGD archive.
File: orf_coding_all.fasta.gz. Includes a FASTA file with systemic name, description and sequence of all genes.
Source: Biopython library, package SeqUtils.
The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology. The Biopython web site provides an online resource for modules, scripts, and web links for developers of Python-based software for bioinformatics use and research. Basically, if you like to program in Python, this library makes it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
The package “seqUtils” includes miscellaneous functions for dealing with sequences.
Features:
- GC content total: the total appearances of nucleotides 'G' and 'C' divided by the length of the sequence.
- Gc_content_1: GC content of first codon position, affects gene expression ( a measure of codon usage bias).
- Gc_content_2: GC content of second codon position.
- Gc_content_3: GC content of third codon position.
- Molecular_weight: Molecular weight of the protein
- MeltingPoint_NN: Melting point of the protein. Calculation based on nearest neighbor thermodynamics. Several tables for DNA/DNA, DNA/RNA and RNA/RNA hybridizations are included. Correction for mismatches, dangling ends, salt concentration and other additives are available in this module.
- Local_complexity_ coefficient_simple: Local compositional complexity is a numerical measure of repetitiveness of sequences of symbols from a finite alphabet. Highly repetitive sequences are considered simple, whereas highly non-repetitive sequences are considered complex.
- Percent_ duplicate_dna: Calculates the percentage of two-letters duplications in a sequence (two following letters are identical).
Protein analysis:
Source: Biopython library, package SeqUtils, Module ProtParam.
The “ProteinAnalysis” class contains methods for simple protein analysis.
Features:
- Aromaticity: calculates the relative frequency of aromatic amino-acids (called aromaticity) according to Lobry, 1994 [13]. It is simply the relative frequency of Phe+Trp+Tyr. In the context of genetic stability, aromatic amino-acids represent a group of amino-acids which frequency is highly variable among proteins. An interpretation is that the biosynthesis of these aminoacids is expensive for the cell, so that there is a selective pressure to reduce the aromaticity of proteins. The fact that these aminoacids are rare is consistent with this hypothesis. However, these amino-acids do not completely disappear, so that there should be an inverse tendency to maintain them in proteins. This inverse tendency could be attributed either to a simple mutational drift or more likely to a selective advantage due to a contribution to the stabilization of the three-dimensional structure of the protein.
- Isoelectric_point: calculates the isoelectric point of the protein, which is defined as the pH at which the protein carries no net electrical charge or is electrically neutral in the statistical mean.
- Helix_perc: calculates the percentage of Alpha-helix secondary structure in the protein. Amino acids in helix: V, I, Y, F, W, L
- Turn_perc: calculates the percentage of Turn secondary structure in the protein. Amino acids in Turn: N, P, G, S.
- Sheet_perc: calculates the percentage of beta-sheet secondary structure in the protein. Amino acids in sheet: E, M, A, L.
- molar_extinction _coefficient_reduced: A measurement of how strongly a protein with reduced cysteines absorbs light at a given wavelength, per molar concentration.
- molar_extinction _coefficient_oxidized: A measurement of how strongly a protein with disulfide bridges and cystine residues (oxidized form, Cys-Cys-bond) absorbs light at a given wavelength, per molar concentration.
Chromosomal location analysis (folder: length analysis)
Source: Yeastmine search engine
Tab: GENOME.
Query: Gene >> Chromosomal location.
Gene list: ALL_Verified_Uncharacterized_Dubious_ORFs
A table containing 6,600 genes of yeast and their chromosomal coordinates (location on the chromosome). We used this data to calculate the distance of each gene from the nearest and farthest chromosome end. For that purpose, we combined this data with data of chromosome length from NCBI.
Features:
- Gene length (in nucleotides): defined as the difference between the end and start coordinate of the gene.
- Gene distance to the nearest end of chromosome (in nucleotides)
- Gene distance to the farthest end of chromosome (in nucleotides)
- Gene distance to the nearest end of chromosome (in percentage from total chromosome length)
- Gene distance to the farthest end of chromosome (in percentage from total chromosome length)
STRING features: Protein-Protein Interaction Networks
Source: STRING website.
The data we used (version 10.5, organism Saccharomyces cerevisiae): Link
The STRING database [14] aims to collect and integrate information regarding functional interactions between expressed proteins, by consolidating known and predicted protein–protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. The 2017 version introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution
The data file we used includes protein network data of Saccharomyces cerevisiae (scored links between proteins). The score is the weight of the interaction between the proteins.
Feature: For each protein we extracted its rank, which is defined as the number of interactions the protein is involved in.
In order to standardize the features, we performed feature normalization followed by the filling of missing values with a representative value: We applied standard-score normalization to the features (minus the average, divided by STD), since the database is largely sparse.
Empty features' values were filled with the average feature value.
As mentioned above, the purpose of this model is to select 10 inviable genes that are as diverse as possible, in order to understand which features helped maintain genomic stability of GFP or RFP proteins that are transcriptionally interlocked to it.
To achieve this goal, we would like to represent the inviable genes space as 10 groups (clusters) that are as distinct as possible, and with as much similarity within each of the 10 groups. This dividing process is called clustering, and can be accomplished through two main approaches:
- Partitional clustering approach (left image): division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
- Hierarchical clustering (right image): A set of nested clusters organized as a hierarchical tree.
To conclude, in our model each gene is represented as a single dot in the multi-dimensional feature-space, and we would like to cluster them into 10 representative subsets.
In this case we chose to use an algorithm called K-means, one of many algorithms included in the partitional clustering approach, since our genes are not likely to be “nested” in one another. In the K-means algorithm, each cluster is associated with a centroid (center point), and each point (observation) is assigned to the cluster with the closest centroid. The number of clusters, K, must be specified before applying the algorithm. The basic algorithm is very simple and involves very few steps that are summarized in the image below, where the centroid is typically computed as the mean of the points in the cluster. (e.g. mean each coordinate), and ‘Closeness’ is measured Euclidean distance.
It is important to note that the centroid does not represent an actual gene but rather the mean value of the genes closest to him. In other words, it is the center of the group, but it is not part of the group. So, for each center Ki, we marked the gene closest to the centroid as the actual representative of this group.
The main issues when applying this algorithm are:
- How to choose initial centers? It is a well-known problem that k-means algorithm tends to converge to local minima and is often biased by the determination of initial centroids. In order to deal with this problem, we used multiple runs (n=1000) and made sure that the chosen centroids are frequent.
- How to choose the number of centers (K) effectively?
The second problem is a little more challanging, since k is influenced by the type of data and its distribution in the feature space. We can’t decide, arbitrarily, that the distribution of the data matches our need for 10 different groups.
For example, take the following feature space:
As seen that the data can be divided into 3 unique groups, but it is forced to be divided into 6 groups that does not necessarily indicate diverse properties.
To deal with this problem, we defined two metrics - the first for defining convergence criteria, and the second to evaluate the cluster's quality.
- Based on the Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster Sum of Squared Errors (SSE), which is sometimes also called cluster inertia:
$$(4) SSE=\sum_{i=1}^{n}\quad \sum_{j=1}^{k} \quad w_{i,j} |x_i-\mu_j|^2$$
Where $X_{i}$ is the $i$-th observation (single data point/sample), $\mu_{j}$ is the centroid for cluster $j$, and $w_{i,j}$ is the weight: it is equal to 1 where the sample $X_{i}$ is in cluster $j$ and otherwise it is set to 0.
Inertia can be recognized as a measure of how internally coherent clusters are. It is calculated under the assumption that clusters are convex and isotropic, which is not always the case. This assumption sometimes leads to poor response to elongated clusters, or manifolds with irregular shapes, but since we are not concentrating in too many clusters it is sufficient and helps us simplify the problem from an engineering perspective.
- To evaluate clusters’ quality, we used the silhouette criteria:
$$(5) s(i)=\frac{b(i)-a(i)}{max\{a(i),b(i)\}}$$
For each data-point $i$, the $a(i)$ is average dissimilarity of i with all data-points in the cluster of i; $b(i)$ is the average dissimilarity of i with data-points not in the cluster of i. Thus, negative $s(i)$ values indicate a sample that is not in the right cluster. If $s(i)=0$ then the sample is in the border of two clusters, and if $s(i)$ is close to 1 then it is associated with the correct cluster.
Our idea is to select more than 10 initial centroids and then select among these initial centroids the most widely separated genes. After defining the above formula, we performed the following steps:
- Calculation of the inertia criteria for k=1...150 clusters. We plotted the results on the following ‘elbow graph’.
- Evaluation of the differences between two following inertia values, indicating the improvement in each iteration. The results are in the graph below.
- The maximal improvement, and the elbow of the SSE value, is obtained for k=133, meaning our observations distribution is to 133 representative clusters.
- In order to choose from these 133 values, the desired 10 genes, we used two approaches:
- Iterative method, by which we implemented the k-means algorithm again on the 133 genes. This approach did not lead to good results because it converged to a very small set of final genes.
- Semi-manual method. In this approach we chose only the k-s that showed maximal Silhouette values. This led us to a list of “the best K-s”: [133,50,40,30,25,20,17,15].
The elbow method is a useful graphical tool to estimate the optimal number of clusters k for a given task. Intuitively, we can see that if k increases, the within-cluster SSE will decrease, because the samples will be closer to the centroids they are assigned to.
The idea behind the elbow method is to identify the value of k where the SSE begins to decrease most rapidly, which will become clearer if we plot it for different values of k:
We noted that for low k values we got large variability and when we reach relatively large k values, low oscillation is obtained, which means that adding k does not affect our overall distribution.
We then implemented the k-means algorithm on each k in the list, and saved the centroids. The genes that appeared in all those divisions are chosen as our final 10 genes!
References (APA)
[1] Blazejewski, T., Ho, H. I., & Wang, H. H. (2019). Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science, 365(6453), 595-598.
[2] Manber, U., & Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5), 935-948.
[3] Zur, H., & Tuller, T. (2015). Exploiting hidden information interleaved in the redundancy of the genetic code without prior knowledge. Bioinformatics, 31(8), 1161-1168.
[4] Weill U, Yofe I, Sass E, Stynen B, Davidi D, Natarajan J, Ben-Menachem R, Avihou Z, Goldman O, Harpaz N, Chuartzman S, Kniazev K, Knoblach B, Laborenz J, Boos F, Kowarzyk J, Ben-Dor S, Zalckvar E, Herrmann JM, Rachubinski RA, Pines O, Rapaport D, Michnick SW, Levy ED, Schuldiner M. (2018) Genome-wide SWAp-Tag yeast libraries for proteome exploration. Nature methods, DOI: 10.1038/s41592-018-0044-9. Pdf.
[5] Yofe I, Weill U, Meurer M, Chuartzman S, Zalckvar E, Goldman O, Ben-Dor S, Schütze C, Wiedemann N, Knop M, Khmelinskii A, Schuldiner M. (2016) One library to make them all: streamlining the creation of yeast libraries via a SWAp-Tag strategy. Nature methods, DOI: doi.org/10.1038/nmeth.3795. Pdf.
[6] Roberts, R. J., & Macelis, D. (n.d.). Monthly Formats. Retrieved from Rebase: http://rebase.neb.com/rebase/rebase.files.html
[7] CFI. (n.d.). Pareto Efficiency. Retrieved from CFI (TM): https://corporatefinanceinstitute.com/resources/knowledge/economics/pareto-efficiency/
[8] Breker M., Gymrek M. & Schuldiner M. (2013) A novel single-cell screening platform reveals proteome plasticity during yeast stress responses. Journal of Cell Biology, DOI: 10.1083/jcb.201301120. Pdf.
[9] Breker M., Gymrek M., Moldavski O. & Schuldiner M. (2013) LoQAtE—Localization and Quantitation ATlas of the yeast proteomE. A new tool for multiparametric dissection of single-protein behavior in response to biological perturbations in yeast. Nucleic acids research, DOI: 10.1093/nar/gkt933. Pdf.
[10] Nadler M*., Breker M*., Gruber R., Azia A., Gymrek M., Eisenstein M., Willison K.R., Schuldiner M. & Horovitz A. (2012) Interactions of subunit CCT3 in the yeast chaperonin CCT/TRiC with Q/N-rich proteins is revealed by high-throughput microscopy analysis. PNAS, DOI: 10.1073/pnas.1209277109. Pdf.
[11] Wall D, Hirsh A, Fraser H, Kumm J, Giaever G, Eisen M, Feldman M. (2005) Functional genomic analysis of the rates of protein evolution. PNAS, DOI: 10.1073/pnas.0501761102. Pdf.
[12] Coghlan A, Wolfe KH. (2000) Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast. DOI:10.1002/1097-0061(20000915)16:12<1131::AID-YEA609>3.0.CO;2-F. Pdf.
[13] Lobry, J. R., & Gautier, C. (1994). Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic acids research, DOI:10.1093/nar/22.15.3174. Pdf.
[14] Szklarczyk, D., Morris, J. H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., Santos, A., Doncheva, N. T., Roth, A., Bork, P., Jensen, L. J., & von Mering, C. (2017). The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic acids research, 45(D1), D362–D368. DOI: 10.1093/nar/gkw937. Pdf.