Team:TAU Israel/Results

sTAUbility

sTAUbility

Results

Table of contents

This page includes the following results:

  1. POC experiment - raw data, plate-reader calibration and validation
  2. Restriction enzymes analysis results
  3. Gene-SEQ experiment - validation and sequencing results
  4. EFM optimizer - POC
  5. Feature selection for stability predictor
  6. Stability predictor performances on a test set

POC experiment (or "10 Genes experiment")

Our wet-lab Proof of Concept experiment and its raw-data analysis process is described in the Experiments page. In short, we compared the evolutionary half-life of ten diverse constructs against the half-life of the unattached target gene (GFP), by measuring their relative preservation of fluorescence, which is used as a measure of stability.

Validation

The ten genes experiment was conducted as detailed in the Lab Notebook and Experiments pages.

Since we used 10 variants from a large-scale library, there is a chance for a mix-up between the variants, meaning that what we think is strain X is in fact strain Y. This could be the result of an error within the library, or a human mistake from our side.

In order to prevent these mistakes, we validated the ten variants using both PCR and sanger sequencing, in detail in the lab notebook. Additionally, to avoid any mix-up that could happen during the evolution experiment itself, which could highly influence our results, we also conducted the same validation process to the strains in the end of the experiment. Altogether, nine out of the ten genes were validated using both PCR and sequencing at the beginning and end of the experiment.

The sequencing results are below.

Figure 1. Sanger sequencing results.

Plate reader calibration

We followed the plate-reader calibration protocol and recieved the results attached in this file.

Raw Data

The raw data from the first and last day of the experiment is summarized in the following zip file. It contains the plate-reader measurements of fluorescence and OD (absorbance). Following the steps of the raw-data analysis described in the Experiments page, we received the final results that are documented in our POC page. Visit us there and see how we proved our hypothesis!

Restriction enzymes analysis

For our Gene-SEQ experiment we used DNA constructs composed of a promoter, fluorescent gene (GFP or RFP), a linker and an ORF of the yeast genome (details in measurement page). When we extracted DNA from the cultures, these constructs were embedded in the yeast genome. Thus, we needed to cut these fragments with sticky-ends using restriction enzymes, both before the promoter and as upstream as possible in the conjugated gene (ORF), so that would be enough nucleotides to allow recognition of the conjugated ORF when sequencing.

The analysis is described in our Engineering Success page. In summary, the analysis is composed of the following steps:

  1. Search for a custom designed restriction enzyme.
  2. Search for a suitable restriction enzyme from a database of commercially available restriction enzymes.
  3. Test pairs of commercial enzymes (instead of a single enzyme) that can work in parallel to create the desired construct.
  4. In-depth view of the final candidates to reach a final decision.

The results of each step are summarized in a different file. Finally, 8 result files were extracted – 4 for a GFP target gene, and 4 for an RFP target gene. All files are available in the following zip file.


Step 1: Custom restrictor sites

Filenames: custom_sites_gfp/rfp

In these files, custom built sites for restriction enzymes are tested.

All possible custom sites were generated from the respective target gene, up to length 8 bp.

For each custom site, statistics were extracted for all genes in S. Cerevisiae.

Columns:

  • site: the custom site tested
  • median_total_length: median of construct length
  • percent_gene_none_unique: the percentage of constructs whose sequence is not unique. We aim to minimize this.
  • percent_total_over_3000/5000: the percentage of constructs whose length is over 3000/5000. We aim to minimize this as well; in our computations we minimized the 3000 value.
  • percent_gene_under_50/25: the percentage of cut ORFs (in the conjugated genes) with length under 25/50.

We filtered the sites presented by the Pareto efficiency principle – a site is discarded if there is another site with smaller percent_gene_none_unique and also smaller percent_total_over_3000.

Step 2: Commercial enzymes

Filenames: commercial_gfp/rfp

In these files, commercial enzymes are tested.

All enzymes from rebase which cut the target gene in a legal region are analyzed.

For each enzyme, statistics were extracted for all genes in S. Cerevisiae.

Columns:

  • enzyme: which commercial enzyme is tested
  • generalized_site: what site does this enzyme recognize
  • overhang_length: how much overhang does this enzyme create. We aim to maximize this.
  • median_total_length: median of construct length
  • percent_gene_none_unique: the percentage of constructs whose sequence is not unique. We aim to minimize this.
  • percent_total_over_3000/5000: the percentage of constructs whose length is over 3000/5000. We aim to minimize this as well; in our computations we minimized the 3000 value.
  • percent_gene_under_50/25: the percentage of cut ORFs (in the conjugated genes) with length under 25/50.

We filtered the sites presented by the Pareto efficiency principle – a site is discarded if there is another site with smaller percent_gene_none_unique, smaller percent_total_over_3000, and larger overhang_length.

Step 3: Two commercial enzymes

Filenames: commercial_2_enzymes_gfp/rfp

In these files, pairs of commercial enzymes are tested.

All pairs of enzymes from rebase that meets the following criteria were analyzed:

  • At least one enzyme cuts the target gene in a legal location.
  • None of the enzymes cut the target gene in an illegal location.

For each pair, statistics were extracted for all genes in S. Cerevisiae.

Columns:

  • Enzyme_1/2: which commercial enzymes we tested
  • generalized_site_1/2: what sites do these enzymes recognize
  • overhang_length_1/2: how much overhang do these enzymes create. We aim to maximize this.
  • median_total_length: median of construct length
  • percent_gene_none_unique: the percentage of constructs whose sequence is not unique. We aim to minimize this.
  • percent_total_over_3000/5000: the percentage of constructs whose length is over 3000/5000. We aim to minimize this as well; in our computations we minimized the 3000 value.
  • percent_gene_under_50/25: the percentage of cut ORFs (in the conjugated genes) with length under 25/50.

In the case of RFP, we filtered the sites presented by the Pareto efficiency principle – a site is discarded if there is another site with smaller percent_gene_none_unique, smaller percent_total_over_3000, and larger overhang_length.

In the case of GFP, we realized that it was infeasible to order 2 enzymes due to schedule limitations during the Covid-19 crisis. Thus, the data uploaded was not filtered.

Step 4: Commercial enzymes: a more in-depth view

Filenames: commercial_enzyme_summary_gfp/rfp

In these files, each of the final commercial enzymes is presented more in depth.

In order to make a more educated decision regarding which enzyme to choose, we created a clearer view of the final commercial enzymes. This is a PDF with a summary page per enzyme. The page title includes the statistics presented in commercial_gfp/rfp.

In addition, two graphs were added per enzyme:

  1. Total Length for PCR – displays the construct length distribution per enzyme.
  2. Group on None-unique Sizes – displays what are the sizes of sets of constructs with the same sequence after applying this enzyme. This helps to estimate the difficulty in differentiating none-unique genes.

An example of the RFP graphs is found below.

Due to supply difficulties caused by the COVID-19 situation, we had to consider not only the efficiency but also the availability of the chosen enzymes. Finally, the chosen restriction enzymes are NSP1 for RFP, and DRA1 for GFP.

If you wish to know how we used them, visit our Measurement page!

Gene-SEQ experiment

The Gene-SEQ measurement is conducted using a new protocol that we designed, thus it had to be validated before each new step.

The best way to ensure that we followed the protocol correctly is to prove that we received the desired results from the Nanopore deep-sequencing, by analyzing the sequencing data. Unfortunately, due to Covid-19 related issues, we had some major delays in the sequencing process. We received the results next to the competition deadlines, leaving us with insufficient time to analyze them. As we were concerned that the pandemic could cause such a delay, we planned another experiment to validate our Gene-SEQ method.

Performing PCR on the mixed cultures of the experiment would not serve as a proper validation of the results, because it requires running the samples on gel. Since there are millions of constructs with different sizes, it will be difficult to differentiate between them and to determine that the process was successful.

Thus, while performing the Gene-SEQ protocol on our mixed cultured libraries, as elaborated in Measurement page, we performed the same sample preparation protocol for one of the genes used in the 10 genes experiments – the SEC2 gene. This way, we could validate our method using both PCR and SANGER sequencing - as opposed to performing PCR on the mixed culture, following the same protocol for only one gene is much easier to analyze.

Be verifying the right size in the PCR, and by further validating in SANGER sequencing, we can answer the following questions:

  1. Did the restriction enzyme model work?
  2. Is our novel protocol valid and working?
  3. Is the product of our protocol in an adequate state for sequencing?

The expected size of the PCR should be around 2.5KB, as demonstrated below since Nsp1 enzyme was used in the protocol. As you can see on the right, we received a band in the expected size.

Figure 2. PCR results for SEC2, as a validation for the Gene-SEQ protocol.

We then sent the PCR fragments for SANGER sequencing. The sequence was compared against the sequence that it should contain - NOP1 promotor, GFP, and part of SEC2 - using the BLAST tool. The alignment showed a similarity in significance level of 2E-58. This means our protocol is valid and working.

In figure 3 there is an example of the deep sequencing results – one of many FASTQ files as explained in our Measurement page. We are currently analyzing the data, and hoping to get results soon!

Figure 3. Example of the Nanopore sequencing data in FASTQ format, that is currently being analyzed.

EFM optimizer - POC

To prove the efficiency and robustness of our EFM optimizer, we wanted the check the evolutionary stability of residues marked as unstable by our software.

For this purpose, we calculated conservation score for each nucleotide in many different genes of Saccharomyces cerevisiae,, by utilizing ConSurf program [1], developed at Tel - Aviv University. This software creates multiple sequence alignment (MSA) out of all the orthologous genes it can find for a given sequence. Then, it sums the appearance of each nucleotide in a specific position across all the orthologous genes.

A well- conserved nucleotide will appear across all the different orthologs genes, while an evolutionary unstable nucleotide will appear significantly less. The number of times the nucleotide appears across the MSA is counted and calculated so that every nucleotide is given a conservation score.

For this analysis, we selected genes that are evolutionary conserved based on the onservation score, and inserted them in the EFM optimizer. The conservation requirement is crucial for the ConSurf software, because if the gene is not well-conserved the software will not be able to find many orthologous genes or to assign reliable conservation score. We also selected genes with different functions in the cell and in different length.

We calculated the average conservation score at each position for each gene we picked, and compared this value to:

  • The average of the lowest conservation score from each area predicted as unstable by our EFM's SSR calculator.
  • The average score from all the areas predicted by the RMD calculator.

When addressing SSR we used only the lowest conserved nucleotide. The reason for this is because a mutation mediated by SSR will cause a single nucleotide level event (deletion, insertion of substation), while RMD mediated mutational event will influence all the predicted sites (large scale deletion of insertion).

Conservation score was calculated for Pol30, Pol3, Cdc9, Tor2, Mgs1 and Rad52, all are evolutionary conserved genes from saccharomyces cerevisiae that vary in length and function. The significancy of the difference between the conservation of all the genes and the areas marked by the EFM optimizer varied between 0.0016 in Pol30 to 0.0004 in Cdc9.

Figure 4. An illustration of the conservation scores of Pol30 sequence.

In total, we found that there is a significant difference between the conservation score of the entire gene and the conservation scores of the areas predicated by the EFM (whether it is SSR or RMD). The results prove that the EFM software our team has devised successfully predicts areas that are evolutionary unstable and automatically offers a new, optimized sequence with enhanced evolutionary stability.

The optimization process is robust and applied on multiple sequences at once with great ease for the user. We believe that this software is highly beneficial for any IGEM team, lab, or biotechnology company.

Feature Selection for the Stability Predictor

As described in our Model page, we developed a predictor of genetic stability, based on many descriptive features we generated for each of the ORFs of yeast.

We ended up with approximately 2,000 features. To reduce the model's dimensionality, we selected a small set from these features based on a heuristic approach that is explained in the "feature selection" sector of the Model page. In short, this small set is composed of 300 features that are highly correlated to the labels (the predicted values), but are less correlated with one another.

The 300 chosen features appear in the file above. The first 40 features are described within this file, so one could get a sense of the importance of some features. The other features are elaborated on the Model page.

Stability Predictor Performances on a Test Set

We evaluated the performances of our stability prediction model when implemented on a test set. The target values $y_{true}$ , to which we compare the predicted results, are the fluorescence measurements of each tested variant in Prof. Schuldiner's libraries (referring to GFP and RFP as target genes). The fluorescence predictions resulted from the model are marked $y_{pred}$ accordingly.

Our metrics of success are: $R^2$, accuracy, and agreement on 20 leading candidates.

$R^2 $ ("Coefficient of determination"):

In linear regression, we try to explain the observed target data by a set of features generated for this purpose. We would like to find the linear line that minimizes the squared distance between it and the true data. This squared distance is called the residual sum of squares $u=\sum{(y_{true}-y_{pred})^2}$.

However, minimizing u is not enough. Part of the variance in the target data $y_{true}$ is due to varying features' values, but some of it is due to reasons that are unknown to us. Thus, the variance of $y_{true}$ can be divided to explained variance and unexplained variance. Our goal in the regression process is that most of the variance of $y_{true}$ will be the explained variance, because it will indicate that our features characterize the data efficiently. This explained variance is called $R^2 $ or the Coefficient of determination.

Mathematically, $R^2 $ is defined as:

$$R^2 = 1-\frac{u}{v}.$$

where $u$ is the residual sum of squares and v is the total sum of squares $v=\sum{(y_{true}-\hat{y_{true}})^2}$.

This measure denotes the proportion of variance of $y_{true}$ that has been explained by the features in the model. For example, a constant model that always predicts the expected value of y, disregarding the input features, has no variance. Therefore, it will be assigned with $R^2=0$. Similarly, the best possible score of $R^2 $ is 1.0 and it can be negative (because the model can be arbitrarily worse).

Accuracy and median diff:

As mentioned in our Model page, our model is trying to predict fluorescence values from Prof. Schuldiner SWAT database. For each gene in the test set, we calculate the absolute prediction error $e=|y_{true}-y_{pred}|$. We then define accuracy as the number of genes with an error within a threshold of 10au, divided by the size of the test set:

$$accuracy = \frac{number-of-genes-with-e<10}{|test-set|}.$$

Also, we define the "median diff" as:

$$ Median-diff = median(e) $$

For example, if the median diff is 5, it means that for about half of the genes in the test set, the prediction error is of 5au. Similarly, if the accuracy is 0.6, it means that the predicted value of about 60% of the genes in the test set is close to the target value within a range of 10au.

Agreement:

We use the predictor to rank the co-stability of the genes in yeast when interlocked to a target gene. The ranking will be presented to the users in order to make an intelligent decision regarding the conjugated gene they prefer.

Thus, the most important measure for our software is the agreement, indicating the predictor's agreement with the original ranking of the leading candidates.

For the GFP test set, the results are:

R-sqaure Accuracy Median diff Agreement on 20 leading candidates
0.5499 0.6487 5.63 0.55


The agreement graph:

Figure 5. Predicted ranks of the leading candidates in comparison to the ground-true ranks.
For example, the gene that was originally ranked as 2nd- is predicted to be the 5th best gene.

As you can see, 11 genes that originally showed high fluorescence levels are also predicted to be of 20 leading candidates.

In order to test the generalization of our model, we also tested the performances on RFP as a new target gene. For this purpose, we used only the agreement measure since it best reflects the suitability of the model to our end goal – recommend the best candidates for increasing the target gene's stability. We received agreement on 4 out of 10 leading candidates!

We believe that following the analysis of the sequencing data from our Gene-SEQ experiment, and its integration to the prediction model, we will be able to further improve the results.

References

[1] Ashkenazy H., Erez E., Martz E., Pupko T. and Ben-Tal N. 2010. ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucl. Acids Res. 2010; DOI: 10.1093/nar/gkq399; PMID: 20478830 [ABS], [PDF]