Team:TAU Israel/Measurement

sTAUbility

sTAUbility

Measurement

Introduction and Motivation

We have created a novel method for genomic stability large-scale measurement. This method collects genomic mutational data of many desired constructs, from millions of microorganisms. We named this method Gene-SEQ (Stability Enhancing Quantifier).

How does it help you?

In our Project Description page, we described the difficulty in attaining long-term expression of target genes in a foreign organism. Although we would like our construct to be stable and functional, the host cell eventually fails. This results from its natural tendency to evolve toward optimal fitness, and the inserted target gene requires an additional effort. Thus, mutations that inactivate the construct (and remove the associated load) are likely to arise and take over the population.

To address this problem, we must quantify the instability of inserted constructs. Most in-lab stability measurement protocols require a reporter gene (such as GFP fluorescent gene), that indicates the preservation status of the construct in a population. In addition, such experiments are usually performed in small-scale, involving only a few constructs at a time. A known example is the impressive protocol presented by 2015 UTexas iGEM team. They have found a way to measure the "burden" that an engineered construct causes to its host organism. However, this method involved transformation of the construct to include GFP, which is not always ideal, and it was done in relatively small scale.

Unlike such protocols, we define the distribution of mutations on the target construct as a measure of the "burden", or instability, that it causes to the host organism. Thus, our method, although validated on the fluorescent genes GFP and RFP, can be implemented on any desired construct, since it only requires deep sequencing for the stability analysis. Moreover, our method enables measurement of stability in large-scale. That means you can measure the stability of several constructs simultaneously - even all the parts in the iGEM registry! - using the same experimental procedure, and get all the results at once. We implemented it on approximately ~5,000 constructs, so you get the idea :)

Moreover, to the best of our knowledge, there are no previous studies that measure the co-stability of conjugated genes expressed together. The use of fusion proteins is common in the biopharma industry for drug delivery purposes. Our technique allows screening, on a very large scale, the co-stability when coupling with various genes.


How did our team use it?

We implemented the Gene-SEQ protocol on two libraries obtained from Prof. Schuldiner's lab, presented in our Experiments page, and conducted a Gene-SEQ preliminary experiment. The utilized libraries contain about 6000 variants, one for each open reading frame (ORF) in the yeast genome. Each variant has a construct composed of a fluorescent gene (GFP or RFP) fused to the N' terminal of the respected ORF. Using the Gene-SEQ protocol, we collect genomic mutational data on each construct as a measure of stability, and use it for the training of our stability predictor model (Fig. 1), referring to GFP and RFP as target genes.

Figure 1. An overview of our main product. The first three steps are the Gene-SEQ measurement


The main goal of our Gene-SEQ experiment is to provide large-scale data about various target-conjugated constructs to our software. The software, using this data, will predict the most stable and optimized target-conjugated construct, given a desired target gene.

Gene-SEQ Measurement overview:

Figure 2. Gene-SEQ overview


  1. Creation of a co-culture: Preparation of a mixed yeast culture of desired constructs (target-conjugated). The mutational data of these constructs would be harvested for model analysis. Each construct has a yeast variant representative and are all grown together.
  2. Evolution experiment: The cultures are grown in a state-of-the-art robotic platform called Chi.Bio (Fig. 3). More information regarding the Chi.Bio machine is available in our Experiments page.

    Figure 3. Our Chi bio components - reactor, computer-chip, turbidostat.

  3. Deep sequencing: We designed this method to extract DNA samples from the Gene-SEQ mixed culture (Fig. 4). Further information below on our novel protocol for sample preparation of millions of constructs.
    Figure 4. Our technique for Gene-SEQ sample preparation for deep-sequencing


    1. Extract all DNA from the mixed culture.
    2. Slice the desired constructs from the DNA of all variants using a single restriction enzyme.
    3. Self-ligation of all constructs with ligase enzyme- Before and after PCR amplification that is completed only with ligated constructs.
    4. Rolling circle amplification (RCA) for circular DNA – creates multiple consecutive copies (or " concatemers") of each construct, that are sequenced using long-read sequencing technology.
    5. Nanopore sequencing of all constructs.
    6. Data analysis.

Before we elaborate on our own experiment, with more technical information for the above mentioned steps, we would like to explain the advantages of using RCA before deep sequencing. The RCA process generates long DNA duplications of the desired construct, such that each sequencing read contains many copies of the same construct. Because we have many copies, we can rule out sequencing errors. For example, if we have 10 copies of a specific construct, we can compare these 10 fragments (by multiple alignment), and decide that if a mutation happened 8 times out of 10 repetitions, then it is not a sequencing error but a real mutation.

Thus, although it is not mandatory to use RCA, we highly recommend it!

Please refer to the last section on this page to view our plan data for this data’s analysis.

Gene-SEQ measurement as proof of concept of our hypothesis

For our first use of the Gene-SEQ measurement technique, was had two goals:

  1. Large-scale testing of our hypothesis – Target-conjugated gene linkage prolongs the evolutionary half-life of the target gene in a population. (See Project description)
  2. Analyze the mutational footprint of the target gene.

We used Gene-SEQ to collect large-scale data for our optimization model, which could then predict the optimal conjugated genes for each target gene. Here is how we followed the three steps in Fig. 2:

Culture preparation: We used the yeast SWAT-libraries (GFP/RFP, see Experiments page) to prepare two mixed cultures, a GFP culture and an RFP culture. Each culture contains a variant for every yeast gene attached in the N-terminus to the fluorescent gene.

Evolution experiment: For the evolution experiment we used the Chi.Bio machine mentioned above (see protocol for Chi.Bio usage). We made two biological repetitions for each culture, four reactors in total. The cultures were grown and monitored by the Chi.Bio machine. We prepared fresh media every few days and connected it to the machine. We could tell whether the library’s variants were still viable; if the fluorescence levels monitored were to decrease significantly – it would meant that most variants lost the construct. We succeeded cultivating the culture for 21 days

Deep sequencing: We collected several culture samples at different times – first day, last day (day 21), and every 5 days (days 5,10,15). If we whad only sampled on the last day of growth, we would have lost a lot of mutational data regarding the constructs. For example, some variants’ populations could disappear from the culture due to mutations that impaired the conjugated essential gene’s synthesis (~1200 essential gene variants in the library), thus sequencing multiple times could gather these variants’ mutational data.

The constructs of the yeast libraries we used for our experiments did not have pre-designed sequences for sequencing-primers. This was a significant problem, because we wanted to sequence millions of constructs. In response, our biology team constructed this novel protocol, so every construct could be sequenced. We used our Gene-SEQ sample preparation protocol listed above (see Protocols page for more specific protocols).

Sequencing sample preparation:

In Fig. 4 we presented the sample preparation technique for deep sequencing. This is how it was done in our experiment:

  1. We extracted DNA with Promega/Sigma-Aldrich kits.
  2. Our model team created a model (see Engineering Success page) that finds a single restriction enzyme (one enzyme for GFP and one for RFP libraries) that cuts with sticky-ends both before the construct and as upstream as possible in the conjugated gene (ORF), with enough nucleotides to allow recognition of the conjugated ORF when sequencing (Fig.5). This restriction-enzyme model outputted two lists of compatible restriction enzymes for GFP/RFP constructs. We used DRA1 (for RFP) and NSP1 (for GFP) enzymes to cut the constructs.
    Figure 5. Cut sites in each construct.


  3. We used T4 ligase enzyme to generate self-assembly of our wanted constructs (see ligation protocol). The sticky ends are compatible for self-assembly as the cuts were made with the same enzyme (Fig. 6).
    Figure 6. Demonstration of self-ligation.


    After the ligation we made a specific PCR amplification for the ligated constructs, so they would be in a high concentration for the upcoming sequencing. We designed primers that attach to the L2 region (constant region in the library) and promote the reaction in two directions, this way, only ligated constructs will be amplified from primer to primer (Fig. 7).
    Figure 7. Primers' location for PCR.


  4. Our DNA samples were then sent to our sponsor “Cyclomics” in the Netherlands. In order to distinguish between mutation and sequencing mistake, which is prevalent in nanopore sequencing, they made a rolling circle amplification (RCA) for the samples (Fig. 8). This method produces long repeats of each construct so a mutation should appear in every repeat, and a sequencing mistake will appear only once. We use this property in the proposed data analysis, as described in the next section.
    Figure 8. Illustration of RCA, that generates long DNA concatemers that are sequenced using long-read sequencing technology. The blue ellipse indicates the specialized polymerase that activates the RCA process. The extension process covers the entire length of the circular DNA multiple times, resulting in the formation of repeated sequences of the template, called "concatemers" [1].


  5. At last, “Cyclomics” sequenced the DNA in an oxford nanopore machine.

How to adapt this protocol to any new construct?

As mentioned, the Gene-SEQ measurement protocol can be implemented on any set of desired constructs. We tested the stability of a construct composed of target gene, conjugated gene and linker, but you can design you own parts and use this protocol to assess their stability.

All the parts that you use should have a homologous area of at least 30bp that is common to all the parts within the experiment. -Its location doesn’t matter, only that it is present in the circular DNA after treatment with restriction enzymes.

Results:

We are happy to announce that our novel Gene-SEQ measurement technique was successful, and the sequencing results are arriving.

Due to COVID-19 related delays, we are received the sequencing data just before the deadlines, and only for two time samples – the start and end of the experiment. We are doing our best to analyse the results and to incorporate them in our optimization model and main product. See Results page for more information. Also, in the next section we discuss our plans for the analysis, so read on!

Data analysis – Quantifying Genomic Stability

Figure 9. Scheme of deep sequencing data analysis.


After execution of all previously mentioned steps, the last and perhaps most important step is data analysis of the deep sequencing results.

As stated, we are still analysing these results, due to COVID-19 delays. Detailed below is our plan, and you can follow it to perform your own analysis.

The results from deep-sequencing experiments are usually obtained in FASTQ format. It is very similar to FASTA format, but also refers to the quality of the sequencing process. In general, a FASTQ file can include many reads. Each read in the file is composed of four rows –

  1. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA '>' title line).
  2. Line 2 is the raw sequence letters.
  3. Line 3 begins with a '+' character and is optionally followed by notes and description similar to line 1.
  4. Line 4 encodes (in ASCII) the quality values for the raw sequence.

You can see for example our own results in the Results page.

The second line in each read is the raw sequence, in the following format: [X] [Unknown Y] [X] [Unknown Y] [X] [Unknown Y] ..., where x is the known sequence (GFP or RFP, their promoter, and the linker) and Y is the ORF fragment after treatment with restriction enzymes. The repetition in the above format are due to the RCA.

The analysis is composed of three main steps:

  1. Identify repetitions and extract the "consensus" - the repeated fragment from each read. Specifically, a consensus is results from multiple alignment of the repetitions. Thus a nucleotide that appears in a particular position of the alignment, appears also in all the repeats. This is a restricted version of the example we described above: a mutation is considered a mutation only if it appears in all of the duplications in the current read.
    Efficient tools for tandem repeat detection include TideHunter, concall, and INC-Seq.
    With the help of Shaked Bergman from Tuller's lab, we plan to use TideHunter and extract the consensus sequences from each read, since right now it seems to be the most efficient and suited to our data.
  2. Map the consensus sequence to the expected results.
    For this purpose, we will first need to differentiate between RFP and GFP reads, as each FASTQ file can include them both. Then, we plan to use minimap2 and adjust it to the output of the first step in order to map the consensus sequence to our expected constructs.
  3. Calculate the distribution of mutations of each consensus sequence when aligned to its mapped ground-truth (GT) sequence. This is done by local alignment to the GT, resulting in a mutational footprint of the construct – the distribution of mutations within the target-conjugated construct.
    We plan to calculate the ration between non-silent (dN) to silent (dS) mutations, as a measure of stability. The lower this ratio, the more stable the construct is.
Figure 10. Our proposed data analysis steps.


references

[1] Wyant, P. S. (2011). The use of rolling circle amplification (RCA) for diagnosis and characterization of geminiviruses.‏