Our project is separated into two parts.
- Computational -- Develop and use software to find transcription-factor binding motifs in fungal promoters, and automate the pipeline so that it is efficient, reliable, and reproducible.
- Experimental -- Create a comprehensive, deeply thought out experimental plan for verifying our software’s results. This plan follows the engineering design cycle, is ready for implementation, and takes into account lab safety, potential design issues, and reproducibility.
- MEME is a motif-finding software in the MEME software suite that accepts a FASTA file containing multiple DNA sequences as input, and outputs position weight matrices (PWM) and their locations in the query sequences. The PWMs describe potentially related subsequences which may indicate the presence of common putative protein-DNA binding sites in the query set.
- In this project we propose to use MEME to find transcription factor binding sites in DNA sequences extracted from one or more fungal genomes. The query sequences are extracted from regions upstream or either translational or transcriptional start sites of genes belonging to a fungal biosynthetic gene cluster and/or orthologous clusters identified in multiple genomes.
- To discover accurate binding motifs with MEME and establish confidence in our results, we must first accomplish a couple things:
- Understand how MEME parameters influence the recovery of common motifs
- Validate our workflow by testing it on biosynthetic gene clusters (BGCs) with known binding sites
- To accomplish these objectives we created a software package that can be downloaded from our GitHub repository(https://github.com/igemsoftware2020)
- Two of our programs in the package were explicitly created for testing the capabilities of MEME.
MBED: Exploring how motif size and sequence composition influence MEME performance
- The first program called “mbed” creates synthetic DNA sequences and embeds a known, unique motif within them. Then, mbed outputs these sequences as a FASTA file for MEME to read. It then outputs DNA sequences that mimic promoters, which MEME attempts to recover the embedded binding sites from.
- By default, the DNA sequences created by “mbed” are randomly generated from a simple nucleotide distribution frequency. To increase the complexity of these DNA sequences to further test MEME’s capabilities, sequences can be generated via the Markov analysis of background DNA sequence. Increasing the markov order of sequence generation allows for more distractor motifs within our synthetic ‘promoters’. Distractor motifs make it more difficult for MEME to find the correct motif sequence. Therefore, a correct result obtained from a high markov order sequence demonstrates robust parameters.
MEMESCAPE: Understanding how MEME parameters influence the recovery of common motifs
- The second program, “memescape”, automates the creation of synthetic promoter files with specific parameters, inputs the promoter files to meme, and extracts the meme parameters and outputs generated. The program iterates through many possible combinations of parameters (see Table below) and produces .csv outputs that show definitively what promoter specifications and meme parameters produce the most accurate output.
|Variable promoter file parameters|
|1. Number of promoters|
|2. Length of promoters|
|3. Motif Frequency|
|4. Markov Order|
|Variable MEME parameters|
|1. MEME model (‘oops’,’zoops’,’anr’)|
|2. Markov order (optional)|
- If someone wanted to test if MEME would work on a certain set of promoters, “memescape” would provide them the ability to create promoters with fully controlled settings. This means that “memescape” can iterate through a number of possible combinations of parameters and promoter files allowing for the best parameters for a given set of promoters to be discovered!
- It is the intention of this program to run promoter files of various specifications through MEME with different motif-finding parameters, in order to determine when meme performs best. For each iteration in memescape, a promoter file is output from “mbed” and run through MEME with a given set of parameters.
- For each iteration of “memescape”, a promoter file with a given set of file specifications (listed above) is created and read by MEME. “Memescape” parses and records the following information:
- The fasta promoter file name
- The file specifications for promoter number and size
- The position and strand of the embedded motif in each sequence
- The position weight matrix of the embedded motif itself
- “Memescape” then reads and extracts the MEME output, and uses data from both the promoter file and the MEME results to comprehensively analyze and quantify the performance of the motif finder for any given set of promoter file and MEME parameters.
- Our thorough testing and analysis of “memescape” has revealed that the number of promoters, the length of promoters, and the frequency of promoters, all impact the ability of motifs to be found by MEME. The MEME model (oops, zoops, anr) also has a large impact on the performance of MEME as a motif finder. Because all of these parameters impact the performance and success of MEME, using “memescape” to identify successful parameter combinations is essential for later steps, when we need high confidence in our motifs found by MEME.
- The output of memescape provides a comprehensive overview of each possible indicator of MEME performance. To quantitatively compare the motifs found by MEME to the expected motifs, a scoring method adapted from “Manhattan Distance” is used to determine the best possible alignment and score between the PWMs of the expected and discovered motifs. This method awards higher scores based on similarity between PWMs. Additionally, the software calculates the number of motifs found successfully by MEME, and outputs a failure rate to determine how poorly (or well) the software performs on average. For any given run of MEME, all parameters and performance metrics are displayed on the same row for easy analysis.
- For more details, see the software tutorial in our GitHub repository.(https://github.com/igemsoftware2020)
- Our theoretical approach to MEME, paired with our scientific background research, allowed us to discover that the ideal length for our promoter sequences was between 200 and 400 base pairs upstream of the translation start site. “Memescape” revealed that a large number of promoter sequences is essential for successful motif finding, especially when promoter sequences are longer, or motif sites are less frequent in the promoter file. For this reason, it was determined that for initial workflow and testing of real promoter sequences, all clusters must have a high number of genes to account for potentially low numbers of motif sites in the clusters.
- In addition to promoter specifications, memescape revealed that markov order plays little to no role in the successful location of binding sites. After multiple uses of higher order and 0-order markov sequences, 0-order models were accepted as standard, as they had little impact on the motif finding ability.
- In addition to these mandatory parameters, “zoops” was chosen as the default MEME model.
- Though mbed and memescape were specifically designed for the purposes of this project, we expect that others may find these tools useful for determining optimal parameters for their own motif inding experiments.
Cluster Processing methods
- The theoretical experiments revealed not only that motif finding in our proposed situation could be accurate, but also what parameters we needed for it to be successful. After using simulated promoters in “mbed” and “memescape” to establish our ability to find accurate binding sites, we started using real fungal BGC promoters from the Joint Genome Institute.
- With 664 filamentous fungi genomes and almost 50,000 clusters to sort through, we decided to focus on Aspergillus for our first several binding sites due to the high volume of research done on clusters in this species. This left almost 11,000 potential clusters to choose from.
- Using the information provided initially by JGI, we were able to identify all of the annotated clusters in the Aspergillus species subset. The JGI cluster annotations provided the coordinates of every single gene in a given cluster. To create a promoter file for each cluster, we extracted 1,000 base pairs of promoter sequence directly upstream from every gene in the cluster. Some genes were defined by transcription start site, while others were defined by translation start site. In either case, we applied the same method. Then, for any negative-strand genes in a FASTA file, the promoter sequence was reverse complemented, before being added to the promoter file, resulting in a FASTA file of only positive-strand promoters. This decision was intended to reduce the formatting variations our software had to account for during motif discovery.
- For every Aspergillus cluster in the JGI data set, we created an identically formatted promoter file for use in our final software program, “motifomatic”. All of these Aspergillus clusters have been made available through our GitHub, in a file called promoters.tar.gz.
Motifomatic: Using MEME to find potential motifs
- Our last program, “motifomatic” is a simple program that collects a FASTA promoter file containing positive-stranded cluster promoters and all of the desired parameters discovered in “memescape”. Motifomatic then runs MEME a single time and collects the consensus sequence, position weight matrix, E-value and and number of sites for the resulting motif. As part of our early results, these values will help determine the accuracy and usefulness of the found binding site.
- Before running our workflow and software on clusters with uncharacterized binding sites, we needed to accurately locate a characterized binding site from the literature. In order to establish proof of concept, we looked to known Aspergillus clusters producing well-studied secondary metabolites.
- After discovering our first 12 motifs, we accomplished our team goals of creating a software toolkit for parameter testing of the motif-finder MEME, establishing a workflow for quickly processing the binding sites of fungal clusters, and creating a database of candidate transcription factors and binding sites for filamentous fungi.
- However, our implementation goals could not be met within the timeframe of iGEM and with the limitations of the pandemic. To continue the improvement of this project we have several steps we would like to take to improve the success of our project.
- We would like to improve the usefulness of our software and function library to users who are attempting de novo motif-finding. We would like to continuously update the way we analyze our MEME and motif data so that users will have access to a more comprehensive comparison between expected and observed.
- We would like to automate the selection and running of our experimental clusters to scale up the motif finding capabilities to the number of available clusters. To accomplish this, we must come up with a method to choose promising clusters for our workflow.
- We would also like to come up with a more quantitative method to access the quality of our experimentally discovered binding sites. This would allow us to rank or prioritize binding sites that will be more valuable to further explore.
- A Markov model describes a scenario where the future state of a model relies on the current state of that model. In the case of DNA, the current state is a nucleotide or nucleotides at a given position in a genome, while the future state is the nucleotide at the position directly ahead of them.
- The order of a markov model describes the magnitude at which current states affect the future state. In our application, the order refers to how many previous nucleotides affect the nucleotide ahead of them. So, a 0-th order markov model tells us that the nucleotide before a given nucleotide in the sequence has no influence on what the next nucleotide will be. A first-order markov model tells us that if the previous nucleotide is a certain base, it will be more likely that the next nucleotide is a certain base. The order may continue to ascend (so long as it is a reasonable assumption) to any order ‘k’.
- Markov models of k order are determined by analysis of a background DNA sequence. In python, each possible DNA window of size k is parsed from the background sequence, and the first nucleotide immediately following the window is recorded in a python dictionary. To generate a new sequence from a markov model, the first k base pairs are randomly generated and the next nucleotide is chosen using the markov model. (we utilize this in the markov version of mbed.py) It is important to note that if the markov order selected is too high for the background sequence given, the markov model cannot be properly trained. In the markov versions of our software, if the order selected is too high, the program will not continue to run.
A quick discussion of motifs and position weight matrices
- Binding motifs are short conserved sequences found in the promoters of genes. These sites bind a transcription factor and help regulate gene expression. Though conserved, not all positions in a binding motif are consistently the same nucleotide. Therefore, representing the binding motif as a single sequence can misrepresent what the binding motif actually looks like. Consensus sequences can represent binding motifs by using an iupac symbol that represents more than one nucleotide. For example S represents C or G and R represents A or G. However, these sequences still lose information about the binding site of a transcription factor. The most informative way to store the data of a binding site is in a position weight matrix or PWM. A PWM accepts the sequences of multiple binding sites and records the frequency of each nucleotide at each position as a nucleotide distribution frequency. In a position weight matrix, the exact frequency of each nucleotide at a given position is stored.
- PWMs are visually displayed as logos in which the highest frequency nucleotides are the tallest and the most informative positions in the motif have the highest columns (see example left).
MEME models explained
- The MEME motif finder does its search of sequences using one of three models chosen by the user which are based on the content of the sequences. If the sequences contain zero or one motifs per sequence the user can use the “zoops” model (which is the default). If the sequences contain one or more motifs per sequence the user can select the “oops” model, and when the sequences contain any possible number of repetitions they can use the “anr” method.
-  Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. [pdf]
-  Ehrlich, K C et al. “Binding of the C6-zinc cluster protein, AFLR, to the promoters of aflatoxin pathway biosynthesis genes in Aspergillus parasiticus.” Gene vol. 230,2 (1999): 249-57. doi:10.1016/s0378-1119(99)00075-x
-  Schoberle, Taylor J et al. “A novel C2H2 transcription factor that regulates gliA expression interdependently with GliZ in Aspergillus fumigatus.” PLoS genetics vol. 10,5 e1004336. 1 May. 2014, doi:10.1371/journal.pgen.1004336
-  Timothy L. Bailey, Mikael Bodén, Fabian A. Buske, Martin Frith, Charles E. Grant, Luca Clementi, Jingyuan Ren, Wilfred W. Li, William S. Noble, "MEME SUITE: tools for motif discovery and searching", Nucleic Acids Research, 37:W202-W208, 2009.
-  Inukai, Sachi et al. “Transcription factor-DNA binding: beyond binding site motifs.” Current opinion in genetics & development vol. 43 (2017): 110-119. doi:10.1016/j.gde.2017.02.007
-  Fornes O, Castro-Mondragon JA, Khan A, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2019; doi: 10.1093/nar/gkz1001
-  Kong, Qing et al. “Identification of AflR Binding Sites in the Genome of Aspergillus Flavus by ChIP-Seq.” Journal of Fungi 6.2 (2020): n. pag. Journal of Fungi. Web.