Team:Heidelberg/Notebook

Notebook
Gel images and more

Labbooks

To see what we did in the lab, take a look at the labbooks below.

Labbook_PPR_Proteins Labbook_PPR_Proteins
Labbook_Split_GFP Labbook_Split_GFP
Labbook_Triple_Helix Labbook_Triple_Helix
Labbook_Split_Trans_Splicing_Ribozymes Labbook_Split_Trans_Splicing_Ribozymes

Drylab

3DOC
A Python script based on PyRosetta was developed, which automatically combines PDBs of of RNA-binding Pumby Modules for a user-inputted RNA sequence Adamala2016ProgrammableRP Chaudhury2010PyRosettaAS . After repacking of mutated amino acids deviating from the Pumby consensus sequence as well as relax of the modules connections, some problems involving wrong binding between the amino acid backbone remain.

PRISM
We discussed different neural network architectures and finally came to the idea to implement a GAN (generative adversial network) for the generation of RNA binding proteins.

RISE
Under the impression of our discussions with other iGEM Teams at the iGEM Meetup organized by the iGEM Team Marburg we discussed in our team, how the iGEM Registry could become an even more valuable resource for our team. For us it was clear that the database suffers from a lack of clarity and flexibility. Therefore, we asked ourselves how we could improve the database.

Collaboration with iGEM Manchester
At the iGEM Meetup organized by the iGEM Team Marburg we got to know Miguel of the iGEM Team Manchester. We got into conversation about the iGEM Registry, which we worked on later in the project RISE and agreed to meet some time later to exchange experiences.
3DOC
The Python script was refined by solving the issues based on wrong bindings between the amino acid backbones at the border of two Pumby modules. Work was started to expand the script on PPR Modules Coquille2014AnAP Barkan2012ACA.

PRISM
We started the development of PRISM discussing our ideas with Prof. Gumbel from the Hochschule Mannheim. With him we discussed the principles of machine learning and which problems could arise, when applying large datasets to it. We identified the Swissprot database for pre-training of the Language Model Several databases, such as EcRBPome as well as RBPDP were reviewed for usage as data source for RNA-binding proteins in the finetuning step of the Language Model Consortium2019UniProtAW. One of the databases reviewed only offered the sequences rather than an identifier, which could be mapped to Uniprot. The entries of the other database could be mapped, but did not offer any RNA-motifs.

CoNCoRDe
We implemented Vienna RNA Inverse Fold as the first part of our approach developing CoNCoRDe. We hoped to solve RNA substructures easily and tested the method with the EteRNA100 benchmark dataset. Based on the findings that the algorithm included in VIenna RNA Inverse Fold does have a low computational cost, but has difficulties with harder RNA structures, we developed the idea of developing a deep learning network for RNA reinforcement learning.
RISE
Our plans to improve the iGEM Registry slowly became more concrete. We identified two keypoints as functionality for a better iGEM Database: Searching for entries not only based on the part name, but also the description and documentation. Filtering the iGEM Registry entries for uses and availability.

CoNCoRDe
Besides ViennaRNA we started implementing a deep learning network as a policy for RNA design based on reinforcement learning.
PRISM
In this week we also talked to Prof. Giunchiglia. Based on his feedback we revised eventual biases in the data we are using and refocussed our plans for PRISM on a classic Language model which is based on a modern transformer architecture Vaswani2017AttentionIA.

RISE
Our very first idea to build our team-internal tool, was to download the SQL Database, which can be downloaded via the iGEM Website. This plan failed. Not only didn’t we have a lot of experience with SQL Programming but also couldn’t we solve several issues concerning the Character encoding. We realized that accessing SQL database may not be a suitable way for teams with not a lot of programmatic experience.

Collaboration with iGEM Manchester
Following up our friendship we made in Calendar Week 29 with the iGEM Team Manchester we met the iGEM Team Manchester again and exchanged our experiences with iGEM Software development so far. iGEM Manchester asked for our advice, concerning their Wiki study project. We suggested implementing a web scraper and show off an example suitable for the iGEM Wiki page some time later.

CoNCoRDe
As reinforcement learning is relatively unstable and highly sensitive to hyperparameters, we added an advantage-weighted actor critic (AWAC) training algorithm, which already was implemented in the “torchsupport" library of our advisor Michael Jendrusch. We imported it from this library.
3DOC
The initial idea was developed further by expanding it to concatenate all types of proteins and protein domains directly to each other or via RNA linkers. For this a workflow to prepare DNA sequences and RNA linkers was developed to be used in Geneious Prime or for manual application.

PRISM
Also, several types of annotations relevant for the sequences were chosen: Gene Ontology (GO-Terms), Taxonomy, Total Charge, Amino Acid Prevalence, Amino Acid Types Prevalence and Ligand-Binding Consortium2019TheGOAshburner2000GeneOT. An automatic download of all databases used in the Language Model was implemented. This included the Swissprot data from Uniprot, Gene Ontology annotations as well as the Binding Database Gilson2016BindingDBI2.

CoNCoRDe
When we finalized our training in CoNCoRDe we got back into touch with our advisor Michael Jendrusch, as he has a lot of experience with RNA design and discussed with him our results of hyperparameter analysis and optimized the code.
3DOC
An automatic BLASTp-search on RCSB database for all protein sequences inputted into 3DOC and based on the BLAST-result automatic mutation and repacking of amino acid residues via PyRosetta were implemented to generate PDBs of the proteins and protein domains which should be fused or interconnected via a linker Berman2000ThePD. Some protein sequences showed after BLASTp-search a relatively low Bitscore and high E-value, so it is necessary to evaluate the accuracy of the proposed method for all types of proteins.

PRISM
The functions to generate vectors with all annotations for a specific sequence originating from Swissprot were established. For this all information from all databases were combined and interconnected. Also functionality to calculate further characteristics, such as the distribution of amino acid types in the sequence, was added. Last but not least padding for the sequences was introduced.
Both a high number of taxonomies and GO-Terms appear in the entries in Swissprot. Some of the GO-Terms and taxonomies only appear for a small number of proteins. To train the network efficiently several ways for feature reduction were overthought.

RISE
After some research on the iGEM Registry we decided to focus our efforts on the point-in-time database dump provided on the iGEM website. In this week we implemented a first version of a .xml parser, which we want to extract all part information to build a .csv table or an own SQL database.

Collaboration with iGEM Manchester
We constantly stayed in contact with the iGEM Team Manchester concerning their Wiki Study project and their request to show them how to implement a web scraper to access an iGEM wiki. Following our first code examples we decided together with iGEM Team Manchester to support them in a more comprehensive way in their Wiki study project. We contacted Benedict Wolf, who discussed with us the structure of iGEM Wiki pages and gave us advice how to optimize our web scraper.

CoNCoRDe
Based on the solved structures we implemented some optimizations of the CoNCoRDe algorithm.
3DOC
It was decided to implement trRosetta as an alternative to the proposed PyRosetta algorithm Yang2020ImprovedPS, when the E-Value exceeds 0.5 and the determined Bitscore is smaller than 100. For this hhblits, a method in the HH-suite3, got implemented as well to generate multiple sequence Alignments as input for trRosetta Steinegger2019HHsuite3FF.

PRISM
The Encoder and EncoderLayer as well as the Multi-Head Attention Mechanism were built based on the CTRL Transformer Model Architecture Keskar2019CTRLAC Vaswani2017AttentionIA. In this week we discussed the structure of the CTRL model with Carola Fischer, who is a PhD student at the Technical University Berlin. Based on her feedback we started thinking how to validate our model.
It was decided to use Adam as an optimizer for our trainingKingma2015AdamAM. We used Adam with the betas = (0.9, 0.98). We also implemented a learning rate scheduler as proposed by Madani et al. In addition to that we decided to use 4000 steps as warmup steps for the training.
RISE
Finishing up our work on the XML parser we decided to insert all part information into a .csv file. By this, we can directly provide a low-entry-level file, which can already be accessed and manipulated by any table calculation program. We also decided to develop a Python program to access the csv upon user request.

Collaboration with iGEM Manchester
In our meeting with iGEM Manchester we presented to iGEM Manchester our work concerning analyzing the texts on iGEM Wikis with our web scraper. Refining the concept of the Wiki study iGEM Manchester suggested implementing more functions into the web scraper, such as a PDF and video analysis, which we worked on in the following weeks.

PRISM
The training function as well as the function for cross validation of the pretraining step of the neural network were implemented. In this week we started the pretraining of our model. Therefore we consulted Michael Jendrusch, our advisor, who gave us feedback on the model architecture. With the help of the bwforcluster, we could lower the high amount of computational cost. Parallel to some test runs first came to idea of hyperparameter search.
CoNCoRDe
Besides the heart of CoNCoRDe we decided to provide several accompanying tools to generate RNAs. These include the scripts “random_structures.py" and “design_simple.py".These scripts are supposed to generate RNA sequences with a fixed-length, but limited to user-inputted constraints and RNA sequences, which are conditioned on a given RNA secondary structure and sequence constraints. With the tool “design_rna_triple_helix.py" we also included a tool to design the RNA for usage in a DNA-RNA triple helix. The development of these tools spread over the following weeks.
3DOC
The algorithm to interconnect the PDBs via the Amino Acid and RNA Linkers was implemented. Also, the input to 3DOC for RNA Linker PDBs was established. Setting up the RNA-denovo Protocol in PyRosetta was quite challenging. Therefore we propose to use the ROSIE FARFAR2 protocol instead Watkins2019FARFAR2ID Lyskov2013ServerificationOM. Secondary structures were generated using ViennaRNA Lorenz2011ViennaRNAP2.

RISE
The implementation of the Python Program, which we baptized as our iGEM Registry intelligent Search Engine (RISE), took place in this and the following calender week. In the process of the development of the Python program, we also decided to implement it into a Jupyter Notebook together with Julius Upmeier zu Belzen, who is an expert in GO-Terms and gave us feedback on how to reduce the number of GO-Terms by filtering and applying semantic similarity.
PRISM
We integrated the ATracT database into the preprocessing pipeline, so the datasets for fine-tuning were generated. Giudice2016ATtRACTaDO. For this we mapped the identifiers of the ATracT database to Uniprot. The RNA-motifs bound by the proteins were also one-hot encoded. We met again with Carola Fischer, who is a PhD student at the Technical University Berlin, to further discuss the influence of hyperparameters on general and on the fine-tuning process. Based on her feedback we run different tests to optimize our fine-tuning.
At this week, we could finally start a training job. We analysed the loss and accuracy of our model over the steps. We recognized that our model is unstable. In order to improve the performance and the convergence speed of our model, we were searching for alternatives. We found the ReZero concept as an alternative approach of the transformer architecture and implemented it.
3DOC
We decided to provide two scripts for the RNA denovo and RNP-structure prediction with Rosetta. These scripts are supposed to prepare the file outputted by 3DOC for the application in Rosetta.

PRISM
The Rezero Transformer network was now implemented and we could start some new training processes. We observed a much more stable transformer. In order to increase the performance of our model, we began to search for optimal hyperparameters and consulted Carola Fischer again.
Collaboration with iGEM Manchester
Until this week we implemented more functionality into the web scraper and discussed them with the iGEM team Manchester. iGEM Manchester showed off the draft of their analysis and results, which they built with the previous versions of our web scraper.

PRISM
We tried to vary our batch size but we early recognized some computational constraints. In order to avoid these problems we found a solution which is called gradient accumulation. We implemented it and it worked. We could now apply much bigger batch sizes as before. This was a huge milestone in our hyperparameter search.
PRISM
At this week we started several training jobs with different parameters as batch size and warmup steps. We also implemented a function which creates a confusion matrix for the validation of our model. We also created an user-friendly python script for generating sequences, so any judge and scientist interested in using the output of PRISM is enabled to do so.

CoNCoRDe
As our project deals with the *in vivo* expression of functional RNA, a common design task features the design of flanking Hammer-Head Ribozymes (HHR) for a given target RNA sequence. We have specifically provided a version of our `design_simple.py` script specialized to this task. It designs flanking HHR for a sequence given just the 5' and 3' ends of that sequence.
PRISM
This week, we implemented several functions for the generation of new RBP-sequences. We implemented a function to generate a sequence based on a user-inputted RNA-motif and annotations vector and a function based on a random annotations vector and RNA-motif. First we had to discuss which method to use. We found some methods and discussed them based on the paper “The Curious Case of Neural Text Degeneration" Holtzman2020TheCC. In addition to that we consulted Michael Jendrusch again. He gave us the tip to use temperature controlled greedy sampling. The functions were written and tested on random data.

Collaboration with iGEM Manchester
In our final meeting with iGEM Manchester we presented our final web scraper to the iGEM team Manchester and discussed with them the first results of their analysis.
PRISM
We started the final trainings. The models were saved as pytorch state dicts and used for our fine-tuning. All weights and biases were freezed but the embedding layer of our protein sequences and prediction layer were unfreezed for finetuning. We generated in total ten test sequences with our model, checked them via a BLAST search and inputted them into 3DOC respectively trRosetta.
Of course, our experiments all have been completed until now, our work already was well documented on our wiki, and the wiki itself looked bright and shiny. If this would not have been a dream overall, we would have relaxed in Week 43.

References