Proof of Concept
Harnessing ML and MM for Protein Design
During our project, we pursued multiple approaches. Just like in natural selection only a few of them were fruitful and produced reasonable results which are not only sufficient by themselves but can also be used to build further projects. Here we present the lucky ones who survived the entire year of iGEM, proved the concepts we envisioned and are ready to be used by others.
Reinforcement Learning for RNA
As part of our software project, we have implemented a reinforcement learning environment for training deep neural network agents to design RNA sequences conforming to a desired secondary structure ( as a part of CoNCoRDe ). This is a notable achievement in two ways: Firstly, RNA design itself is a provably hard optimization problem which can strongly benefit from the application of powerful machine-learning methods. Here, reinforcement learning has proven to be immensely powerful across application domains – from vastly improving robot control to mastering the game of go. Secondly, reinforcement learning research itself is plagued by a scarcity of benchmark environments testing the generalization properties of a learned policy. Adding a further benchmark environment in the form of RNA sequence design for random RNA structures could thus provide opportunities for advancements in the field of reinforcement learning itselfCategory Theory in Biology
Part of our self-imposed modelling programme for our project was to establish whether methods from category theory could bring about real benefits to bioinformatics. Through modelling RNA structures in a compositional manner using operads and implementing CoNCoRDe – an RNA design algorithm based on category theory and the principle of compositionality – we could answer this question affirmatively. Indeed, category helped us to gain deeper insights into the workings of RNA secondary structure which we could convert into real, novel, usable algorithms.PRISM and 3DOC in action
Although PRISM is not fully trained yet it does demonstrate the capability of developing a language model for RNA-binding protein sequences. We have shown that prediction accuracy improves through pretraining and fine-tuning. In the discussion of PRISM we have proposed that longer training would lead to a better performance of the model. This assumption is supported by the literature, where it is stated that language models for protein engineering need to be trained for a much longer time than 48 hours, which is the time we trained PRISM. We have demonstrated in the training of PRISM that amino acid prevalences normalize themselves within a longer training time. Although our model lacks from its short training time, we have shown that some features of the protein sequences already have been learnt, such as the fact that methionine appears at the beginning of over half of all generated sequences.
The tool 3DOC was successfully implemented for the creation of fusion protein binding domains, such as pumby and ppr domains as well as fusion proteins. Integrating PyRosetta modification as well as trRosetta as an alternative pathway generates relatively accurate models. We have successfully generated proteins interconnected with amino acid linkers. For the future, further modelling, for instance with the RNP-prediction modelling method proposed by Kalli Kappel, could improve the understanding of Protein-RNA complexes Kappel2019SamplingNS . With 3DOC we have laid an important foundation for the generation of the corresponding protein interface.