Team:Heidelberg/Software/Concorde

CoNCoRDe
Compositional Nested Conditional RNA Design

Introduction

On our quest for RNA sequence design, we came across the notion that RNA structure is compositional and should be treated in a compositional way. Towards a practical application of category theory and compositional methods to RNA design, we set out to conceive and implement an algorithm, which takes compositionality seriously as applied to structure-conditional RNA sequence design. Our algorithm – CoNCoRDe (Compositional Nested Conditional RNA Design) – is based on decomposing a target RNA structure into its poset (partially ordered set, \(Bool\)-enriched category) of substructures (for more on posets and all that, take a look at our model pages) which it proceeds to solve and integrate into solutions for larger substructures. In the following, we will present the components and implementation details of our algorithm.

Compositionality and Designability

Substructure Solvers

Having outlined the basic idea of the CoNCoRDe algorithm, we need to fix a set of algorithms we want to use to search for solutions of our substructures. Here, we have a variety of options, from the many available RNA structure design algorithms ViennaRNA to simple genetic algorithms GeneticAlgorithms or our own deep learning models. In this section we explore two main approaches – using Vienna RNA InverseFold ViennaRNA and using our own deep learning model trained via deep reinforcement learning Reinforcement.

Vienna RNA

As the first, easy-to-use approach to solving RNA substructures, we use InverseFold which is a part of the ViennaRNA suite for RNA structure prediction and design. By itself, InverseFold has difficulties dealing with hard-to design RNA structures contained in the EteRNA100 benchmark dataset Eterna100. However, we still chose it as our main substructure solver alongside genetic algorithm-based search for its simplicity and low computational overhead.

Reinforcement Learning

In parallel, we train a deep neural network as a policy for RNA design Reinforcement. We train our model using reinforcement learning (RL) – an approach to training neural networks for tasks, where no differentiable cost function is available Reinforcement. While RL is notorious for its training difficulties and lack of robustness, it has brought about impressive results in control, robotics and the widely popular AlphaZero series of programs greatly advancing the playstrength of chess and go engines Reinforcement.
Figure 1: Reinforcement learning environment
The agent is provided with a random target RNA structure and has to propose a solution sequence. Upon completion of that sequence, it is folded by Vienna RNA and the resulting structure is compared to the target structure. The environment then assigns a score of 20, if the proposed sequence folds correctly to the target structure and a score between 0 and 1 proportional to the hamming distance otherwise.
As a part of our software suite, we implement a reinforcement learning environment for RNA design. An RL environment can be seen as a game which a neural network agent should learn to play and achieve high scores in Reinforcement. In our case, the way to score is to design – given some RNA secondary structure – RNA sequences, which reliably fold into that structure. We implement this environment using a combination of the torchsupport machine learning library and Vienna RNA ViennaRNA to provide an RNA structure prediction engine. Our network generates RNA sequences for a given structure, which are then folded by Vienna RNA and compared to the target structure (Figure 1). The network is then assigned a score for its performance. If it solves a structure, it gets the highest possible score of 20. Otherwise, it is assigned a score between 0 and 1 proportional to the Hamming distance between the target and predicted structures of its output sequence. By alternatively training the network to maximize its score, and predicting sequences in the environment we should achieve continuous improvement of model performance Reinforcement.

Behavioural Cloning / AWR

That, at least is the theory. In practice, reinforcement learning has gained notoriety for being unstable, highly sensitive to hyperparameters, immensely data-hungry and all in all rather unpleasant to work with until you get it just right. Therefore, we chose a family of reinforcement learning methods, where we had someone working with those methods who could help us get them just right. To implement our reinforcement learning, we used an advantage-weighted actor critic (AWAC) training algorithm implemented in our advisor's torchsupport library. Advantage weighted actor critic (AWAC) AWAC is a simple, but effective reinforcement learning algorithm, which works by training a policy \(\pi(a\vert s)\) using the following objective: $$ \arg\max_\theta \mathbb{E}_{a, s}[\log(\pi(a \vert s)) \exp(\beta A)] $$ where \(A\) is the advantage of action \(a\) at state \(s\). Intuitively, this objective maximizes the probability of past actions, weighted by how well these actions performed. Like this, actions with a high advantage are copied more strongly than actions with a low, or even negative advantage. This results in a policy which is biased towards performing advantageous actions, leading to the continuous model improvement advertised by proponents of reinforcement learning.
Figure 2: RNA reinforcement learning
We trained a simple convolutional neural network on our RNA design environment using AWAC. After some time of stagnated loss, out model learned to consistently produce RNA sequences which conformed to a target structure, as evidenced by the mean reward approaching 20 – the maximum reward of our environment, achievable only upon structure solution.
We trained a simple convolutional neural network we implemented on our repository using pytorch Pytorch. And lo and behold, after thousands of training steps of stagnation, our model finally learned to generate RNA sequences folding into a desired structure as seen in the progression of our model's average reward, which approaches 20, the maximum solved reward of our RNA design environment (Figure 2).

CoNCoRDe

We have implemented all these components in Python, resulting in our hierarchical RNA design software CoNCoRDe (Compositional Nested Conditional RNA Design). We provide CoNCoRDe to the iGEM and wider scientific communities in the form of a script concorde.py with a simple command-line user interface. Now you can design RNA structures for your specific application domain by installing CoNCoRDe following the setup instructions and running it:
usage: concorde.py [-h] [--path [PATH]] [--size [SIZE]]
                [--inner-size [INNER_SIZE]] [--fraction [FRACTION]]
                [--random-fraction [RANDOM_FRACTION]]
                [--max-steps [MAX_STEPS]] [--inner-steps [INNER_STEPS]]
                [--regen [REGEN]] [--early]
                STRUCTURE

Search RNA sequences folding a given target structure. Based on the operad of
RNA structures, decomposes the target structure into substructures which are
solved recursively.

positional arguments:
  STRUCTURE             Target RNA secondary structure.

optional arguments:
  -h, --help            show this help message and exit
  --path [PATH]         Path to save and load substructure library from.
  --size [SIZE]         size of the sequence pool.
  --inner-size [INNER_SIZE]
                        size of the sequence pool for substructure search.
  --fraction [FRACTION]
                        number of top sequences to propagate.
  --random-fraction [RANDOM_FRACTION]
                        number of sequences to randomly repopulate.
  --max-steps [MAX_STEPS]
                        maximum number of search steps.
  --inner-steps [INNER_STEPS]
                        maximum number of search steps for substructure
                        search.
  --regen [REGEN]       probability of diversifying the substructure library
                        at each step.
  --early               Finish sequence search early, if matching sequence is
                        found?
  
We have applied CoNCoRDe to the hard RNA-design problems contained in the EteRNA100 benchmark dataset Eterna100, with very encouraging results. CoNCoRDe solved hard design problems within minutes, for some examples see the structures in Figure 3, with sequences generated by CoNCoRDe. All of these structures possess features which are hard for RNA design algorithms to solve. Figure 3 (a) contains an RNA structure, with multiple short stems, which tend to be hard to design, as they have to be optimized to disallow pairing interactions to occur across stems Eterna100. CoNCoRDe circumvents this problem by choosing asymmetric sequences for the stems resulting in one single preferred way of base pairing. Figure 3 (b) contains deeply nested multiloops which present a similar type of optimization challange. Figure 3 (c) contains a pair of consecutive short bulges, commonly referred to as a "zigzag" structure Eterna100. Short bulges are energetically disfavourable in general, which complicates sequence design Eterna100. CoNCoRDe is able to solve a wide variety of such hard structures due to its to our knowledge unique design approach of recursively subdividing the target structure into a poset of substructures (for more info, look at our work on RNA mathematical models ).
Figure 3: Example solved structures
Some hard-to-design structures from the EteRNA100 benchmark dataset, together with their corresponding RNA sequences designed using CoNCoRDe. (a) a small tetraloop with multiple same-length short stems, which is hard because of its high symmetry. (b) highly nested multiloops which pose a similar optimization challange to (a). (c) nested multiloop structure containing consecutive short bulges in a "zigzag" structure, which is inherently energetically disfavoured.

Implementation Details

Substructure Solvers

We use a combination of Vienna RNA ViennaRNA InverseFold and a simple genetic algorithm to recursively solve substructures. The main target structure is generated by recombining substructure solutions as the mutation step of the same genetic algorithm with a large population size.

Substructure Library

To more rapidly solve families of RNA structures as well as restart design runs in a more efficient manner, we cache previously solved substructures and their solution sequences in a shared substructure library. This library can be reloaded across CoNCoRDe runs to make use of a growing pool of substructures for designing RNAs for a given application domain. The user may elect to use the same substructure library across runs, or build specific libraries for a given application for the sake of efficiency.

Outlook

Currently, CoNCoRDe is implemented as a fully sequential algorithm. However, given a (family of) target structure(s), we can first construct the poset of substructures involved and then search for substructure solutions in a fully parallelized manner, as the substructure design problems are only loosely coupled through the shared substructure solution library. Moving the substructure library into shared memory and running multiple processes working on substructures in parallel and feeding their solutions back into the library asynchronously could potentially greatly speed up our algorithm. Having shown the promise of a categorical approach to RNA bioinformatics, we leave the optimization of CoNCoRDe for another, less stressful day, when our mathematician recharges their batteries.

References