CoNCoRDe
Compositional Nested Conditional RNA Design
Introduction
On our quest for RNA sequence design, we came across the notion that
RNA structure is compositional
and should be treated in a compositional way. Towards a practical application of category theory and
compositional methods to RNA design, we set out to conceive and implement an algorithm, which takes
compositionality seriously as applied to structure-conditional RNA sequence design. Our algorithm –
CoNCoRDe (Compositional Nested Conditional RNA Design) – is based on decomposing a target RNA structure
into its poset (partially ordered set, \(Bool\)-enriched category) of substructures (for more on posets
and all that, take a look at our model pages)
which it proceeds to solve and integrate into solutions for larger substructures. In the following, we will
present the components and implementation details of our algorithm.
Compositionality and Designability
Substructure Solvers
Having outlined the basic idea of the CoNCoRDe algorithm, we need to fix a set of algorithms we want to use
to search for solutions of our substructures. Here, we have a variety of options, from the many available
RNA structure design algorithms ViennaRNA to simple genetic algorithms GeneticAlgorithms
or our own deep learning models. In this section we explore two main approaches –
using Vienna RNA InverseFold ViennaRNA and using our own deep learning model trained via
deep reinforcement learning Reinforcement .
Vienna RNA
As the first, easy-to-use approach to solving RNA substructures, we use InverseFold which is a part of the
ViennaRNA suite for RNA structure prediction and design. By itself, InverseFold has difficulties
dealing with hard-to design RNA structures contained in the EteRNA100 benchmark dataset Eterna100 .
However, we still chose it as our main substructure solver alongside genetic algorithm-based search for its
simplicity and low computational overhead.
Reinforcement Learning
In parallel, we train a deep neural network as a policy for RNA design Reinforcement . We train our
model using reinforcement learning (RL) – an approach to training neural networks for tasks, where no differentiable
cost function is available Reinforcement . While RL is notorious for its training difficulties and
lack of robustness, it has brought about impressive results in control, robotics and the widely popular AlphaZero
series of programs greatly advancing the playstrength of chess and go engines Reinforcement .
As a part of our software suite, we implement a reinforcement learning environment for RNA design.
An RL environment can be seen as a game which a neural network agent should learn to play and achieve high scores in
Reinforcement . In our case, the way to score is to design – given some RNA secondary structure
– RNA sequences, which reliably fold into that structure. We implement this environment using a combination
of the ViennaRNA to provide
an RNA structure prediction engine. Our network generates RNA sequences for a given structure, which are then
folded by Vienna RNA and compared to the target structure (Figure 1). The network is then assigned a score for
its performance. If it solves a structure, it gets the highest possible score of 20. Otherwise, it is assigned
a score between 0 and 1 proportional to the Hamming distance between the target and predicted structures of its
output sequence. By alternatively training the network to maximize its score, and predicting sequences in the
environment we should achieve continuous improvement of model performance Reinforcement .
torchsupport
machine learning library and Vienna RNA Behavioural Cloning / AWR
That, at least is the theory. In practice, reinforcement learning has gained notoriety for being unstable,
highly sensitive to hyperparameters, immensely data-hungry and all in all rather unpleasant to work with
until you get it just right. Therefore, we chose a family of reinforcement learning methods, where
we had someone working with those methods who could help us get them just right. To implement our
reinforcement learning, we used an advantage-weighted actor critic (AWAC) training algorithm implemented
in our advisor's AWAC
is a simple, but effective reinforcement learning algorithm, which works by training a policy \(\pi(a\vert s)\)
using the following objective:
$$
\arg\max_\theta \mathbb{E}_{a, s}[\log(\pi(a \vert s)) \exp(\beta A)]
$$
where \(A\) is the advantage of action \(a\) at state \(s\). Intuitively, this objective maximizes
the probability of past actions, weighted by how well these actions performed. Like this, actions with a
high advantage are copied more strongly than actions with a low, or even negative advantage. This results
in a policy which is biased towards performing advantageous actions, leading to the continuous model improvement
advertised by proponents of reinforcement learning.
torchsupport
library. Advantage weighted actor critic (AWAC)
We trained a simple convolutional neural network we implemented
on our repository
using pytorch Pytorch . And lo and behold, after thousands of training steps of stagnation, our model
finally learned to generate RNA sequences folding into a desired structure as seen in the progression of our model's
average reward, which approaches 20, the maximum solved reward of our RNA design environment (Figure 2).
CoNCoRDe
We have implemented all these components in Python, resulting in our hierarchical RNA design software
CoNCoRDe (Compositional Nested Conditional RNA Design). We provide CoNCoRDe to the iGEM and wider
scientific communities in the form of a script
concorde.py
with a simple command-line user interface. Now you can design RNA structures for your specific application
domain by installing CoNCoRDe following the setup instructions
and running it:
usage: concorde.py [-h] [--path [PATH]] [--size [SIZE]] [--inner-size [INNER_SIZE]] [--fraction [FRACTION]] [--random-fraction [RANDOM_FRACTION]] [--max-steps [MAX_STEPS]] [--inner-steps [INNER_STEPS]] [--regen [REGEN]] [--early] STRUCTURE Search RNA sequences folding a given target structure. Based on the operad of RNA structures, decomposes the target structure into substructures which are solved recursively. positional arguments: STRUCTURE Target RNA secondary structure. optional arguments: -h, --help show this help message and exit --path [PATH] Path to save and load substructure library from. --size [SIZE] size of the sequence pool. --inner-size [INNER_SIZE] size of the sequence pool for substructure search. --fraction [FRACTION] number of top sequences to propagate. --random-fraction [RANDOM_FRACTION] number of sequences to randomly repopulate. --max-steps [MAX_STEPS] maximum number of search steps. --inner-steps [INNER_STEPS] maximum number of search steps for substructure search. --regen [REGEN] probability of diversifying the substructure library at each step. --early Finish sequence search early, if matching sequence is found?
We have applied CoNCoRDe to the hard RNA-design problems contained in the
EteRNA100 benchmark dataset Eterna100 , with very encouraging
results. CoNCoRDe solved hard design problems within minutes, for some
examples see the structures in Figure 3, with sequences generated by CoNCoRDe.
All of these structures possess features which are hard for RNA design
algorithms to solve. Figure 3 (a) contains an RNA structure, with multiple
short stems, which tend to be hard to design, as they have to be optimized
to disallow pairing interactions to occur across stems Eterna100 .
CoNCoRDe circumvents this problem by choosing asymmetric sequences for the stems
resulting in one single preferred way of base pairing. Figure 3 (b) contains
deeply nested multiloops which present a similar type of optimization challange.
Figure 3 (c) contains a pair of consecutive short bulges, commonly referred to
as a "zigzag" structure Eterna100 . Short bulges are energetically
disfavourable in general, which complicates sequence design Eterna100 .
CoNCoRDe is able to solve a wide variety of such hard structures due to its to
our knowledge unique design approach of recursively subdividing the target structure
into a poset of substructures (for more info, look at our work on
RNA mathematical models
).
Implementation Details
Substructure Solvers
We use a combination of Vienna RNA ViennaRNA InverseFold and a simple
genetic algorithm to recursively solve substructures. The main target structure is
generated by recombining substructure solutions as the mutation step of the same genetic
algorithm with a large population size.
Substructure Library
To more rapidly solve families of RNA structures as well as restart design runs in
a more efficient manner, we cache previously solved substructures and their solution
sequences in a shared substructure library. This library can be reloaded across
CoNCoRDe runs to make use of a growing pool of substructures for designing RNAs for
a given application domain. The user may elect to use the same substructure library
across runs, or build specific libraries for a given application for the sake of
efficiency.
Outlook
Currently, CoNCoRDe is implemented as a fully sequential algorithm. However, given
a (family of) target structure(s), we can first construct the poset of substructures
involved and then search for substructure solutions in a fully parallelized manner,
as the substructure design problems are only loosely coupled through the shared
substructure solution library. Moving the substructure library into shared memory and
running multiple processes working on substructures in parallel and feeding their
solutions back into the library asynchronously could potentially greatly speed up our
algorithm. Having shown the promise of a categorical approach to RNA bioinformatics,
we leave the optimization of CoNCoRDe for another, less stressful day, when our
mathematician recharges their batteries.