Operads for RNA
Trees, Categories and Compositionality
Introduction
The goal of our project being the design of specific RNA-RNA, as well as protein-RNA interactions,
we have to deal with RNAs and their secondary structures quite frequently. To design pairs of RNAs and RNA-binding proteins,
we first have to be able to design RNAs of arbitrary secondary structures under application-specific
sequence constraints. It whould therefore be beneficial to have a very good understanding of the properties
of RNA secondary structures themselves which we could use to develop better algorithms for RNA design.
To get a grasp on those properties, we have set up a family of mathematical models of RNA secondary structure
grounded in category theory Cats1 . Briefly, category theory frames mathematics in terms of
mathematically very simple objects – categories – which contain structure describing a set of
operations and laws for their composition Cats1 . This simple structure suffices to describe
most of modern mathematics Cats1 Cats2 and has recently found to provide powerful
models of phenomena in computer science CatsComp1 CatsComp2 as well as
physics CatsUrs CatsQuantum , chemistry CatsJade CatsCRN
and biology CatsTracelets CatsGRN .
What all these application domains have in common, is a notion of (de-)compositionality –
the idea that complex systems may be built up from simpler components, the understanding of which allows us
to reason about the behaviour of the complete system CatsApplied . Here, categories and their
derived mathematical structures provide the perfect reasoning tool by allowing us to encode composition of
tractable subsystems as the composition of a the category Cats1 CatsApplied CatsFong .
For example, chemical reaction networks, which can get notoriously complex in the context of living systems
CatsGRN have been treated compositionally using multiple different approaches in applied
category theory CatsGRN CatsCRN CatsTracelets . This shows the
promise of applied category theory for problems in synthetic biology.
Nevertheless, applied category theory is a rather young field, with its rise to mathematical prominence
dating back to the last few years CatsApplied . To the best of our knowledge, no prior iGEM
team has set out to model components of their project using the language of category theory. However, we
feel, that category theory provides a natural fit for our problem of RNA structure generation and design
and manage to draw concrete benefits from using categorical language.
In the following, we will develop a categorical model of RNA secondary structure and directly derive and
implement algorithms for enumerating and generating fixed-length RNA secondary structures according to a
user-defined probability distribution, as well as apply compositional insight to develop
CoNCoRDe, a powerful compositional
algorithm for RNA sequence design.
RNA secondary structure
Mathematical descriptions of compositionality
Having seen that RNA structure could benefit from a compositional treatment, we will now take a look at the ways we
may represent compositionality in the framework of category theory. As mentioned previously, categories provide a
natural framework for describing compositional structures Cats1 CatsApplied . To start with,
let us take a look at the definition of a category Cats1 .
Category
A category \(C\) consists of:
- a set of objects \(Ob(C)\)
- for each pair of objects \(X, Y: Ob(C)\) a set of morphisms \(Hom(X, Y)\)
- an associative composition operation \(\circ: Hom(X, Y) \times Hom(Y, Z) \to Hom(X, Z)\) taking a pair of morphisms and composing them
- for each object \(X: Ob(C)\) an identity morphism \(id_X\) such that \(f \circ id_X = f = id_Y \circ f\) for each morphism in \(Hom(X, Y)\).
At first glance, this definition may remind the reader of an algebraic theory of sets and functions. Indeed,
we may consider a category \(Set\) with \(X, Y: Ob(Set)\) sets and \(f: Hom(X, Y)\) functions \(f: X \to Y \)
between those sets Cats1 , \(\circ \) ordinary function composition. And indeed, sets with
functions between them do give a category! We leave the proof of this as an exercise to the reader.
However, \(Set\) is not the be-all end-all of categories. We may similarly envision a finite set of objects
\(O\) endowed with a partial order \(\leq: O \times O \to Bool\) where \(Bool\) is the set with two elements
representing truth values. Intuitively, we may then place a morphism between two objects \(X, Y: O\),
if and only if \(X \leq Y\). This too can be proven to be a category, which is far away from the notion of
a category of sets and functions CatsApplied . We hope this illustrates nicely, how categories
can represent quite diverse notions of composition, which could possibly be applied to RNA structure as well.
However, having a single type of sequential composition may not be enough to satisfy our needs. Thankfully,
category theory provides a means of complementing sequential composition with a kind of parallel composition
by introducing a monoidal product CatsApplied . This amounts to giving defining a way of
composing not only morphisms, but objects as well! This takes the form of a map on objects \(\otimes : Ob(C) \times Ob(C) \to Ob(C)\)
and morphisms \(\otimes: Hom(X, Y) \times Hom(A, B) \to Hom(X \otimes A, Y \otimes B)\) satisfying a bunch
of conditions which make it behave in the natural way CatsApplied . As the definition of a
monoidal category – a category equipped with such a monoidal product – is quite involved, and
monoidal categories are very general objects, we will postpone their definition to the last section of this
page, together with even more general constructs and instead work with the more restricted notion of an operad Operads1 .
This notion should give us all the necessary expressive power to describe RNA secondary structure, while
remaining definable in a fairly self-contained manner.
Plain Operad
An operad consists of a set of operations \(F(n)\) for each natural number \(n: \mathbb{N}\),
together with a composition operation
\(\circ: F(k) \otimes F(n_1) \otimes ... \otimes F(n_k) \to F(n_1 + ... + n_k)\)
and unit \(e : I \to F(1)\) picking out an identity operation from \(F(1)\).
- composition \(\circ\) is associative.
- composition \(\circ\) is unital, i.e. \(e \otimes O = O\) for \(O : F(1)\).
This encodes all the things we could wish for, by having a sequential composition in \(\circ\), as well as
a parallel composition \(\otimes\) on the inputs of our sequential composition \(\circ\) Operads1 .
For example, this suffices to describe the compositional properties of wiring diagrams (Figure 3 (a))
as well as labelled trees (Figure 3 (b)).
However, operads themselves just give us a set of rules how to put together more complex objects from a set of
known simple objects – they carry no information about the behaviour of the things we put together, other
than their laws of composition Operads1 CatsApplied . That is, they give a syntax
for our systems of interest, but leave out their semantics CatsApplied . To actually apply operads to
a problem of interest, we have to define another kind of mathematical structure, which takes as input an operad and
interprets its abstract operations as concrete operations on some object in a category Operads1
CatsApplied . Let us now graft the semantic flesh of algebras and coalgebras onto our
syntactic bones of operads.
(Co)Algebras
An algebra \(A\) of an operad \(F\) (an \(F\)-algebra) is an object \(A\) together with maps
\(v: F(n) \otimes A^{\otimes n} \to A\), compatible with the operad structure.
A coalgebra \(V\) of an operad \(F\) (an \(F\)-coalgebra) is an object \(V\) together with maps
\(v: F(n) \otimes A \to A^{\otimes n}\), compatible with the operad structure.
In this definition, the sentence "compatible with the operad structure" pulls a lot of weight. Essentially, it means
that the (co)algebra maps should respect associativity and unitality, and that in some sense their evaluation on a
composition \(v(f \circ g_1 \otimes ... \otimes g_k)\) in the operad should equal the composition of its parts in
the algebra \(v(f) \circ v(g_1) \otimes ... \otimes v(g_k)\). Intuitively, the maps \(v\) and their compositions
encode the semantics of whatever composite systems we want to model.
Compositionality of RNA Structure
Now that we have defined categories, operads, and their algebras, we are finally ready to see how we can describe
RNA structure in a compositional manner. To start with, we invite the reader to keep in mind the representation
of RNA structure as composed of simple structural motifs shown in Figure 2. We will now use this representation to
hierarchically decompose an RNA structure into those basic structural parts, eventually arriving at an operad of
RNA secondary structures.
Poset of Substructures
Starting with a given RNA structure, we may decompose it into a poset (partially ordered set) –
a category with at most one morphism between any two objects – of substructures (Figure 4). Intuitively a morphism
between two objects indicates that the target object is contained as a substructure in the source object. This idea of
decomposing an RNA structure into a partially ordered set alone is already very useful. Based on the poset associated
to an RNA structure, we develop an RNA design algorithm –
CoNCoRDe (Compositional Nested Conditional RNA Design)
– which finds RNA sequences folding a given structure by traversing the poset of its substructures, finding
substructure solutions and attempting to represent solutions for a given structure as a composition of substructure solutions.
RNAs as Trees
Having seen that we can describe RNA structures as a poset of substructures, we noticed, that the poset structure
we get resembles a tree structure. Having a look at the literature, we found that the connection between RNA secondary
structure and trees has been known since early on in the development of bioinformatics for RNA structure prediction
RNATrees1 RNATrees2 . RNA structure is defined as a tree, where the nodes are labelled
by stems or multiloops and the leaves are labelled by hairpin loops and miscellaneous unpaired regions
RNATrees1 RNATrees2 . However, the literature does not make the connection between the tree
structure of RNA and applications to hierarchical RNA design algorithms (
CoNCoRDe
) or the use of categorical methods to reason about RNA structures. In the following, we will use operads and category
theory to expand on the understanding of RNA structure as a hierarchical, tree-like object with applications to RNA
structure generation and design.
Operad of RNA Structures
Without further ado, we are now ready to define an operad of RNA secondary structure.
Operad of RNA structure
Consider an operad \(\mathbf{RNA}\) freely generated by the following set of fundamental operations:
- Stems of length \(n\): \(S_n : \mathbb{N} \to \mathbf{RNA}(1)\)
- Multiloops of valence \(n\): \(M_n : \mathbb{N} \to \mathbf{RNA}(n)\)
- Unpaired, loop regions of length \(n\): \(L_n: \mathbb{N} \to \mathbf{RNA}(0)\)
- The identity and the zero-length stem are the same: \(S_0 = id\)
- Stems are additive: \(S_n \circ S_m = S_{m + n}\)
- Multiloops are additive: \(M_2 \circ id \otimes M_2 = M_3 = M_2 \circ M_2 \otimes id \). Moreover, we can show additivity for multiloops of higher valence from this.
- Loops are additive in multiloops: \(M_2 \circ L_n \otimes L_m = L_{m + n}\). Similarly, we can show additivity for loops inside hivher valence multiloops from this together with multiloop additivity.
This operad captures the properties of RNA secondary structure and bestows us with a syntax for
composing arbitrary RNA structures from stems, loops and multiloops. While seemingly trivial, having
access to an operad of RNA structures can greatly simplify the user interface for specifying RNA
secondary structures. The predominant method of specifying RNA secondary structure –
dot-bracket notation RNAInteractions (Figure 1 (c)) – has proven to be error-prone during the
development of our RNA design tools, with missing closing braces and off-by-one errors in unpaired regions
being a common source of frustration. Therefore, both our compositional RNA design software
CoNCoRDe and our
Hammer-Head Ribozyme and general RNA structure design scripts use our operad \(\mathbf{RNA}\) internally.
To translate back into the dot-bracket structure needed by ViennaRNA, we define an
\(\mathbf{RNA}\)-algebra of dot-bracket structures.
\(\mathbf{RNA}\)-algebra of dot-bracket structures
Consider the set of dot-bracket strings \(D: Set\) we define evaluation operations \(v\)
for an \(\mathbf{RNA}\)-algebra structure on \(D\) by pattern matching as follows:
- Loop evaluation: \(L_n \mapsto \texttt{"." * n}\)
- Stem evaluation: \(S_n, x \mapsto \texttt{"(" * n + x + ")" * n}\)
- Multiloops evaluation: \(M_n, x_1, ..., x_n \mapsto \texttt{("{}" * n).format(x_1, ..., x_n)}\)
- Composite evaluations follow from applying the algebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)
RNA Structure Generation
We can think of another simple algebra on \(\mathbf{RNA}\), which we call the resource algebra
on RNA structures. Given a partial RNA secondary structure represented by an operation in \(\mathbf{RNA}\),
we may ask about the length of the total RNA structure, should we complete it using structures of a given length.
The resource algebra answers this question. However, we might think about doing the converse –
that is, given a desired target structure length, find out which operations we could compose to arrive at an
RNA structure of that length, which motivates our next step.
Resource algebra
Consider the set of natural numbers \(\mathbb{N}\). We can turn it into an algebra of \(\mathbf{RNA}\) counting the
number of bases used in a structure by defining the following evaluation \(v\) by pattern matching:
- Loop evaluation: \(L_n \mapsto n\)
- Stem evaluation: \(S_n, x \mapsto x + 2n\)
- Multiloops evaluation: \(M_n, x_1, ..., x_n \mapsto x_1 + ... + x_n\)
- Composite evaluations follow from applying the algebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)
Our final goal being a software capable of generating arbitrary RNA structures and designing corresponding
sequences, we move on from the admittedly simple resource and dot-bracket algebras and try to define coalgebras
which would allow us to generate RNA structures adhering to user-defined constraints. As a first step, we
define a coalgebra which tells us, given a partial RNA structure, the minimum and maximum sizes of RNA structure
elements we could compose it with to arrive at an RNA structure of a desired length. That is, we want a
coalgebra, which, given a total number of resources (i.e. the leftover desired length of our RNA structure)
provides us with both minimum loop, multiloop and stem lengths to achieve the desired size and
maximum lengths to stay within the desired size. We can do this, by defining the following coalgebra:
Resource coalgebra
Consider the set \(G = \{ L, S, M \}\) of generator kinds in \(\mathbf{RNA}\). Define the type of size bounds
\(B := \mathbb{N} \times \mathbb{N}\) such that \(B_0 \lt B_1\). Then we may define the resource coalgebra
on the product \(\mathbb{N} \times (G \to B) \) by specifying the evaluation maps \(v\) by pattern matching:
- Loop bounds: \(b_L(r) = (\min b_L, r)\)
- Stem bounds: \(b_S(r) = (\min b_S, \frac{r - \min b_L}{2})\)
- Multiloop bounds: \(b_M(r) = (\min b_M, \frac{r}{\min b})\)
- Generator bounds \(b: G \to \mathbb{N} \to B\) are defined by pattern matching on the generators.
- Stem evaluation: \(S_n, (r, f) \mapsto (r - 2n, b(r - 2n))\)
- Multiloop evaluation: \(M_n, (r, f) \mapsto (r, b(r))\)
- Composite evaluation: \( X \circ (Y_1 \otimes ... \otimes Y_k), x_1, ..., x_n \mapsto (r_0 = r - r(Y_1), ..., r_k = r_{k-1} - r(Y_k), f_k = v(x_n).f(r_k)) \)
- Composite evaluations follow from applying the coalgebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)
While we have implemented random uniform sampling of RNA secondary structures of a given length using
this coalgebra in Python
to great success, we may want to go a step further and define a coalgebra of probability distributions on RNA substructures.
This would allow us to flexibly sample RNA secondary structures given almost arbitrary constraints. We implement this
\(\mathbf{RNA}\)-coalgebra on the \(\Delta^{G'}\) – the set of probability distributions on the generators of
\(\mathbf{RNA}\) by defining a map of probability distributions \(\pi: \Delta^{G'} \to \Delta^{G'}\) for each generator,
allowing us to progressively modify our probability distribution, as we go deeper into the nested tree of our RNA structure:
Probability coalgebra
Consider the set \(G' = L^\mathbb{N} \cup S^\mathbb{N} \cup M^\mathbb{N}\) of all generators of \(\mathbf{RNA}\).
We may take the set of probability distributions \(\Delta ^ {G'}\) on generators and endow \(\Delta^{G'} \times (G \to \Delta^{G'})\)
with a coalgebra structure. Intuitively, for each generator kind, we specify a probability distribution for consecutive
generators to compose with. Given a map of probability distributions \(\pi: G' \times \Delta^{G'} \to \Delta^{G'}\),
we define the resulting probability coalgebra of \(\mathbf{RNA}\) by specifying evaluation maps \(v\) by pattern matching.
- Loop bounds: \(b_L(r) = (\min b_L, r)\)
- Stem bounds: \(b_S(r) = (\min b_S, \frac{r - \min b_L}{2})\)
- Multiloop bounds: \(b_M(r) = (\min b_M, \frac{r}{\min b})\)
- Generator bounds \(b: G \to \mathbb{N} \to B\) are defined by pattern matching on the generators.
- Stem evaluation: \(S_n, (q, p) \mapsto (\pi(S_n, p)(S), \pi(S_n, p))\)
- Multiloops evaluation: \( M_n, (q, p) \mapsto (\pi(M_n, p)(M), \pi(M_n, p))^{\times n} \)
- Composite evaluations follow from applying the coalgebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)
Python Implementation
We implement the resource coalgebra for a version of our operad \(\mathbf{RNA}\) and use it to implement
random sampling of fixed-length RNA structures, with user-defined constraints on sequence length, stem length,
multiloop valency range and loop size range. Our resource coalgebra makes this implementation almost trivial
by handling which stem, loop and multiloop sizes are possible to sample from at any given time, to exactly
match a given sequence length. This can be used for generating training data for machine learning methods
for RNA design, as used in the reinforcement learning environment used by
CoNCoRDe, as well as a component of
statistical RNA structure prediction algorithms RNAPrediction1 . Our algorithm functions by
at each step sampling a random operation from the generating set of \(\mathbf{RNA}\) with a size between the
minimum and maximum size returned by the resource cooperad, composing it with the current structure and repeating
the process until the RNA structure is completed, and all resources (RNA sequence length) are exhausted.
To use our algorithm, install CoNCoRDe
by following the setup instructions.
Then, you can run our script
random_structures.py
following the usage instructions:
usage: random_structures.py [-h] [--count [COUNT]] [--min-E [MIN_E]] [--max-E [MAX_E]] [--min-H [MIN_H]] [--max-H [MAX_H]] [--min-S [MIN_S]] [--max-S [MAX_S]] [--min-Z [MIN_Z]] [--max-Z [MAX_Z]] [--min-M [MIN_M]] [--max-M [MAX_M]] [--min-B [MIN_B]] [--max-B [MAX_B]] LENGTH Generate random RNA secondary structures of fixed length. Powered by the operad of RNA structures and its resource algebra. positional arguments: LENGTH Length of desired RNA structures. optional arguments: -h, --help show this help message and exit --count [COUNT] Number of RNA structures to generate. --min-E [MIN_E] Minimum length of non-hairpin unpaired bases. --max-E [MAX_E] Maximum length of non-hairpin unpaired bases. --min-H [MIN_H] Minimum length of hairpin unpaired bases. --max-H [MAX_H] Maximum length of hairpin unpaired bases. --min-S [MIN_S] Minimum length of stems. --max-S [MAX_S] Maximum length of stems. --min-Z [MIN_Z] Minimum length of continued stems. --max-Z [MAX_Z] Maximum length of continued stems. --min-M [MIN_M] Minimum size of multiloops. --max-M [MAX_M] Maximum size of multiloops. --min-B [MIN_B] Minimum size of bulges. --max-B [MAX_B] Maximum size of bulges.
Pseudoknots
At this point, we've firmly established that methods from category can provide real benefits in the
mathematical treatment of RNA secondary structure but up to here, all we have talked about are
pseudo-knot free structures RNAInteractions . While we will not provide a detailed
conceptual treatment of pseudo-knotted RNA structures in the framework of category here, let us
give you a bit of an intuition, how a categorical treatment of pseudo-knots could look like.
Overlay Composition of RNA Structures
Let us begin by recalling that a pseudo-knot is defined as a base-pairing interaction between regions
of RNA which are not part of a stem RNAInteractions . Therefore, we may not represent
pseudoknots in our current operad of RNA structures. As pseudo-knots are interactions between otherwise
unpaired segments in an RNA structure, we have to keep track of both paired and unpaired segments of
RNA in our mathematical description. We can then treat composition of possibly pseudo-knotted structures
by introducing three types of composition of RNA structures:
- Sequential composition, which is the usual composition operation used in our operad of RNA structures. (Figure 6(a))
- Parallel composition, which corresponds to setting multiple RNA structures side-by-side as in operad composition. (Figure 6(b))
- Overlay composition, which glues two compatible RNA structures by identifying stems in one structure with edges or loops in another. (Figure 6(c))
((((...[[[[...))))...]]]]
where brackets []
indicate pseudo-knot base-pairs.
We can see that in the case of pseudoknotted RNA structures with overlay composition, we may still
construct a poset of substructures by introducing morphisms between composite structures and their
composed parts, now with respect to all three introduced composition operations.
Outlook: Double Categorical Structure
While we leave the final definition open, we may point the reader in the direction of the right
mathematical structures to work with pseudoknotted RNA structures. First of all, we will no longer
be able to dodge having to work with monoidal categories by relying on operads instead.
Furthermore, as our mathematical structure would have to deal with a monoidal product together
with two compatible modes of composition, we have to move on to higher categories CatsApplied .
These can be seen as an extension to ordinary categories, where, in addition to objects and morphisms
we have further \(2\)-morphisms, which are intuitively morphisms between morphisms CatsApplied .
A particular structure we have in mind are symmetric monoidal double categories, which feature
\(2\)-morphisms with two compatible modes of composition together with a monoidal product –
exactly what we need!
Double categories or not, we have definitely shown that there is merit to applying category theory
to problems in biology. Stay tuned for our presentation and poster, to see if our mathematician can
work up enough mathematical mojo to work out the technical difficulties towards defining a symmetric
monoidal double category by then!