Operads for RNA

Trees, Categories and Compositionality

Introduction

The goal of our project being the design of specific RNA-RNA, as well as protein-RNA interactions, we have to deal with RNAs and their secondary structures quite frequently. To design pairs of RNAs and RNA-binding proteins, we first have to be able to design RNAs of arbitrary secondary structures under application-specific sequence constraints. It whould therefore be beneficial to have a very good understanding of the properties of RNA secondary structures themselves which we could use to develop better algorithms for RNA design. To get a grasp on those properties, we have set up a family of mathematical models of RNA secondary structure grounded in category theory Cats1. Briefly, category theory frames mathematics in terms of mathematically very simple objects – categories – which contain structure describing a set of operations and laws for their composition Cats1. This simple structure suffices to describe most of modern mathematics Cats1Cats2 and has recently found to provide powerful models of phenomena in computer science CatsComp1CatsComp2 as well as physics CatsUrsCatsQuantum, chemistry CatsJadeCatsCRN and biology CatsTraceletsCatsGRN.

What all these application domains have in common, is a notion of (de-)compositionality – the idea that complex systems may be built up from simpler components, the understanding of which allows us to reason about the behaviour of the complete system CatsApplied. Here, categories and their derived mathematical structures provide the perfect reasoning tool by allowing us to encode composition of tractable subsystems as the composition of a the category Cats1

CatsApplied

CatsFong. For example, chemical reaction networks, which can get notoriously complex in the context of living systems CatsGRN have been treated compositionally using multiple different approaches in applied category theory CatsGRN

CatsCRN

CatsTracelets. This shows the promise of applied category theory for problems in synthetic biology.

Nevertheless, applied category theory is a rather young field, with its rise to mathematical prominence dating back to the last few years CatsApplied. To the best of our knowledge, no prior iGEM team has set out to model components of their project using the language of category theory. However, we feel, that category theory provides a natural fit for our problem of RNA structure generation and design and manage to draw concrete benefits from using categorical language. In the following, we will develop a categorical model of RNA secondary structure and directly derive and implement algorithms for enumerating and generating fixed-length RNA secondary structures according to a user-defined probability distribution, as well as apply compositional insight to develop CoNCoRDe, a powerful compositional algorithm for RNA sequence design.

RNA secondary structure

Figure 1: Representations of RNA secondary structure

RNA secondary structure denotes the pattern of base-pairing of a given RNA sequence. Multiple equivalent representations of this type of structure exist. (a) Most commonly RNA secondary structure is represented as a graph with two types of edges. One type connecting consecutive bases in the RNA sequence and the other connecting base pairs of (G-C, A-U and G-U). (b) The same graph shown in (a) laid out in a manner emphasizing the sequential nature of the underlying RNA sequence. (c) Dot-bracket representation of the same structure. Here a pair of "(" and ")" indicates base-pairing. (d) Contact matrix representation. A filled square indicates that the corresponding pair of sequence positions participate in base pairing.

Let us first direct our view to RNA sequences and secondary structures. To revisit, an RNA sequence is a sequence of bases taken from the alphabet of ribonucleotide bases \(\{G, U, A, C\}\). An RNA is fully defined by its sequence, the sequence of any given RNA giving rise to a secondary structure. While the interactions within and among RNA molecules are quite diverse RNAInteractions, it suffices in many cases to restrict ourselves to three particular interactions between bases. Bases form base-pairs of three main types – A-U, G-C and G-U. The particular pattern of base-pairing of a given sequence is known as its secondary structure RNAInteractions. For the sake of visual reasoning, secondary structures may be represented in a variety of ways, starting from graphs of the RNA sequence with paired bases connected by an edge and put in spatial proximity (Figure 1 (a, b)) to ticking boxes in a matrix of base pairs (Figure 1 (d)) RNAInteractions.

Figure 2: Elements of RNA secondary structure

By virtue of its simplicity, RNA structure features a set of named conserved structural motifs shared across natural RNA structures. (a) These include stems of consecutive paired bases, hairpin loops of unpaired bases capping stems and bulges, featuring unpaired regions within a stem and multiloops joining multiple stems. These structural elements may be arbitrarily nested to form a complete secondary structure. (b) In addition, RNA structures may include pseudo knot interactions, that is non-stem interactions between otherwise unpaired regions.

RNA secondary structures consist of a set of common structural motifs, which compose to give a complete secondary structure RNAInteractionsRNATrees1. Stems, which consist of consecutive paired bases, hairpin loops of unpaired bases capping a stem, bulges of unpaired bases inside a stem and multiloops joining multiple stems (Figure 2 (a)). These structures may be nested, with stems containing multiloops containing further stems resulting in a complex structure made up of simple components (Figure 2 (a)). In addition, otherwise unpaired regions of an RNA sequence may form base-pairing interactions to form pseudo-knots – non-stem stretches of consecutive paired bases RNAInteractions. From this observation that RNA structure consists of a composition of multiple simple parts (stems, loops, multiloops) we may gather that compositional – and thus categorical – methods would do well at modelling RNA structures.

Mathematical descriptions of compositionality

Having seen that RNA structure could benefit from a compositional treatment, we will now take a look at the ways we may represent compositionality in the framework of category theory. As mentioned previously, categories provide a natural framework for describing compositional structures Cats1CatsApplied. To start with, let us take a look at the definition of a category Cats1.

Compositionality of RNA Structure

Now that we have defined categories, operads, and their algebras, we are finally ready to see how we can describe RNA structure in a compositional manner. To start with, we invite the reader to keep in mind the representation of RNA structure as composed of simple structural motifs shown in Figure 2. We will now use this representation to hierarchically decompose an RNA structure into those basic structural parts, eventually arriving at an operad of RNA secondary structures.

Poset of Substructures

Figure 4: Poset of substructures

Starting with a large RNA structure, we may progressively remove stems and split multiloops to arrive at a partially ordered set (poset) of RNA substructures, some of which may be shared across structures or even within a single structure. This indicates we do the converse and compose RNA structures from substructures.

Starting with a given RNA structure, we may decompose it into a poset (partially ordered set) – a category with at most one morphism between any two objects – of substructures (Figure 4). Intuitively a morphism between two objects indicates that the target object is contained as a substructure in the source object. This idea of decomposing an RNA structure into a partially ordered set alone is already very useful. Based on the poset associated to an RNA structure, we develop an RNA design algorithm – CoNCoRDe (Compositional Nested Conditional RNA Design) – which finds RNA sequences folding a given structure by traversing the poset of its substructures, finding substructure solutions and attempting to represent solutions for a given structure as a composition of substructure solutions.

RNAs as Trees

Having seen that we can describe RNA structures as a poset of substructures, we noticed, that the poset structure we get resembles a tree structure. Having a look at the literature, we found that the connection between RNA secondary structure and trees has been known since early on in the development of bioinformatics for RNA structure prediction RNATrees1RNATrees2. RNA structure is defined as a tree, where the nodes are labelled by stems or multiloops and the leaves are labelled by hairpin loops and miscellaneous unpaired regions RNATrees1RNATrees2. However, the literature does not make the connection between the tree structure of RNA and applications to hierarchical RNA design algorithms ( CoNCoRDe ) or the use of categorical methods to reason about RNA structures. In the following, we will use operads and category theory to expand on the understanding of RNA structure as a hierarchical, tree-like object with applications to RNA structure generation and design.

Operad of RNA Structures

Without further ado, we are now ready to define an operad of RNA secondary structure.

Operad of RNA structure

Consider an operad \(\mathbf{RNA}\) freely generated by the following set of fundamental operations:

Stems of length \(n\): \(S_n : \mathbb{N} \to \mathbf{RNA}(1)\)
Multiloops of valence \(n\): \(M_n : \mathbb{N} \to \mathbf{RNA}(n)\)
Unpaired, loop regions of length \(n\): \(L_n: \mathbb{N} \to \mathbf{RNA}(0)\)

together with a set of relations denoting equivalent structures:

The identity and the zero-length stem are the same: \(S_0 = id\)
Stems are additive: \(S_n \circ S_m = S_{m + n}\)
Multiloops are additive: \(M_2 \circ id \otimes M_2 = M_3 = M_2 \circ M_2 \otimes id \). Moreover, we can show additivity for multiloops of higher valence from this.
Loops are additive in multiloops: \(M_2 \circ L_n \otimes L_m = L_{m + n}\). Similarly, we can show additivity for loops inside hivher valence multiloops from this together with multiloop additivity.

Figure 5: Visual representation of the operad \(\mathbf{RNA}\)

For better understanding, we visualize the generators and relations of the operad \(\mathbf{RNA}\).

This operad captures the properties of RNA secondary structure and bestows us with a syntax for composing arbitrary RNA structures from stems, loops and multiloops. While seemingly trivial, having access to an operad of RNA structures can greatly simplify the user interface for specifying RNA secondary structures. The predominant method of specifying RNA secondary structure – dot-bracket notation RNAInteractions (Figure 1 (c)) – has proven to be error-prone during the development of our RNA design tools, with missing closing braces and off-by-one errors in unpaired regions being a common source of frustration. Therefore, both our compositional RNA design software CoNCoRDe and our Hammer-Head Ribozyme and general RNA structure design scripts use our operad \(\mathbf{RNA}\) internally. To translate back into the dot-bracket structure needed by ViennaRNA, we define an \(\mathbf{RNA}\)-algebra of dot-bracket structures.

\(\mathbf{RNA}\)-algebra of dot-bracket structures

Consider the set of dot-bracket strings \(D: Set\) we define evaluation operations \(v\) for an \(\mathbf{RNA}\)-algebra structure on \(D\) by pattern matching as follows:

Loop evaluation: \(L_n \mapsto \texttt{"." * n}\)
Stem evaluation: \(S_n, x \mapsto \texttt{"(" * n + x + ")" * n}\)
Multiloops evaluation: \(M_n, x_1, ..., x_n \mapsto \texttt{("{}" * n).format(x_1, ..., x_n)}\)
Composite evaluations follow from applying the algebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)

RNA Structure Generation

We can think of another simple algebra on \(\mathbf{RNA}\), which we call the resource algebra on RNA structures. Given a partial RNA secondary structure represented by an operation in \(\mathbf{RNA}\), we may ask about the length of the total RNA structure, should we complete it using structures of a given length. The resource algebra answers this question. However, we might think about doing the converse – that is, given a desired target structure length, find out which operations we could compose to arrive at an RNA structure of that length, which motivates our next step.

Resource algebra

Consider the set of natural numbers \(\mathbb{N}\). We can turn it into an algebra of \(\mathbf{RNA}\) counting the number of bases used in a structure by defining the following evaluation \(v\) by pattern matching:

Loop evaluation: \(L_n \mapsto n\)
Stem evaluation: \(S_n, x \mapsto x + 2n\)
Multiloops evaluation: \(M_n, x_1, ..., x_n \mapsto x_1 + ... + x_n\)
Composite evaluations follow from applying the algebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)

Our final goal being a software capable of generating arbitrary RNA structures and designing corresponding sequences, we move on from the admittedly simple resource and dot-bracket algebras and try to define coalgebras which would allow us to generate RNA structures adhering to user-defined constraints. As a first step, we define a coalgebra which tells us, given a partial RNA structure, the minimum and maximum sizes of RNA structure elements we could compose it with to arrive at an RNA structure of a desired length. That is, we want a coalgebra, which, given a total number of resources (i.e. the leftover desired length of our RNA structure) provides us with both minimum loop, multiloop and stem lengths to achieve the desired size and maximum lengths to stay within the desired size. We can do this, by defining the following coalgebra:

Resource coalgebra

Consider the set \(G = \{ L, S, M \}\) of generator kinds in \(\mathbf{RNA}\). Define the type of size bounds \(B := \mathbb{N} \times \mathbb{N}\) such that \(B_0 \lt B_1\). Then we may define the resource coalgebra on the product \(\mathbb{N} \times (G \to B) \) by specifying the evaluation maps \(v\) by pattern matching:

Loop bounds: \(b_L(r) = (\min b_L, r)\)
Stem bounds: \(b_S(r) = (\min b_S, \frac{r - \min b_L}{2})\)
Multiloop bounds: \(b_M(r) = (\min b_M, \frac{r}{\min b})\)
Generator bounds \(b: G \to \mathbb{N} \to B\) are defined by pattern matching on the generators.
Stem evaluation: \(S_n, (r, f) \mapsto (r - 2n, b(r - 2n))\)
Multiloop evaluation: \(M_n, (r, f) \mapsto (r, b(r))\)
Composite evaluation: \( X \circ (Y_1 \otimes ... \otimes Y_k), x_1, ..., x_n \mapsto (r_0 = r - r(Y_1), ..., r_k = r_{k-1} - r(Y_k), f_k = v(x_n).f(r_k)) \)
Composite evaluations follow from applying the coalgebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)

While we have implemented random uniform sampling of RNA secondary structures of a given length using this coalgebra in Python to great success, we may want to go a step further and define a coalgebra of probability distributions on RNA substructures. This would allow us to flexibly sample RNA secondary structures given almost arbitrary constraints. We implement this \(\mathbf{RNA}\)-coalgebra on the \(\Delta^{G'}\) – the set of probability distributions on the generators of \(\mathbf{RNA}\) by defining a map of probability distributions \(\pi: \Delta^{G'} \to \Delta^{G'}\) for each generator, allowing us to progressively modify our probability distribution, as we go deeper into the nested tree of our RNA structure:

Probability coalgebra

Consider the set \(G' = L^\mathbb{N} \cup S^\mathbb{N} \cup M^\mathbb{N}\) of all generators of \(\mathbf{RNA}\). We may take the set of probability distributions \(\Delta ^ {G'}\) on generators and endow \(\Delta^{G'} \times (G \to \Delta^{G'})\) with a coalgebra structure. Intuitively, for each generator kind, we specify a probability distribution for consecutive generators to compose with. Given a map of probability distributions \(\pi: G' \times \Delta^{G'} \to \Delta^{G'}\), we define the resulting probability coalgebra of \(\mathbf{RNA}\) by specifying evaluation maps \(v\) by pattern matching.

Loop bounds: \(b_L(r) = (\min b_L, r)\)
Stem bounds: \(b_S(r) = (\min b_S, \frac{r - \min b_L}{2})\)
Multiloop bounds: \(b_M(r) = (\min b_M, \frac{r}{\min b})\)
Generator bounds \(b: G \to \mathbb{N} \to B\) are defined by pattern matching on the generators.
Stem evaluation: \(S_n, (q, p) \mapsto (\pi(S_n, p)(S), \pi(S_n, p))\)
Multiloops evaluation: \( M_n, (q, p) \mapsto (\pi(M_n, p)(M), \pi(M_n, p))^{\times n} \)
Composite evaluations follow from applying the coalgebra laws on a composite operation, i.e. recursively applying the above three evaluations on a given structure: \(X \circ (Y_1 \otimes ... \otimes Y_k) \mapsto v(X, v(Y_1), ..., v(Y_k))\)

Python Implementation

We implement the resource coalgebra for a version of our operad \(\mathbf{RNA}\) and use it to implement random sampling of fixed-length RNA structures, with user-defined constraints on sequence length, stem length, multiloop valency range and loop size range. Our resource coalgebra makes this implementation almost trivial by handling which stem, loop and multiloop sizes are possible to sample from at any given time, to exactly match a given sequence length. This can be used for generating training data for machine learning methods for RNA design, as used in the reinforcement learning environment used by CoNCoRDe, as well as a component of statistical RNA structure prediction algorithms RNAPrediction1. Our algorithm functions by at each step sampling a random operation from the generating set of \(\mathbf{RNA}\) with a size between the minimum and maximum size returned by the resource cooperad, composing it with the current structure and repeating the process until the RNA structure is completed, and all resources (RNA sequence length) are exhausted.

To use our algorithm, install CoNCoRDe by following the setup instructions. Then, you can run our script random_structures.py following the usage instructions:

    usage: random_structures.py [-h] [--count [COUNT]] [--min-E [MIN_E]]
                                [--max-E [MAX_E]] [--min-H [MIN_H]]
                                [--max-H [MAX_H]] [--min-S [MIN_S]]
                                [--max-S [MAX_S]] [--min-Z [MIN_Z]]
                                [--max-Z [MAX_Z]] [--min-M [MIN_M]]
                                [--max-M [MAX_M]] [--min-B [MIN_B]]
                                [--max-B [MAX_B]]
                                LENGTH

    Generate random RNA secondary structures of fixed length. Powered by the
    operad of RNA structures and its resource algebra.

    positional arguments:
      LENGTH           Length of desired RNA structures.

    optional arguments:
      -h, --help       show this help message and exit
      --count [COUNT]  Number of RNA structures to generate.
      --min-E [MIN_E]  Minimum length of non-hairpin unpaired bases.
      --max-E [MAX_E]  Maximum length of non-hairpin unpaired bases.
      --min-H [MIN_H]  Minimum length of hairpin unpaired bases.
      --max-H [MAX_H]  Maximum length of hairpin unpaired bases.
      --min-S [MIN_S]  Minimum length of stems.
      --max-S [MAX_S]  Maximum length of stems.
      --min-Z [MIN_Z]  Minimum length of continued stems.
      --max-Z [MAX_Z]  Maximum length of continued stems.
      --min-M [MIN_M]  Minimum size of multiloops.
      --max-M [MAX_M]  Maximum size of multiloops.
      --min-B [MIN_B]  Minimum size of bulges.
      --max-B [MAX_B]  Maximum size of bulges.

Pseudoknots

At this point, we've firmly established that methods from category can provide real benefits in the mathematical treatment of RNA secondary structure but up to here, all we have talked about are pseudo-knot free structures RNAInteractions. While we will not provide a detailed conceptual treatment of pseudo-knotted RNA structures in the framework of category here, let us give you a bit of an intuition, how a categorical treatment of pseudo-knots could look like.

Overlay Composition of RNA Structures

Let us begin by recalling that a pseudo-knot is defined as a base-pairing interaction between regions of RNA which are not part of a stem RNAInteractions. Therefore, we may not represent pseudoknots in our current operad of RNA structures. As pseudo-knots are interactions between otherwise unpaired segments in an RNA structure, we have to keep track of both paired and unpaired segments of RNA in our mathematical description. We can then treat composition of possibly pseudo-knotted structures by introducing three types of composition of RNA structures:

Sequential composition, which is the usual composition operation used in our operad of RNA structures. (Figure 6(a))
Parallel composition, which corresponds to setting multiple RNA structures side-by-side as in operad composition. (Figure 6(b))
Overlay composition, which glues two compatible RNA structures by identifying stems in one structure with edges or loops in another. (Figure 6(c))

Here, what we have termed overlay composition provides a mode of composition which we can use to represent pseudoknot structures. For example the overlay composition in Figure 6(c) can be interpreted as a pseudoknotted RNA secondary structure corresponding to the following dot-bracket string: ((((...[[[[...))))...]]]] where brackets [] indicate pseudo-knot base-pairs.

Figure 6: Overlay composition

A categorical description of RNA secondary structure featuring pseudoknots should admit three types of structure composition. (a) ordinary sequential composition: this is the very same composition used in our operad of RNA structures, where two RNAs are glued together along a common open boundary. (b) parallel composition as given by a monoidal product. This roughly corresponds to setting operations side-by-side in the composition of our operad of RNA structures. (c) overlay composition, which we introduce to allow the introduction of pseudoknots. RNA structures are composed by gluing them along corresponding loops and paired regions. Here, we have to keep track of both paired and unpaired regions of RNA in stems and loops, respectively. Two structures are compatible, if they either match paired with unpaired regions, or unpaired with unpaired regions. Should two paired regions match, the corresponding pair of structures cannot be overlay-composed.

We can see that in the case of pseudoknotted RNA structures with overlay composition, we may still construct a poset of substructures by introducing morphisms between composite structures and their composed parts, now with respect to all three introduced composition operations.

Outlook: Double Categorical Structure

While we leave the final definition open, we may point the reader in the direction of the right mathematical structures to work with pseudoknotted RNA structures. First of all, we will no longer be able to dodge having to work with monoidal categories by relying on operads instead. Furthermore, as our mathematical structure would have to deal with a monoidal product together with two compatible modes of composition, we have to move on to higher categories CatsApplied. These can be seen as an extension to ordinary categories, where, in addition to objects and morphisms we have further \(2\)-morphisms, which are intuitively morphisms between morphisms CatsApplied. A particular structure we have in mind are symmetric monoidal double categories, which feature \(2\)-morphisms with two compatible modes of composition together with a monoidal product – exactly what we need!

Double categories or not, we have definitely shown that there is merit to applying category theory to problems in biology. Stay tuned for our presentation and poster, to see if our mathematician can work up enough mathematical mojo to work out the technical difficulties towards defining a symmetric monoidal double category by then!

Team:Heidelberg/Model/RNA Combinatorics

Introduction

RNA secondary structure

Mathematical descriptions of compositionality

Compositionality of RNA Structure

Poset of Substructures

RNAs as Trees

Operad of RNA Structures

RNA Structure Generation

Python Implementation

Pseudoknots

Overlay Composition of RNA Structures

Outlook: Double Categorical Structure

References

Quote

Useful Links

Follow us on

Contact us