Team:Heidelberg/Software/PRISM

PRISM
Protein RNA interaction sequence model

Introduction

Abstract

We are presenting the neural network PRISM (Protein RNA Interaction Sequence Model) based on the conditional transformer language model for Controllable Generation (CTRL) by Salesforce as language model to generates sequences of novel RNA-binding Proteins (RBPs) Keskar2019CTRLAC. PRISM generates sequences of RBPs, when a RNA-motif and protein annotations, such as the charge and taxonomy is put in.

Generative modeling of protein sequences is a unsupervised learning problem and a key to solving fundamental problems in synthetic biology, medicine, and material science. The transformer architecture as proposed by Vaswani et al. is the state of the art of language modelling Vaswani2017AttentionIA. The layer normalisation of CTRL model was substituted with zero (ReZero) initialisation technique as shown by Bachlechner et al. Bachlechner2020ReZeroIA. We trained the neural network with about 500'000 sequences and more than 15'000'000 annotations from UniProtKB Reviewed (Swissprot), BindingDB Gilson2016BindingDBI2 and Gene Ontology Ashburner2000GeneOT Consortium2019TheGO . Training was initialized with random data. Fine-tuning the neural network on RBPs was performed by training the network on data from the ATtRACT database Giudice2016ATtRACTaDO.

Figure 1: Principle of Amino acid sequence generation based on PRISM
PRISM takes a RNA-Motif and Annnotations as an input. Based on the previous training on Swissprot database and fine-tuning on ATtTRACT database PRISM generates a specific amino acid sequence.
Based on the annotations, such as GO-Terms, total charge, taxonomy, distribution of amino acid types distribution and amino acid distribution, and the corresponding RNA motif and the corresponding RNA motif PRISM is able to generate novel RBP-Sequences. These are automatically evaluated via BLAST analysis and trRosettaYang2020ImprovedPS.

Fork us on GitHub! Fork us on GitHub!

Preprocessing

To train the neural network we implemented a comprehensive preprocessing pipeline. The proposed preprocessing pipeline prepares the sequences and annotations stored in several databases for usage in the neural network.

Databases

Several databases are implemented in the preprocessing pipeline for extracting the sequences and annotations:
  • UniProtKB Reviewed (Swissprot) [Accessed on 13.09.2020, Version from the 12.08.2020] - We implemented the automatic download and unzip of “uniport_sprot.dat.gz”: The Uniprot ID, the sequence, the GO-Terms, the taxonomy terms and the Pfam Family were extracted and used for the initial training. The sequence of the proteins encoded by the genes from ATtRACT database were extracted from Swissprot again.
  • BindingDB [Accessed on 13.09.2020, Version from the 12.08.2019] – Database of experimentally determined binding affinities between proteins and small, drug-like molecules Gilson2016BindingDBI. - We implemented the automatic download of the file “BindingDB_UniProt.txt”, which offers the mapping of BindingDB polymer (single protein) IDs to UniProt IDs. The UniProt IDs were extracted from the text file and used for the annotations in the initial training.
  • go-basic.obo [Accessed on 20.09.2020, Version: releases/2020-09-10] – Basic version of Gene Ontology Terms, which were filtered such that the graph is guaranteed to be acyclic and annotations can be propagated up the graph Consortium2019TheGO Ashburner2000GeneOT. - We implemented the automatic download of go-basic.obo. Further utilization was handled by the tool GOATOOLS Klopfenstein2018GOATOOLSAP.
  • ATtRACT database [Accessed on 13.09.2020, Ver 0.99ß] – Database of experimentally validated RBPs and their associated RNA motifs Giudice2016ATtRACTaDO. - We implemented the automatic download and unzip of the ATtRACT database and the position weight matrix. The IDs for the Genes coding for a RBP, the RNA motifs as well as the quality score, which describes the binding affinity between RBP domain and RNA motif, were extracted from the database and are used for fine-tuning.

Annotation vector creation

The preprocessing pipeline prepares the annotations and sequences for the pre-training and fine-tuning. The annotations are saved in a one-hot encoded annotations vector, which is described further below. The user is given the possibility to choose which annotations should be included in the annotations_vector. The preprocessing pipeline also prepares lists of the GO-Terms and taxonomies as well as further annotations used in the annotations vector. In the following an overview about the annotations implemented in the annotations vector is given:
  • GO – Terms: The GO-Terms are listed in this .txt file. These are the GO – Terms, which fulfill the following conditions: Do not have catalytic activity. (Excluded all sequences, who had a GO-Term equal to [“GO:0140096”, “GO:0140097”, “GO:0140098”] or descendants) Appear for more often than 1000 proteins in the Swissprot database. If the GO-Term is RBP-related, it has to appear for more than (1000/#GO-Terms in Swissprot * #GO-Terms for RBPs in Swissprot ) proteins in the Swissprot database. The namespace is equal to “biological_process” or “molecular_function”, so all GO-Terms with the namespace “subcellular_location” are excluded. If the GO-Term depth is below level 6, they do not have a semantic similarity to another term in the GO-Terms list bigger than 0.8. The GO-Terms with a semantic similarity bigger than 0.8 are then displayed via the semantic similar GO-Term in the annotations_vector.
  • Taxonomy: The taxonomies in the annotations_vector are listed in this .txt file. These taxonomies are chosen if they appear for more than 7000 proteins in the Swissprot database.
  • Total charge: The total charge is calculated for each sequence and is normalised via a radial basis function.
  • The prevalence of amino acids ['#A', '#R', '#N', '#D', '#C', '#Q', '#E', '#G', '#H', '#I', '#L', '#K', '#M', '#F', '#P', '#O', '#S', '#T', '#U', '#W', '#Y', '#V', '#B', '#Z', '#X', '#J']. - Normalised via a radial basis function.
  • The prevalence of amino acid types ['#positive', '#negative', '#polar', '#nonpolar', '#hydroxy', '#sulf', '#aromatic', '#amide']. - Normalised via a radial basis function.
  • The RNA-motif, which binds RBP (One-hot encoded.): ['ResPos1-A', 'ResPos1-C', 'ResPos1-G', 'ResPos1-U', 'ResPos2-A', 'ResPos2-C', 'ResPos2-G', 'ResPos2-U', 'ResPos3-A', 'ResPos3-C', 'ResPos3-G', 'ResPos3-U', 'ResPos4-A', 'ResPos4-C', 'ResPos4-G', 'ResPos4-U', 'ResPos5-A', 'ResPos5-C', 'ResPos5-G', 'ResPos5-U', 'ResPos6-A', 'ResPos6-C', 'ResPos6-G', 'ResPos6-U', 'ResPos7-A', 'ResPos7-C', 'ResPos7-G', 'ResPos7-U', 'ResPos8-A', 'ResPos8-C', 'ResPos8-G', 'ResPos8-U', 'ResPos9-A', 'ResPos9-C', 'ResPos9-G', 'ResPos9-U', 'ResPos10-A', 'ResPos10-C', 'ResPos10-G', 'ResPos10-U', 'ResPos11-A', 'ResPos11-C', 'ResPos11-G', 'ResPos11-U', 'ResPos12-A', 'ResPos12-C', 'ResPos12-G', 'ResPos12-U'] – For the initial training the RNA-motif is not carried along. It only is added for the fine-tuning step.
The exact structure and positions of annotations of the annotations_vector is described in the file annotations_vector_description.csv (LINK), which is generated when using the pipeline.

Sequence and Mask creation

The sequences are padded to a length of 512 amino acids. For sequences shorter than 512 amino acids a mask is prepared by the preprocessing pipeline.

Data availability

The file preprocessing_pipeline.py gives the user the option to redo with a newer dataset and/ or modify the data preparation. The preprocessing pipeline also offers the option to load the pre-calculated GO-Terms and Taxonomies list used for the generation of the annotations vectors. The original data for pretraining prepared for the neural network is made available via zenodo due to its size:

Download data for pretraining from zenodo! Download data for pretraining from zenodo!

Model architecture

A language model learns the probability distribution $p(x)$ of a given sequence $x = (x_{1}, ...x_{n})$. The model recognizes only tokens which were determined in the beginning. The probability distribution can be described as a product of conditional probability distributions. $$\prod_{x = 1}^{n} p(x_{i} \vert x_{\gt i})$$ For controllable protein generation we prepended the protein sequence $x = (x_{1}, ...x_{n})$ with annotations $a = (a_{1},.., a_{m})$ to yield $y = (x, a)$. The probability distribution over each sequence and its annotations was learned by our model. The model was based on CTRL, a model which was stated by Keskar et al. The transformer was modified by substituting the LayerNorm with ReZero initialization method. ReZero was implemented to increase the convergence speed of our transformer and to improve the model performance during training Bachlechner2020ReZeroIA. We also added a prediction of the size amino acids times the model size to calculate the loss using teacher forcing. The protein sequence was first embedded and a positional encoding added to the embedding. The embedded sequence was then concatenated with their according annotations. The concatenated sequence was then forwarded to the encoder. Each encoder layer consists of a multi-head-attention block and a fully connected feed forward neural network which are connected through a residual connection. Our model has the dimension $\d_{model} = 512$, inner dimension $\d_{ff} = 512$, number of encoder layers $num\_layers = 9$, number of heads $num\_heads = 8$ and dropout rate $rate = 0.1$. We used a masked cross entropy loss for the computation of the loss. We applied the Adam optimizer with betas = (0.9, 0.98) and scheduled our learning rate as proposed by Madani et al., starting with 4000 warmup steps. The model was implemented in PyTorch.

Generation of new sequences

After calling pretraining.py and finetuning.py for the initial training and fine-tuning or loading the final model, the user may generate RBP-sequences by accessing the file user_generator.py or the file random_generator.py The file user_generator.py parses the RNA-motif, which should be bound by the generated RBP-sequence, as well as the annotations for the annotations_vector. An example of calling user_generator.py is given here:
python user_generator.py –rna_motif “ACGUGU” –annotations “#A-004, #TOTALCHARGE-122, #POLAR-145, GO:0000001, TAX:Homo, LIGAND-Y”
For text generation the annotations vector is created out of the user input. At minimum at least one annotation of each category (RNA-motif, GO-Terms, taxonomy, amino acid distribution, amino acid types and ligand-binding) must be entered. When choosing random_generator.py a random annotations vector is generated. random_generator.py can be called as followed:
python random_generator.py --number_annotation_vector
The user may input the number of annotation vectors which are then generated and run through PRISM. The random generator takes the frequency of annotation types in annotation vectors in the dataset used for training and prevalence of single GO-Terms in the dataset used for training into account. Ligand-binding is per default set to true. The RNA-motif is generated completely random except for the motif length. The prevalence for the annotations are already pre-calculated. If the user is applying a deviating dataset, the files imported into random_generator.py need to be removed. The prevalence are recalculated automatically when first initialized. Please note, that the calculation may need to be adapted to the specific dataset. The trained model receives the generated annotation vector with a masked sequence, so a single amino acid in the sequence is generated by the learned weights and parameters under consideration of the amino acids already generated. The sequence is saved by user_generator.py respectively random_generator.py as a FASTA file. The sampling of the amino acids from the output id implemented via a temperature-controlled greedy algorithm: For text generation we used temperature controlled greedy sampling. For that logits must be scaled by a temperature before applying a softmax. $$greedy_probabilities = softmax(logits/temperature)$$ The resulting probability distribution can be used to sample the tokens with the highest probability.

Discussion

At first, the training of the model based on CTRL as well as the model based on ReZero needs to be compared. Both were trained for a batch size of 128 and a warmup step number of 4000 for 16 epochs. As it can be seen in Figure 1 respectively Figure 2 for both models the training loss has not declined as much as we have hoped for. As the loss function we used cross-entropy loss. The decline of the loss for the ReZero model is higher although the loss function is relatively unstable. A certain instability also applies for the CTRL model.

Figure 2: Training of the proposed model based on CTRL model.
The CTRL model is compared to the ReZero model a lot more instabile.


Compared to the CTRL model the ReZero model has a better performance, as the loss declines.

Figure 3: Training of the proposed model based on ReZero model.
In total the ReZero model has a better performance than the CTRL model.


Machine Learning on protein sequences is a daunting task and normally takes a lot of training. For both models longer training would have been needed. Compared to language models published in scientific papers, this training time is significantly shorter Madani2020ProGenLM. Therefore, we propose a significant improvement in the performance of the ReZero model when trained longer. After pretraining we applied finetuning for 20 epochs with a batch size of 64 and a warmup step number of 100 on the model, as it can be seen in figure 3:

Figure 4: Finetuning of the proposed model based on ReZero model.
Finetuning may lead to overfitting due to the minor optimizations in the pretraining step.


Finetuning improves the performance of the model. Due to the minor optimizations in the pretraining step the finetuning step may has let to overfitting, which would need to be investigated further.

In total, we generated 100 sequences with randomly initiated RNA-motifs and annotations with the proposed model after pretraining and finetuning. The generated sequences display the relative low performance of the model. The prevalence of certain amino acids in the amino acid sequence is relatively high (Up to 80 %). When comparing models from early stages and later stages of pretraining the prevalence of the amino acids in the sequence begins to balance. To reach better results in generation longer training would be needed. Interestingly, the model predicted in the majority of the generated sequences Methione as the first amino acid. This demonstrates that the model has learnt to generalize the high prevalence of methionen at the beginning of amino acid sequences in Swissprot database.

The generated sequences could be built individually with 3DOC and trRosetta. Furthermore the sequence could potentially be checked via SafetyNet, which was introduced by the iGEM team Heidelberg 2017. SafetyNet checks protein sequences, whether they are potentially harmful and is based on the model DeeProtein, which also was introduced by the iGEM team Heidelberg 2017. Due to the fact, that our Protein Model did not generate a protein sequence fulfilling all criteria of protein sequences at all time, this analysis was not undertaken.

References