Team:Heidelberg/Software/3DOC

3DOC
Modular Protein Design in 3D

Abstract

We are introducing the pipeline 3DOC (3D-domain concatenation) to create Protein Database Format (PDB) files out of concatenated protein sequences, which can be interconnected via amino acid linkers. The pipeline enables creating fusion protein-binding domains and fusion proteins. For instance, it can be used to create Pumby and PPR protein sequences and PDBs, which can be linked modularly to bind specific RNAs Adamala2016ProgrammableRP Coquille2014AnAP. The PDB file creation pipeline uses BLASTp (https://blast.ncbi.nlm.nih.gov/Blast.cgi) together with PyRosetta respectively trRosetta to generate a PDB file for the given protein sequences Chaudhury2010PyRosettaAS . If several sequences are inputted, the PDBs are fused or interconnected automatically with the inputted Amino Acid Linker via PyRosetta. We are also offering two scripts "rna_denovo_preparation.py" respectively "rnp_structure_prediction_preparation.py", which prepare the output of 3DOC for RNA denovo (FARFAR) and RNP Structure prediction protocols in Rosetta Das2010AtomicAI Cheng2015ModelingCR Kappel2019SamplingNS.

Fork us on GitHub! Fork us on GitHub!

Figure 1: Structure of 3DOC
3DOC is a pipeline, which generated 3D structure for inputted protein sequences making use of trRosetta or manual generation via PyRosetta. The 3D structures are automatically concatenated and prepared for Protein-RNA complex modeling.

Preprocessing Pipeline

We are presenting the Geneious Prime workflow “3DOC” as well as instructions for a manual pipeline for preparation of DNA sequence files for the PDB file creation pipeline. Also, we are offering a library of Pumby and PPR modules as well as common RNA linkers for optional usage in the Geneious Prime workflow respectively the manual pipeline. The files can be imported into Geneious Prime via “File (In the menu.) → Import → From File”.

Preparation for PDB file creation pipeline

The Geneious Prime workflow is offered in the file "workflow_3DOC_preprocessing.geneiousWorkflow" and needs to be imported into Geneious Prime via “Tools (In the menu.) → Workflows → Manage Workflows → Import”. In Geneious Prime the DNA-sequence files need to be prepared for the workflow. The preparation steps as well as some further steps apply for the manual preparation of sequence files as well:
  1. Please use the folder “3DOC” for Geneious Prime or as ZIP file for manual preparation offered in the Github. Copy all DNA-sequence files, which you want to work with in this pipeline, into the folder "3DOC". You may also copy Pumby (pumilio) and PPR protein modules from their corresponding folders.
  2. Change the names of DNA-sequence documents, so they are all written in the format "X sequence_name". The words are interconnected with an underscore.
  3. The sequences are numbered with ascending numbers from the beginning to the last sequence in the RBP-binding fusion domains respectively fusion protein. The first protein is marked by adding "- begin" at the end of the dna-sequence document name. See the example:
    "1 protein_2 - begin"
    "2 protein_1"
    "3 protein_3"
    
  4. Protein linker can be copied from the folder "Linker" or inserted into Geneious on your own. They are numbered according to the convention below, to demonstrate, which sequences they interconnect. The word "linker" in the file is necessary. The smaller protein sequence number is written at the front with a dash connecting the smaller protein sequence number. Only protein linkers between proteins, which are direct neighbours, are allowed. See the example:
    "1-2 linker_1"
    
  5. Removal of the amino Acid Methionine at the beginning of eachamino acid sequence.
    • Geneious Prime: The folder "Trimming" is necessary to perform the trimming of the DNA-sequences for Methione at the beginning of the sequences except for the first sequence. We cannot guarantee for correct functioning if the folder is modified.
    • Manual preparation: Please delete the bases “ATG” (Are translated to Methionine.) for all sequences except for the beginning sequence, if it can be found at 5’ end. Please delete all STOP codons at the end of all sequences except for the last sequence. They may occur for Sequences from iGEM Registry and SynBioHub.
  6. Output
    • Geneious Prime: The workflow outputs the translation of the sequences directly to a folder, which can be chosen by the user. It also outputs the concatenated DNA sequence.
    • Manual preparation: You may translate the sequences separately with another tool (e.g. EMBOSS by EMBL) and concatenate the DNA-sequences manually.

Library of Pumby (Pumilio) and PPR Modules and Amino Acid Linkers

Both Pumby (Pumilio) and PPR modules in the folders "Pumby (Pumilio) Modules" as well as "PPR Modules" can be used for the Geneious workflow as well as the manual pipeline. Both types of protein modules can be combined in a modular way and bind specific to a certain nucleotide. The Pumby (Pumilio) modules are based on a consensus sequence of Pumilio family. When using the Pumby modules please refer to the article “Programmable RNA-binding protein composed of repeats of a single modular unit”, published in 2016 in the journal PNAS Adamala2016ProgrammableRP. Please note, that in accordance to the authors a sequence of pumby modules is supposed to start with the unit “X pumby_module_start_NUCLEOTIDE” and to end with the unit “X pumby_module_end_NUCLEOTIDE”. In accordance to experiments presented in the paper the pipeline will inform the user, if the “start” and “end” module are positioned in the wrong position and if the amount of modules is wrong (Sequence Length: 6/ 10/ 12/ 18 or 24) The PPR Modules are split into two groups:
  • “X cPPR”-polyNUCLEOTIDE” group, which is based on a consensus PPR protein family motif. When using this group of PPR modules please refer to the article “An artificial PPR scaffold for programmable RNA recognition”, published in 2014 in the journal Nature Communications Coquille2014AnAP.
  • “X ppr10_repeat6/7_NUCLEOTIDE” group, which is based on single amino acid substitutions of the repeat 6 respectively 7 of the PPR10 protein. When using this group of PPR modules please refer to the article “A combinatorial amino acid code for RNA recognition by pentatricopeptide repeat proteins.”, published in 2012 in the journal PLoS Genetics Barkan2012ACA.
We also provide amino acid linkers described in literature in the folder "Linker" to interconnect protein domains or fusion protein parts with each other Chen2013FusionPL.

Workflow of 3DOC

3DOC can be called via output_processing.py:
python output_processing.py path program trRosetta_recurrences trRosetta_mode

:param path: "directory, which contains FASTA files for processing", type = str
:param program: '3DOC+trROsetta' if using as output processing after manual preparation or via "3DOC" with Geneious Prime of RBP-binding domains or fusion protein concatenation or 'trRosetta' if using as output processing for neural network, type = str
:param trRosetta_recurrences: recommended: '3', max: '5', type=int
:param trRosetta_mode: 'best_energy_model' or 'user_choice'", type = str
When choosing 3DOC+trRosetta as program the algorithm loads the protein sequences and automatically performs a BLAST search against RCSB database for each protein sequence. When the BLAST search outputs one or several entries of RCSB database, where one high-scoring pair fragment (HSPfragment) with not more than 10% of all residues of the sequence mutated and an e-value below 0.05 for the BLAST search is outputted, the PDB of the best-fitting entry is downloaded and automatically mutated, repacked and relaxed via PyRosetta. If the BLAST search was not successful the program automatically changes to trRosetta. When choosing trRosetta as program the PDB for the sequence is generated via the tool trRosetta. For installation and documentation of trRosetta and the PyRosetta scripts used by trRosetta please refer to their GitHub repository and website (URL: https://github.com/gjoni/trRosetta and https://yanglab.nankai.edu.cn/trRosetta/download/). It is recommended to reiterate the Neural Network behind trRosetta several times to be able to choose an energy-optimized PDB. The user can choose via trRosetta_recurrences the number of iterations and via trRosetta_mode, whether he wants to get the energy-optimal model as output or choose the fitting model on his own. A multiple sequence alignment, which is necessary for using trRosetta, was implemented using the method hhblits in the HH-suite3 Steinegger2019HHsuite3FF. When working with 3DOC trRosetta must be installed fully with the model as well as PyRosetta scripts as a subdirectory of 3DOC. hhblits must be installed as well.

RNP-Structure Modelling

Preparation scripts for FARFAR and RNP-Structure prediction

The scripts "rna_denovo_preparation.py" respectively "rnp_structure_prediction_preparation.py" prepare the outputted files of 3DOC and the bound RNA-motif for the FARFAR and RNP-structure prediction protocol: The preparation for FARFAR can be called via script rna_denovo_preparation.py:
python rna_denovo_preparation.py rna_motif

:param rna_motif: "RNA-motif bound by protein", type = str
The function outputs the secondary structure of the RNA-motif, which is calculated via ViennaRNA, and can be inputted into FARFAR protocol Lorenz2011ViennaRNAP2. The FARFAR protocol outputs a SilentFile respectively a PDB of the RNA. The preparation for RNP-structure prediction can be called via the script rnp_structure_prediction_preparation.py:
python rnp_structure_prediction_preparation.py secstruct_rna pdb_protein pdb_rna

:param secstruct_rna: "Secondary structure of RNA modelled in RNP-Complex.", type = str, optional
:param pdb_protein: "Path to the PDB of Protein, which is modelled in RNP-Complex.", type = str
:param pdb_rna: "Path to the PDB of RNA, which is modelled in RNP-Complex.", type = str
The submission of the secondary structure of the RNA is optional. If not provided it is calculated via ViennaRNA. The script outputs a FASTA file with the combined protein and RNA sequence as well as a text file with the combined secondary structures of the protein and the RNA. It also produces a combined PDB of the unbound protein and RNA. The FARFAR and RNP-structure prediction protocol can be called via:
./rosetta_bin_linux_2020.08.61146_bundle/main/source/bin/rna_denovo.static.linuxgccrelease -sequence *RNA_SEQUENCE* -secstruct *SECSTRUCT_RNA* -nstruct 10 -out:pdb -minimize_rna

./rosetta_bin_linux_2020.08.61146_bundle/main/source/bin/rna_denovo.static.linuxgccrelease -in:file:fasta ./3DOC/rnp_prediction/protein_sequence_concatenated_with_rna.txt -secstruct_file ./3DOC/rnp_prediction/protein_rna_secstruct.txt -s ./3DOC/rnp_prediction/unbound_protein_and_RNA.pdb -minimize_rna false -nstruct 10 -out::file::silent_struct_type pdb

References