Team:UESTC-Software/Poster - 2020.igem.org

Team:UESTC-Software/Poster

UESTC-Software:CPD3DS

Classification of Protein Domains in 3D Shape to design a standard set of protein bricks

CPD3DS

Presented by Team UESTC-Sortware 2020

Bin Wang¹,Nan Mo¹,Zhaochang Yang¹, Mingkang Liu¹, Yao Mei¹, Jingqi Han¹, Huiyu Xiong¹, Tao Zhou¹, Dan Qiao¹, Fan Gao¹, HaoRan Dong¹, Zhuochao Min¹, Beibei Wang², Department of life science and technology^§

[¹iGEM Student Team Member,²iGEM Team Primary PI, ^§Faculty Sponsor, Department of University of Electronic Science and Technology of China, Chengdu, Sichuan, China]

Abstract

Proteins are responsible for most of the physiological functions in the cells, and many synthetic biologists focus on designing customized proteins according to demands. Nowadays, the protein design starts from a new amino acid sequence in most cases, which leads to undoubtedly huge workload. At the same time, the structure of a protein is often closely related to the function of protein. Our project, CPD3DS, directly uses the structural domains as the basic unit to analyze protein structures. To get a set of protein bricks, domains were classified by their shape features, and 3D-Zernike descriptor were used to cluster all domains in existing databases. We developed a user-friendly website termed CPD3DS for retrieval, analyses and visualization of classified protein domains. In addition, we printed a set of our protein bricks for teaching and popularizing synthetic biology.

Introduction(Project Goals)

Based on lots of research findings, we can speculate that domain can be used in protein design under certain conditions. We hope to build a domain parts library similar to BioBrick, while BioBrick is for protein but CPD3DS is for domain.

1. Domain-based design

At present, most of the protein-based design studies are designed and synthesized from scratch, and the workload is large. As a conservative region in the process of protein evolution, the domain is closely related to the structure and function of protein. It is hoped that a domain parts library can be constructed to provide necessary elements for subsequent domain-based protein design and synthesis, functional relationships and other information.

2. Shape clustering

Based on the shape of the domain, we classified the domain, provided the results of domain clustering, and processed the clustering results.

3. Block model Building

We further design a series of solid block models, and through the building block model , we can further promote science popularization and expand influence.

4. Website Tools

All the domains in our database has been clustered by shape structure. The domain database CPD3DS is accessible to query domain ID, domain category, domain related functions, domain category and domain combination probability.

Inspiration

Our design ideas are derived from the assembly of physical bricks and the idea of biobrick's component design. At present, researches based on the shape of a large number of domains are still rare, most of which are based on the comparison and research of the shape between a few sequences or domains.
However, this traditional method is very unfavorable for a large number of domains with the similar shape to be screened and obtained. In the structural domain reorganization, domains with similar shape will be easier to be recombined to obtain proteins with specific functions.

Engineering

Extract & Cluster

Feature extraction

1. Voxelization of domain.

2. 3D Zernike transformation: The 3DZD program takes the cubic grid as input and generates 3DZDs (the 121 invariants).

K-Means

K-Means is an unsupervised classification algorithm, We use it for domain shape clustering after feature extraction.

Elbow rule

K-Means is an unsupervised classification algorithm, We use it for domain shape clustering after feature extraction.

Data acquisition

Database integration

CPD3DS integrates five databases, PDB, CATH, SCOP, SCOP2, and UniProt, as the basis for domain extraction, while providing users with as much information as possible, including sequence information, functions, etc.

De redundancy

We selected domain data recorded in CATH, SCOP and Scop2, but the classification criteria for domains of these databases are different slightly. Therefore, domain data obtained by integration has a high degree of redundancy. We use the CD-HIT tool to de-redundant domain data, and finally, we Reduce the sequence similarity of all domains in the database to less than 80%.

Results(Our website)

Home Page

This is the front page of our website, and the first thing you see is the building blocks that we designed from protein domain prototypes. Look at these two columns, you could visit our Wiki, Download Page, Model Page, Education Page and Style Migration page through these links.

Tutorial

In our tutorial page, you could see an overview of our project. This long paragraph is the detailed background of our project, we state overview of the project, project functions, project design, and how to use the search function in this introduction. The technology stack of our project uses Vue, and you could see the related introduction of Vue here. Finally, there are the reference of our project.

Domain List

This part is called Domain List page, which is a very simple page, click the SHOW ALL button in the center, and you’ll see all the domain’s data!

Standard

On our Standard Class page, it’s available both entering the domain name and domain's class to search, such as 2p4vA02 of domain name or 11 of domain class.

Search

On our Domain Search page, we’ve provided three searching methods. First, search according to domain class. For example, we type in 12, then all the domains in class 12 are given in the table below. The second searching method is based on the domain name, after entering the name of the domain such as 101mA00, we could see the class and its source link in the table below. The third method is through function, for instance, we want to know which protein has the function of RNA binding, so we search for RNA binding. After a click, there will been a number of domains composed of proteins with the function of RNA binding.

Download

On the last part Download page, we provide our database download. You could download it for further exploration.

Expanding contents

Model making

Using models can not only benefit synthetic biology education, but also provide inspiration for new protein designs.

Functional analysis

Functional tags for proteins in the GO database come from experiments and literature, which means that these tags are highly reliable. Count the occurrence probability of functional tags in each class, just like word cloud. Take class 1 for example. The functional features are visualized through the word cloud.

In this way, domains related to requirements functions can be found when we want to design or change the properties and functions of certain aspects of the protein.

Combining probability

In fact, the combination between domains is not arbitrary. Therefore, a statistical analysis was conducted on the combination probability of different domain classes.

Reflected in the figure above, the darker the color, the more likely these two domain classes to combine.

Proof of concept

Literatures proof

Exchange the CAP domain of the AFEST and the propeller domain of the apAPH to construct the chimeric enzyme.

According to the 3DZD (3D Zernike Descriptors) algorithm: The average inter-class European distance in the 100 classification we defined is 5. 8715 European distance based on 3DZD between 1jjiA00 and 1ve6A02:3. 785 Based on the 3DZDs and the database we built, synthetic biologists were able to find out domains that might be replaced based on the shape structure.

Comparison of classification effects

Comparison with existing database classifications to check the effect of our result.

It can be seen that, compared with random distribution, even if we use stereo structure as the classification standard, which is different from the method of CATH/SCOP that evaluates sequence similarity first and then corrects manually, the similarity is still much higher than that of random distribution. On the one hand, it reflects the correlation between protein sequence and spatial structure, on the other hand, it also proves that the results of this project classification can provide a consistent and even more reasonable explanation with the traditional explanation.

References and Acknowledgements

database

[1]Sillitoe I, Dawson N, Lewis TE, et al. CATH: expanding the horizons of structure-based functional annotations forgenome sequences. Nucleic Acids Res. 2019;47(D1):D280-D284.doi:10.1093/nar/ gky1097
[2]Antonina Andreeva, Eugene Kulesha, Julian Gough, Alexey Murzin, TheSCOP database in 2020: expanded classification of representative family andsuperfamily domains of known protein structures. (2020) Nucl. Acid Res, 48(D1): D376-D382
[3]Antonina Andreeva, Dave Howorth, Cyrus Chothia, Eugene Kulesha, Alexey Murzin, SCOP2 prototype: a new approach to protein structu re mining. (2014) Nucl. Acid Res, 42 (D1): D310-D314
[4] Goodsell DS. The Protein Data Bank[M]// Atomic Evidence. Springer International Publishing, 2016.
[5]UniProt Consortium T UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi: 10.1093/ nar/gky092.

CD-hit references

[6]Weizhong Li, Adam Godzik, Cd-hit: a fastprogram for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, Volume 22, Issue 13, 1 July 2006, Pages1658–1659,https://doi.org/10.1093/bioinformatics/btl158
[7]Ying Huang, Beifang Niu, Ying Gao, LiminFu, Weizhong Li, CD-HIT Suite: a web server for clustering and comparing biologicalsequences, Bioinformatics, Volume 26, Issue 5, 1 March 2010, Pages68 0–682, https://doi.org/10.1093/bioinformatics/btq003
[8]Ying Huang, Beifang Niu, Ying Gao, Limin Fuand Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010(26): 680-682.
[9]Weizhong Li and Adam Godzik. Cd-hit: a fastprogram for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006(22): 1658-1659.

Other references

[10]Weizhong Li, Lukasz Jaroszewski and AdamGodzik. Tolerating some redundancy significantly speeds up clustering of largeprotein databases. Bioinformatics, 2002(18): 77-82.
[11]Weizhong Li, Lukasz Jaroszewski and AdamGodzik. Clustering of highly homologous sequences to reduce the size of largeprotein datab ases. Bioinformatics, 2001(17): 282-283.
[12] MacQueen, J. Some Methods for Classification and Analysis of MultiVariate Observations[C]// Proc of Berkeley Symposium on Mathematical Statistics & Probability. 1965.
13]David La, Juan Esquivel-Rodriguez, Vishwesh Venkatraman, Bin Li, Lee Sael, Steven Ueng, Steven Ahrendt, and Daisuke Kihara. 3D-SURFER: software for high throughput protein surface comparison and analysis. B ioinformatics 25: 2843-2844 (2009) model
[14]LSculpt. Download from https://github.com/RomkoSI/lsculpt.
[15] Novotni M, Klein R. 3D Zernike Descriptors for Content Based Shape Retrieval[J]. 2003:216.
[16] La D, Juan Esquivel-Rodríguez, Venkatraman V, et al. 3D-SURFER: software for high- throughput protein surface compa rison and analysis[J]. Bioinformatics, 2009.
[17] Wang Jianren , Ma Xin , Duan Ganglong . Improved K-means clustering k value selection algorithm [J]. Computer Engineering and Applications , 2019, 55(08): 27-33.
[18] Han J, Kamber M. Data mining:Concepts and Techniques[M]. San Fromcisco:Morgan Kaufmann, 2006:483-486.
[19] Andrew Ng, Clustering with the K-Means Algorithm, Machine Learning, 2012
[20] MacQueen, J. Some Methods for Classification and An alysis of MultiVariate Observations[C]// Proc of Berkeley Symposium on Mathematical Statistics & Probability. 1965.
[21] Peter RJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis[J]. Journal of Computational & Applied Mathematics, 1999, 20.
[22] Zhou Xiaoli. Recombination of protein domains to construct new functional enzymes [D]. Jilin University, 2012.

Acknowledgement

Dr. Beibei Wang
Dr. Lin Quan
Dr. Lifeng Zhang
Dr. Hao Lin
Dr. Fengbiao Guo
Dr. Zhiyang Zhang
Dr. Nan Liu
Academic Affairs Office of UESTC
School of Life Science and Technology of UESTC