Team:UESTC-Software/Poster
UESTC-Software:CPD3DS
Classification of Protein Domains in 3D Shape to design a standard set of protein bricks
CPD3DS
Bin Wang¹,Nan Mo¹,Zhaochang Yang¹, Mingkang Liu¹, Yao Mei¹, Jingqi Han¹, Huiyu Xiong¹, Tao Zhou¹, Dan Qiao¹, Fan Gao¹, HaoRan Dong¹, Zhuochao Min¹, Beibei Wang², Department of life science and technology§
[¹iGEM Student Team Member,²iGEM Team Primary PI, §Faculty Sponsor, Department of University of Electronic Science and Technology of China, Chengdu, Sichuan, China]
Abstract
Proteins are responsible for most of the physiological functions in the cells, and many synthetic biologists focus on designing customized proteins according to demands. Nowadays, the protein design starts from a new amino acid sequence in most cases, which leads to undoubtedly huge workload. At the same time, the structure of a protein is often closely related to the function of protein. Our project, CPD3DS, directly uses the structural domains as the basic unit to analyze protein structures. To get a set of protein bricks, domains were classified by their shape features, and 3D-Zernike descriptor were used to cluster all domains in existing databases. We developed a user-friendly website termed CPD3DS for retrieval, analyses and visualization of classified protein domains. In addition, we printed a set of our protein bricks for teaching and popularizing synthetic biology.
1. Domain-based design
At present, most of the protein-based design studies are designed and synthesized from scratch, and the workload is large. As a conservative region in the process of protein evolution, the domain is closely related to the structure and function of protein. It is hoped that a domain parts library can be constructed to provide necessary elements for subsequent domain-based protein design and synthesis, functional relationships and other information.2. Shape clustering
Based on the shape of the domain, we classified the domain, provided the results of domain clustering, and processed the clustering results.3. Block model Building
We further design a series of solid block models, and through the building block model , we can further promote science popularization and expand influence.4. Website Tools
All the domains in our database has been clustered by shape structure. The domain database CPD3DS is accessible to query domain ID, domain category, domain related functions, domain category and domain combination probability.However, this traditional method is very unfavorable for a large number of domains with the similar shape to be screened and obtained. In the structural domain reorganization, domains with similar shape will be easier to be recombined to obtain proteins with specific functions.
Feature extraction
1. Voxelization of domain.
2. 3D Zernike transformation: The 3DZD program takes the cubic grid as input and generates 3DZDs (the 121 invariants).
K-Means
K-Means is an unsupervised classification algorithm, We use it for domain shape clustering after feature extraction.
Elbow rule
K-Means is an unsupervised classification algorithm, We use it for domain shape clustering after feature extraction.
Database integration
De redundancy
Home Page
This is the front page of our website, and the first thing you see is the building blocks that we designed from protein domain prototypes. Look at these two columns, you could visit our Wiki, Download Page, Model Page, Education Page and Style Migration page through these links.
Tutorial
In our tutorial page, you could see an overview of our project. This long paragraph is the detailed background of our project, we state overview of the project, project functions, project design, and how to use the search function in this introduction. The technology stack of our project uses Vue, and you could see the related introduction of Vue here. Finally, there are the reference of our project.
Domain List
This part is called Domain List page, which is a very simple page, click the SHOW ALL button in the center, and you’ll see all the domain’s data!
Standard
On our Standard Class page, it’s available both entering the domain name and domain's class to search, such as 2p4vA02 of domain name or 11 of domain class.
Search
On our Domain Search page, we’ve provided three searching methods. First, search according to domain class. For example, we type in 12, then all the domains in class 12 are given in the table below. The second searching method is based on the domain name, after entering the name of the domain such as 101mA00, we could see the class and its source link in the table below. The third method is through function, for instance, we want to know which protein has the function of RNA binding, so we search for RNA binding. After a click, there will been a number of domains composed of proteins with the function of RNA binding.
Download
On the last part Download page, we provide our database download. You could download it for further exploration.
Model making
Using models can not only benefit synthetic biology education, but also provide inspiration for new protein designs.
Functional analysis
Functional tags for proteins in the GO database come from experiments and literature, which means that these tags are highly reliable. Count the occurrence probability of functional tags in each class, just like word cloud. Take class 1 for example. The functional features are visualized through the word cloud.
Combining probability
Literatures proof
Comparison of classification effects
database
- [1]Sillitoe I, Dawson N, Lewis TE, et al. CATH: expanding the horizons of structure-based functional annotations forgenome sequences. Nucleic Acids Res. 2019;47(D1):D280-D284.doi:10.1093/nar/ gky1097
- [2]Antonina Andreeva, Eugene Kulesha, Julian Gough, Alexey Murzin, TheSCOP database in 2020: expanded classification of representative family andsuperfamily domains of known protein structures. (2020) Nucl. Acid Res, 48(D1): D376-D382
- [3]Antonina Andreeva, Dave Howorth, Cyrus Chothia, Eugene Kulesha, Alexey Murzin, SCOP2 prototype: a new approach to protein structu re mining. (2014) Nucl. Acid Res, 42 (D1): D310-D314
- [4] Goodsell DS. The Protein Data Bank[M]// Atomic Evidence. Springer International Publishing, 2016.
- [5]UniProt Consortium T UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi: 10.1093/ nar/gky092.
CD-hit references
- [6]Weizhong Li, Adam Godzik, Cd-hit: a fastprogram for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, Volume 22, Issue 13, 1 July 2006, Pages1658–1659,https://doi.org/10.1093/bioinformatics/btl158
- [7]Ying Huang, Beifang Niu, Ying Gao, LiminFu, Weizhong Li, CD-HIT Suite: a web server for clustering and comparing biologicalsequences, Bioinformatics, Volume 26, Issue 5, 1 March 2010, Pages68 0–682, https://doi.org/10.1093/bioinformatics/btq003
- [8]Ying Huang, Beifang Niu, Ying Gao, Limin Fuand Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, 2010(26): 680-682.
- [9]Weizhong Li and Adam Godzik. Cd-hit: a fastprogram for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 2006(22): 1658-1659.
Other references
- [10]Weizhong Li, Lukasz Jaroszewski and AdamGodzik. Tolerating some redundancy significantly speeds up clustering of largeprotein databases. Bioinformatics, 2002(18): 77-82.
- [11]Weizhong Li, Lukasz Jaroszewski and AdamGodzik. Clustering of highly homologous sequences to reduce the size of largeprotein datab ases. Bioinformatics, 2001(17): 282-283.
- [12] MacQueen, J. Some Methods for Classification and Analysis of MultiVariate Observations[C]// Proc of Berkeley Symposium on Mathematical Statistics & Probability. 1965.
- 13]David La, Juan Esquivel-Rodriguez, Vishwesh Venkatraman, Bin Li, Lee Sael, Steven Ueng, Steven Ahrendt, and Daisuke Kihara. 3D-SURFER: software for high throughput protein surface comparison and analysis. B ioinformatics 25: 2843-2844 (2009) model
- [14]LSculpt. Download from https://github.com/RomkoSI/lsculpt.
- [15] Novotni M, Klein R. 3D Zernike Descriptors for Content Based Shape Retrieval[J]. 2003:216.
- [16] La D, Juan Esquivel-Rodríguez, Venkatraman V, et al. 3D-SURFER: software for high- throughput protein surface compa rison and analysis[J]. Bioinformatics, 2009.
- [17] Wang Jianren , Ma Xin , Duan Ganglong . Improved K-means clustering k value selection algorithm [J]. Computer Engineering and Applications , 2019, 55(08): 27-33.
- [18] Han J, Kamber M. Data mining:Concepts and Techniques[M]. San Fromcisco:Morgan Kaufmann, 2006:483-486.
- [19] Andrew Ng, Clustering with the K-Means Algorithm, Machine Learning, 2012
- [20] MacQueen, J. Some Methods for Classification and An alysis of MultiVariate Observations[C]// Proc of Berkeley Symposium on Mathematical Statistics & Probability. 1965.
- [21] Peter RJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis[J]. Journal of Computational & Applied Mathematics, 1999, 20.
- [22] Zhou Xiaoli. Recombination of protein domains to construct new functional enzymes [D]. Jilin University, 2012.
- Dr. Beibei Wang
- Dr. Lin Quan
- Dr. Lifeng Zhang
- Dr. Hao Lin
- Dr. Fengbiao Guo
- Dr. Zhiyang Zhang
- Dr. Nan Liu
- Academic Affairs Office of UESTC
- School of Life Science and Technology of UESTC