Team:SJTU-software/Engineering

Overview

In order to analyze rice pan-genome data, we tried various kinds of algorithms, including a series of statistical approaches like logistic regression and Markov chain Monte Carlo (MCMC), and machine learning methods like SVM, *etc*. Most of them don't perform well, either generate false-positive results or takes too much time.

Besides the analysis which lent the tools like Random Forest and Cluster, we adopted and modified a novel global search algorithm that has never been used in Genome-Wide Association Studies. Rather than detecting the relations one by one, our algorithm can draw a directed acyclic graph (DAG) at a time with a low false-positive rate. We adopt a two-stage Bayesian network method to implement a global search of DAG to identify genome-wide interactions with multiple outcomes. This method integrates the advantages of score-based methods and constraint-based methods to learn the structure of the phenotype-related Bayesian network. We used the Bayesian information criterion (BIC) because its maximization led to consistent model selection in the classical large-sample limit. The algorithm makes the maximization of the BIC computationally feasible for much larger graphs. It maximizes the BIC in a greedy way but still guarantees consistency in the large-sample limit. It still has exponential-time complexity in the worst case, but only polynomial complexity in the average case where the size of the largest clique in a graph grows only logarithmically with the number of nodes.

All in all, our method improved accuracy and efficiency compared to several common methods with no need to predetermine the outcome variables. We attached the result to our database so that users could use our database in a broadened way. As long as the user entered a certain gene name, some information like gene classifications and annotations, sample information and categories, and protein sequences, also related genes could be shown. It offers new choices for the experimenter to investigate the functions and enlarges the vision to do genetic reprogramming. If they are willing to add more phenotype data to our database, we can also do a similar analysis for them.

Gene-gene Interaction Analysis

Database

Our database is mainly divided into two parts, classified by original data and result data.

I. Original data

Whole_Gene_Annotion: This table mainly contains gene annotions as well as it contains details of all DNA sequences (including CDS, 5UTR,3UTR). The format is the standard format of gff File.

Part_Of_Gene_Annotion: This table is similar to the Whole_Gene_Annotion. The only difference of them is that Part_Of_Gene_Annotion only contains rough information of all DNA sequences (NOT including CDS, 5UTR,3UTR, ONLY gene).

Protein_Seq: This table mainly contains the protein sequences encoded by the gene, and also records the frequency of the gene in all samples and its importance (Core, Soft core, Distributed and Rare).

Gene_Phe: This table mainly contains some phenotypic characteristics of samples (flag_leaf_angle, leaf_width, grain_length, height, leaf_angle, leaf_length).

II. Result data

Gene_Relation: This table mainly contains related genes. We used GES algorithm to analyze the evolutionary relationship between different genes, and the analysis results were imported into the database.

Sample_Relation: This table mainly contains related samples. We used GES algorithm to analyze the evolutionary relationship between different genes, and the analysis results are imported into the database.

Phe_By_Gene: This table mainly contains genes which decide the phenotypic characteristics. We used GES and RF algorithm to get the genes that affect the phenotypic characteristics.

References

D. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3(3):507–554, 2002.
D. Chickering, D. Heckerman, and C. Meek. Largesample learning of Bayesian networks is NP-hard.
J. Mach. Learn. Res., 5:1287–1330, December 2004. T. Claassen and T. Heskes. A logical characterization of constraint-based causal discovery. In Proc. of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pages 135–144, 2011.
T. Claassen, J. Mooij, and T. Heskes. Proof supplement to Learning sparse causal models is not NP-hard. Technical report, Faculty of Science, Radboud University Nijmegen, 2013. http://www.cs.ru.nl/~tomc/docs/NPHardSup.pdf.
D. Colombo, M. Maathuis, M. Kalisch, and T. Richardson. Learning high-dimensional DAGs with latent and selection variables. The Annals of Statistics, 40(1):294–321, 2012.
G. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, (42):393–405, 1990.
J. Cussens. Bayesian network learning with cutting planes. In Proc. of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pages 153–160. AUAI Press, 2011.
R. Evans and T. Richardson. Maximum likelihood fitting of acyclic directed mixed graphs to binary data. In Proc. of the 26th Conference on Uncertainty in Artificial Intelligence, pages 177–184, 2010.
M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.
H. Freeman and Co., 1979. O. Goldreich. Computational Complexity: A Conceptual Perspective. Cambridge University Press, 2008.
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.
D. Margaritis and S. Thrun. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems 12, pages 505–511, 1999.
J. Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2000.
J. Pearl and T. Verma. A theory of inferred causation. In Knowledge Representation and Reasoning: Proc. of the Second Int. Conf., pages 441–452, 1991.
T. Richardson and P. Spirtes. Ancestral graph Markov models. Ann. Stat., 30(4):962–1030, 2002.
P. Spirtes, C. Meek, and T. Richardson. An algorithm for causal inference in the presence of latent variables and selection bias. In Computation, Causation, and Discovery, pages 211–252. AAAI Press, Menlo Park, CA, 1999.
P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT Press, Cambridge, Massachusetts, 2nd edition, 2000.
J. Tian, A. Paz, and J. Pearl. Finding minimal dseparators. Technical Report R-254, UCLA Cognitive Systems Laboratory, 1998.
C. Yuan and B. Malone. An improved admissible heuristic for learning optimal Bayesian networks. In Proc. of the 28th Conference on Uncertainty in Artificial Intelligence (UAI), pages 924–933, Corvallis, Oregon, 2012. AUAI Press.
J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.
Schatz, M. C.,Maron, L. G.,Stein, J. C., etc., Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica[J]. Genome Biol, 2014, 15 (11),p 506.
Li, Z.,Pinson, S.,Stansel, J., etc., Identification of quantitative trait loci (QTLs) for heading date and plant height in cultivated rice (Oryza sativa L.)[J]. Theoretical and Applied Genetics, 1995, 91 (2),pp 374-381.
Li, J.-Y.,Wang, J.,Zeigler, R. S., The 3,000 rice genomes project: new opportunities and challenges for future rice research[J]. GigaScience, 2014, 3 (1),pp 1-3.
Yao, W.,Li, G.,Zhao, H., etc., Exploring the rice dispensable genome using a metagenome-like assembly strategy[J]. Genome Biol, 2015, 16 p 187.
Schatz, M. C.,Maron, L. G.,Stein, J. C., etc., Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica[J]. Genome Biol, 2014, 15 (11),p 506.
Sasaki, T.,Burr, B., International Rice Genome Sequencing Project: the effort to completely sequence the rice genome[J]. Current opinion in plant biology, 2000, 3 (2),pp 138-142.
Li, Z.,Pinson, S.,Stansel, J., etc., Identification of quantitative trait loci (QTLs) for heading date and plant height in cultivated rice (Oryza sativa L.)[J]. Theoretical and Applied Genetics, 1995, 91 (2),pp 374-381.
Cao, J.,Schneeberger, K.,Ossowski, S., etc., Whole-genome sequencing of multiple Arabidopsis thaliana populations[J]. Nature Genetics, Oct, 2011, 43 (10),pp 956-U60.
Aflitos, S.,Schijlen, E.,de Jong, H., etc., Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing[J]. Plant Journal, Oct, 2014, 80 (1),pp 136-148.
Zhang, Z.,Mao, L.,Chen, H., etc., Genome-wide mapping of structural variations reveals a copy number variant that determines reproductive morphology in cucumber[J]. The Plant Cell, 2015, 27 (6),pp 1595-1604.