Team:Linkoping/Software



Introduction

We set out to create a web tool with the intent of identifying novel pathological biomarkers. A large bottleneck within bioinformatics lies within the user's ability to integrate the proper methods within their analysis. Additionally, the learning curve of novice bioinformaticians today is very steep. By giving the user a plethora of customizable tools to do so and collecting them all in one web tool we hope to relieve this bottleneck and to generate significant results in an elegant fashion.

Complex diseases can not be explained by specific loci but are instead created through abnormalities in numerous genes. Disease modules are a way of visualizing the interactions and it structures pathological genes in a comprehensible way. By doing enrichment analysis on the generated disease modules one can start to pinpoint suitable biomarkers for disease. All whilst having the generated figures and data readily at hand if using our tool, making it useful for quick rendering alongside being experimental in nature.

Building a pipeline by yourself can be tedious and requires certain prerequisites. There are currently no efficient platforms and out there for creating disease modules and doing enrichments. We thought that customizability and user-friendliness were lacking and, the approachability of current systems was deemed too low. Drawing inspiration from a swiss army knife, that has several unique features, all whilst being sleek and easy to use. Elegantly simple yet complex. Some users might just use the big knife, some might use the scissors and even the toothpicks that always get overseen. Could we take inspiration from this?

The sophisticated existing methodology can be overwhelming if one is unfamiliar with the theory behind the decision making. This was our greatest hurdle to overcome, how can you develop a sophisticated web tool that can be used for novice and expert alike? And can it be stimulating for both types of users?



Overview of ClusteRsy

This is an overview of ClusteRsy. We have provided support to make sure that ClusteRsy can be operated as easily as possible. It consists of a front-page with information about the tool as well as buttons to our tutorial provided and to the user guide for more detailed information.

Then we have the actual usage of ClusteRsy.

”Alttext”
Figure 1. Illustration to show what is included in ClusteRsy. The support, shown in navy blue, consists of an informative front page with links to the user guide, a tutorial, and the GitHub. We also provided a user guide for a more detailed explanation. The Usage, shown in orange, is the workflow of ClusteRsy.

First, the user uploads their RNA-seq data in the form of a count matrix, this is illustrated in figure 2 and it contains the genes studied as rows and the experiment i.e. patient and control samples as columns. You can choose to either upload an expression matrix or directly upload a previously created input object. From here they can select groups, choose whether to use quantile or p-value or additional values. ClusteRsy will automatically calculate whether or not the genes from the count matrix are differentially expressed. This can be summarized as choosing or creating an input object.

The next step is the MODifieR section, where an inference method used for disease module identification is selected to act upon the input object. Then, the user can also select a protein-protein interaction network (PPI), which you can either choose to upload yourself or use existing ones. The role of the PPI network is that of a spine as it gives the mathematical representations of the physical contacts between proteins in the cell and is used by MODifieR for mapping the genes. Whilst doing this second step, you can choose to go into advanced settings or to keep the pre set parameters as default, with the latter being recommended for new users. This second step will create a module object.

In the third step, you select the type of enrichment analysis that you would like to perform. This step has a lot of customizability and we recommend thoughtful selection. There are several different enrichment methods including disease analysis, gene ontology analysis as well as pathway analysis. This third step will generate an enrichment object.

You have now successfully created an enriched module object. In the visualization tab, you can now choose a visualization method to generate a reactive plot. All objects are stored under the database tab, where you can inspect your generated objects but also download or delete them.

”Alttext”
Figure 2. Overview of ClusteRsy workflow from start to finish.


More detailed information

As previously described you will start with uploading RNA-sequencing data to ClusteRsy then you will use any of the MODifieR methods we provide and then calculate whether or not the disease module is enriched for a certain disease, function, or pathway.

In this section we will go into more detail on how this is done. Figure 3 is an illustration of how the software looks and how it’s designed to guide the user in the right direction.

”Alttext”
Figure 3. Illustration of the workflow on ClusteRsy.
Alttext
Figure 4. Example of a dataset prepared and uploaded to the database in ClusteRsy. Diff genes refers to the ENTREZIDs, logFC is the log2 fold change, logCPM is the log counts per million, F is the F-statistic, P-value calculated Genewise Negative Binomial Generalized Linear Models With Quasi-Likelihood Tests (gmlQLFit) and FDR is the false discovery rate.

Input data for ClusteRsy

We offer the ability to upload a count matrix from high-throughput RNA-sequencing experiments to ClusteRsy. One big problem is the lack of standardized formats of the datasets retrieved from these results. To tackle this, we included extensive information about the format needed for ClusteRsy in our user guide and a video tutorial on this topic.

Once the input data has been transformed into a suitable format it is easily uploaded to ClusteRsy. The user will then choose the groups, in this case, which samples belong to patients and which samples belong to control noted as groups 1 and 2 within ClusteRsy. The option to use an adjusted P-value as well as quantile is also included and this will be used when preparing the input data for MODifieR.

ClusteRsy will from the selected parameters then automatically calculate an edgeR [1] table. EdgeR is a statistical package included in MODifieR, it’s used for calculating differentially expressed genes from (in our case) a count matrix and is done so by generalized linear models. Figure 4 is an example from the input data when it has been uploaded to the ClusteRsy database. The diff genes refer to the genes, named with ENTREZIDs, found in the input data. The logFC is the log2 fold change and is calculated by comparing the selected groups 1 and 2. logCPM is the log counts per million, F is the F-statistic, P-values is calculated by a Genewise Negative Binomial Generalized Linear Models With Quasi-Likelihood Tests (gmlQLFit) [2] and FDR is the false discovery rate.

Once the input data has been prepared and uploaded to ClusteRsy it can now be used for MODifieR and enrichment analysis.



MODifieR

”Alttext”
Figure 5. The different MODifieR methods included in ClusteRsy and what category they belong to.

MODifieR [3] is an acronym for module identification. It is an R package containing 8 different algorithms for the identification of disease modules given RNA-sequencing data. The different algorithms are divided into three different categories, seed-based, clique-based, and co-expression based methods.

We have included all of the 8 methods in ClusteRsy with an easy to use layout as well as an advanced options button to access all of the parameters. In figure 5 the different methods included are displayed and to the left, you can see which type of category they belong to. How these different categories operate will be described further down.



”Alttext”
Figure 6. The seed-based methods include the DIAMOnD algorithm. This sketch illustrates how the algorithm finds the disease module.

Seed based methods

The seed-based method works by firstly a set of genes, defined as the most differentially expressed genes between control and patients are set as the seed genes. Next, the method looks at all neighbors of the seed genes and calculates all the interactions to those neighbors; it also includes already known interactions from the PPI-network. Then it calculates the probability that a connection exists between the seed genes and the neighbor, the one with the lowest P-value for the interaction gets added to the networks. This is done iteratively until the maximum number of genes set by the user is fulfilled; or when there are no more genes to add below the p-value threshold. From the MODifieR package the DIAMOnD [4] method is a seed based method.



”Alttext”
Figure 7. The Clique based methods include the Correlation clique, Clique sum, MCODE and module discoverer. This sketch illustrates how the algorithm finds the disease module.

Clique based methods

Clique-based methods work by splitting the big network into all possible cliques; it then searches those cliques for over-represented genes and overlays them. It will then create the final module. From the MODifieR package the Correlation clique [5], Clique sum [6], MCODE [7] and Module discoverer [8] are clique based methods.



”Alttext”
Figure 8. The Co-expression based methods include DiffCoEx, WGCNA and MODA. This sketch illustrates how the algorithm finds the disease module.

Co-expression based methods

Co-expression based methods work by firstly, defining the network by the data itself as a pairwise gene-gene correlation. The algorithm looks through this network for either sets of differently expressed genes or patterns for correlated gene expression between the control group and diseased groups. If there are any differently expressed genes it considers those to be the disease genes and then adds them to the network. From the MODifieR package the DiffCoEx [8], WGCNA [9] and MODA [10] methods are Co-expression based methods.



Enrichment analysis

Enrichment analysis also known as over representation analysis is a common approach to determine if a biological process or function is enriched. This is done by calculating the p-values from a hypergeometric distribution given a list of differentially expressed genes (DEGs).

P-values from a hypergeometric distribution is calculated by the following equation

”Equation
Equation 1: Equation for calculations of p-values.

Where N is the total number of genes in the background distribution, M is the number of genes within that distribution that are annotated to the gene set of interest. n is the size of the list of genes of interest and k is the number of genes within that list that are annotated to the gene set. The background distribution is all the genes that have annotation by default.

Enrichment can be calculated for different things and we have included three different types of enrichment analysis as we as gene set enrichment analysis in ClusteRsy.

Disease analysis

This method uses three different databases. DOSE [11] and DisGeNET (DGN) [12] are used to find gene-disease associations within the module and the Network of Cancer Genes (NCG) [13] database is used for cancer-associated genes.

Gene ontology analysis

The gene ontology analysis method is used to classify functions among the genes using the bioconductor database bioconductor database. This method does so from three aspects; either by molecular function, cellular component or biological process.

KEGG pathway analysis

This method uses the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [14] to determine what pathways that the genes of interest are involved in.



Engineering principles

When creating our software the team worked through a cycle of feedback and development. We had two sessions of beta-testing where we asked different people to try out ClusteRsy and give us feedback on what to improve. Between the rounds of testing, we improved the software based on the feedback we received.

When we started working ClusteRsy we began by asking clinicians what they would want from software that analyzes transcriptome data since they were the intended users. They wanted it to be user friendly and easy to learn, which was what they felt was missing from their current options. This then became our primary focus, together with making a quality product with real value for the user.

For the first beta-testing the testers were our PIs, a doctoral student in Bioinformatics, and the part of the team not working on developing the software. The main goal of this beta-testing was to find bugs and to find out how user friendly it was. Because of this, we gave the testers as little information as possible about ClusteRsy beforehand, in order to truly test how user friendly it was on its own.

”Alttext”
Figure 9. Illustration of the first iteration of the beta testing.

The improvements we made after this beta-testing both made the software more user friendly and gave the user more options on how to use it. The changes we made to make it more user friendly were that we added an advanced options button to the parameters to make them easier to navigate, added a tutorial on how to use ClusteRsy, and added clickable buttons for more information. Other, more general, improvements we added to ClusteRsy were one single tab for all the different databases and the option to inspect the objects in them. After the testing, we also fixed the bugs that were found.

For the second beta-testing we invited clinicians we had been in contact within the beginning of the project to get their feedback on the finished product. We also invited bioinformaticians working at our university, our PIs, two other iGEM teams (Rochester and Imperial College London), and an employee from AstraZeneca we also had been in contact with earlier. The focus of this beta-testing was to find the last bugs and get some final feedback on ClusteRsy. This will then be relevant during the continued upkeep of the software after iGEM.

”Alttext”
Figure 10. Illustration of the second iteration of the beta testing.

The feedback we received from the last beta-testing was primarily that the testers wanted easily accessible YouTube-videos to give an introduction to how to use ClusteRsy, which we made and uploaded to our YouTube-page. From the people who had been a part of the first beta-testing, the feedback was positive and they felt that their previous concerns had been addressed.



Results

The final version of the ClusteRsy is a sophisticated, intuitive, and easy to use software for big-RNA-data analysis. One of the ambitions of the software was to make it easy to use for people who are not familiar with bioinformatics, so through the clickable explanations and user guide the user is guided through the use of ClusteRsy. During the first meeting with the clinicians, this ambition for user-friendliness was set. Throughout the two rounds of beta testing, we witnessed a great increase in the experienced user-friendliness. Another of the wishes opinionated at the first beta testing that we took into account was hiding all settings that were not vital for the user to change for each analysis and setting those to an appropriate default value. Another of the aspirations we set ourselves was to make ClusteRsy look like a modern tool that is pleasing for the eye. This was accomplished by using contrasting, yet matching colors combined with rounded shapes whenever it was possible. We also divided the information to separate tabs where it would be easy to find all relevant information without the tool being too cluttered.

It's now time to actually show ClusteRsy!

Video: An introduction of the ClusteRsy interface.

Conclusion

ClusteRsy is a user-friendly software for transcriptome analysis. To showcase that the software performs as designed for we have used it to predict biomarkers for asthma diagnosis, more details on how this was done can be found in the modeling tab and in proof of concept.

We are truly proud of our finished product and we believe it will help not only clinicians to analyze big data in an easy manner, but also allow future iGEM teams to easily access biomarker discovery and give insights into complex diseases.



References

[1]. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139-140. doi:10.1093/bioinformatics/btp616

[2]. Chen Y, Lun ATL and Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; peer review: 5 approved]. F1000Research 2016, 5:1438 (https://doi.org/10.12688/f1000research.8987.2)

[3]. de Weerd HA, Badam TVS, Martínez-Enguita D, Åkesson J, Muthas D, Gustafsson M, et al. MODifieR: an Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks. Bioinformatics. 2020;36(12):3918–9.

[4]. Ghiassian SD, Menche J, Barabási AL. A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome. PLOS Computational Biology. 2015;11(4): e1004120.

[5]. Gawel DR. et al. A validated single-cell-based strategy to identify diagnostic and therapeutic targets in complex diseases. Genome Med. 2019;11(47).

[6]. Gustafsson M, et al. Integrated genomic and prospective clinical studies show the importance of modular pleiotropy for disease susceptibility, diagnosis, and treatment. Genome Med. 2014;6(17).

[7]. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;13(4):2. doi: 10.1186/1471-2105-4-2.

[8]. Vlaic, S., Conrad, T., Tokarski-Schnelle, C. et al. ModuleDiscoverer: Identification of regulatory modules in protein-protein interaction networks. Sci Rep 8, 433 (2018). https://doi.org/10.1038/s41598-017-18370-2

[9]. Tesson BM, Breitling R, Jansen RC. DiffCoEx: a simple and sensitive method to find differentially coexpressed gene modules. BMC Bioinformatics. 2010;6(11):497. doi: 10.1186/1471-2105-11-497.

10]. Langfelder P, Horvath S. Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of statistical software. 2012;46(11):i11.

[11]. Dong Li, James B. Brown, Luisa Orsini, Zhisong Pan, Guyu Hu, Shan He. MODA: MOdule Differential Analysis for weighted gene co-expression network. bioRxiv 053496; doi: https://doi.org/10.1101/053496

[12]. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI. The DisGeNET knowledge platform for disease genomics. Nucleic Acids Research. 2020;48(D1):D845-55. https://doi.org/10.1093/nar/gkz1021

[13]. Repana D, Nulsen J, Dressler L, et al. The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens. Genome Biol. 2019;20(1). https://doi.org/10.1186/s13059-018-1612-0

[14]. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27-30. doi:10.1093/nar/28.1.27