Purpose How it was done Modeling and ClusteRsy Results Conclusion

Why we created ClusteRsy

Today there is a gap between clinicians generating data and bioinformaticians evaluating it. This problem is becoming more apparent because RNA-sequencing is getting cheaper and more accessible, making it a widely used tool for clinicians as well as researchers. More data than ever is being generated and all of it has to be analyzed.

Our solution to this is ClusteRsy, user-friendly software for transcriptome analysis. The main goal was to offer software that had state-of-the-art methods but without requiring any coding skills from the user.

In this section, we will demonstrate how we were able to not only predict 13 biomarkers for asthma but we also discovered a very interesting connection between the respiratory syncytial virus causing bronchiolitis in young children and asthma. This was all done using our own software, which offered us a great way to test that it indeed worked as it was designed for. First, we will go through how ClusteRsy was built and how we ensured that it was user-friendly and then we will go into more detail about how it was used to find biomarkers for asthma.

How it was done

Very early on in the design phase, it became clear to us that if we were to build software to mainly be used by clinicians, we had to reach out to this community and figure out what was currently lacking. We began by setting up a meeting with clinicians as well as bioinformaticians and began to break down what was needed, the pros and cons of the tools they used today as well as the esthetics and design of such tools. After a very informative discussion, we had set up the following criteria for ClusteRsy.

It should contain the computational tools used by bioinformaticians but be presented with an easy to use interface and no coding should be required.
A database should be provided to store results since this was currently lacking for most tools.
There should be examples of how the input data looks like and how it should be formatted in order to be used on our software.
The design should be modern and an overall easy to understand interface was crucial.

With these criteria in mind, we now began to build our software.

Most of the tools used by the bioinformaticians are integrated into the R programming language, this meant that we had to figure out how to integrate advanced R scripts into a webpage. We decided that the best solution to this would be to use the R-shiny framework. It offered the opportunity to use all the power of R but also the ability to integrate it with HTML, CSS, and javascript, thereby enabling the opportunity to create advanced R scripts but with a graphical user interface. The computational tools we decided to implement were MODifieR, used for identifying disease modules from RNA-sequencing data as well as enrichment analysis from the Clusterprofiler package that is needed in order to get any biologically relevant information out of the modules created by MODifieR [1,2]. We also wanted to provide an easy way to visualize the results, this was done by implementing 3 plots from the enrichment analysis package as well as a heatmap. A data table to inspect the final results even further was also included. To make sure that these results could be used for publications we made sure it was easy to download in a suitable format. In addition to this, we decided to implement an easy way to download the results as a Cytoscape object. Cytoscape is a software designed for network visualization and is widely used in network biology. In order to load networks into Cytoscape, the data has to be formatted, our Cytoscape download option allows for just that. This saves the user a lot of time otherwise spent on cleaning and formatting the data needed.

We also provide a database to store all of the uploaded and generated files on ClusteRsy. To do this we got help from one of our collaborators, Hendrik de Weerd. The database solution is provided as a SQL database that we then could integrate into our software and thereby allowing all of the files created in ClusteRsy to be stored properly.

Beta-testing

As mentioned earlier, the main purpose when creating our software was to bridge the gap between clinicians and bioinformaticians. During the development, we therefore always had the criteria set up during the design phase in mind. In order for us to quantify how well we had achieved our goal to make the software user-friendly, we made sure to beta-test it. This was done during two beta-testing sessions.

For the first beta-testing, our PIs, a doctoral student in bioinformatics, and team members not involved in the development of ClusteRsy were invited. During this session, we mainly focused on the overall user-friendliness and finding bugs. The changes we made from the received feedback was to make the software even more user-friendly by integrating all of the parameters used for the computational tools in an advanced setting button, more information was also added in clickable buttons and a guided tutorial was implemented.

When the software was almost finished we set up the second beta-testing session, this time we invited the clinicians we had been in contact with from the beginning. We also invited two other iGEM teams (Rochester and Imperial College London), an employee from AstraZeneca we also had been in contact with earlier as well as some of the beta-testers from the first session. The feedback we received from the last beta-testing was primarily that the testers wanted easily accessible YouTube-videos to give an introduction to how to use ClusteRsy, which we made and uploaded to our YouTube channel. From the people who had been a part of the first beta-testing, the feedback was positive and they felt that their previous concerns had been addressed.

Now when we have talked about how ClusteRsy was built it is finally time to show you the results of our software!

Modeling and ClusteRsy

How we used ClusteRsy to model and predict asthma biomarkers

ClusteRsy offers an easy way to elucidate advanced transcriptome analysis. It was developed to bridge the gap between bioinformaticians and clinicians both by enabling non-coders to access the advanced algorithms with an easy to use interface but also to allow bioinformaticians and enthusiasts to speed up their workflow.

To showcase that ClusteRsy indeed works not only in theory, we decided to use all the power of ClusteRsy when modeling and predicting biomarkers for our biosensor for asthma diagnosis. To understand how we did this we first need some background on Asthma.

Asthma and modeling

Globally, 339 million people are suffering from asthma and it’s the most common chronic disease amongst children [3]. Today, only physiological diagnostic methods, such as measuring breathing capability using a spirometer, are used for asthma diagnosis. However, several studies show that these methods are unreliable and often lead to misdiagnosis [4-6].

We strongly believe that biomarkers and biosensor technology could be used to address this problem. This will allow not only for more accurate diagnosis but also an overall better understanding of the genetic components of the disease. Asthma is a polygenic disease, meaning that using only one biomarker for asthma diagnosis simply wouldn’t be enough. Using an assay of biomarkers specific for asthma would therefore be needed to create a biosensor that could reliably be used for asthma diagnosis.

”Alttext” — **Figure 1.** Sketch of how the modeling was done on ClusteRsy and the results as a proof of concept.

This leads us back to modeling. In order to get an assay of biomarkers specific for asthma, we used state-of-the-art transcriptome analysis. This was done on ClusteRsy and it gave us a great opportunity not only to test our software but also to showcase that ClusteRsy actually worked in practice.

The workflow seen in Figure 1 illustrates the connection between ClusteRsy and modeling. All of the input data used for modeling was prepared and uploaded to the database using ClusteRsy. The disease module identification with MODifieR as well as enrichment analysis was calculated using ClusteRsy. For visualization, we used our software and were able to get a deep insight into asthma and good ideas on how to interpret the results. We also provide a download option to Cytoscape that allows for network visualization.

In order to orchestrate all of this, the cooperation group was formed. Where the modeling group focused on predicting biomarkers and evaluating the results produced by the model and ClusteRsy. The experimental part of this group focused on finding a good way to validate all of the results. More of this can be found in the Results tab.

The input data used

As mentioned previously asthma is a complex and polygenic disease. An asthma attack causes an activation of the inflammatory process. During this, allergens activate dendritic cells which leads to the recruitment of the T-helper cells who then release cytokines, which promotes the production of a variety of cells to fight off the infection. A good start when searching for biomarkers for asthma is therefore to look at these cells. For the modeling, we used publicly available RNA-sequencing data from Th2 cells sampled from the blood (GSE75011). This dataset includes 40 allergic asthma patients and 15 controls.

This data was then uploaded to ClusteRsy where it was processed in order to find differentially expressed genes and to be prepared for MODifieR. As seen in Table 1, in total the dataset contained 12 923 genes and 2306 of them were considered to be differentially expressed.

Validating the results from ClusteRsy

In order to validate which of the resulting modules, produced by our model, we performed enrichment analysis of disease-associated SNPs using a tool called PASCAL. We scored the modulus created on ClusteRsy using independent genomics data as a control and therefore were able to validate the results.

Figure 2 shows that the modules above the threshold marked as a dashed line (P-value < 0.05) were considered to be significant and these once were selected for further analysis. A more detailed explanation about this can be found in our Model tab. To further increase the significance of our findings we also did a literature search for all of the predicted biomarkers to see if they had been reported in asthma before, more of this is discussed further down.

Results that show that our model is better

Before we go into the details about the results that we were able to get by using ClusteRsy, we want to show that our model actually performs better than already established methods. To do this we first filtered the input data to only contain the differentially expressed genes, once this was done we ran disease analysis using the same settings as we used for the modules created with our model.

As seen in Table 2, using the input data only is simply not enough and returns non-significant results for all of the diseases (P-value > 0.05 and P-adjusted = 0.01). With this, we can show that a modular approach using our pipeline greatly increases the chance to get significant results.

Table 1. Enrichment analysis using the disease analysis algorithm and DGN database. In yellow the filtered input data is shown together with the P-value and P-adjusted value for the investigated diseases and is compared to DIAMOnD in green, CSP in light red, Module discoverer in blue and Correlation clique in pink.

Results from the modeling

In this section, we will show that using ClusteRsy for biomarker discovery is not only highly efficient thanks to the easy to use interface but that the software is powerful enough to discover heterogeneity within a disease and how this led to the discovery of a connection between asthma and RSV Bronchiolitis.

Once we had performed all the calculations in ClusteRsy using the asthma input data we began to visualize the results within the software. This allowed us to quickly get an overview and to find interesting patterns in the resulting data. When we had found the results we would want to continue with we wanted to visualize the networks in Cytoscape, by using the built-in Cytoscape download option in ClusteRsy the data was automatically formatted and cleaned in order to be able to upload it to Cytoscape.

Figure 3. Network representation of the different modules, Module Discoverer upper left, DIAMOnD upper right, Correlation Clique bottom left and CSP bottom right. Red is overexpressed genes with a logFC > 0.5, yellow is genes with -0.5 < logFC < 0.5 and are not considered significant for our purpose. Blue is underexpressed genes with a logFC < -0.5. The green area is the genes involved in viral Bronchiolitis, orange area is genes involved in Asthma and the purple area is genes involved in RSV. The intersection between the disease are genes present in both or all diseases.

Figure 3 shows the results once investigated in Cytoscape. Using Venn diagrams we could easily detect the overlapping genes which led to the discovery of the connection between asthma and RSV Bronchiolitis. If you want to get into the details about this then it’s found in our Model tab. In Figure 3 the power of ClusteRsy can be seen. The model was powerful enough not only to predict 13 biomarkers for asthma diagnosis but also detected a correlation between asthma and RSV Bronchiolitis. This discovery led us to propose that these overlapping genes could be used to screen infants that are at high risk for developing severe RSV Bronchiolitis. With this screening, it would enable at an early stage giving treatment to the high-risk group and thereby reducing the severity of the illness.

As mentioned we also predicted 13 biomarkers for asthma diagnosis. This was done by taking the results from ClusteRsy and Cytoscape mentioned above, from this we set up criteria to narrow down the raw data of potential biomarkers. These criteria included that there should be available antibodies, that the log2 fold change should be above a threshold set, and that the team of phase II should be able to express and purify them. Table 3 shows the final 13 biomarkers that were selected using the above-mentioned criteria. Once we had selected our final biomarkers they were searched for in the literature to see if they had been reported in asthma before. All of the predicted biomarkers had been reported in asthma and this increases the overall reliability of our findings [7-15].

**Table 3.** The final biomarkers that were selected from the criteria listed above. The biomarkers are divided into asthma, the intersection of asthma and RSV as well as the intersection asthma, RSV and viral Bronchiolitis. The logFC value contains a gradient where 0.5 is light red, 1.2 and above is red. For negative values -0.5 is light blue and the minimum logFC is blue. The P-values is the P-value for the differential expression between control group and patient group for that gene.

All of this resulted in a deep insight into the complexity of asthma and we were not only able to predict novel biomarkers for asthma diagnosis but also found the connection between Respiratory syncytial virus (RSV) Bronchiolitis and asthma. This connection shows the power of our computational pipeline which can be done on ClusteRsy and it gave us not only 13 predicted biomarkers for asthma diagnosis but also an idea on how to identify infants at high risk for developing severe RSV Bronchiolitis.

Conclusion

We have now showcased that ClusteRsy can effectively be used as a tool for biomarker discovery. Through two beta-testing sessions, we made sure to measure the user-friendliness and we were able to make several improvements which led to an easy to use software for advanced transcriptome analysis.

Although we have used ClusteRsy to predict biomarkers it also has many more potential uses such as optimizing gene expression and drug target discovery to name a few.

We are happy to announce that ClusteRsy will be hosted online through Mika Gustafssons group at Linköping University that has shown great interest in our software. This will allow for fast computations, lots of storage for all the created results, and a reliable server. We think it is safe to say that our software will indeed be used in the real world and we hope that it will be a useful resource when the big data gets bigger.

We are truly proud of our finished product and we believe it will help not only clinicians to analyze big data in an easy manner, but also allow future iGEM teams to easily access biomarker discovery and give insights into complex diseases.

References

[1]. de Weerd HA, Badam TVS, Martínez-Enguita D, Åkesson J, Muthas D, Gustafsson M, et al. MODifieR: an Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks. Bioinformatics. 2020;36(12):3918–9.

[2]. Yu G, Wang LG, Han Y, He QY. ClusterProfiler: An R package for comparing biological themes among gene clusters. Omi A J Integr Biol. 2012;16(5):284–7.

[3]. Global, regional, and national incidence, prevalence, and years lived with disability for 328 diseases and injuries for 195 countries, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet. 2017;390: 1211–59.

[4]. Kavanagh J, Jackson DJ, Kent BD. Over- and under-diagnosis in asthma. Breathe. 2019;15: e20–e27.

[5]. Pratter MR, Hingston DM, Irwin RS. Diagnosis of bronchial asthma by clinical evaluation. An unreliable method. Chest. 1983;84(1):42–7.

[6]. Gherasim A, Dao A, Bernstein JA. Confounders of severe asthma: Diagnoses to consider when asthma symptoms persist despite optimal therapy. World Allergy Organ J. 2018;11(1):1–11.

[7]. Seumois G, Zapardiel-Gonzalo J, White B, Singh D, Schulten V, Dillon M, et al. Transcriptional Profiling of Th2 Cells Identifies Pathogenic Features Associated with Asthma. J Immunol. 2016;197(2):655–64.

[8]. Matsuoka T, Hirata M, Tanaka H, Takahashi Y, Murata T, Kabashima K, et al. Prostaglandin D2 as a mediator of allergic asthma. Science. 2000;287(5460):2013–7.

[9]. Fajt ML, Gelhaus SL, Freeman B, Uvalle CE, Trudeau JB, Holguin F, and Wenzel SE. Prostaglandin D2 pathway upregulation: Relation to asthma severity, control, and TH2 inflammation. Bone. 2008;23(1):1–7.

[10]. Balzar S, Fajt ML, Comhair SAA, Erzurum SC, Bleecker E, Busse WW, et al. Mast cell phenotype, location, and activation in severe asthma: Data from the Severe Asthma Research Program. Am J Respir Crit Care Med. 2011;183(3):299–309.

[11]. Oguma T, Asano K, Ishizaka A. Role of prostaglandin D2 and its receptors in the pathophysiology of asthma. Allergol Int. 2008;57(4):307–12.

[12]. Culley FJ, Pennycook AMJ, Tregoning JS, Hussell T, Openshaw PJM. Differential Chemokine Expression following Respiratory Virus Infection Reflects Th1- or Th2-Biased Immunopathology. J Virol. 2006;80(9):4521–7.

[13]. Sananez I, Raiden S, Erra-Díaz F, De Lillo L, Holgado MP, Geffner J, et al. Dampening of IL-2 Function in Infants with Severe Respiratory Syncytial Virus Disease. J Infect Dis. 2018;218(1):75–83.

[14]. Sańchez-Zauco N, Rio-Navarro B Del, Gallardo-Casas C, Del Rio-Chivardi J, Muriel-Vizcaino R, Rivera-Pazos C, et al. High expression of Toll-like receptors 2 and 9 and Th1/Th2 cytokines profile in obese asthmatic children. Allergy Asthma Proc. 2014;35(3).

[15]. Hamzaoui A, Maalmi H, Berraïes A, Abid H, Ammar J, Hamzaoui K. Transcriptional characteristics of CD4+ T cells in young asthmatic children: RORC and FOXP3 axis. J Inflamm Res. 2011;4(1):139–46.

Team:Linkoping/Proof Of Concept