ClusteRsy: A user-friendly software for transcriptomic analysis and biomarker discovery.
Asthma is a chronic and inflammatory disease that affects the airways of 339 million people around the world yet the specific causes and triggers of asthma are still unknown. There is an increasing demand for refined diagnostic methods and in the age of Big data, the advancement of powerful algorithms provides an approach different from traditional methods. With the creation of the web-based software ClusteRsy, we empower clinicians and biologists to analyze any RNA-seq data without the need for a bioinformatician. Through the use of ClusteRsy and our designed workflow, differentially expressed genes and pathways can be determined, which is pivotal to understand the mechanisms of diseases and find potential biomarkers. With the processed information, we have designed a theoretical biosensor to detect and distinguish asthma from similar conditions, thereby simultaneously striving to decipher the etiology of asthma and improving the diagnosis for the disease.
Globally 339 million people are suffering from asthma and it’s the most common chronic disease amongst children [1]. Each day 1000 people are estimated to die from Asthma, and it affects low-income areas the worst [2]. Asthma is a complex and polygenic disease and today only physiological diagnostic methods are available, such as measuring breathing capabilities using a spirometer. However, several studies show that these methods are unreliable and often lead to misdiagnosis [3]. With the advancement of bioinformatics and vast accumulation of data, we strongly believe that we can use this to not only address the problem mentioned above but also make bioinformatics accessible to more people. We will therefore create software for transcriptome analysis and use this for biomarker discovery and biosensor optimization. This will allow not only for more accurate diagnosis but also an overall better understanding of the genetic components of the disease. The software will assist researchers and clinicians without the use of bioinformaticians as a middle hand. The project was divided into 2 phases, where phase I did cover the creation of the software ClusteRsy as well as literature studies of Asthma and the first design of a biosensor based on results generated by the software. The second phase will take place next year.
As another field in biology is increasing due to advanced techniques and adaptation to the big data we saw an opportunity to create software to assist current clinicians in their work. With patient data and our software, a clinician would be able to predict which genes are connected to a certain disease and would be able to tell is it is under or overexpressed. The result could then be used to fuse bioinformatic with synthetic biology by creating a biosensor optimised with the resulting genes. We also noticed that today, a lot of people suffer from Asthma and this disease affects their lives in many ways. Especially in low-income countries. The diagnostic methods are nowadays done by collecting air from the lungs using a spirometer. This method is ineffective and can in many cases lead to misdiagnosis. Our vision was to create a device that more easily would detect and differentiate asthma based preferably on a body fluid.
The aim of the programming group was to create a software that was so intuitive that a user with no previous knowledge of programming could utilize state-of-the-art algorithms to detect differentially expressed genes. Advanced algorithms to detect differentially expressed genes in samples from RNA-sequencing have been developed at Linköpings University[4]. In order for the algorithms to be used, knowledge regarding data pre-processing and optimizing parameters is key. This decreases the number of potential users to such an extent that only users with a background in bioinformatics and programming can use them. This means that the typical users with RNA-sequencing data, such as clinicians and biologists, have trouble using these algorithms for their research. Today the RNA-sequencing data is sent to a bioinformatician to be analyzed. We in iGEM Linköping want to remove this stage and empower clinicians and biologists to analyze RNA-sequencing data by themselves. We have therefore created the web-based software ClusteRsy.
We have created a computational pipeline that utilizes the tools integrated in ClusteRsy and our goal with this model is to predict new biomarkers for asthma diagnosis. This pipeline could be used to predict biomarkers for any disease as long as there is RNA-sequencing data available. We hope that our results can inspire more teams to harness the potential that network biology and statistical modeling has to offer.
Since the project was decided to be two-phased, the experimental subgroup was mainly focused on preparing grounds for next year’s phase II. Our work was divided into three big steps. The goal of step one was to gain basic knowledge about asthma as a disease, its phenotypes, endotypes, and disease mechanism as well as finding known asthma biomarkers. The biomarkers of highest interest were then chosen, with the criteria that they should be proteins and preferably enzymes since that could aid in the production of a biosensor. Three proteins were chosen, eosinophil derived neurotoxin (EDN), eosinophil cationic protein (ECP), and histamine n-methyltransferase (HNMT). These proteins were then designed in Benching and ordered from IDT. For step two we aimed to express and purify chosen protein biomarkers, which will be used as a control for the validation of the predicted biomarkers by ClusteRsy during phase II. In the wet lab, the designed constructs were transformed into Escherichia coli (E. coli) and expression levels were measured. ECP was expressed successfully, and a NanoDFS was performed to control the folding of the proteins. The results showed that the ECP protein had at least partially correctly folded. The final third step of phase I was to create a theoretical biosensor which will be developed by next year's team
Our objective was to bridge the gap between clinicians and data-analysis, thus we developed the software ClusteRsy, for transcriptome analysis and biomarker discovery. To make sure we met our ambition we decided we would have several sessions of beta testing where we would invite people with varying backgrounds to test ClusteRsy and give us relevant feedback. During the course of the project, we had time for two different beta-testing sessions. During the first beta testing we mainly wanted feedback on user-friendliness. This way we would get early feedback on what could be improved in order for ClusteRsy to be easy to use. For this, we invited our PIs, bioinformaticians, and the experimental part of our team and they gave us varied feedback.The feedback we got was that ClusteRsy was quite difficult to understand. From this feedback we added a click-through tutorial to the homepage where we showed the different elements of the software and explained how to use it. We also added more explanations in the form of clickable buttons. In order to make it easier for the user to navigate the parameters we added an advanced-settings option for the more complicated parameters and let the user use default settings. When we had the second beta testing we were almost finished with ClusteRsy and therefore wanted to investigate how well we actually met the criteria we started with. Here we got feedback on both the improvements made since the first beta testing as well as on how well the tool fulfilled the initial criteria. The people who had used ClusteRsy during the first beta testing said that the improvements made satisfactorily addressed their initial concerns, which gave us confidence in our design process. The clinicians asked for more detailed instructions, preferably in the form of video tutorials. This would help when instructing new users and therefore make it a lot easier to learn how to use ClusteRsy. Because of this feedback we decided to make video tutorials and upload them to YouTube. These will guide future users and aid anyone interested in using ClusteRsy.
The final version of the ClusteRsy is a sophisticated, intuitive, and easy to use software for big-RNA-data analysis. One of the ambitions of the software was to make it easy to use for people who are not familiar with bioinformatics, so through the clickable explanations and user guide the user is guided through the use of ClusteRsy. During the first meeting with the clinicians, this ambition for user-friendliness was set. Throughout the two rounds of beta testing, we witnessed a great increase in the experienced user-friendliness. Another of the wishes opinionated at the first beta testing that we took into account was hiding all settings that were not vital for the user to change for each analysis and setting those to an appropriate default value. Another of the aspirations we set ourselves was to make ClusteRsy look like a modern tool that is pleasing for the eye. This was accomplished by using contrasting, yet matching colors combined with rounded shapes whenever it was possible. We also divided the information to separate tabs where it would be easy to find all relevant information without the tool being too cluttered. Throughout this project, we also researched asthma and its biomarkers via the literature study and later came up with the idea of a biosensor that will help in asthma diagnosis. Our work was divided into three big steps. In the first step we gained knowledge about asthma as a disease, its phenotypes, endotypes, and disease mechanism as well as finding three good asthma biomarkers. The second step was expressing and purifying the chosen protein biomarkers, which will be used as a control for the validation of the predicted biomarkers by ClusteRsy during phase II. And the final third step of phase I was to create a theoretical biosensor which will be developed by next year's team.
To conclude, ClusteRsy is a software developed for transcriptome analysis and it will enable other teams as well as clinicians to easily access biomarker discovery and give new insight into complex diseases. Not only can we show that a modular approach using our computational pipeline improves the results significantly, but the amount of detail in the results is sufficient enough to be able to distinguish between heterogeneity within a disease. We could use this computational pipeline to successfully predict 13 biomarkers for asthma diagnosis. We also purpose that with our findings we could with a screening program identify infants with high risk for severe RSV bronchiolitis.
It is crucial to validate the potential biomarkers found both from the literature study and from ClusteRsy. This is to confirm that they are indeed involved in asthma and therefore can be reliably used as biomarkers for the biosensor. To do this the best option would be to use an antibody-based assay. We have previously mentioned that one of the criteria when selecting the final biomarkers was to make sure that known antibodies existed. So the idea is to create antibodies specifically for the predicted biomarkers and then comparing blood samples from patients and controls using this assay. One idea we have for phase II of the project is to create these antibodies ourselves. This will be done by using a protocol for antibody production created by the Rochester iGEM team this year and then optimizing this protocol for our purpose. The biomarkers that are shown to be significant during this validation will finally make it to the biosensor assay.
[1]. Global, regional, and national incidence, prevalence, and years lived with disability for 328 diseases and injuries for 195 countries, 1990–2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet. 2017;390: 1211–59. [2]. Enilari O and Sinha S. The Global Impact of Asthma in Adult Populations. Annals of Global Health. 2019;85(1), p.2. [3]. Kavanagh J, Jackson DJ, Kent BD. Over- and under-diagnosis in asthma. Breathe (Sheff). 2019;15(1):e20-e27. doi:10.1183/2073473 [4]. de Weerd HA, Badam TVS, Enguita DM, Åkesson J, Muthas D, Gustafsson M, Lubovac-Pilav Z, MODifieR: an Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks. Bioinformatics. 2020;36(12):3918-19
We want to dedicate a special thanks to the following people and teams: Lars-Göran Mårtensson Per Hammarström Mika Gustafsson Hendrik de Weerd Sofia Nyström Jan Ernerudh Maria Jenmalm Rochester iGEM Team 2020 Imperial College iGEM Team 2020
Adam Lång Alexander Johansson Christina Hedner Erika Mattson Thalia Gárate Rodríguez Frida Haugskott Ida Berghtén Jake Pham Astrid Welin Lucas Porcile Natalia Savinkova Ronja Höglund Osas Iyere Oliver Hild-Walett