We have created a computational pipeline that utilizes the tools integrated to ClusteRsy and our goal with this model is to predict new biomarkers for asthma diagnosis. This pipeline could be used to predict biomarkers for any disease as long as there is RNA-sequencing data available, we hope that our results can inspire more teams to harness the potential that network biology and statistical modeling has to offer.
This is an overview of the modeling pipeline. Through this pipeline, starting with the RNA-sequencing data as input, followed by MODifieR for disease module identification we then split modules that are too big and finally validate our findings using PASCAL. Before any further analysis can be done we pick out the modules that made it through the significance threshold from PASCAL.
Once this is done, enrichment analysis through the disease analysis algorithm is performed on all the modules that made it passed PASCAL. Now when we have extracted biologically relevant information about the modules it’s time to visualize our findings. This is done using Cytoscape and here we search through the networks looking for interesting patterns, in our case, genes that are highly correlated with asthma and asthma-related diseases. The genes should also make it above a fold change threshold to indicate that they are indeed over or under-expressed in Asthma, this is crucial for the biosensor.
Finally we arrive at the predicted biomarkers. To refine our findings even more we now gather information about these predicted biomarkers, questions like are they easy to express and purify? Are there any antibodies that we can use to experimentally validate the findings and have these predicted biomarkers been mentioned in literature before? Needs to be answered in order to make it to the final biosensor assay for asthma diagnosis.
When selecting our input data we set up a few criteria to help us find a good dataset. Firstly the organism should be Homo sapiens. It should be Th2 cells and the experiment should be expression profiling by high throughput sequencing. From these criteria, we found the dataset GSE75011.
The GSE75011 count matrix was obtained from the NCBI GEO datasets. The data was collected during the transcriptional profiling of TH2 cells identifying pathogenic features associated with asthma. TH2 cells that are circulating in the body were isolated from a cohort of patients with allergic rhinitis (data from those patients was not used for our purpose), asthma and control group, which were healthy non-allergic subjects. The aim of the study was to determine if there is a qualitative difference in TH2 cells between the asthmatic and healthy group of patients.
RNA-seq was then prepared by extracting total RNA, selection mRNA, and sequencing of amplified cDNA.
Once the input data had been collected we prepared it for MODifieR by transforming it into a usable format for MODifieR to handle. Firstly the data was loaded into R. With the AnnotationDbi package we then converted the gene symbols into Entrez ID, the org.Hs.eg.db package was used for mapping between the identifiers. Once this was done then groups were selected, i.e, patients and controls were chosen according to the study for this dataset, an adjusted p-value, as well as normalized quantiles, was used. Differentially expressed genes were calculated using a built-in function in the MODifieR package which in turn uses generalized linear models from the edgeR package.
The resulting input data contained 12923 genes where 2306 was differentially expressed (P-value < 0.05).
MODifieR also needs a PPI network for mapping the genes present in the input data. We selected the STRING v.11 PPI network and filtered it to only contain human genes with a high confidence score > 700. Once the MODifieR input and PPI network had been prepared we ran all of the 8 methods included in MODifieR using the default parameter setting. The calculations were performed on a supercomputer in order to speed up the calculations.
The resulting data we received from MODifieR were networks of genes that the algorithms predicted to be significant given the input data. These modules don't contain any biologically interesting results themselves and further analysis using enrichment analysis is therefore needed. To validate how well MODifieR performed we chose to validate the resulting modules using PASCAL.
To measure the performance of our MODifieR model we tested it against only using the input data without MODifieR applied to it. To do this we firstly filtered the input data to only contain differentially expressed genes with a P-value < 0.05. This step is not needed when using MODifieR since it’s filtering the input data automatically with different settings, but in order to get significant results from only the input data, the data itself must contain significance. Once the filtering was done we ran disease analysis on this filtered input data using the same settings as for our MODifieR modules.
The DiffCoEx and WGCNA modules were too big, i.e there were more than 10.000 genes in the module and they could not be analyzed properly in PASCAL. To reduce the size we split these modules into smaller submodules. The splitted DiffCoEx resulted in 30 different submodules and the splitted WGCNA in 9 different submodules. The splitting was easily done within ClusteRsy and the resulting submodules could then be analyzed with PASCAL.
PASCAL stands for Pathway Scoring Algorithm and is open-source software designed to gain biological insight into genome-wide association studies. These studies typically generate results in the form of genes associated with certain traits or diseases. In order to draw any biologically relevant conclusions from these results, they have to be further analyzed. PASCAL generates gene and pathway scores for this purpose. This is done by mapping genomic regions associated with SNPs detected in the GWAS to already established pathways.
During this project PASCAL was used to validate the MODifieR modules. This was done by setting up a threshold for the calculated SNP P-values from PASCAl. The threshold was set to a P-value = 0.05, all of the modules that were considered significant had a P-value > threshold.
Enrichment analysis also known as overrepresentation analysis is a common approach to determine if a biological process or function is enriched. This is done by calculating the P-values from a hypergeometric distribution given a list of differentially expressed genes (DEGs).
Once the modules had been validated with PASCAL, the significant modules were collected to perform enrichment analysis. Our main focus was to investigate the genes involved in allergic asthma and other diseases related to asthma, therefore to run disease analysis using the DisGeNET (DGN) database was the best option.
Disease analysis was calculated within ClusteRsy on all the significant modules as well as the filtered input data and default parameters were selected. The results were stored within the ClusteRsy database.
Cytoscape is open-source software for analyzing and integrating large networks with corresponding attributes. The possibility in ClusteRsy to download the modules as Cytoscape objects makes them more accessible for users and open up the opportunity for even further visualizations and analysis of the specific modules.
The significant modules according to PASCAL are enriched for detailed information of over- and underexpression of genes. The Cytoscape objects from ClusteRsy consist of two files, a PPI-network and a table of attributes for each gene in the inferred network. These two files are added to the Cytoscape environment for further analysis.
Once the data was loaded to cytoscape the network was filtered to only contain significantly differentially expressed genes (P-value < 0.05).
Fold change is defined as the ratio between two quantities. In our case, we would like to study how the expression level of a given gene differs between patients and a control group and it can be described with the following equation.
In order to analyze the fold change the log2 was used. This gives an easy to interpret fold change since log2 fold change of 1 means a doubling in the expression level of gene A in patients compared to the control.
When filtering for genes that were considered overexpressed log2 fold change value of 0.5 was used, this gives roughly 40% increased overexpression in patients compared to control. For underexpressed genes, a log2 fold change value of -0.5 was used, which gives roughly 30% decreased expression, i.e underexpressed by 30% in patients compared to control.
log2 fold change will from now on be referred to as logFC
From the filtered network we now find the disease genes. From the enrichment analysis asthma, respiratory syncytial virus (RSV), and bronchiolitis genes were collected. In Cytoscape, we then selected starting with the asthma genes and group them this was then followed by the same procedure for RSV and Bronchiolitis. Using the Venndiagram app provided in Cytoscape we could then find the intersection between the diseases.
Once the results had been investigated with Cytoscape we had retrieved a lot of biomarkers that were predicted to be significantly differentially expressed. From this data, we want to find the best biomarkers for our biosensor. In order to systematically find these proteins, we set up a few different criteria based on the experimental setup. Firstly we needed to make sure that polyclonal antibodies were purchasable in order to be available to validate our findings using an antibody-based assay such as ELISA. We also picked the ones with the highest and lowest logFC value and they should be easy or moderately easy to express and purify in order to ease the work of phase II. Another important aspect we investigated was the disease if the biomarker was present in asthma, asthma/RSV, or asthma/RSV/bronchiolitis. While selecting the relevant biomarkers according to the below-mentioned criteria, a literature study was performed to ensure that the protein is related to the diseases of interest.
From the GSE75011 count matrix an edgeR table was calculated using a built in function in MODifieR which in turn uses generalized linear models to calculate which genes were differentially expressed. The resulting input data contained 12923 genes and 2306 of them was considered to be differentially expressed (P-value < 0.05).
Using the prepared GSE75011 dataset we ran all 8 methods from MODifieR to produce in total 8 different disease modules. These modules contain genes that are highly linked together and therefore form a cluster in the network. The next step was to validate how well these modules performed and this was done using PASCAL to compare the SNP P-values from the genome-wide association study but before that could be done the modules DiffCoEx and WGCNA was too big i.e. contained more than 10.000 genes and therefore had to be split into submodules. This resulted in 30 different submodules for DiffCoEX and 9 different submodules for WGCNA.
In order to validate the modules we ran PASCAL on all the MODifieR modules. The threshold was set to a P-value < 0.05. As seen in Figure 1, the negative logarithm of the P-values are displayed, hence significant modules are above this threshold.
From validation with PASCAL we concluded that the DIAMOnD, DiffCoEX_skyblue1 (splitted module), WGCNA, WGCNA_splitted_module_by_color_3, Correlation clique, Clique sum permutation (CSP) and module discoverer modules were significant.
The final modules that were chosen to continue with were DIAMOnD, Correlation clique, CSP, and Module discoverer. DiffCoEx_skyblue1, WGCNA, and WGCNA_splitted_module_by_color_3 were discarded, this was due to the small size of some modules and none of them gave any significant results further on. As seen in Table 1, the final modules that were chosen to continue with had varying sizes. The original input data contained 12923 genes.
The final modules, DIAMOnD, Correlation clique, CSP, and Module discoverer contain information about the disease module, i.e. the resulting modules are networks of genes that are considered significant for the modules and are highly linked together. To find out any information about the disease genes from these networks enrichment analysis is needed. We performed disease analysis using the DGN database on DIAMOnD, Correlation clique, CSP, and Modules discoverer.
As seen in Table 3, the disease analysis is significant with a p-value < 0.05 for all diseases of interest as well as all the modules. The gene ratio is the number of genes involved in the disease divided by the total number of genes in that module after the disease analysis had been performed.
In order to find out how well our model using MODifieR actually performed we wanted to test it against the case of not using the model. We, therefore, filtered the input data and didn’t use MODifieR. the input data was filtered to only contain differentially expressed genes that were significant, P-value < 0.05, and performed disease analysis at this filtered input data as well.
We can show that our model using MODifieR outperforms input data only when performing enrichment analysis.
As seen in Table 4 using our model with MODifieR produces significant results with P-values < 0.05 for all diseases investigated. Compared to using input data only which returns non-significant results with P-value > 0.05 and P-adjusted = 1 for all diseases. This supports our findings and that the connections discovered are indeed correlated to the disease as well as showing the power of this computational approach.
With this we can show the importance of analyzing transcriptome data using a modular approach. Our pipeline will therefore increase the chance to get significant results when doing enrichment analysis.
When the modules had been validated and proven to be better than using only input data, we wanted to find any biological relevance out of these modules. The disease analysis provided us with very interesting results, as seen in figure 3 there is a big overlap between Asthma, RSV, and viral bronchiolitis. This truly shows the power of this computational pipeline, the amount of insight into asthma we got was more than we had expected. The connection between Asthma and severe RSV Bronchiolitis was first established at Linköping University [1] and has since then been reported in several studies [2], although the causality has not been proven yet [3], the connection is well established. From Figure 3 the asthma genes are shown in the orange area, viral bronchiolitis in green, and RSV in purple. Red is overexpressed genes with a logFC > 0.5, yellow is genes with -0.5 < logFC < 0.5 and are not considered significant for our purpose. In blue underexpressed genes are shown.
These results made us think about the idea of being able to differentiate between patients with asthma and patients who had severe childhood RSV Bronchiolitis and asthma. But first, we needed to find out if there were any clinical value in being able to differentiate between the groups. We contacted Maria Jennmalm, Professor of Experimental Allergology for a consultation about this idea. After a very informative meeting, we came to the conclusion that the intersection between asthma, RSV, and Bronchiolitis, shown in figure 3, is a perfect indicator for infants at high risk for severe RSV Bronchiolitis. These results could therefore be used for screening of infants with a higher risk for severe RSV Bronchiolitis and vaccination of this high-risk group should therefore be considered.
When selecting the predicted biomarkers of interest we started with collecting all the genes that were considered to be overexpressed (logFC > 0.5) or underexpressed (logFC < -0.5) with a P-value < 0.05 and that were involved in asthma, the intersection of asthma and RSV as well as the intersection of asthma, RSV and Bronchiolitis.
The results can be seen in Table 5. The logFC value was given a gradient ranging from light red (low positive logFC) to red (large positive logFC) and light blue (low negative value) to blue (large negative logFC) this was done to easily spot the most differentially expressed genes.
Once this first selection was done we wanted to refine our finding even more. In order to make it to the final biomarker assay, we needed to know whether it was practically doable given these biomarkers. We, therefore, set up a few criteria to narrow down these biomarkers to a final list.
The following criteria for the final selection were based on the following: There should be purchasable polyclonal antibodies in order for us to validate our results using ELISA. The once with the highest or lowest logFC should be picked and they should be easy to moderately easy to express in order to ease the work during phase II of the project. We also looked into whether the predicted biomarkers had been reported in asthma before by doing a literature search.
In Table 6, the final 13 biomarkers are shown. These are biomarkers that have been reported in asthma, have polyclonal antibodies to purchase, and are easy or moderately easy to express. The biomarkers have been divided into asthma, the intersection of asthma and RSV as well as the intersection of asthma, RSV, and viral Bronchiolitis. All these 13 biomarkers will be validated using a sandwich ELISA during phase 2 of the project before finally making it to the biosensor assay. From these 13 biomarkers, all of them could potentially be used for asthma diagnosis and we also purpose that from the intersection between Asthma, RSV, and Bronchiolitis we can use these biomarkers to screen for infants who are at high risk to develop severe RSV Bronchiolitis.
Not only can we show that a modular approach using our computational pipeline improves the results significantly but the amount of detail in the results is sufficient enough to be able to distinguish between heterogeneity within a disease.
We could, by using this computational pipeline successfully predict 13 biomarkers for asthma diagnosis. We also purpose that with our findings we could with a screening program identify infants with high risk for severe RSV bronchiolitis.
[1]. Sigurs N, Bjarnason R, Sigurbergsson F, Kjellman B. Respiratory syncytial virus bronchiolitis in infancy is an important risk factor for asthma and allergy at age 7. Am J Respir Crit Care Med. 2000;161(5):1501-7. doi: 10.1164/ajrccm.161.5.9906076.
[2]. Wu P, Hartert TV. Evidence for a causal relationship between respiratory syncytial virus infection and asthma. Expert Rev Anti Infect Ther. 2011;9(9):731-745. doi:10.1586/eri.11.92
[3]. Altman MC, Beigelman A, Ciaccio C, Gern JE, Heymann PW, Jackson DJ, et al. Evolving concepts in how viruses impact asthma. J Allergy Clin Immunol. 2020;145(5):1332-1344. doi: 10.1016/j.jaci.2019.12.904.