Programming Model Experimental

Programming

The final version of the ClusteRsy software is a sophisticated, intuitive, and easy to use software for analysis of big-RNA-data. One of the ambitions of the software was to make it easy to use for people who are not familiar with bioinformatics, so through the clickable explanations, user guide, and Youtube tutorials the user is guided through the use of ClusteRsy. During the first meeting with the clinicians, this ambition was set. Throughout the two rounds of beta testing, we witnessed a great increase in the experienced user-friendliness. Another of the wishes opinionated at the first beta testing that we took into account was the hiding of all settings that were not vital for the user to change for each analysis and setting those to an appropriate default value. Another of the aspirations we set ourselves was to make ClusteRsy look like a modern tool that is pleasing for the eye. This was accomplished by using contrasting, yet matching colors combined with rounded shapes whenever it was possible. We also divided the information to separate tabs where it would be easy to find all relevant information without the tool being too cluttered.

To conclude, ClusteRsy is a software developed for transcriptome analysis and it will enable other teams as well as clinicians to easily access biomarker discovery and give new insight into complex diseases.

Video: An introduction of the ClusteRsy interface.

Model

Here we will showcase the power of ClusteRsy and our modeled computational pipeline that anyone can use. We have used RNA-seq data from the NCBI GEO database from the asthma study GSE75011 on Th2 cells. Without prior knowledge, we were able to find a connection between asthma and RSV bronchiolitis using the results from ClusteRsy and our pipeline.

Input data

From the GSE75011 count matrix an edgeR table was calculated using a built-in function in MODifieR which in turn uses generalized linear models to calculate which genes were differentially expressed. The resulting input data contained 12923 genes and 2306 of them were considered to be differentially expressed (P-value < 0.05).

MODifieR

Using the prepared GSE75011 dataset we ran all 8 methods from MODifieR to produce in total 8 different disease modules. These modules contain genes that are highly linked together and therefore form a cluster in the network. The next step was to validate how well these modules performed. This was done using PASCAL to compare the SNP P-values from the genome-wide association study. Before that could be done the modules DiffCoEx and WGCNA were too big which means they contained more than 10.000 genes and therefore had to be split into submodules. This resulted in 30 different submodules for DiffCoEX and 9 different submodules for WGCNA.

Validation of modules

In order to validate the modules we ran PASCAL on all the MODifieR modules. The threshold was set to a P-value < 0.05. As seen in Figure 1, the negative logarithm of the P-values are displayed, hence significant modules are above this threshold.

From validation with PASCAL we concluded that the DIAMOnD, DiffCoEX_skyblue1 (splitted module), WGCNA, WGCNA_splitted_module_by_color_3, Correlation clique, Clique sum permutation (CSP) and module discoverer modules were significant.

The final modules that were chosen to continue with were DIAMOnD, Correlation clique, CSP, and Module discoverer. DiffCoEx_skyblue1, WGCNA, and WGCNA_splitted_module_by_color_3 were discarded, this was due to the small size of some modules and none of them gave any significant results further on. As seen in Table 1, the final modules that were chosen to continue with had varying sizes. The original input data contained 12923 genes.

”Alttext” — **Figure 1.** Results from PASCAL. The negative logarithm of the P-values are being displayed on the y-axis and all the MODifieR modules on the x-axis. Red is seed based methods, blue is co-expression based and yellow is Clique based methods. The dashed line is the threshold, alpha, set to a P-value = 0.05. All the significant modules are above this dashed line. Investigating the resulting modules further DiffCoEX_skyblue1 (splitted module), WGCNA and WGCNA_splitted_module_by_color_3 was discarded.

Enrichment analysis

The final modules, DIAMOnD, Correlation clique, CSP, and Module discoverer contain information about the disease module, i.e. the resulting modules are networks of genes that are considered significant for the modules and are highly linked together. To find out any information about the disease genes from these networks enrichment analysis is needed. We performed disease analysis using the DGN database on DIAMOnD, Correlation clique, CSP, and Modules discoverer.

As seen in Table 2, the disease analysis is significant with a p-value < 0.05 for all diseases of interest as well as all the modules. The gene ratio is the number of genes involved in the disease divided by the total number of genes in that module after the disease analysis had been performed.

How well our model preformed

In order to find out how well our model using MODifieR actually performed we wanted to test it against the case of not using the model. We, therefore, filtered the input data and did not use MODifieR. The input data was filtered to only contain differentially expressed genes that were significant, P-value < 0.05, and performed disease analysis at this filtered input data as well.

We can show that our model using MODifieR outperforms input data only when performing enrichment analysis.

As seen in Table 3 using our model with MODifieR produces significant results with P-values < 0.05 for all diseases investigated, compared to using input data only, which returns non-significant results with P-value > 0.05 and P-adjusted = 1 for all diseases. This supports our findings and that the connections discovered are indeed correlated to the disease as well as showing the power of this computational approach.

With this we can show the importance of analyzing transcriptome data using a modular approach. Our pipeline will therefore increase the chance to get significant results when doing enrichment analysis.

Predicted biomarkers

Figure 2. Network representation of the different modules, Module Discoverer upper left, DIAMOnD upper right, Correlation Clique bottom left and CSP bottom right. Red is overexpressed genes with a logFC > 0.5, yellow is genes with -0.5 < logFC < 0.5 and are not considered significant for our purpose. Blue is underexpressed genes with a logFC < -0.5. The green area is the genes involved in viral Bronchiolitis, orange area is genes involved in Asthma and the purple area is genes involved in RSV. The intersection between the disease are genes present in both or all diseases.

When the modules had been validated and proven to be better than using only input data, we wanted to find any biological relevance out of these modules. The disease analysis provided us with very interesting results, as seen in Figure 2 there is a big overlap between Asthma, RSV, and viral bronchiolitis. This shows the power of this computational pipeline, the amount of insight into asthma we got was more than we had expected. The connection between Asthma and severe RSV Bronchiolitis was first established at Linköping University and has since then been reported in several studies, although the causality has not been proven yet, the connection is well established [1-3]. From Figure 2 the asthma genes are shown in the orange area, viral bronchiolitis in green and RSV in purple. Red is overexpressed genes with a logFC > 0.5, yellow is genes with -0.5 < logFC < 0.5 and are not considered significant for our purpose. In blue underexpressed genes are shown.

These results made us think about the idea of being able to differentiate between patients with asthma and patients who had severe childhood RSV Bronchiolitis and asthma. But first, we needed to find out if there were any clinical value in being able to differentiate between the groups. We contacted Maria Jenmalm, Professor of Experimental Allergology for a consultation about this idea. After a very informative meeting, we came to the conclusion that the intersection between asthma, RSV, and Bronchiolitis, shown in Figure 2, is a perfect indicator for infants at high risk for severe RSV Bronchiolitis. These results could therefore be used for screening of infants with a higher risk for severe RSV Bronchiolitis and vaccination of this high-risk group should therefore be considered.

First selection of biomarkers

When selecting the predicted biomarkers of interest we started with collecting all the genes that were considered to be overexpressed (logFC > 0.5) or underexpressed (logFC < -0.5) with a P-value < 0.05 and that were involved in asthma, the intersection of asthma and RSV as well as the intersection of asthma, RSV and Bronchiolitis.

The results can be seen in table 4. The logFC value was given a gradient ranging from light red (low positive logFC) to red (large positive logFC) and light blue (low negative value) to blue (large negative logFC) this was done to easily spot the most differentially expressed genes.

Once this first selection was done we wanted to refine our finding even more. In order to make it to the final biomarker assay, we needed to know whether it was practically doable given these biomarkers. We, therefore, set up a few criteria to narrow down these biomarkers to a final list.

Criteria for selecting the final biomarkers

The following criteria for the final selection were based on the following: There should be purchasable antibodies in order for us to validate our results using ELISA. The ones with the highest or lowest logFC should be picked and they should be easy to moderately easy to express in order to ease the work during phase II of the project. We also looked into whether the predicted biomarkers had been reported in asthma before by doing a literature search.

The final biomarkers

In Table 5, the final 13 biomarkers are shown. These are biomarkers that have been reported in asthma, have polyclonal antibodies to purchase, and are easy or moderately easy to express. The biomarkers have been divided into asthma, the intersection of asthma and RSV as well as the intersection of asthma, RSV, and viral Bronchiolitis. All these 13 biomarkers will be validated using a sandwich ELSA during phase 2 of the project before finally making it to the biosensor assay. From these 13 biomarkers, all of them could potentially be used for asthma diagnosis and we also purpose that from the intersection between Asthma, RSV, and Bronchiolitis we can use these biomarkers to screen for infants who are at high risk to develop severe RSV Bronchiolitis.

Conclusion

Not only can we show that a modular approach using our computational pipeline improves the results significantly, but the amount of detail in the results is sufficient enough to be able to distinguish between heterogeneity within a disease.

We could use this computational pipeline to successfully predict 13 biomarkers for asthma diagnosis. We also purpose that with our findings we could with a screening program identify infants with high risk for severe RSV bronchiolitis.

Experimental

Due to the ongoing pandemic the time that our team could spend in the lab was limited. Because of this time limitation, our main goals were to express and purify chosen proteins to simplify laboratory work for phase 2 of the project that could be performed next year. To minimize the time spent in the laboratory, constructs were ordered from IDT precloned in vectors (pIDTSmart-Kan) ready to be transformed and expressed directly, removing the time spent on manual cloning.

Plasmid purification

To investigate the presence of our plasmids, and agarose gel analysis of plasmid DNA was done. Positive bands were seen for pIDTSmart-EDN/ECP from purified plasmid DNA which was transformed into Escherichia coli (BL21). These bands indicate a functioning plasmid and a non-lethal construct. However, due to the nature of EDN/ECP, the BL21 cells which carried these constructs grew slow. The most intense bands in Figure 3. indicate the presence of a near-correct-sized plasmid, bands above it indicate different conformations of that same plasmid. Both constructs are ~2700 bp, however, showing up at a slightly lighter size can be explained by supercoiling.

Protein purification

After purification, the fractions were loaded onto an SDS-PAGE gel to investigate the content (for protocol, see Experiments). As seen in Figure 4, purified ECP could be seen at 26-27 kDa. The ECP protein has a molecular weight of 17.4 kDa (with His-tag and TEV-site), the slightly larger size on the gel can be explained by its high positive charge. While EDN (Lane 5-7) showed a similar size as ECP, the amount was less, and the sample was more unpure than ECP.

Protein folding verification

Based on previous results from the SDS-PAGE the folding of ECP was decided to be investigated. That was done by using nano Differential Scanning Fluorimetry (nanoDSF). Using this method a thermal unfolding curve was generated. ECP has tyrosine and tryptophan residues, which has intrinsic fluorescence. The environments of those residues will alter as they become more exposed to the solvent, which will change the fluorescence intensity. The relation between the change in fluorescence intensity and temperature change can be measured using a melting temperature (Tm). Tm indicates that half of the proteins in the sample are unfolded, which corresponds to the midpoint transition between folded and unfolded states.

Due to the 4 disulfide bonds in ECP, dithiothreitol (DTT) was added (10 mM) as a control to find out the state of these bonds and a nanoDSF experiment was conducted (between 20-100 °C). As seen in Figure 5 the first conformational change (tighter conformation as seen in Figure 5, negative derivate) occurred at 49 °C (10 mM DTT, 5 mM borate, pH 7.5) and 51 °C (5 mM borate, pH 7.5), which is quite far from the literature value at 39.4 °C (50 mM HEPES, pH 7.5), possibly due to the different buffers used. The thermal unfolding of the protein can be seen at 65 °C in the control sample with DTT and 69.5 °C in the sample without any DTT. This is in accordance with the literature, which showed 68.6 °C (50 mM HEPES, pH 7.5) [4]. These results indicate that the disulfide bonds and the fold of the protein were at least partially correct. These results also show that the construct design, expression, and purification of potential biomarkers have a good basis for further experiments during phase II. With slight improvements, which are discussed in the Engineering part of the wiki, we believe that human disulfide bridge-rich proteins can be produced at good yields.

References

[1]. Sigurs N, Bjarnason R, Sigurbergsson F, Kjellman B. Respiratory syncytial virus bronchiolitis in infancy is an important risk factor for asthma and allergy at age 7. Am J Respir Crit Care Med. 2000;161(5):1501-7. doi: 10.1164/ajrccm.161.5.9906076.

[2]. Wu P, Hartert TV. Evidence for a causal relationship between respiratory syncytial virus infection and asthma. Expert Rev Anti Infect Ther. 2011;9(9):731-745. doi:10.1586/eri.11.92

[3]. Altman MC, Beigelman A, Ciaccio C, Gern JE, Heymann PW, Jackson DJ, Kennedy JL, Kloepfer K, Lemanske RF Jr, McWilliams LM, Muehling L, Nance C, Peebles RS Jr. Evolving concepts in how viruses impact asthma: A Work Group Report of the Microbes in Allergy Committee of the American Academy of Allergy, Asthma & Immunology. J Allergy Clin Immunol. 2020 May;145(5):1332-1344. doi: 10.1016/j.jaci.2019.12.904.

[4]. Nikolovski Z, Buzón V, Ribó M, Moussaoui M, Vilanova M, Cuchillo, CM, Nogués MV. Thermal unfolding of eosinophil cationic protein/ribonuclease 3: a nonreversible process. Protein science, 2006;15(12):2816-2827

Team:Linkoping/Results