Team:Concordia-Montreal/Software

Astroyeast - Accelerating outer space exploration through synthetic biology !-- Title end -->

...

Introducing

...

Open-source, differential gene expression analysis software and database for microgravity researchers.

Explore AstroBio

GitHub Judging Release

Overview of AstroBio

AstroBio is a well-curated, open-source software and database compiling literature findings on microgravity-induced gene expression changes in yeast, bacteria, and plants. AstroBio's database contains data from published microarray and RNA-seq experiments conducted either during spaceflight or under simulated microgravity conditions. These experimental datasets were retrieved and analyzed from the NCBI Gene Expression Omnibus (GEO) database and the NCBI Sequence Read Archive (SRA) database using the R software suite. Differential expression analysis was performed on datasets that have been curated for quality based on normal density plot distributions, a sufficient number of replicates for robust statistical analysis, as well as the quantity of near 1 false discovery rates. AstroBio allows users to search a specific gene, micro-organism, species, microgravity-induced gene regulation (upregulated versus downregulated), open reading frame, microgravity conditions (space-flown experiments versus simulated-microgravity experiments), and/or assay type (RNA-seq versus microarray experiments). It also allows users to compare findings from different studies and to determine whether a change in the expression of a given Saccharomyces cerevisiae gene is specific to microgravity-induced stress when compared to other stressors such as heat shock.

'Having a subset of high quality data that can be validated, is worth something. There are pools and pools of poor subsets out there... but if you can offer something of quality it will be very much appreciated.'

Denis Legault
-Bioinformatics Project Manager at the Center for Structural and Functional Genomics, Concordia University


Human practices interview with Denis

Quick Tour of

...

Why is AstroBio needed?

01. Need for well-curated data

NCBI's GEO2R (differential gene expression analysis platform) accesses and analyzes data regardless of its quality. There are no quality controls for normal distribution and cross-comparability of samples. GEO2R analyses datasets regardless of whether there is a sufficient number of sample replicates to ensure a robust statistical analysis. To solve this gap, we built AstroBio as a well-curated database including only datasets that have been checked for quality based on normal density plot distributions, a sufficient number of replicates for robust statistical analysis (at least 3 sample replicates), as well as the quantity of near 1 false discovery rates.

02. AstroBio has no data-type restrictions

To our knowledge, all available web applications for differential gene expression analysis are specialized for either microarray analysis or RNA-seq analysis. For example, NCBI's GEO2R and Bioinformatics Array R Tool (BART) are both restricted to the microarray-data type whereas the integrated Differential Expression and Pathway analysis (iDEP) is restricted to RNA-seq data type. We designed AstroBio to include both microarray and RNA-seq data types, allowing our users the ability to directly and easily compare experimental findings across different assay types.

03. Need for a user-friendly interface

Whereas GEO2R is easy to use, it does not display search results beyond the top 250 genes for any given pairwise comparison within a dataset. Other web applications such as BART are difficult to navigate and they require knowledge of the R statistical language, GitHub, and command-line interfaces. AstroBio is easy to use as it does not require any prior knowledge for users to be able to navigate it. Unlike GEO2R, AstroBio allows users to display an unlimited number of search results which can be easily filtered by selecting a desired filter from a number of filtering options. Users can also set the number of search results to be displayed per page and they can quickly identify upregulated versus downregulated genes thanks to our unique colour-coding system.

04. Ability to directly and easily compare findings across studies

To our knowledge, there are no existing databases and web applications that allow users to directly compare experimental findings for a given gene across studies. AstroBio solves this by proving users who search for a specific gene with a list of experimental findings from different research studies. Users can quickly spot difference in gene regulation and they can easily access more information on each study in order to determine where these differences might come from.

05. Need for a specialized microgravity database

A recurring concern among the microgravity researchers that we interviewed is the lack of standardization in their research field. Differences in factors such as the type of microgravity conditions (space-flown versus simulated microgravity experiments), as well as experimental design and protocols can have an impact on experimental findings. By providing researchers with an interface that allows them to easily spot differences, AstroBio is also contributing to the standardization of microgravity research.

Who is it for?

AstroBio is an invaluable tool for individuals from various genetics and space-related research fields, including biomanufacturing applications in space, who investigate how living organisms respond to microgravity. AstroBio provides these researchers a well-curated and user-friendly interface that allows them to explore the effect of microgravity-induced stress on the genomes of living organisms.

What can users do with it?

Search any desired combination of 8 widely-used criteria

One of the robust features of AstroBio is its search capabilities. Users can easily select any combination of the following 8 search criteria:

1. Gene name
2. Organism
3. Species
4. Microgravity conditions (spaceflown versus simulated microgravity)
5. Gene expression change in microgravity (upregulated versus downregulated expression)
6. Assay type (microarray versus RNA-seq)
7. Study (GEO accession number)
8. Platform Open Reading Frame

Directly and easily compare findings across studies and important experimental factors

Unlike existing databases, AstroBio allows researchers the ability to directly and easily compare experimental findings for a specific gene across different studies, microgravity conditions (space-flown versus simulated microgravity), assay types, and generations. Such comparisons are important for any researcher who is attempting to determine whether results are consistent between studies for a given gene or set of genes. They are also important in determining whether there are any generation effects*.

*generations effects were included wherever applicable.

Furthermore, AstroBio is instrumental in determining whether similar results are obtained from space-flown and simulated microgravity experiments. During our human practices consultation with Macauley Green, a PhD candidate investigating gene expression changes in bacteria under microgravity conditions, we learned that there is an ongoing debate on whether simulated microgravity is similar to the microgravity that is experienced by living organisms during spaceflight. Therefore, we included “microgravity condition” in our differential expression analysis and we have also added it as a search criteria in AstroBio.

Image

"One of the major findings during my literature review...so far in simulated microgravity using a Clinostat, like the HARV or RCCS, there's 900 genes that are differentially regulated. But when you go into true microgravity, there's 1600 genes that are differentially regulated. A lot of the issues that researchers tend to find this is, a lot of research doesn't stipulate what type of microgravity..."

Macauley Green
- Affiliation: University of Nottingham's Astropharmacy & Astromedicine Group


Interview with Macauley
Determine whether a given gene or set of genes are specifically regulated by microgravity-induced stress rather than by others stressors

The AstroBio MultiStress Explorer offers yeast researchers the unique ability to compare expression changes in Saccharomyces cerevisiae genes across different stressor conditions: microgravity-induced stress versus oxidative stress, heat shock-induced stress, or high-osmolarity-induced stress. Researchers can search a specific gene or multiple genes, choose the number of pairwise comparisons desired (heat shock vs. microgravity, high osmolarity vs. microgravity, oxidative stress vs. microgravity, or all of these options). Results can be visualized in heatmaps, volcano plots, gene forest plots, and PCA-biplots. Users can also set thresholds values for PCA-biplots by setting values for expression change, number of top ranking genes, gene expression regulation (upregulated or downregulated). This allows a researcher to determine whether a given gene’s expression is specifically and significantly affected by microgravity-induced stress compared to stress that is induced by other types of environmental stressors.

Determine if a given gene is essential for survival during spaceflight

AstroBio allows yeast researchers the ability to determine whether a given gene of interest is essential for survivability during spaceflight**.

**Based on Nislow et. al (2015)

Sort, filter, export, and print search results

We designed AstroBio with a user-first approach in mind. The database is a user-friendly web application that allows researchers the ability to filter results by ascending or descending expression changes, p-value, and adjusted p-value. The number of search results that are displayed per page can be set from available options: 10, 50, 100, or 500 results per page. Users can print and export search results in pdf formats.




To learn how to use AstroBio, download the AstroBio User Guide.

User Guide

AstroBio Architecture

AstroBio contains transcriptomics datasets that were retrieved and analyzed from the NCBI GEO and SRA databases through the R software suite using R scripts written by our team. Included datasets were curated for quality based on normal density plot distributions, a sufficient number of replicates for robust statistical analysis, as well as the quantity of near 1 false discovery rates. Using our R scripts, we retrieved and performed differential gene expression analysis on two types of datasets: microarray and RNA-seq.

  • Microarray experiments were pulled from the NCBI Gene Expression Omnibus (GEO)database using the R package GeoQuery (Davis and Meltzer, 2007). Samples were removed if they significantly differed in density distribution and/or involved treatments not related to microgravity such as gene knockouts experiments.A log2 transformation and/or cyclic loess normalization was applied if needed. The arrays were then fitted to a linear mixed model defined and provided by the R package Limma (linear models for microarray data) and a moderated t-statistic was generated for each gene in addition to a log2 Fold Change value after the appropriate contrasts were performed (Smyth et al., 2020).

  • All gene expression arrays derived from high throughput sequencing were processed from raw sequence data provided by the NCBI Sequence Read Archive (SRA) (Leinonen et al., 2011). The processing was performed on the Cloud Computing environment Galaxy (Afgan et al., 2018). Reads were aligned to their respective reference genome using HISAT (hierarchical indexing for spliced alignment of transcripts) after preprocessing and trimming the reads (Kim et al., 2015). The tool featureCounts was then used to quantify exons using the respective organism’s annotated genome (Liao et al., 2014). Count files were then filtered of low-expressed exons and normalized using the R package edgeR and then voom transformed for differential expression analysis using Limma (Chen et al., 2020; Smyth et al., 2020). All metadata was fetched from NCBI GEO.

  • AstroBio contains a dataset of Saccharomyces cerevisiae genes that were determined to be essential or non-essential for spaceflight using data derived from Nislow et al. (2015). The experiment consisted of pooling strains from both a heterogenous and homogenous knockout library, exposing the pooled cultures to microgravity conditions at the International Space Station (ISS) in an Opticell Processing Module (OPM), then amplifying and sequencing the barcoded regions. Barcodes were then mapped to a specific knockout strain and quantified at specific time points. A linear fit was then computed for counts across time points (7-21 generations) and an F-test was performed against a null model. Significant fitness defects were defined as a robust Z score for each time comparison as follows:

    Where MAD is the median absolute deviation and R is the ratio of abundance for the -ith strain across time points. It is defined as:

    Where g is the generation of the sampled strain. A p-value for each strain can then be computed using the Z score. Strains that were shown to have significant fitness defects at a specific time point are defined as having a log2R ≥ 1 and a p-value ≤0.001 and or having the count drop below background threshold. The genes associated with each knockout strain exhibiting significant fitness defects at any later time point are listed as essential for spaceflight within the database. This includes both heterogenous and homologous knockout samples.

The differential gene expression analysis that we performed generated desired results in excel files. The program can convert the data, parse them, and store them into the database. We selected a NOSQL database for its scalability, dynamic schema, and flexibility. MongoDB is lightweight, which is ideal for application expecting to store a huge amount of data. The web application has a unique domain hosted in the Cloud (AWS, EC2) and it is accessible to users from any IP address.

AstroBio is linked to our MultiStress Explorer which is also hosted in the Cloud (AWS, EC2). The MultiStress Explorer AstroBio is a web application intended for exploring results of a series of pairwise comparisons of Saccharomyces cerevisiae transcriptomics analyses in simulated microgravity (HARV bioreactor) with various other environmental stressors. Prior to the meta-analysis, differential expression analysis was performed using linear mixed models provided the by the R package Limma for each condition. The adjusted p-values for each experiment were combined using a random effects model in addition to summarizing the fold change across conditions. The environmental stresses include:

  • Oxidative Stress from 0.19mM Cumene hydroperoxide. RNA was extracted after 6 minutes of treatment (GSE26169, Sha et al. 2013).

  • Heat Shock at 37C for 7 minutes (GSE132186, Mühlhofer et al. 2019).

  • Hyperosmotic Shock from 0.4M of NaCl. RNA was extracted after 6 minutes of treatment (GSE13097, Romero-Santacreu et al. 2009).

All of the above datasets were then combined both pairwise and pooled with data derived from the condition:

  • Simulated low-shear modeled microgravity (LSMMG) from growth in a High Aspect Ratio Vessel (HARV). RNA extracted after 5 generations of growth (GSE4136, Sheehan et al.)

Quick Tour of MultiStress Explorer

Database Modelling

<

MongoDB is part of the backend software of AstroBio. MongoDB is considered a document-oriented NoSQL database used for high volume data storage. This means that instead of using tables and rows as it is the case in relational databases, AstroBio makes use of collections and documents. Collections contain sets of documents and functions which are the equivalent of traditional database tables (MongoDB Inc., 2020).

iGEM Concordia 2020 Database

The iGEM Concordia 2020 database contains collections which, in turn contain documents. Each document has a varying number of fields. This structure is aligned with how classes and objects are typically constructed. The data model available within MongoDB allows us to represent hierarchical relationships, and to store arrays easily. Moreover, the environment is highly scalable.

Collections

metaData

The metaData collection contains experiment information, which includes the following fields: Accessions, Treatment, Description, Experimenter, PMID, Institution, Assay Type, Design, Summary, and finally the URL of the study.

Metadata Schema

{ "_id" : "", "field" : "", "accession" : "", "treatment: "", "description" : "", "Link" : "", "Experimenter" : "", "Contact" : "", "Title" : "", "URL" : "", "PMIDs" : "", "Institute" : "", "Design" : "", "PlatformID" : "", "Type" : "", "Summary" : "" }

geneResults

The geneResults contains basic gene information, as well as gene ontology information, gene annotations and differential expression analysis results.

Available Schemas

GSE4136 Schema

{ "_id" : " ", "ID" : " ", "adj" : { "P" : { "Val" : " "} }, "P" : { "Value" : " " } }, "t" : " ", "B" : " ", "logFC": " ", "Gene" : { "symbol" : " ", "title" : " ", } "Platform_ORF" : " ", "GO" : { "Function" : " ", "Process" : " ", "Component" : " " } "Chromosome" : { "annotation" : " " } }, "EGEOD" : " ", "Organism" : " ", "Species" : " ", "Strain" : " ", "StudyType" : " ", "AssayType": " ", "Gen" : " ", "EssentialforFlight" : " ", "meta_data" : " " }

GSE40648, GSE95388 and GSE105058 Schemas

{ "_id" : " ", "adj" : { "P" : { "Val" : " "} }, "P" : { "Value" : " " } }, "t" : " ", "B" : " ", "logFC" : " ", "Gene" : { "symbol" : " ", "title" : " ", } "Platform_ORF" : " ", "GO" : { "Function" : " ", "Process" : " ", "Component" : " " } "Chromosome" : { "annotation" : " " } }, "EGEOD" : " ", "Organism" : " ", "Species" : " ", "Strain" : " ", "StudyType": " ", "AssayType" : " ", "Gen" : " ", "meta_data" : " " }

GSE50881 and GSE90166 Schemas

{ "_id" : " ", "adj" : { "P" : { "Val" : " "} }, "P" : { "Value" : " " } }, "AveExpr" : " ", "F" : " ", "logFC" : " ", "Gene" : { "symbol" : " ", "title" : " ", } "Platform_ORF" : " ", "GO" : { "Function" : " ", "Process" : " ", "Component" : " " } "Chromosome" : { "annotation" : " " } }, "EGEOD" : " ", "Organism" : " ", "Species" : " ", "Strain" : " ", "StudyType" : " ", "AssayType" : " ", "Gen" : " ", "meta_data" : " " }



reducedGenes

Map-reduce operation to split and map genes with multiple names that point to the same statistical analysis results.

Map-Reduce Schema

{ "_id" : " ", "value" : " " }

References

Afgan, E., D. Baker, B. Batut, M. vandenBeek, D. Bouvier, M. Čech, J. Chilton, D. Clements, N. Coraor, B.A. Grüning, A. Guerler, J. Hillman-Jackson, S. Hiltemann, V. Jalili, H. Rasche, N. Soranzo, J. Goecks, J. Taylor, A. Nekrutenko, and D. Blankenberg. 2018. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res.46:W537–W544. doi:10.1093/nar/gky379.

Chen, Y., A.T. Lun, D.J. McCarthy, M.E. Ritchie, B. Phipson, Y. Hu, X. Zhou, M.D. Robinson, and G.K. Smyth. 2020. edgeR: Empirical Analysis of Digital Gene Expression Data in R. Bioconductor version: Release (3.11).

Davis, S., and P.S. Meltzer. 2007. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 23:1846–1847. do.i:10.1093/bioinformatics/btm254

Kim, D., B. Langmead, and S.L. Salzberg. 2015. HISAT: a fast spliced aligner with lowmemory requirements. Nat. Methods. 12:357–360. doi:10.1038/nmeth.3317.

Leinonen, R., H. Sugawara, and M. Shumway. 2011. The Sequence Read Archive. Nucleic Acids Res. 39:D19–D21. doi:10.1093/nar/gkq1019.

Liao, Y., G.K. Smyth, and W. Shi. 2014. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 30:923–930. doi:10.1093/bioinformatics/btt656.

Mühlhofer, M., E. Berchtold, C.G. Stratil, G. Csaba, E. Kunold, N.C. Bach, S.A. Sieber, M. Haslbeck, R. Zimmer, and J. Buchner. 2019. The Heat Shock Response in Yeast Maintains Protein Homeostasis by Chaperoning and Replenishing Proteins. Cell Reports. 29:4593-4607.e8. doi:10.1016/j.celrep.2019.11.109.

Nislow, C., A.Y. Lee, P.L. Allen, G. Giaever, A. Smith, M. Gebbia, L.S. Stodieck, J.S. Hammond, H.H. Birdsall, and T.G. Hammond. 2015. Genes Required for Survival in Microgravity Revealed by Genome-Wide Yeast Deletion CollectionsCultured during Spaceflight. BioMed Research International. 2015:e976458. doi: click here

Romero-Santacreu, L., J. Moreno, J.E. Pérez-Ortín, and P. Alepuz. 2009. Specific and global regulation of mRNA stability during osmotic stress in Saccharomyces cerevisiae. RNA. 15:1110–1120. doi:10.1261/rna.1435709.

Sha, W., A.M. Martins, R. Laubenbacher, P. Mendes, and V. Shulaev. 2013. The Genome-Wide Early Temporal Response of Saccharomyces cerevisiae to Oxidative Stress Induced by Cumene Hydroperoxide. PLOS ONE. 8:e74939. doi:10.1371/journal.pone.0074939.

Sheehan, K.B., K. McInnerney, B. Purevdorj-Gage, S.D. Altenburg, and L.E. Hyman. 2007. Yeast genomic expression patterns in response to low-shear modeled microgravity. BMC Genomics. 8:3. doi:10.1186/1471-2164-8-3.

Smyth, G., Y. Hu, M. Ritchie, J. Silver, J. Wettenhall, D. McCarthy, D. Wu, W. Shi, B. Phipson, A. Lun, N. Thorne, A. Oshlack, C. de Graaf, Y. Chen, M. Langaas, E. Ferkingstad, M. Davy, F. Pepin, and D. Choi. 2020. limma: Linear Models for Microarray Data. Bioconductor version: Release (3.11).

Our 2020-2021 iGEM project is generously supported by

Gold Partners
Partners