Team:Chalmers-Gothenburg/Reproducibility

iGEM Chalmers Gothenburg 2020

Reproducibility

Lately, a lot of attention has been drawn towards the reproducibility crisis in science [1], [2], [3]. A survey conducted in 2016 by Nature revealed that most scientists cannot reproduce other scientists’ work, and an important percentage cannot even replicate their own [3]. Fig 1 Figure 1. According to Ioannidis et al. 2009, most researchers fail when trying to reproduce experiments, and an important number does not have established procedures for reproducibility.
What can we do about it?
This crisis is especially relevant when it comes to bioinformatic analysis. For many available papers, the raw data is not readily available, nor are the details of the exact pipeline that was used to process the data, statistical methods used to normalize, etc. [1]

Fig 2 Figure 2. Summary of the reproducibility problem. Life sciences research, highlighting the issues when it comes to Computational Biology. Figure adapted from Ioannidis et al. [1] and extracted from the workshop Tools for reproducible Research and imparted by SciLife and NBIS in Gothenburg, 2019.
We faced this issue ourselves when searching for complete datasets of shotgun sequencing experiments: a considerable number of researchers did not provide their raw data nor the procedure they follow to process and analyse the data.

Fortunately, the trend is starting to change. We believe that reproducibility is one of the keys for good science. Therefore, we tried to comply with good practices in computational biology [4], [5] when conducting our research. Among other things, this means that all the code for our pipelines is available on the version control tool Github, together with explanations on how to run it.

In addition to this, we used the package management system Conda [6]. Conda, among other things, makes sure that the package versions are compatible with each other and it allows other users to trace the version used and install the exact same package. In a Conda environment, you can install a specific collection of software tools (like FastQC). This is important for bioinformatic projects, because with time new versions of software tools become available. However, new versions may behave a bit differently and can affect your results. To achieve full reproducibility, you need to be able to recreate the exact system you used for your research. This is why environments are so handy! For our project, since it was not too large nor complex, we felt that these measures were enough to ensure reproducibility.

  1. [1] J. P. A. Ioannidis et al., “Repeatability of published microarray gene expression analyses,” Nat. Genet., vol. 41, no. 2, pp. 149–155, Feb. 2009, doi: 10.1038/ng.295.
  2. [2] “Six factors affecting reproducibility in life science research and how to handle them.” https://www.nature.com/articles/d42473-019-00004-y (accessed Oct. 18, 2020).
  3. [3] M. Baker and D. Penny, “Is there a reproducibility crisis?,” Nature, vol. 533, no. 7604. Nature Publishing Group, pp. 452–454, May 26, 2016, doi: 10.1038/533452A.
  4. [4] R. D. Peng, “Reproducible research in computational science,” Science, vol. 334, no. 6060. American Association for the Advancement of Science, pp. 1226–1227, Dec. 02, 2011, doi: 10.1126/science.1213847.
  5. [5] G. Wilson, J. Bryan, K. Cranston, J. Kitzes, L. Nederbragt, and T. K. Teal, “Good enough practices in scientific computing,” PLOS Comput. Biol., vol. 13, no. 6, p. e1005510, Jun. 2017, doi: 10.1371/journal.pcbi.1005510.
  6. [6] “Conda — Conda documentation.” https://docs.conda.io/en/latest/ (accessed Oct. 20, 2020).