26 Reproducibility

26.1 Reproduciblity

In quantitative research, an analysis or project are considered to be reproducible if:

That is becoming more and more important in science:

  • as programming and scripting are becoming integral in most disciplines
  • as the amount of data increases

26.2 Why?

In scientific research:

  • verificability of claims through replication
  • incremental work, avoid duplication

For your working practice:

  • better working practices
    • coding
    • project structure
    • versioning
  • better teamwork
  • higher impact (not just results, but code, data, etc.)

26.3 Reproducibility and software engineering

Core aspects of software engineering are:

  • project design
  • software readibility
  • testing
  • versioning

As programming becomes integral to research, similar necessities arise among scientists and data analysts.

26.4 Reproducibility and “big data”

There has been a lot of discussions about “big data”

  • volume, velocity, variety, …

Beyond the hype of the moment, as the amount and complexity of data increases

  • the time required to replicate an analysis using point-and-click software becomes unsustainable
  • room for error increases

Workflow management software (e.g., ArcGIS ModelBuilder) is one answer, reproducible data analysis based on script languages like R is another.

26.5 Reproducibility in GIScience

Singleton et al. have discussed the issue of reproducibility in GIScience, identifying the following best practices:

  1. Data should be accessible within the public domain and available to researchers.
  2. Software used should have open code and be scrutable.
  3. Workflows should be public and link data, software, methods of analysis and presentation with discursive narrative
  4. The peer review process and academic publishing should require submission of a workflow model and ideally open archiving of those materials necessary for replication.
  5. Where full reproducibility is not possible (commercial software or sensitive data) aim to adopt aspects attainable within circumstances

26.6 Document everything

In order to be reproducible, every step of your project should be documented in detail

  • data gathering
  • data analysis
  • results presentation

Well documented R scripts are and excellent way to document your project.

26.7 Document well

Create code that can be easily understandable to someone outside your project, including yourself in six-month time!

  • use a style guide (e.g. tidyverse) consistently
  • add a comment at the beginning of a file, including
    • date
    • contributors
    • other files the current file depends on
    • materials, sources and other references
  • add a comment before each code block, describing what the code does
  • also add a comment before any line that could be ambiguous or particularly difficult or important

26.8 Workflow

Relationships between files in a project are not simple:

  • in which order are file executed?
  • when to copy files from one folder to another, and where?

A common solution is using make files

  • commonly written in bash on Linux systems
  • they can be written in R, using commands like
    • source to execute R scripts
    • system to interact with the operative system

26.9 Future-proof formats

Complex formats (e.g., .docx, .xlsx, .shp, ArcGIS .mxd)

  • can become obsolete
  • are not always portable
  • usually require proprietary software

Use the simplest format to future-proof your analysis.
Text files are the most versatile

  • data: .txt, .csv, .tsv
  • analysis: R scrpts, python scripts
  • write-up: LaTeX, Markdown, HTML

26.10 Store and share

Reproducible data analysis is particularly important when working in teams, to share and communicate your work.

  • Dropbox
    • good option to work in teams, initially free
    • no versioning, branches
  • Git
    • free and opensource control system
    • great to work in teams and share your work publically
    • can be more difficult at first
    • GitHub public repositories are free, private ones are not
    • GitLab offers free private repositories

26.11 This repository


github.com/sdesabbata/granolarr