26 Reproducibility
26.1 Reproduciblity
In quantitative research, an analysis or project are considered to be reproducible if:
- “the data and code used to make a finding are available and they are sufficient for an independent researcher to recreate the finding.” Christopher Gandrud, Reproducible Research with R and R Studio
That is becoming more and more important in science:
- as programming and scripting are becoming integral in most disciplines
- as the amount of data increases
26.2 Why?
In scientific research:
- verificability of claims through replication
- incremental work, avoid duplication
For your working practice:
- better working practices
- coding
- project structure
- versioning
- better teamwork
- higher impact (not just results, but code, data, etc.)
26.3 Reproducibility and software engineering
Core aspects of software engineering are:
- project design
- software readibility
- testing
- versioning
As programming becomes integral to research, similar necessities arise among scientists and data analysts.
26.4 Reproducibility and “big data”
There has been a lot of discussions about “big data”…
- volume, velocity, variety, …
Beyond the hype of the moment, as the amount and complexity of data increases
- the time required to replicate an analysis using point-and-click software becomes unsustainable
- room for error increases
Workflow management software (e.g., ArcGIS ModelBuilder) is one answer, reproducible data analysis based on script languages like R is another.
26.5 Reproducibility in GIScience
Singleton et al. have discussed the issue of reproducibility in GIScience, identifying the following best practices:
- Data should be accessible within the public domain and available to researchers.
- Software used should have open code and be scrutable.
- Workflows should be public and link data, software, methods of analysis and presentation with discursive narrative
- The peer review process and academic publishing should require submission of a workflow model and ideally open archiving of those materials necessary for replication.
- Where full reproducibility is not possible (commercial software or sensitive data) aim to adopt aspects attainable within circumstances
26.6 Document everything
In order to be reproducible, every step of your project should be documented in detail
- data gathering
- data analysis
- results presentation
Well documented R scripts are and excellent way to document your project.
26.7 Document well
Create code that can be easily understandable to someone outside your project, including yourself in six-month time!
- use a style guide (e.g. tidyverse) consistently
- add a comment at the beginning of a file, including
- date
- contributors
- other files the current file depends on
- materials, sources and other references
- add a comment before each code block, describing what the code does
- also add a comment before any line that could be ambiguous or particularly difficult or important
26.8 Workflow
Relationships between files in a project are not simple:
- in which order are file executed?
- when to copy files from one folder to another, and where?
A common solution is using make files
- commonly written in bash on Linux systems
- they can be written in R, using commands like
- source to execute R scripts
- system to interact with the operative system
26.9 Future-proof formats
Complex formats (e.g., .docx, .xlsx, .shp, ArcGIS .mxd)
- can become obsolete
- are not always portable
- usually require proprietary software
Use the simplest format to future-proof your analysis.
Text files are the most versatile
- data: .txt, .csv, .tsv
- analysis: R scrpts, python scripts
- write-up: LaTeX, Markdown, HTML