2020-01-15
Moving from programming to data science
Reproducibility in (geographic) data science
See also: Christopher Gandrud, Reproducible Research with R and R Studio also available on GitHub
In quantitative research, an analysis or project are considered to be reproducible if:
That is becoming more and more important in science:
In scientific research:
For your working practice:
Core aspects of software engineering are:
As programming becomes integral to research, similar necessities arise among scientists and data analysts.
There has been a lot of discussions about “big data”…
Beyond the hype of the moment, as the amount and complexity of data increases
Workflow management software (e.g., ArcGIS ModelBuilder) is one answer, reproducible data analysis based on script languages like R is another.
Singleton et al. have discussed the issue of reproducibility in GIScience, identifying the following best practices:
In order to be reproducible, every step of your project should be documented in detail
Well documented R scripts are and excellent way to document your project.
Create code that can be easily understandable to someone outside your project, including yourself in six-month time!
Relationships between files in a project are not simple:
A common solution is using make files
Complex formats (e.g., .docx, .xlsx, .shp, ArcGIS .mxd)
Use the simplest format to future-proof your analysis.
Text files are the most versatile
Reproducible data analysis is particularly important when working in teams, to share and communicate your work.
Markdown is a simple markup language
Markdown can be used in conjunction with numerous tools
These slides are written in Markdown
### This is a third level heading Text can be specified as *italic* or **bold** - and list can be created - very simply 1. also numbered lists 1. [add a link like this](http://le.ac.uk) |Tables |Can |Be | |-------|------------|---------| |a bit |complicated |at first | |but |it gets |easier |
Text can be specified as italic or bold
Tables | Can | Be |
---|---|---|
a bit | complicated | at first |
but | it gets | easier |
Let's write an example of **R** code including - a variable `a_variable` - an assignment operation (i.e., `<-`) - a mathematical operation (i.e., `+`) ```{r, echo=TRUE} a_variable <- 0 a_variable <- a_variable + 1 a_variable <- a_variable + 1 a_variable <- a_variable + 1 a_variable ```
RMarkdown documents contain both Markdown and R code. These files can be created in RStudio, and compiled to create an html page (like this document), a pdf, or a Microsoft Word document.
Git is a free and opensource version control system
A series of snapshots
When working with a git repository
git clone
git fetch
git pull
git add
git commit
git push
RStudio includes a git plug-in
Reproducibility in (geographic) data science
In the practical session, we will see
Exploratory data analysis