Today's topics

What is reproducible research?

In quantitative research, an analysis or project are considered to be reproducible if:

That is becoming more and more important in science:

  • as programming and scripting are becoming integral in most disciplines
  • as the amount of data increases

Why?

In scientific research:

  • verificability of claims through replication
  • incremental work, avoid duplication

For your working practice:

  • better working practices
    • coding
    • project structure
    • versioning
  • better teamwork
  • higher impact (not just results, but code, data, etc.)

Reproducibility and software engineering

Core aspects of software engineering are:

  • project design
  • software readibility
  • testing
  • versioning

As programming becomes integral to research, similar necessities arise among scientists and data analysts.

Reproducibility and "big data"

There has been a lot of discussions about "big data"

  • volume, velocity, variety, …

Beyond the hype of the moment, as the amount and complexity of data increases

  • the time required to replicate an analysis using point-and-click software becomes unsustainable
  • room for error increases

Workflow management software (e.g., ArcGIS ModelBuilder) is one answer, reproducible research based on script languages like R is another.

Reproducibility in GIScience

Singleton et al. have discussed the issue of reproducibility in GIScience, identifying the following best practices:

  1. Data should be accessible within the public domain and available to researchers.
  2. Software used should have open code and be scrutable.
  3. Workflows should be public and link data, software, methods of analysis and presentation with discursive narrative
  4. The peer review process and academic publishing should require submission of a workflow model and ideally open archiving of those materials necessary for replication.
  5. Where full reproducibility is not possible (commercial software or sensitive data) aim to adopt aspects attainable within circumstances

Five practical tips for reproducible research

(1) Document everything!

In order to be reproducible, every step of your project should be documented in detail

  • data gathering
  • data analysis
  • results presentation

Well documented R scripts are and excellent way to document your project.

(1) Document everything!

The sessionInfo function can be used to print a record of all loaded packages and versions.

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2    forcats_0.2.0   stringr_1.2.0   dplyr_0.7.4    
##  [5] purrr_0.2.4     readr_1.1.1     tidyr_0.7.2     tibble_1.3.4   
##  [9] ggplot2_2.2.1   tidyverse_1.2.1 rmarkdown_1.8  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     cellranger_1.1.0 compiler_3.4.4   plyr_1.8.4      
##  [5] bindr_0.1        tools_3.4.4      digest_0.6.12    lubridate_1.7.1 
##  [9] jsonlite_1.5     evaluate_0.10.1  nlme_3.1-137     gtable_0.2.0    
## [13] lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.4      psych_1.7.8     
## [17] cli_1.0.0        rstudioapi_0.7   curl_3.0         yaml_2.1.16     
## [21] parallel_3.4.4   haven_1.1.0      xml2_1.1.1       httr_1.3.1      
## [25] knitr_1.17       hms_0.3          rprojroot_1.2    grid_3.4.4      
## [29] glue_1.2.0       R6_2.2.2         readxl_1.0.0     foreign_0.8-70  
## [33] modelr_0.1.1     reshape2_1.4.2   magrittr_1.5     backports_1.1.1 
## [37] scales_0.5.0     htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.0
## [41] mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.6    lazyeval_0.2.1  
## [45] munsell_0.4.3    broom_0.4.2      crayon_1.3.4

(2) Everything is a (text) file

Complex formats (e.g., .docx, .xlsx, .shp, ArcGIS .mxd)

  • can become obsolete
  • are not always portable
  • usually require propetary software

Use the simplest format to future-proof your analysis.
Text files are the most versatile

  • data: .txt, .csv, .tsv
  • analysis: R scrpts, python scripts
  • write-up: LaTeX, Markdown, HTML

(3) All files should be human readable

Create code that can be easily understandable to someone outside your project, including yourself in six-month time!

  • use a style guide (e.g. tidyverse) consistently
  • add a comment at the beginning of a file, including
    • date
    • contributors
    • other files the current file depends on
    • materials, sources and other references
  • add a comment before each code block, describing what the code does
  • also add a comment before any line that could be ambiguous or particulary difficult or important

(4) Explicitly tie your files together

Relationships between files in a project is not simple:

  • in which order are file executed?
  • when to copy file from one folder to another, and where?

A common solution are make files

  • commonly written in bash on Linux systems
  • they can be written in R, using commands like
    • source to execute R scripts
    • system to interact with the operative system

(4) Explicitly tie your files together

Example: Make.R

##########
# Example make file in R
# Author: Stefano De Sabbata
# Date: Oct 22, 2018
##########

# Un comment the install.packages command below
# if Tidyverse is not installed
#install.packages("tidyverse")

# Load necessary libraries
library(rmarkdown)

# Step 0: Make clean
# Uncomment the line below to delete data files and files compiled from R Markdown scripts
#source("Make_Clean.R")

# Step 1: execute the scripts that gather data
# from Ofcom
source("Data/Gather_Ofcom_data_2012.R")
# and from the Department for Transport
source("Data/Gather_DfT_data_2015.R")

# Compile the lecture file
rmarkdown::render("Materials/Lecture/ReproducibleResearchWithR.Rmd")

# Compile the practical session file
rmarkdown::render("Materials/Practical/Practical_session_instructions.Rmd", output_format = c("html_document", "pdf_document"))

# Compile the analysis document for the practical session
rmarkdown::render("Analysis/Reproducible_analysis_in_R.Rmd", output_format = c("html_document", "pdf_document"))

(5) Organize, store, share

Reproducible research is particularly important when working in teams and to share and communicate your work.

  • Dropbox
    • good option to work in teams, initially free
    • no versioning, branches
  • Git
    • free and opensource control system
    • great to work in teams and share your work publically
    • can be more difficult at first
    • GitHub public repositories are free, private ones are not
    • Bitbucket offers free private repositories

(5) Organize, store, share

R and Markdown

Markdown

Markdown is a simple markup language

  • allows to mark-up plain text
  • to specify more complex features (such as italics text)
  • using a very simple syntax

Markdown can be used in conjunction with numerous tools

  • to produce HTML pages
  • or even more complex formats (such as PDF)

These slides are written in Markdown

Markdown example code

### This is a third level heading

Text can be specified as *italic* or **bold**

- and list can be created
    - very simply

1. also numberd lists
    1. [add a link like this](http://le.ac.uk)

|Tables |Can         |Be       |
|-------|------------|---------|
|a bit  |complicated |at first |
|but    |it gets     |easier   |

Markdown example output

This is a third level heading

Text can be specified as italic or bold

  • and list can be created
    • very simply
  1. also numberd lists
    1. add a link like this
Tables Can Be
a bit complicated at first
but it gets easier

RMarkdown

Markdown can be used in combination with R to dynamically create documents incorporating code and outcomes

Two R libraries:

  • knitr
    • Markdown (or LaTeX) with R snippets as input
    • compiles it into HTML (or PDF)
  • rmarkdown
    • uses knitr and pandoc
    • to output files in different formats

R libraries

As mentioned in earlier lectures, libraries are collections of functions

Libraries can be installed in R using the function install.packages and loaded using the function library, as shown below (note the use of quote for the first and lack thereof in the second)

install.packages("knitr")
library(knitr)

Once a library is installed on a computer you don't need to install it again, but every script needs to load all the library that it uses.

Once a library is loaded all its functions can be used.

RMarkdown example code

Let's write an example of **RMarkdown** including 

- an *if-else* conditional statement
- a *for* loop

```{r, echo=TRUE}
for (i in 1:4) {
    if (i %% 2 == 0){
        cat("even \n")
    } else {
        cat("odd \n")
    }
}
```

RMarkdown example output

Let's write an example of RMarkdown including

  • an if-else conditional statement
  • a for loop
for (i in 1:4) {
    if (i %% 2 == 0){
        cat("even \n")
    } else {
        cat("odd \n")
    }
}
## odd 
## even 
## odd 
## even

RMarkdown example

The knitr library also includes very useful functions such as kable that formats a data.frame object to be displayed using Markdown

library(knitr)
coverage_data <- read.csv("../../Data/ofcom_mobile_coverage_2012.csv")
kable(head(coverage_data[, 1:4], n=3))
LocalAuthority M2G_NoS M2G_1op M2G_2op
Aberdeen City 0.1 3.0 9.4
Aberdeenshire 15.9 16.6 20.8
Abertawe - Swansea 1.0 7.7 15.1

Practical session

In the practical session we will see:

  • R and Markdown
    • how to create R Markdown files
    • that can be compiled in
      • HTML
      • PDF
      • Microsoft Word
  • how to work with internet data
    • downloading files
    • loading .csv files