class: center, middle, inverse, title-slide .title[ # Lecture 102 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Reproducible data science --- ## Recap <br/> .pull-left[ **Previously**: An introduction to R - The R programming language - Core concepts - Tidyverse **Today**: Reproducible data science - Data science - Reproducibility - Data input and output ] .pull-right[ ![](data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/dplyr/main/man/figures/logo.png) .right[ .referencenote[ by dplyr authors<br/> via [dplyr GitHub repository](https://github.com/tidyverse/dplyr/), MIT License ] ] ] --- class: inverse, center, middle # Reproducible data science --- ## Data science .pull-left[ The term *"data science"* has gained wide-spread usage in the past decade - akin **data mining** - including or bordering - machine learning - artificial intelligence - statistical analysis - data visualisation - commonly associated with - **big data** - the use of **programming** - python and R in particular - but not necessarily - **interdisciplinarity** - generally used for those who *use* those tools within an *application field* (e.g., geography or geology) ] .pull-right[ <iframe src="https://sdesabbata.github.io/2021-census-analysis/100-exploratory-analysis/101-population-households.html" width="100%" height="500px" data-external="1"></iframe> .referencenote[ A simple example: [UK population and household change between 2011 and 2021 censuses](https://sdesabbata.github.io/2021-census-analysis/100-exploratory-analysis/101-population-households.html#Visualisation:_plots) ] ] --- ## Reproduciblity .pull-left[ In quantitative research, an analysis or project are considered to be **reproducible** if: - *"the data and code used to make a finding are available and they are sufficient for an independent researcher to recreate the finding."* [Christopher Gandrud, *Reproducible Research with R and R Studio*](https://www.crcpress.com/Reproducible-Research-with-R-and-R-Studio/Gandrud/p/book/9781498715379) That is becoming more and more important in science: - as programming and scripting are becoming integral in most disciplines - as the amount of data increases ] .pull-right[ In **scientific research**: - verificability of claims through replication - incremental work, avoid duplication For your **working practice**: - better working practices - coding - project structure - versioning - better teamwork - higher impact (not just results, but code, data, etc.) ] --- ## Reproducibility in GIScience [Singleton *et al.*](https://www.tandfonline.com/doi/abs/10.1080/13658816.2015.1137579) have discussed the issue of reproducibility in GIScience, identifying the following best practices: 1. Data should - be accessible within the public domain - available to researchers 2. Software used should - have open code - be scrutable 3. Workflows should - be public - link data, software, methods of analysis and presentation with discursive narrative 4. The peer review process and academic publishing should - require submission of a workflow model - ideally open archiving of those materials necessary for replication 5. Where full reproducibility is not possible (commercial software or sensitive data) - aim to adopt aspects attainable within circumstances --- ## Reproducibility and software engineering .pull-left[ Emergence of **"big data"** - volume, velocity, variety, ... Beyond the hype of the moment, as the **amount** and **complexity** of data increases - the time required to replicate an analysis using point-and-click software becomes unsustainable - room for error increases Two options: - Workflow management software (e.g., ArcGIS ModelBuilder) - Reproducible data analysis (based on script languages like R) ] .pull-right[ Core aspects of **software engineering** are: - project design - software **readibility** - testing - **versioning** As programming becomes integral to research, similar necessities arise among scientists and data analysts. 5 key aspects identified by [Gandrud](https://www.crcpress.com/Reproducible-Research-with-R-and-R-Studio/Gandrud/p/book/9781498715379) - Document everything - Document well - Workflow - Future-proof formats - Store and share ] --- .pull-left[ ## Document everything In order to be reproducible, every step of your project should be documented in detail - data gathering - data analysis - results presentation Well documented R scripts are an excellent way to document your project. ] .pull-right[ ## Document well Create code that can be **easily understood** by someone outside your project, including yourself in six-month time! - use a style guide (e.g. [tidyverse](http://style.tidyverse.org/)) consistently - also add a **comment** before any line that could be ambiguous or particularly difficult or important - add a **comment** before each code block, describing what the code does - add a **comment** at the beginning of a file, including - date - contributors - other files the current file depends on - materials, sources and other references ] --- ## Workflow Relationships between files in a project are not simple: - in which order are file executed? - when to copy files from one folder to another, and where? A common solution is using **make files** - commonly written in *bash* on Linux systems - they can be written in R, using commands like - *source* to execute R scripts - *system* to interact with the operative system Example: section of the [*R for Geographic Data Science*](https://github.com/sdesabbata/r-for-geographic-data-science) project's make file [Make.R](https://github.com/sdesabbata/r-for-geographic-data-science/blob/main/src/Make.R) ```{} cat("\n\n>>> Rendering 102-slides-reproducible-data-science.Rmd <<<\n\n") rmarkdown::render( paste0(Sys.getenv("RGDS_HOME"), "/src/slides/102-slides-reproducible-data-science.Rmd"), quiet = TRUE, output_dir = paste0(Sys.getenv("RGDS_HOME"), "/docs/slides") ) ``` --- .pull-left[ ## Future-proof formats Complex formats (e.g., .docx, .xlsx, .shp, ArcGIS .mxd) - can become obsolete - are not always portable - usually require proprietary software Use the simplest format to **future-proof** your analysis.<br/>**Text files** are the most versatile - data: .txt, .csv, .tsv - analysis: R scripts, python scripts - write-up: LaTeX, Markdown, HTML ] .pull-right[ ## Store and share Reproducible data analysis is particularly important when working in teams, to share and communicate your work. - [Dropbox](https://www.dropbox.com) - good option to work in teams, initially free - no versioning, branches - [Git](https://git-scm.com) - free and opensource control system - great to work in teams and share your work publicly - can be more difficult at first - [GitHub](https://github.com) offers free public and private repositories - [GitLab](https://about.gitlab.com/) offers free public and private repositories ] --- class: inverse, center, middle # Input / Ouptut --- ## Text file formats <br/> .pull-left[ [Tidyverse](https://www.tidyverse.org/) includes or [imports](https://www.tidyverse.org/packages/#import) many libraries for reading data - Tabular formats - [`readr`](https://readr.tidyverse.org/) for plain text - [`readxl`](https://readxl.tidyverse.org/) for Excel (`.xls` and `.xlsx`) - [`haven`](https://haven.tidyverse.org/) for SPSS, Stata, and SAS data. - Databases - [`DBI`](https://cran.r-project.org/web/packages/DBI/) for relational databases - NoSQL - [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/) for JSON - [`xml2`](https://cran.r-project.org/web/packages/xml2/) for XML - Web - [`httr`](https://cran.r-project.org/web/packages/httr/) for web APIs ] .pull-right[ A wide range of formats based on plain-text files exist to represent tabular data For instance - tabular formats - comma-separated values files `.csv` - semi-colon-separated values files `.csv` - tab-separated values files `.tsv` - other formats using custom delimiters - fix-width files `.fwf` ] --- ## Output Area Classification .pull-left[ The file `2011_OAC_supgrp_Leicester.csv` contains - one row for each [Output Area (OA)](https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeography) in Leicester - [Lower-Super Output Area (LSOA)](https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeography) containing the OA - code and name of the supergroup assigned to the OA by the [2011 Output Area Classification](http://geogale.github.io/2011OAC/) - total population of the OA Extract showing only the first few rows ```{} OA11CD,LSOA11CD,supgrpcode,supgrpname,Total_Population E00069517,E01013785,6,Suburbanites,313 E00069514,E01013784,2,Cosmopolitans,323 E00169516,E01013713,4,Multicultural Metropolitans,341 E00169048,E01032862,4,Multicultural Metropolitans,345 ``` ] .pull-right[ <iframe src="https://mapmaker.cdrc.ac.uk/#/output-area-classification?h=0&lon=-1.1117&lat=52.6331&zoom=11" width="100%" height="500px" data-external="1"></iframe> ] --- ## readr .pull-left[ The [`readr`](https://readr.tidyverse.org/) (pronounced *read-er*) provides functions to read and write text files - `readr::read_csv`: comma-separated files `.csv` - `readr::read_csv2`: semi-colon-separated files `.csv` - `readr::read_tsv`: tab-separated files `.tsv` - `readr::read_fwf`: fix-width files `.fwf` - `readr::read_delim`: files using a custom delimiter and their *write* counterpart, such as - `readr::write_csv`: comma-separated files `.csv` ] .pull-right[ Read functions provide options about how to interpret a file contents - For instance, `readr::read_csv` - `col_names`: - `TRUE` or `FALSE` whether top row is column names - or a vector of column names - `col_types`: - a `cols()` specification or a string - `skip`: lines to skip before reading data - `n_max`: max number of record to read ] --- ## readr::read_csv The `readr::read_csv` function of the [`readr`](https://readr.tidyverse.org/index.html) library reads a *csv* file from the path provided as the first argument ```r library(tidyverse) leicester_2011OAC <- read_csv("2011_OAC_supgrp_Leicester.csv") leicester_2011OAC ``` ``` ## # A tibble: 969 × 5 ## OA11CD LSOA11CD supgrpcode supgrpname Total_Population ## <chr> <chr> <dbl> <chr> <dbl> ## 1 E00069517 E01013785 6 Suburbanites 313 ## 2 E00069514 E01013784 2 Cosmopolitans 323 ## 3 E00169516 E01013713 4 Multicultural Metropolitans 341 ## 4 E00169048 E01032862 4 Multicultural Metropolitans 345 ## 5 E00169044 E01032862 4 Multicultural Metropolitans 322 ## 6 E00069041 E01013679 4 Multicultural Metropolitans 334 ## 7 E00169049 E01032862 4 Multicultural Metropolitans 336 ## 8 E00068806 E01013628 4 Multicultural Metropolitans 312 ## 9 E00068886 E01013647 3 Ethnicity Central 505 ## 10 E00068807 E01013624 4 Multicultural Metropolitans 362 ## # ℹ 959 more rows ``` --- ## readr::read_csv .pull-left[ Using `readr::read_csv` as in the previous example with no further options ```r leicester_2011OAC <- read_csv("2011_OAC_supgrp_Leicester.csv") leicester_2011OAC ``` will generate the following warning ```{} Parsed with column specification: cols( OA11CD = col_character(), LSOA11CD = col_character(), supgrpcode = col_double(), supgrpname = col_character(), Total_Population = col_double() ) ``` ] .pull-right[ See [`readr` column specification](https://readr.tidyverse.org/reference/cols.html) - `col_logical()` or `l` as logic values - `col_integer()` or `i` as integer - `col_double()` or `d` as numeric (double) - `col_character()` or `c` as character - `col_factor(levels, ordered)` or `f` as factor - `col_date(format = "")` or `D` as data type - `col_time(format = "")` or `t` as time type - `col_datetime(format = "")` or `T` as datetime - `col_number()` or `n` as numeric (dropping marks) - `col_skip()` or `_` or `-` don’t import - `col_guess()` or `?` use best type based on the input ] --- ## readr::read_csv Alternatively, column specifications can be provided manually - using the argument `col_types` - and the function `cols` ```r leicester_2011OAC <- read_csv( "2011_OAC_supgrp_Leicester.csv", col_types = cols( OA11CD = col_character(), LSOA11CD = col_character(), supgrpcode = col_character(), supgrpname = col_character(), Total_Population = col_integer() ) ) ``` ``` ## # A tibble: 969 × 5 ## OA11CD LSOA11CD supgrpcode supgrpname Total_Population ## <chr> <chr> <chr> <chr> <int> ## 1 E00069517 E01013785 6 Suburbanites 313 ## 2 E00069514 E01013784 2 Cosmopolitans 323 ## 3 E00169516 E01013713 4 Multicultural Metropolitans 341 ## 4 E00169048 E01032862 4 Multicultural Metropolitans 345 ## 5 E00169044 E01032862 4 Multicultural Metropolitans 322 ## # ℹ 964 more rows ``` --- ## readr::read_csv Alternatively, column specifications can be provided manually - using the argument `col_types` - and the corresponding one-letter codes for each column ```r leicester_2011OAC <- read_csv( "2011_OAC_supgrp_Leicester.csv", col_types = "cccci" ) ``` ``` ## # A tibble: 969 × 5 ## OA11CD LSOA11CD supgrpcode supgrpname Total_Population ## <chr> <chr> <chr> <chr> <int> ## 1 E00069517 E01013785 6 Suburbanites 313 ## 2 E00069514 E01013784 2 Cosmopolitans 323 ## 3 E00169516 E01013713 4 Multicultural Metropolitans 341 ## 4 E00169048 E01032862 4 Multicultural Metropolitans 345 ## 5 E00169044 E01032862 4 Multicultural Metropolitans 322 ## 6 E00069041 E01013679 4 Multicultural Metropolitans 334 ## 7 E00169049 E01032862 4 Multicultural Metropolitans 336 ## 8 E00068806 E01013628 4 Multicultural Metropolitans 312 ## 9 E00068886 E01013647 3 Ethnicity Central 505 ## 10 E00068807 E01013624 4 Multicultural Metropolitans 362 ## # ℹ 959 more rows ``` --- ## readr::write_csv .pull-left[ The function `write_csv` can be used to save a dataset to `csv` Example: 1. `read` the 2011 OAC dataset 2. `select` a few columns 3. `filter` only those OA in the supergroup *Suburbanites* (code `6`) 4. `write` the results to a file named *2011_OAC_supgrp_Leicester_supgrp6.csv* ```r read_csv("2011_OAC_supgrp_Leicester.csv") %>% select( OA11CD, supgrpcode, Total_Population ) %>% filter(supgrpcode == "6") %>% write_csv( "2011_OAC_supgrp_Leicester_supgrp6.csv" ) ``` .referencenote[ (... more on `select` and `filter` next week) ] ] .pull-right[ Alternatively, `write_tsv` can be used ```r read_csv("2011_OAC_supgrp_Leicester.csv") %>% select( OA11CD, supgrpcode, Total_Population ) %>% filter(supgrpcode == "6") %>% write_tsv( "2011_OAC_supgrp_Leicester_supgrp6.tsv" ) ``` ```{} OA11CD supgrpcode Total_Population E00069517 6 313 E00069468 6 251 E00069528 6 270 E00069538 6 307 E00069174 6 321 E00069170 6 353 E00069171 6 351 E00068713 6 265 E00069005 6 391 E00069014 6 316 E00068989 6 354 ``` ] --- ## Summary <br/> .pull-left[ **Today**: Reproducible data science - Data science - Reproducibility - Data input and output **Next week**: Data manipulation - Complex data types - Into the Tidyverse - `dplyr` <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/dplyr/main/man/figures/logo.png) .right[ .referencenote[ by dplyr authors<br/> via [dplyr GitHub repository](https://github.com/tidyverse/dplyr/), MIT License ] ] ]