class: center, middle, inverse, title-slide .title[ # Lecture 101 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Introduction to R --- ## About this module .pull-left[ `R` is one of the most widely used languages for *(geospatial)* data science - data wrangling - statistical analysis - machine learning - data visualisation and maps - processing spatial data - geographic information analysis ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/101-slides-introduction_files/figure-html/unnamed-chunk-1-1.png)<!-- --> ] .pull-right[ This module will provide you with the fundamental skills in - basic programming in R - data wrangling - data analysis - reproducibility basis for - *GY7707 Geospatial Data Analysis* - *GY7708 Geospatial Artificial intelligence* ] --- ## Schedule The lectures and practical sessions have been designed to follow the schedule below - **Introduction to Reproducible Data Science** - 101 Introduction to R - 102 Reproducible data science - 103 Data manipulation - 104 Table operations - **R Scripting** - 201 Control structures - 202 Functions - **Data analysis** - 301 Exploratory visualisation - 302 Exploratory statistics - 303 Comparing data - 304 Regression models --- ## Reference books Suggested reading - *Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R* by Michael Freeman and Joel Ross, Addison-Wesley, 2019. See book [webpage](https://www.pearson.com/us/higher-education/program/Freeman-Programming-Skills-for-Data-Science-Start-Writing-Code-to-Wrangle-Analyze-and-Visualize-Data-with-R/PGM2047488.html) and [repository](https://programming-for-data-science.github.io/). - *R for Data Science* by Garrett Grolemund and Hadley Wickham, O'Reilly Media, 2016. See [online book](https://r4ds.had.co.nz/). - *Discovering Statistics Using R* by Andy Field, Jeremy Miles and Zoë Field, SAGE Publications Ltd, 2012. See book [webpage](https://www.discoveringstatistics.com/books/discovering-statistics-using-r/). - *Machine Learning with R: Expert techniques for predictive modeling* by Brett Lantz, Packt Publishing, 2019. See book [webpage](https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788295864). Further reading - *The Art of R Programming: A Tour of Statistical Software Design* by Norman Matloff, No Starch Press, 2011. See book [webpage](https://nostarch.com/artofr.htm) - *An Introduction to R for Spatial Analysis and Mapping* by Chris Brunsdon and Lex Comber, Sage, 2015. See book [webpage](https://uk.sagepub.com/en-gb/eur/an-introduction-to-r-for-spatial-analysis-and-mapping/book241031) - *Geocomputation with R* by Robin Lovelace, Jakub Nowosad, Jannes Muenchow, CRC Press, 2019. See [online book](https://bookdown.org/robinlovelace/geocompr/). --- ## The R programming language .pull-left[ <br/> Created in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand - Free, open-source implementation of *S* - statistical programming language - Bell Labs <br/> - Functional programming language - Supports (and commonly used as) procedural (i.e., imperative) programming - Object-oriented - Interpreted (not compiled) ] .pull-right[ <br/> <br/> ![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/1200px-R_logo.svg.png) .right[ .referencenote[ by Hadley Wickham and others at RStudio<br/> via Wikimedia Commons, CC BY-SA 4.0 ] ] ] <br/> --- ## Interpreting values .pull-left[ <br/> When values and operations are inputted in the *Console*, the interpreter returns the results of its interpretation of the expression ```r 2 ``` ``` ## [1] 2 ``` ```r "String value" ``` ``` ## [1] "String value" ``` ```r # comments are ignored ``` ] .pull-right[ <br/> R provides three core data types - numeric - both integer and real numbers - character - i.e., text, also called *strings* - logical - `TRUE` or `FALSE` ] --- ## Numeric operators R provides a series of basic numeric operators |Operator|Meaning |Example |Output | |--------|----------------|---------|-----------| |+ |Plus |`5 + 2` |7 | |- |Minus |`5 - 2` |3 | |`*` |Product |`5 * 2` |10 | |/ |Division |`5 / 2` |2.5 | |%/% |Integer division|`5 %/% 2`|2| |%% |Module |`5 %% 2` |1 | |^ |Power |`5^2` |25 | <br/> ```r 5 + 2 ``` ``` ## [1] 7 ``` --- ## Logical operators R provides a series of basic logical operators to test |Operator|Meaning |Example |Output | |--------|------------------|----------------------|--------------------| |== |Equal |`5 == 2` |FALSE | |!= |Not equal |`5 != 2` |TRUE | |> (>=) |Greater (or equal)|`5 > 2` |TRUE | |< (<=) |Less (or equal) |`5 <= 2` |FALSE | |! |Not |`!TRUE` |FALSE | |& |And |`TRUE & FALSE` |FALSE | | | |Or |`TRUE` | `FALSE` |TRUE | <br/> ```r 5 >= 2 ``` ``` ## [1] TRUE ``` --- ## Variables <br/> Variables **store data** and can be defined - using an *identifier* (e.g., `a_variable`) - on the left of an *assignment operator* `<-` - followed by the object to be linked to the identifier - such as a *value* (e.g., `1`) ```r a_variable <- 1 ``` The value of the variable can be invoked by simply specifying the **identifier**. ```r a_variable ``` ``` ## [1] 1 ``` --- ## Data frames **Data frames** are complex data types, which encode the concept of a table Example: the first five rows of the [`iris`](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/iris) dataset ```r iris ``` | Sepal.Length| Sepal.Width| Petal.Length| Petal.Width|Species | |------------:|-----------:|------------:|-----------:|:-------| | 5.1| 3.5| 1.4| 0.2|setosa | | 4.9| 3.0| 1.4| 0.2|setosa | | 4.7| 3.2| 1.3| 0.2|setosa | | 4.6| 3.1| 1.5| 0.2|setosa | | 5.0| 3.6| 1.4| 0.2|setosa | <br/> ... more on what data frames actually are in the coming weeks. --- ## Algorithms and functions <br/> *An* **algorithm** *or effective procedure is a mechanical rule, or automatic method, or programme for performing some mathematical operation* (Cutland, 1980). A **program** is a specific set of instructions that implement an abstract algorithm. The definition of an algorithm (and thus a program) can consist of one or more **function**s - set of instructions that perform a task - possibly using an input, possibly returning an output value Programming languages usually provide pre-defined functions that implement common algorithms (e.g., to find the square root of a number or to calculate a linear regression) --- ## Functions <br/> Functions execute complex operations and can be invoked - specifying the *function name* - the *arguments* (input values) between simple brackets - each *argument* corresponds to a *parameter* - sometimes the *parameter* name must be specified ```r sqrt(2) ``` ``` ## [1] 1.414214 ``` ```r round(1.414214, digits = 2) ``` ``` ## [1] 1.41 ``` --- ## Example: data visualisation <br/> `R` provides a wide range of functions - allowing to execute the steps necessary to conduct a data analysis - e.g., using `hist` to plot the histogram of petal lengths of flowers in `iris` ```r hist(iris$Petal.Length, main = "Petal lengths") ``` ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/101-slides-introduction_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- ## Functions and variables <br/> - functions can be used on the right side of `<-` - variables and functions can be used as *arguments* ```r sqrt_of_two <- sqrt(2) sqrt_of_two ``` ``` ## [1] 1.414214 ``` ```r round(sqrt_of_two, digits = 2) ``` ``` ## [1] 1.41 ``` ```r round(sqrt(2), digits = 2) ``` ``` ## [1] 1.41 ``` --- ## Coding style .pull-left[ **Naming**: when creating an identifier for a variable or function - R is a **case sensitive** language - UPPER and lower case are not the same - `a_variable` is different from `a_VARIABLE` - names can include - alphanumeric symbols - `.` and `_` - names must start with - a letter ] .pull-right[ A *coding style* is a way of writing the code, including - how variables and functions are named - lower case and `_` - how spaces are used in the code - which libraries are used ```r # Bad X<-round(sqrt(2),2) #Good sqrt_of_two <- sqrt(2) %>% round(digits = 2) ``` Study the [Tidyverse Style Guide](http://style.tidyverse.org/) and use it consistently! ] --- ## Libraries <br/> Once a number of related, reusable functions are created - they can be collected and stored in **libraries** (a.k.a. *packages*) - `install.packages` is a function that can be used to install libraries (i.e., downloads it on your computer) - `library` is a function that *loads* a library (i.e., makes it available to a script) Libraries can be of any size and complexity, e.g.: - `base`: base R functions, including the `sqrt` function above - `rgdal`: implementation of the [GDAL (Geospatial Data Abstraction Library)](https://gdal.org/) functionalities --- ## Tidyverse <br/> .pull-left[ The [Tidyverse](https://www.tidyverse.org/) was introduced by statistician [Hadley Wickham](https://t.co/DWqWlxbOKK?amp=1), Chief Scientist at [RStudio](https://rstudio.com/) (worth following him on [twitter](https://twitter.com/hadleywickham)). - *"The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures."* ([Tidyverse homepage](https://www.tidyverse.org/)). ] .pull-right[ **Core libraries** - [`tibble`](https://tibble.tidyverse.org/) - [`tidyr`](https://tidyr.tidyverse.org/) - [`stringr`](https://stringr.tidyverse.org/) - [`dplyr`](https://dplyr.tidyverse.org/) - [`readr`](https://readr.tidyverse.org/) - [`ggplot2`](https://ggplot2.tidyverse.org/) - [`purrr`](https://purrr.tidyverse.org/) - [`forcats`](https://forcats.tidyverse.org/) ] Also, imports [`magrittr`](https://magrittr.tidyverse.org/), which plays an important role. <!-- ## Tidyverse core libraries The meta-library [Tidyverse](https://www.tidyverse.org/) includes: - **[`tibble`](https://tibble.tidyverse.org/)** is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. - **[`tidyr`](https://tidyr.tidyverse.org/)** provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. - **[`stringr`](https://stringr.tidyverse.org/)** provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. ## Tidyverse core libraries The meta-library [Tidyverse](https://www.tidyverse.org/) includes: - **[`dplyr`](https://dplyr.tidyverse.org/)** provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. - **[`readr`](https://readr.tidyverse.org/)** provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. - **[`ggplot2`](https://ggplot2.tidyverse.org/)** is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. ## Tidyverse core libraries The meta-library [Tidyverse](https://www.tidyverse.org/) contains the following libraries: - **[`purrr`](https://purrr.tidyverse.org/)** enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive. - **[`forcats`](https://forcats.tidyverse.org/)** provides a suite of useful tools that solve common problems with factors. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. --> --- ## The pipe operator The [Tidyverse](https://www.tidyverse.org/) (via [`magrittr`](https://magrittr.tidyverse.org/)) also provide a clean and effective way of combining multiple manipulation steps .pull-left-small[ The pipe operator `%>%` - takes the result from one function - and passes it to the next function - as the **first argument** - that doesn't need to be included in the code anymore ] .pull-right-large[ ![Pipe operator example and diagram](data:image/png;base64,#images/PipeOperator.png) ] --- ## Pipe example <br/> The two codes below are equivalent - the first simply invokes the functions - the second uses the pipe operator `%>%` ```r round(sqrt(2), digits = 2) ``` ``` ## [1] 1.41 ``` ```r library(tidyverse) sqrt(2) %>% round(digits = 2) ``` ``` ## [1] 1.41 ``` --- ## Pipe example: data visualisation <br/> The previous histogram example ```r hist(iris$Petal.Length, main = "Petal lengths") ``` can be rewritten as ```r iris %>% pull(Petal.Length) %>% hist(main = "Petal lengths") ``` ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/101-slides-introduction_files/figure-html/unnamed-chunk-16-1.png)<!-- --> --- ## Summary <br/> .pull-left[ An introduction to R - The R programming language - Core concepts - Tidyverse **Next week**: Reproducible data science - Data science - Complex data types - Reproducibility - `dplyr` <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/1200px-R_logo.svg.png) .right[ .referencenote[ by Hadley Wickham and others at RStudio<br/> via Wikimedia Commons, CC BY-SA 4.0 ] ] ]