Lecture 103

class: center, middle, inverse, title-slide

.title[
# Lecture 103
]
.author[
### Dr Stefano De Sabbata<br /><small>School of Geography, Geology, and the Env., University of Leicester<br /><a href="https://github.com/sdesabbata/r-for-geographic-data-science" style="color: white">github.com/sdesabbata/r-for-geographic-data-science</a><br /><a href="mailto:s.desabbata@le.ac.uk" style="color: white">s.desabbata@le.ac.uk</a> | <a href="https://twitter.com/maps4thought" style="color: white">@maps4thought</a><br />text licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/" style="color: white">CC BY-SA 4.0</a>, code licensed under <a href="https://www.gnu.org/licenses/gpl-3.0.html" style="color: white">GNU GPL v3.0</a></small>
]

---

class: inverse, center, middle

# Data manipulation

---

## Recap

<br/>

.pull-left[

**Previously**: Reproducible data science

- Data science
- Reproducibility
- Data input and output

**Today**: Data manipulation

- Complex data types
- Into the Tidyverse
  - `dplyr`

]
.pull-right[

![](data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/dplyr/main/man/figures/logo.png)

.right[
.referencenote[
by dplyr authors<br/>
via [dplyr GitHub repository](https://github.com/tidyverse/dplyr/), MIT License
]
]

]

---
class: inverse, center, middle

# Complex data types

---

## Vectors

.pull-left[

**Vectors** are ordered list of values.

- Vectors can be of any data type
    - numeric
    - character
    - logic
- All items in a vector have to be of the same type
- Vectors can be of any length
- Functions can be used on a vector variable directly

]
.pull-right[

A vector variable can be defined using

- an **identifier** (e.g., `a_vector`) 
- on the left of an **assignment operator** `<-`
- followed by the object to be linked to the identifier
- in this case, the result returned by the function `c`
- which creates a vector containing the provided elements

]

<br/>

```r
a_vector <- c("Birmingham", "Derby", "Leicester",
  "Lincoln", "Nottingham", "Wolverhampton")
a_vector
```

```
## [1] "Birmingham"    "Derby"         "Leicester"     "Lincoln"      
## [5] "Nottingham"    "Wolverhampton"
```

---

## Selection and filtering

Each element of a vector can be retrieved specifying the related **index** between square brackets, after the identifier of the vector. The first element of the vector has index 1.

```r
a_vector[3]
```

```
## [1] "Leicester"
```

A vector of indexes can be used to retrieve more than one element.

```r
a_vector[c(5, 3)]
```

```
## [1] "Nottingham" "Leicester"
```

A condition can be provided for filtering only the elements for which the condition is true

```r
a_numeric_vector <- 1:5
a_numeric_vector[a_numeric_vector >= 3]
```

```
## [1] 3 4 5
```

---

## Factors

A **factor** is a data type similar to a vector. However, the values contained in a factor can only be selected from a set of **levels**.

A specific set of levels can be specified when creating a factor by providing a **levels** argument.

```r
houses_factor_spec <- factor(
  c("People Carrier", "Flat", "Flat", "Hatchback",
      "Flat", "Terrace", "Terrace"),
  levels = c("Bungalow", "Flat", "Detached",
       "Semi", "Terrace"))

table(houses_factor_spec)
```

```
## houses_factor_spec
## Bungalow     Flat Detached     Semi  Terrace 
##        0        3        0        0        2
```

Factors are unordered by default, but ordered factors can be created

---

## Lists and data frames

Variables of the type **list** can contain elements of different types (including vectors and matrices), whereas elements of vectors are all of the same type.

```r
employee <- list("Stef", 2015)
employee
```

```
## [[1]]
## [1] "Stef"
## 
## [[2]]
## [1] 2015
```

```r
employee[[1]] # Note the double square brackets for selection
```

```
## [1] "Stef"
```

In **named lists** each element has a name, and elements can be selected using the operator `$`.

```r
employee <- list(employee_name = "Stef", start_year = 2015)
employee$employee_name
```

```
## [1] "Stef"
```

---

## Data frames

A **data frame** is equivalent to a *named list* where all elements are *vectors of the same length*.

```r
employees <- data.frame(
  employee_name = c("Maria", "Pete", "Sarah"),
  age = c(47, 34, 32),
  role = c("Professor", "Researcher", "Researcher"))

employees
```

```
##   employee_name age       role
## 1         Maria  47  Professor
## 2          Pete  34 Researcher
## 3         Sarah  32 Researcher
```

Data frames are the most common way to represent tabular data in R.

|employee_name | age|role       |
|:-------------|---:|:----------|
|Maria         |  47|Professor  |
|Pete          |  34|Researcher |
|Sarah         |  32|Researcher |

---

## Selecting and filtering

.pull-left[

**Columns** of data frames can be **selected**

- specifying the column index

```r
employees[, c(3, 2)]
```

- specifying the column name

```r
employees[, c("role", "employee_name")]
```

```
##         role employee_name
## 1  Professor         Maria
## 2 Researcher          Pete
## 3 Researcher         Sarah
```

The same applies to **rows** (**slicing**)

```r
employees[3, ]
```

```
##   employee_name age       role
## 3         Sarah  32 Researcher
```

]
.pull-right[

As with vectors, a condition can be provided for **filtering** only the **rows** for which the condition is true

```r
employees$role
```

```
## [1] "Professor"  "Researcher" "Researcher"
```

```r
employees$role == "Researcher"
```

```
## [1] FALSE  TRUE  TRUE
```

```r
employees[employees$role == "Researcher", ]
```

```
##   employee_name age       role
## 2          Pete  34 Researcher
## 3         Sarah  32 Researcher
```

]

---
class: inverse, center, middle

# Into the Tidyverse

---

## readr

The `readr::read_csv` function of the [`readr`](https://readr.tidyverse.org/index.html) library reads a *csv* file from the path provided as the first argument

```r
library(tidyverse)

leicester_2011OAC <- 
  read_csv("2011_OAC_supgrp_Leicester.csv")

leicester_2011OAC
```

```
## # A tibble: 969 × 5
##    OA11CD    LSOA11CD  supgrpcode supgrpname                  Total_Population
##    <chr>     <chr>          <dbl> <chr>                                  <dbl>
##  1 E00069517 E01013785          6 Suburbanites                             313
##  2 E00069514 E01013784          2 Cosmopolitans                            323
##  3 E00169516 E01013713          4 Multicultural Metropolitans              341
##  4 E00169048 E01032862          4 Multicultural Metropolitans              345
##  5 E00169044 E01032862          4 Multicultural Metropolitans              322
##  6 E00069041 E01013679          4 Multicultural Metropolitans              334
##  7 E00169049 E01032862          4 Multicultural Metropolitans              336
##  8 E00068806 E01013628          4 Multicultural Metropolitans              312
##  9 E00068886 E01013647          3 Ethnicity Central                        505
## 10 E00068807 E01013624          4 Multicultural Metropolitans              362
## # ℹ 959 more rows
```

---

.pull-left[

## dplyr

The `dplyr` (pronounced *dee-ply-er*) library is part of `tidyverse` and it offers a grammar for data manipulation

- `select`: select specific columns
- `filter`: select specific rows
- `arrange`: arrange rows in a particular order
- `summarise`: calculate aggregated values (e.g., mean, max, etc)
- `group_by`: group data based on common column values
- `mutate`: add columns
- `join`: merge tables (`data.frames` or `tibbles`)

```r
library(tidyverse)
```

]
.pull-right[

## tibble

A [tibble](https://tibble.tidyverse.org/) is a modern reimagining of the data.frame within `tidyverse`

- they do less 
    - don’t change column names or types
    - don’t do partial matching
- complain more
    - e.g. when referring to a column that does not exist
    
That forces you to confront problems earlier, typically leading to cleaner, more expressive code.

]

---

## dplyr::select

.pull-left[

`select` can be used to specify which columns to retain

```r
leicester_2011OAC %>%
  select( 
    OA11CD, LSOA11CD
    supgrpname,
    Total_Population
  )
```

]
.pull-right[

... or which ones to drop, using - in front of the column name

```r
leicester_2011OAC %>%
  select(
    OA11CD, LSOA11CD
    supgrpname,
    Total_Population
  ) %>% 
  select(-LSOA11CD)
```

]

.pull-left[

```
## # A tibble: 969 × 4
##   OA11CD    LSOA11CD  supgrpname                  Total_Population
##   <chr>     <chr>     <chr>                                  <dbl>
## 1 E00069517 E01013785 Suburbanites                             313
## 2 E00069514 E01013784 Cosmopolitans                            323
## 3 E00169516 E01013713 Multicultural Metropolitans              341
## 4 E00169048 E01032862 Multicultural Metropolitans              345
## 5 E00169044 E01032862 Multicultural Metropolitans              322
## 6 E00069041 E01013679 Multicultural Metropolitans              334
## 7 E00169049 E01032862 Multicultural Metropolitans              336
## # ℹ 962 more rows
```

]
.pull-right[

```
## # A tibble: 969 × 3
##   OA11CD    supgrpname                  Total_Population
##   <chr>     <chr>                                  <dbl>
## 1 E00069517 Suburbanites                             313
## 2 E00069514 Cosmopolitans                            323
## 3 E00169516 Multicultural Metropolitans              341
## 4 E00169048 Multicultural Metropolitans              345
## 5 E00169044 Multicultural Metropolitans              322
## 6 E00069041 Multicultural Metropolitans              334
## 7 E00169049 Multicultural Metropolitans              336
## # ℹ 962 more rows
```
]

---

## dplyr::filter

.pull-left[

`filter` can be used to specify a condition for **filtering** only the **rows** for which the condition is true

```r
leicester_2011OAC %>%
  # Cosmopolitans areas
  filter(supgrpname == "Cosmopolitans")
```

```
## # A tibble: 83 × 5
##   OA11CD    LSOA11CD  supgrpcode supgrpname   
##   <chr>     <chr>          <dbl> <chr>        
## 1 E00069514 E01013784          2 Cosmopolitans
## 2 E00068882 E01013643          2 Cosmopolitans
## 3 E00169461 E01032873          2 Cosmopolitans
## # ℹ 80 more rows
## # ℹ 1 more variable: Total_Population <dbl>
```

]
.pull-right[

... and can be concatenated with `select` or any other function using `%>%` to create a more complex process

```r
leicester_2011OAC %>%
  # Select the columns you need
  select(
    OA11CD, LSOA11CD,
    supgrpname,
    Total_Population
  ) %>% 
  # Let's assume we are not interested in the LSOA code
  select(-LSOA11CD) %>% 
  # Cosmopolitans areas
  filter(supgrpname == "Cosmopolitans")
```

```
## # A tibble: 83 × 3
##   OA11CD    supgrpname    Total_Population
##   <chr>     <chr>                    <dbl>
## 1 E00069514 Cosmopolitans              323
## 2 E00068882 Cosmopolitans              285
## 3 E00169461 Cosmopolitans              229
## # ℹ 80 more rows
```

]

---

---
class: inverse, center, middle

# Data manipulation

---

## dplyr::slice

.pull-left[

Slices the table, returning only a given subset of rows

- `slice_head` returns the `n` top rows
- `slice_tail` returns the `n` last rows
- `slice_min` returns the `n` rows with the lowest value for the selected column
- `slice_max` returns the `n` rows with the highest value for the selected column
- `slice_sample` returns a random sample of `n` rows

]
.pull-right[

```r
library(knitr)
library(kableExtra)

leicester_2011OAC %>%
  # 10 highest delays
  select(
    OA11CD,	supgrpname,	
    Total_Population
  ) %>% 
  slice_max(Total_Population, n = 8) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> OA11CD </th>
   <th style="text-align:left;"> supgrpname </th>
   <th style="text-align:right;"> Total_Population </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> E00169553 </td>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 714 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00168096 </td>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 708 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069303 </td>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:right;"> 623 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00169458 </td>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 622 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069304 </td>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:right;"> 608 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00169575 </td>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 597 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069393 </td>
   <td style="text-align:left;"> Ethnicity Central </td>
   <td style="text-align:right;"> 591 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00169564 </td>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 591 </td>
  </tr>
</tbody>
</table>

]

`prop` can be specified instead of `n` to indicate a percentage

- e.g., `prop = 0.01` returns 1% of rows

How is this different from `filter`?

---

## dplyr::arrange

Arranges rows in a particular order

- sorted in acending order of the specified columns
- descending orders specified by using `-` (minus symbol)
- the `NA` values are always at the end

.pull-left[

```r
leicester_2011OAC %>%
  select(
    OA11CD,	supgrpname,	
    Total_Population
  ) %>% 
  arrange(
    # Ascending supergroup name
    supgrpname,
    # Descending population
    -Total_Population
  ) %>% 
  slice_head(n = 8) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

]
.pull-right[

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> OA11CD </th>
   <th style="text-align:left;"> supgrpname </th>
   <th style="text-align:right;"> Total_Population </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> E00068829 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 394 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069287 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 357 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069284 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 332 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069099 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 325 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00068850 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 321 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069040 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 320 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069106 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 316 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069029 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 307 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069047 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 303 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> E00069108 </td>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 300 </td>
  </tr>
</tbody>
</table>
]

---

## dplyr::summarise

<br/>

.pull-left[

Calculates aggregated values

- using functions such
    - `mean`, `median`, `sd`
    - `min`, `max`, `quantile`
    - `first`, `last`, `nth`
    - `n` (count)
    - `any`, `all`

]
.pull-right[

```r
leicester_2011OAC %>%
  select(
    OA11CD,	supgrpname,	
    Total_Population
  ) %>% 
  summarise(
    # Average population
    avg_pop = mean(Total_Population), 
    # Is there any OA with over 200 people?
    pop_over_600 = any(Total_Population > 600),
    # Total population in Leicester
    tot_pop = sum(Total_Population)
  ) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

]

- both *British English* and *American English* spellings work, there is no difference
    - `summarise`
    - `summarize`

---

## dplyr::group_by

Groups rows based on common values for specified column(s)

- all subsequent operations are applied **to each group separately**
- frequently combined with `summarise` to calculate aggregated values per group

.pull-left[

```r
leicester_2011OAC %>%
  select(
    OA11CD,	supgrpname,	
    Total_Population
  ) %>% 
  group_by(supgrpname) %>% 
  summarise(
    # Average population
    avg_pop = mean(Total_Population), 
    # Is there any OA with over 200 people?
    pop_over_200 = any(Total_Population > 600),
    # Total population in Leicester
    tot_pop = sum(Total_Population)
  ) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

]
.pull-right[

.referencenote[
<br/>
]

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> supgrpname </th>
   <th style="text-align:right;"> avg_pop </th>
   <th style="text-align:left;"> pop_over_200 </th>
   <th style="text-align:right;"> tot_pop </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 257.3056 </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:right;"> 9263 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 323.0482 </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:right;"> 26813 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ethnicity Central </td>
   <td style="text-align:right;"> 335.7368 </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:right;"> 19137 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Hard-Pressed Living </td>
   <td style="text-align:right;"> 317.5149 </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:right;"> 32069 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:right;"> 357.7784 </td>
   <td style="text-align:left;"> TRUE </td>
   <td style="text-align:right;"> 205007 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Suburbanites </td>
   <td style="text-align:right;"> 320.8519 </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:right;"> 17326 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urbanites </td>
   <td style="text-align:right;"> 311.1385 </td>
   <td style="text-align:left;"> FALSE </td>
   <td style="text-align:right;"> 20224 </td>
  </tr>
</tbody>
</table>

]

---

## dplyr::group_by

.pull-left[

```r
leicester_2011OAC %>%
  select(
    OA11CD,	LSOA11CD, supgrpname,	
    Total_Population
  ) %>% 
  group_by(supgrpname, LSOA11CD) %>% 
  summarise(
    # Average population
    avg_pop = mean(Total_Population)
  ) %>% 
  # then slice the highest 3
  slice_max(avg_pop, n = 3) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> supgrpname </th>
   <th style="text-align:left;"> LSOA11CD </th>
   <th style="text-align:right;"> avg_pop </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:left;"> E01013632 </td>
   <td style="text-align:right;"> 394.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:left;"> E01013723 </td>
   <td style="text-align:right;"> 332.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:left;"> E01013638 </td>
   <td style="text-align:right;"> 321.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:left;"> E01013648 </td>
   <td style="text-align:right;"> 549.5000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:left;"> E01013689 </td>
   <td style="text-align:right;"> 471.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:left;"> E01013647 </td>
   <td style="text-align:right;"> 465.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ethnicity Central </td>
   <td style="text-align:left;"> E01013754 </td>
   <td style="text-align:right;"> 514.7500 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ethnicity Central </td>
   <td style="text-align:left;"> E01013760 </td>
   <td style="text-align:right;"> 494.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ethnicity Central </td>
   <td style="text-align:left;"> E01013755 </td>
   <td style="text-align:right;"> 443.2000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Hard-Pressed Living </td>
   <td style="text-align:left;"> E01013640 </td>
   <td style="text-align:right;"> 392.5000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Hard-Pressed Living </td>
   <td style="text-align:left;"> E01013631 </td>
   <td style="text-align:right;"> 382.6667 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Hard-Pressed Living </td>
   <td style="text-align:left;"> E01013637 </td>
   <td style="text-align:right;"> 382.3333 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:left;"> E01013686 </td>
   <td style="text-align:right;"> 571.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:left;"> E01013748 </td>
   <td style="text-align:right;"> 533.5000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:left;"> E01013733 </td>
   <td style="text-align:right;"> 498.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Suburbanites </td>
   <td style="text-align:left;"> E01013699 </td>
   <td style="text-align:right;"> 397.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Suburbanites </td>
   <td style="text-align:left;"> E01013696 </td>
   <td style="text-align:right;"> 395.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Suburbanites </td>
   <td style="text-align:left;"> E01013678 </td>
   <td style="text-align:right;"> 383.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urbanites </td>
   <td style="text-align:left;"> E01013785 </td>
   <td style="text-align:right;"> 401.0000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urbanites </td>
   <td style="text-align:left;"> E01013736 </td>
   <td style="text-align:right;"> 385.5000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urbanites </td>
   <td style="text-align:left;"> E01013733 </td>
   <td style="text-align:right;"> 375.6667 </td>
  </tr>
</tbody>
</table>

]
.pull-right[

]

```r
leicester_2011OAC %>%
  select(
    OA11CD,	LSOA11CD, supgrpname,	
    Total_Population
  ) %>% 
  group_by(supgrpname, LSOA11CD) %>% 
  summarise(
    # Average population
    avg_pop = mean(Total_Population)
  ) %>% 
  ungroup() %>% 
  # then slice the highest 3
  slice_max(avg_pop, n = 3) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

---

## dplyr::mutate

Calculates new columns based on values in current columns

.pull-left[

```r
leicester_2011OAC %>%
  select(supgrpname,	Total_Population) %>% 
  group_by(supgrpname) %>% 
  summarise(tot_pop = sum(Total_Population)) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> supgrpname </th>
   <th style="text-align:right;"> tot_pop </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Constrained City Dwellers </td>
   <td style="text-align:right;"> 9263 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cosmopolitans </td>
   <td style="text-align:right;"> 26813 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ethnicity Central </td>
   <td style="text-align:right;"> 19137 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Hard-Pressed Living </td>
   <td style="text-align:right;"> 32069 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Multicultural Metropolitans </td>
   <td style="text-align:right;"> 205007 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Suburbanites </td>
   <td style="text-align:right;"> 17326 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urbanites </td>
   <td style="text-align:right;"> 20224 </td>
  </tr>
</tbody>
</table>

]
.pull-right[

```r
leicester_2011OAC %>%
  select(supgrpname,	Total_Population) %>% 
  group_by(supgrpname) %>% 
  summarise(tot_pop = sum(Total_Population)) %>% 
  mutate(
    perc_pop = (tot_pop / 329839) * 100
  ) %>% 
  kable("html") %>%
  kable_styling(font_size = 16)
```

]

---

## Saving the output

.pull-left[

<br/>

It is **important** to note that

- operations in the previous slides **do not** modify the input data!

<br/>

To save the newly crated information, use

- the assignment operator `<-`
    - to save to new (or same) variable
- the assignment pipe `%<>%` (`magrittr`)
    - to update the variable

```r
library(magrittr)
```

]
.pull-right[

]

```r
pop_per_supgr <-
  leicester_2011OAC %>%
  select(supgrpname,	Total_Population) %>% 
  group_by(supgrpname) %>% 
  summarise(tot_pop = sum(Total_Population))

pop_per_supgr %<>%
  mutate(perc_pop = (tot_pop / 329839) * 100)
  
pop_per_supgr %>% 
  # and print as usual
  kable("html") %>%
  kable_styling(font_size = 16)
```

---

## Full pipe example

.pull-left-large[

```r
# Let's start from the filename
"2011_OAC_supgrp_Leicester.csv" %>% 
  # as input to the read_csv function
  read_csv(col_types = "cccci") %>% 
  # Select only the necessary columns
  select(supgrpname,	Total_Population) %>% 
  # Let's say we are not interested in
  # the Suburbanites supergroup
  filter(supgrpname != "Suburbanites") %>% 
  # Group by supergroup
  group_by(supgrpname) %>% 
  # Aggregate population
  summarise(tot_pop = sum(Total_Population)) %>% 
  # Ungroup
  ungroup() %>% 
  # Arrange by descending total population
  arrange(-tot_pop) %>% 
  # Calculate percentage
  mutate(perc_pop = (tot_pop / 329839) * 100) %T>%
  # Then use the Tee pipe %T>% in the line above
  # to write the calculated values to file
  # and the pass the same input values to the kable function
  write_csv(
    "2011_Leicester_pop_per_OAC_supgrp_excl_suburb.csv"
  ) %>% 
  # Print to screen nicely
  kable("html") %>%
  kable_styling(font_size = 16)
```

]
.pull-right-small[

<br/>

This is a full example of a short analysis using only one series of pipes (including the [Tee pipe](https://magrittr.tidyverse.org/reference/tee.html)) to read, process and write data using `R` and almost all the `tidyverse` *verbs* seen so far.

<br/>

]

---

## Summary

<br/>

.pull-left[

**Today**: Data manipulation

- Complex data types
- Into the Tidyverse
  - `dplyr`

**Next week**: Table operations

- Long and wide table formats
- Pivot operations *(not as in Excel)*
- Join operations

<br/>

.referencenote[
Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com).
]

]
.pull-right[

![](data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/tidyr/main/man/figures/logo.png)

.right[
.referencenote[
by tidyr authors<br/>
via [tidyr GitHub repository](https://github.com/tidyverse/tidyr/), MIT License
]
]

]