class: center, middle, inverse, title-slide .title[ # Lecture 103 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Data manipulation --- ## Recap <br/> .pull-left[ **Previously**: Reproducible data science - Data science - Reproducibility - Data input and output **Today**: Data manipulation - Complex data types - Into the Tidyverse - `dplyr` ] .pull-right[ ![](data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/dplyr/main/man/figures/logo.png) .right[ .referencenote[ by dplyr authors<br/> via [dplyr GitHub repository](https://github.com/tidyverse/dplyr/), MIT License ] ] ] --- class: inverse, center, middle # Complex data types --- ## Vectors .pull-left[ **Vectors** are ordered list of values. - Vectors can be of any data type - numeric - character - logic - All items in a vector have to be of the same type - Vectors can be of any length - Functions can be used on a vector variable directly ] .pull-right[ A vector variable can be defined using - an **identifier** (e.g., `a_vector`) - on the left of an **assignment operator** `<-` - followed by the object to be linked to the identifier - in this case, the result returned by the function `c` - which creates a vector containing the provided elements ] <br/> ```r a_vector <- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton") a_vector ``` ``` ## [1] "Birmingham" "Derby" "Leicester" "Lincoln" ## [5] "Nottingham" "Wolverhampton" ``` --- ## Selection and filtering Each element of a vector can be retrieved specifying the related **index** between square brackets, after the identifier of the vector. The first element of the vector has index 1. ```r a_vector[3] ``` ``` ## [1] "Leicester" ``` A vector of indexes can be used to retrieve more than one element. ```r a_vector[c(5, 3)] ``` ``` ## [1] "Nottingham" "Leicester" ``` A condition can be provided for filtering only the elements for which the condition is true ```r a_numeric_vector <- 1:5 a_numeric_vector[a_numeric_vector >= 3] ``` ``` ## [1] 3 4 5 ``` --- ## Factors A **factor** is a data type similar to a vector. However, the values contained in a factor can only be selected from a set of **levels**. A specific set of levels can be specified when creating a factor by providing a **levels** argument. ```r houses_factor_spec <- factor( c("People Carrier", "Flat", "Flat", "Hatchback", "Flat", "Terrace", "Terrace"), levels = c("Bungalow", "Flat", "Detached", "Semi", "Terrace")) table(houses_factor_spec) ``` ``` ## houses_factor_spec ## Bungalow Flat Detached Semi Terrace ## 0 3 0 0 2 ``` Factors are unordered by default, but ordered factors can be created --- ## Lists and data frames Variables of the type **list** can contain elements of different types (including vectors and matrices), whereas elements of vectors are all of the same type. ```r employee <- list("Stef", 2015) employee ``` ``` ## [[1]] ## [1] "Stef" ## ## [[2]] ## [1] 2015 ``` ```r employee[[1]] # Note the double square brackets for selection ``` ``` ## [1] "Stef" ``` In **named lists** each element has a name, and elements can be selected using the operator `$`. ```r employee <- list(employee_name = "Stef", start_year = 2015) employee$employee_name ``` ``` ## [1] "Stef" ``` --- ## Data frames A **data frame** is equivalent to a *named list* where all elements are *vectors of the same length*. ```r employees <- data.frame( employee_name = c("Maria", "Pete", "Sarah"), age = c(47, 34, 32), role = c("Professor", "Researcher", "Researcher")) employees ``` ``` ## employee_name age role ## 1 Maria 47 Professor ## 2 Pete 34 Researcher ## 3 Sarah 32 Researcher ``` Data frames are the most common way to represent tabular data in R. |employee_name | age|role | |:-------------|---:|:----------| |Maria | 47|Professor | |Pete | 34|Researcher | |Sarah | 32|Researcher | --- ## Selecting and filtering .pull-left[ **Columns** of data frames can be **selected** - specifying the column index ```r employees[, c(3, 2)] ``` - specifying the column name ```r employees[, c("role", "employee_name")] ``` ``` ## role employee_name ## 1 Professor Maria ## 2 Researcher Pete ## 3 Researcher Sarah ``` The same applies to **rows** (**slicing**) ```r employees[3, ] ``` ``` ## employee_name age role ## 3 Sarah 32 Researcher ``` ] .pull-right[ As with vectors, a condition can be provided for **filtering** only the **rows** for which the condition is true ```r employees$role ``` ``` ## [1] "Professor" "Researcher" "Researcher" ``` ```r employees$role == "Researcher" ``` ``` ## [1] FALSE TRUE TRUE ``` ```r employees[employees$role == "Researcher", ] ``` ``` ## employee_name age role ## 2 Pete 34 Researcher ## 3 Sarah 32 Researcher ``` ] --- class: inverse, center, middle # Into the Tidyverse --- ## readr The `readr::read_csv` function of the [`readr`](https://readr.tidyverse.org/index.html) library reads a *csv* file from the path provided as the first argument ```r library(tidyverse) leicester_2011OAC <- read_csv("2011_OAC_supgrp_Leicester.csv") leicester_2011OAC ``` ``` ## # A tibble: 969 × 5 ## OA11CD LSOA11CD supgrpcode supgrpname Total_Population ## <chr> <chr> <dbl> <chr> <dbl> ## 1 E00069517 E01013785 6 Suburbanites 313 ## 2 E00069514 E01013784 2 Cosmopolitans 323 ## 3 E00169516 E01013713 4 Multicultural Metropolitans 341 ## 4 E00169048 E01032862 4 Multicultural Metropolitans 345 ## 5 E00169044 E01032862 4 Multicultural Metropolitans 322 ## 6 E00069041 E01013679 4 Multicultural Metropolitans 334 ## 7 E00169049 E01032862 4 Multicultural Metropolitans 336 ## 8 E00068806 E01013628 4 Multicultural Metropolitans 312 ## 9 E00068886 E01013647 3 Ethnicity Central 505 ## 10 E00068807 E01013624 4 Multicultural Metropolitans 362 ## # ℹ 959 more rows ``` --- .pull-left[ ## dplyr The `dplyr` (pronounced *dee-ply-er*) library is part of `tidyverse` and it offers a grammar for data manipulation - `select`: select specific columns - `filter`: select specific rows - `arrange`: arrange rows in a particular order - `summarise`: calculate aggregated values (e.g., mean, max, etc) - `group_by`: group data based on common column values - `mutate`: add columns - `join`: merge tables (`data.frames` or `tibbles`) ```r library(tidyverse) ``` ] .pull-right[ ## tibble A [tibble](https://tibble.tidyverse.org/) is a modern reimagining of the data.frame within `tidyverse` - they do less - don’t change column names or types - don’t do partial matching - complain more - e.g. when referring to a column that does not exist That forces you to confront problems earlier, typically leading to cleaner, more expressive code. ] --- ## dplyr::select .pull-left[ `select` can be used to specify which columns to retain ```r leicester_2011OAC %>% select( OA11CD, LSOA11CD supgrpname, Total_Population ) ``` ] .pull-right[ ... or which ones to drop, using - in front of the column name ```r leicester_2011OAC %>% select( OA11CD, LSOA11CD supgrpname, Total_Population ) %>% select(-LSOA11CD) ``` ] .pull-left[ ``` ## # A tibble: 969 × 4 ## OA11CD LSOA11CD supgrpname Total_Population ## <chr> <chr> <chr> <dbl> ## 1 E00069517 E01013785 Suburbanites 313 ## 2 E00069514 E01013784 Cosmopolitans 323 ## 3 E00169516 E01013713 Multicultural Metropolitans 341 ## 4 E00169048 E01032862 Multicultural Metropolitans 345 ## 5 E00169044 E01032862 Multicultural Metropolitans 322 ## 6 E00069041 E01013679 Multicultural Metropolitans 334 ## 7 E00169049 E01032862 Multicultural Metropolitans 336 ## # ℹ 962 more rows ``` ] .pull-right[ ``` ## # A tibble: 969 × 3 ## OA11CD supgrpname Total_Population ## <chr> <chr> <dbl> ## 1 E00069517 Suburbanites 313 ## 2 E00069514 Cosmopolitans 323 ## 3 E00169516 Multicultural Metropolitans 341 ## 4 E00169048 Multicultural Metropolitans 345 ## 5 E00169044 Multicultural Metropolitans 322 ## 6 E00069041 Multicultural Metropolitans 334 ## 7 E00169049 Multicultural Metropolitans 336 ## # ℹ 962 more rows ``` ] --- ## dplyr::filter .pull-left[ `filter` can be used to specify a condition for **filtering** only the **rows** for which the condition is true ```r leicester_2011OAC %>% # Cosmopolitans areas filter(supgrpname == "Cosmopolitans") ``` ``` ## # A tibble: 83 × 5 ## OA11CD LSOA11CD supgrpcode supgrpname ## <chr> <chr> <dbl> <chr> ## 1 E00069514 E01013784 2 Cosmopolitans ## 2 E00068882 E01013643 2 Cosmopolitans ## 3 E00169461 E01032873 2 Cosmopolitans ## # ℹ 80 more rows ## # ℹ 1 more variable: Total_Population <dbl> ``` ] .pull-right[ ... and can be concatenated with `select` or any other function using `%>%` to create a more complex process ```r leicester_2011OAC %>% # Select the columns you need select( OA11CD, LSOA11CD, supgrpname, Total_Population ) %>% # Let's assume we are not interested in the LSOA code select(-LSOA11CD) %>% # Cosmopolitans areas filter(supgrpname == "Cosmopolitans") ``` ``` ## # A tibble: 83 × 3 ## OA11CD supgrpname Total_Population ## <chr> <chr> <dbl> ## 1 E00069514 Cosmopolitans 323 ## 2 E00068882 Cosmopolitans 285 ## 3 E00169461 Cosmopolitans 229 ## # ℹ 80 more rows ``` ] --- --- class: inverse, center, middle # Data manipulation --- ## dplyr::slice .pull-left[ Slices the table, returning only a given subset of rows - `slice_head` returns the `n` top rows - `slice_tail` returns the `n` last rows - `slice_min` returns the `n` rows with the lowest value for the selected column - `slice_max` returns the `n` rows with the highest value for the selected column - `slice_sample` returns a random sample of `n` rows {{content}} ] .pull-right[ ```r library(knitr) library(kableExtra) leicester_2011OAC %>% # 10 highest delays select( OA11CD, supgrpname, Total_Population ) %>% slice_max(Total_Population, n = 8) %>% kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> OA11CD </th> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> Total_Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> E00169553 </td> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 714 </td> </tr> <tr> <td style="text-align:left;"> E00168096 </td> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 708 </td> </tr> <tr> <td style="text-align:left;"> E00069303 </td> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 623 </td> </tr> <tr> <td style="text-align:left;"> E00169458 </td> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 622 </td> </tr> <tr> <td style="text-align:left;"> E00069304 </td> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 608 </td> </tr> <tr> <td style="text-align:left;"> E00169575 </td> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 597 </td> </tr> <tr> <td style="text-align:left;"> E00069393 </td> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:right;"> 591 </td> </tr> <tr> <td style="text-align:left;"> E00169564 </td> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 591 </td> </tr> </tbody> </table> ] -- `prop` can be specified instead of `n` to indicate a percentage - e.g., `prop = 0.01` returns 1% of rows -- How is this different from `filter`? --- ## dplyr::arrange Arranges rows in a particular order - sorted in acending order of the specified columns - descending orders specified by using `-` (minus symbol) - the `NA` values are always at the end .pull-left[ ```r leicester_2011OAC %>% select( OA11CD, supgrpname, Total_Population ) %>% arrange( # Ascending supergroup name supgrpname, # Descending population -Total_Population ) %>% slice_head(n = 8) %>% kable("html") %>% kable_styling(font_size = 16) ``` ] .pull-right[ <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> OA11CD </th> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> Total_Population </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> E00068829 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 394 </td> </tr> <tr> <td style="text-align:left;"> E00069287 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 357 </td> </tr> <tr> <td style="text-align:left;"> E00069284 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 332 </td> </tr> <tr> <td style="text-align:left;"> E00069099 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 325 </td> </tr> <tr> <td style="text-align:left;"> E00068850 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 321 </td> </tr> <tr> <td style="text-align:left;"> E00069040 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 320 </td> </tr> <tr> <td style="text-align:left;"> E00069106 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 316 </td> </tr> <tr> <td style="text-align:left;"> E00069029 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 307 </td> </tr> <tr> <td style="text-align:left;"> E00069047 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 303 </td> </tr> <tr> <td style="text-align:left;"> E00069108 </td> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 300 </td> </tr> </tbody> </table> ] --- ## dplyr::summarise <br/> .pull-left[ Calculates aggregated values - using functions such - `mean`, `median`, `sd` - `min`, `max`, `quantile` - `first`, `last`, `nth` - `n` (count) - `any`, `all` {{content}} ] .pull-right[ ```r leicester_2011OAC %>% select( OA11CD, supgrpname, Total_Population ) %>% summarise( # Average population avg_pop = mean(Total_Population), # Is there any OA with over 200 people? pop_over_600 = any(Total_Population > 600), # Total population in Leicester tot_pop = sum(Total_Population) ) %>% kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> avg_pop </th> <th style="text-align:left;"> pop_over_600 </th> <th style="text-align:right;"> tot_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 340.3911 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 329839 </td> </tr> </tbody> </table> ] -- - both *British English* and *American English* spellings work, there is no difference - `summarise` - `summarize` --- ## dplyr::group_by Groups rows based on common values for specified column(s) - all subsequent operations are applied **to each group separately** - frequently combined with `summarise` to calculate aggregated values per group .pull-left[ ```r leicester_2011OAC %>% select( OA11CD, supgrpname, Total_Population ) %>% group_by(supgrpname) %>% summarise( # Average population avg_pop = mean(Total_Population), # Is there any OA with over 200 people? pop_over_200 = any(Total_Population > 600), # Total population in Leicester tot_pop = sum(Total_Population) ) %>% kable("html") %>% kable_styling(font_size = 16) ``` ] .pull-right[ .referencenote[ <br/> ] <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> avg_pop </th> <th style="text-align:left;"> pop_over_200 </th> <th style="text-align:right;"> tot_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 257.3056 </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:right;"> 9263 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 323.0482 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 26813 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:right;"> 335.7368 </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:right;"> 19137 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:right;"> 317.5149 </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:right;"> 32069 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 357.7784 </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:right;"> 205007 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:right;"> 320.8519 </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:right;"> 17326 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:right;"> 311.1385 </td> <td style="text-align:left;"> FALSE </td> <td style="text-align:right;"> 20224 </td> </tr> </tbody> </table> ] --- ## dplyr::group_by .pull-left[ ```r leicester_2011OAC %>% select( OA11CD, LSOA11CD, supgrpname, Total_Population ) %>% group_by(supgrpname, LSOA11CD) %>% summarise( # Average population avg_pop = mean(Total_Population) ) %>% # then slice the highest 3 slice_max(avg_pop, n = 3) %>% kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:left;"> LSOA11CD </th> <th style="text-align:right;"> avg_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:left;"> E01013632 </td> <td style="text-align:right;"> 394.0000 </td> </tr> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:left;"> E01013723 </td> <td style="text-align:right;"> 332.0000 </td> </tr> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:left;"> E01013638 </td> <td style="text-align:right;"> 321.0000 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:left;"> E01013648 </td> <td style="text-align:right;"> 549.5000 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:left;"> E01013689 </td> <td style="text-align:right;"> 471.0000 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:left;"> E01013647 </td> <td style="text-align:right;"> 465.0000 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:left;"> E01013754 </td> <td style="text-align:right;"> 514.7500 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:left;"> E01013760 </td> <td style="text-align:right;"> 494.0000 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:left;"> E01013755 </td> <td style="text-align:right;"> 443.2000 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:left;"> E01013640 </td> <td style="text-align:right;"> 392.5000 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:left;"> E01013631 </td> <td style="text-align:right;"> 382.6667 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:left;"> E01013637 </td> <td style="text-align:right;"> 382.3333 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:left;"> E01013686 </td> <td style="text-align:right;"> 571.0000 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:left;"> E01013748 </td> <td style="text-align:right;"> 533.5000 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:left;"> E01013733 </td> <td style="text-align:right;"> 498.0000 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:left;"> E01013699 </td> <td style="text-align:right;"> 397.0000 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:left;"> E01013696 </td> <td style="text-align:right;"> 395.0000 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:left;"> E01013678 </td> <td style="text-align:right;"> 383.0000 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:left;"> E01013785 </td> <td style="text-align:right;"> 401.0000 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:left;"> E01013736 </td> <td style="text-align:right;"> 385.5000 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:left;"> E01013733 </td> <td style="text-align:right;"> 375.6667 </td> </tr> </tbody> </table> ] .pull-right[ {{content}} ] -- ```r leicester_2011OAC %>% select( OA11CD, LSOA11CD, supgrpname, Total_Population ) %>% group_by(supgrpname, LSOA11CD) %>% summarise( # Average population avg_pop = mean(Total_Population) ) %>% ungroup() %>% # then slice the highest 3 slice_max(avg_pop, n = 3) %>% kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:left;"> LSOA11CD </th> <th style="text-align:right;"> avg_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:left;"> E01013686 </td> <td style="text-align:right;"> 571.0 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:left;"> E01013648 </td> <td style="text-align:right;"> 549.5 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:left;"> E01013748 </td> <td style="text-align:right;"> 533.5 </td> </tr> </tbody> </table> --- ## dplyr::mutate Calculates new columns based on values in current columns .pull-left[ ```r leicester_2011OAC %>% select(supgrpname, Total_Population) %>% group_by(supgrpname) %>% summarise(tot_pop = sum(Total_Population)) %>% kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> tot_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 9263 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 26813 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:right;"> 19137 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:right;"> 32069 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 205007 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:right;"> 17326 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:right;"> 20224 </td> </tr> </tbody> </table> ] .pull-right[ ```r leicester_2011OAC %>% select(supgrpname, Total_Population) %>% group_by(supgrpname) %>% summarise(tot_pop = sum(Total_Population)) %>% mutate( perc_pop = (tot_pop / 329839) * 100 ) %>% kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> tot_pop </th> <th style="text-align:right;"> perc_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 9263 </td> <td style="text-align:right;"> 2.808340 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 26813 </td> <td style="text-align:right;"> 8.129117 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:right;"> 19137 </td> <td style="text-align:right;"> 5.801921 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:right;"> 32069 </td> <td style="text-align:right;"> 9.722622 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 205007 </td> <td style="text-align:right;"> 62.153657 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:right;"> 17326 </td> <td style="text-align:right;"> 5.252866 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:right;"> 20224 </td> <td style="text-align:right;"> 6.131476 </td> </tr> </tbody> </table> ] --- ## Saving the output .pull-left[ <br/> It is **important** to note that - operations in the previous slides **do not** modify the input data! <br/> To save the newly crated information, use - the assignment operator `<-` - to save to new (or same) variable - the assignment pipe `%<>%` (`magrittr`) - to update the variable ```r library(magrittr) ``` ] .pull-right[ {{content}} ] -- ```r pop_per_supgr <- leicester_2011OAC %>% select(supgrpname, Total_Population) %>% group_by(supgrpname) %>% summarise(tot_pop = sum(Total_Population)) pop_per_supgr %<>% mutate(perc_pop = (tot_pop / 329839) * 100) pop_per_supgr %>% # and print as usual kable("html") %>% kable_styling(font_size = 16) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> tot_pop </th> <th style="text-align:right;"> perc_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 9263 </td> <td style="text-align:right;"> 2.808340 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 26813 </td> <td style="text-align:right;"> 8.129117 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:right;"> 19137 </td> <td style="text-align:right;"> 5.801921 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:right;"> 32069 </td> <td style="text-align:right;"> 9.722622 </td> </tr> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 205007 </td> <td style="text-align:right;"> 62.153657 </td> </tr> <tr> <td style="text-align:left;"> Suburbanites </td> <td style="text-align:right;"> 17326 </td> <td style="text-align:right;"> 5.252866 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:right;"> 20224 </td> <td style="text-align:right;"> 6.131476 </td> </tr> </tbody> </table> --- ## Full pipe example .pull-left-large[ ```r # Let's start from the filename "2011_OAC_supgrp_Leicester.csv" %>% # as input to the read_csv function read_csv(col_types = "cccci") %>% # Select only the necessary columns select(supgrpname, Total_Population) %>% # Let's say we are not interested in # the Suburbanites supergroup filter(supgrpname != "Suburbanites") %>% # Group by supergroup group_by(supgrpname) %>% # Aggregate population summarise(tot_pop = sum(Total_Population)) %>% # Ungroup ungroup() %>% # Arrange by descending total population arrange(-tot_pop) %>% # Calculate percentage mutate(perc_pop = (tot_pop / 329839) * 100) %T>% # Then use the Tee pipe %T>% in the line above # to write the calculated values to file # and the pass the same input values to the kable function write_csv( "2011_Leicester_pop_per_OAC_supgrp_excl_suburb.csv" ) %>% # Print to screen nicely kable("html") %>% kable_styling(font_size = 16) ``` ] .pull-right-small[ <br/> This is a full example of a short analysis using only one series of pipes (including the [Tee pipe](https://magrittr.tidyverse.org/reference/tee.html)) to read, process and write data using `R` and almost all the `tidyverse` *verbs* seen so far. <br/> <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> supgrpname </th> <th style="text-align:right;"> tot_pop </th> <th style="text-align:right;"> perc_pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Multicultural Metropolitans </td> <td style="text-align:right;"> 205007 </td> <td style="text-align:right;"> 62.153657 </td> </tr> <tr> <td style="text-align:left;"> Hard-Pressed Living </td> <td style="text-align:right;"> 32069 </td> <td style="text-align:right;"> 9.722622 </td> </tr> <tr> <td style="text-align:left;"> Cosmopolitans </td> <td style="text-align:right;"> 26813 </td> <td style="text-align:right;"> 8.129117 </td> </tr> <tr> <td style="text-align:left;"> Urbanites </td> <td style="text-align:right;"> 20224 </td> <td style="text-align:right;"> 6.131476 </td> </tr> <tr> <td style="text-align:left;"> Ethnicity Central </td> <td style="text-align:right;"> 19137 </td> <td style="text-align:right;"> 5.801921 </td> </tr> <tr> <td style="text-align:left;"> Constrained City Dwellers </td> <td style="text-align:right;"> 9263 </td> <td style="text-align:right;"> 2.808340 </td> </tr> </tbody> </table> ] --- ## Summary <br/> .pull-left[ **Today**: Data manipulation - Complex data types - Into the Tidyverse - `dplyr` **Next week**: Table operations - Long and wide table formats - Pivot operations *(not as in Excel)* - Join operations <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#https://raw.githubusercontent.com/tidyverse/tidyr/main/man/figures/logo.png) .right[ .referencenote[ by tidyr authors<br/> via [tidyr GitHub repository](https://github.com/tidyverse/tidyr/), MIT License ] ] ]