+ - 0:00:00
Notes for current slide
Notes for next slide

Lecture 104

Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk | @maps4thought
text licensed under CC BY-SA 4.0, code licensed under GNU GPL v3.0

1 / 29

Table operations

2 / 29

Recap


Previously: Data manipulation

  • Data input and output
  • Mutate and summarise data
  • Working with R projects (practical)

Today: Table operations

  • Long and wide table formats
  • Pivot operations (not as in Excel)
  • Join operations

by tidyr authors
via tidyr GitHub repository, MIT License

3 / 29

Wide and long tables

CONTENT WARNING:
Some of the examples used in these slides discuss issues that some people might find distressing: disease.

4 / 29

Wide data

Each real-world entity is represented by one single row

  • its attributes are represented through different columns
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5 / 29

Long data



Each real-world entity is represented by multiple rows

  • each one reporting only one of its attributes
  • one column indicates which attribute each row represent
  • another column is used to report the value

Common approach for temporal series

Example: Covid-19 7-day case rates per 100,000 people

city week_ending cases
Derby 2021-10-25 NA
Leicester 2021-10-25 425.9
Nottingham 2021-10-25 229.0
Derby 2021-11-01 359.0
Leicester 2021-11-01 337.0
Nottingham 2021-11-01 NA

Derived from data available from Coronavirus (COVID-19) in the UK under the Open Government Licence v3.0.

6 / 29

Wide and long data


One can be transforme into the other

city cases_2021_10_25 cases_2021_11_01
Derby NA 359
Leicester 425.9 337
Nottingham 229.0 NA


  • Long data can be more flexible
    • new attributes add new rows where necessary
  • Wide data require more structure
    • new attributes need new column for all entities
7 / 29

tidyr

The tidyr (pronounced tidy-er) library is part of tidyverse

Provides a series of functions to "tidy-up" your data, including

city_info_long <-
data.frame(
city = c(
"Derby", "Leicester" , "Nottingham",
"Derby", "Leicester", "Nottingham"
),
week_ending = c(
"2021-10-25", "2021-10-25", "2021-10-25",
"2021-11-01", "2021-11-01", "2021-11-01"
),
cases = c(
NA, 425.9, 229,
359, 337, NA
)
) %>%
tibble::as_tibble()
city week_ending cases
Derby 2021-10-25 NA
Leicester 2021-10-25 425.9
Nottingham 2021-10-25 229.0
Derby 2021-11-01 359.0
Leicester 2021-11-01 337.0
Nottingham 2021-11-01 NA
8 / 29

tidyr::pivot_wider


Re-shape from long to wide format

city_info_wide <-
city_info_long %>%
pivot_wider(
# Column from which to extract
# new column names
names_from = week_ending,
# Column from which to extract
# values
values_from = cases
)

From:

city week_ending cases
Derby 2021-10-25 NA
Leicester 2021-10-25 425.9
Nottingham 2021-10-25 229.0
Derby 2021-11-01 359.0
Leicester 2021-11-01 337.0
Nottingham 2021-11-01 NA

To:

city 2021-10-25 2021-11-01
Derby NA 359
Leicester 425.9 337
Nottingham 229.0 NA
9 / 29

tidyr::pivot_wider

It might be useful (or indeed necessary) to format the values that will become the names of the new columns

city_info_wide <-
city_info_long %>%
mutate(
# Change "-" to "_"
week_ending = str_replace_all(
week_ending, "-", "_"
)
) %>%
pivot_wider(
names_from = week_ending,
values_from = cases, # As before
names_prefix = "cases_" # Add a prefix
)
city cases_2021_10_25 cases_2021_11_01
Derby NA 359
Leicester 425.9 337
Nottingham 229.0 NA
10 / 29

tidyr::pivot_longer


Re-shape from wide to long format

city_info_back_to_long <-
city_info_wide %>%
pivot_longer(
# Pivot all columns, excluding city
cols = -city,
# Name column for column names
names_to = "week_ending",
# Name column for values
values_to = "cases"
)

From:

city cases_2021_10_25 cases_2021_11_01
Derby NA 359
Leicester 425.9 337
Nottingham 229.0 NA

To:

city week_ending cases
Derby cases_2021_10_25 NA
Derby cases_2021_11_01 359.0
Leicester cases_2021_10_25 425.9
Leicester cases_2021_11_01 337.0
Nottingham cases_2021_10_25 229.0
Nottingham cases_2021_11_01 NA
11 / 29

tidyr::pivot_longer

It might be useful (or indeed necessary) to format the values extracted from the column names

city_info_back_to_long <-
city_info_wide %>%
pivot_longer(
# As before
cols = -city,
names_to = "week_ending",
values_to = "cases",
# Remove name prefix
names_prefix = "cases_",
# Transform the values that will
# become column names
names_transform = list(
# Provide a function name or define one
week_ending = function (x) {
str_replace_all(x, "_", "-")
}
)
)

... which brings us back exactly where we started.

city week_ending cases
Derby 2021-10-25 NA
Derby 2021-11-01 359.0
Leicester 2021-10-25 425.9
Leicester 2021-11-01 337.0
Nottingham 2021-10-25 229.0
Nottingham 2021-11-01 NA
12 / 29

Working with incomplete tables

13 / 29

tidyr


The tidyr (pronounced tidy-er) library is part of tidyverse

Provides a series of functions to "tidy-up" your data, including

14 / 29

tidyr::replace_na

If the data allow for a baseline value, missing values can be replaced

city_info_long %>%
replace_na(
# List of columns <-> values to replace NA
list(cases = 0)
)
city week_ending cases
Derby 2021-10-25 0.0
Leicester 2021-10-25 425.9
Nottingham 2021-10-25 229.0
Derby 2021-11-01 359.0
Leicester 2021-11-01 337.0
Nottingham 2021-11-01 0.0
15 / 29

tidyr::fill

Sometimes it can make sense to fill missing values using "nearby" values, but caution, order and grouping matter!

city_info_long %>%
group_by(city) %>%
arrange(week_ending) %>%
# Columns to fill
fill(cases)
city week_ending cases
Derby 2021-10-25 NA
Leicester 2021-10-25 425.9
Nottingham 2021-10-25 229.0
Derby 2021-11-01 359.0
Leicester 2021-11-01 337.0
Nottingham 2021-11-01 229.0
16 / 29

tidyr::drop_na


In other cases, it might be simpler or safer to just remove all the rows with missing data

city_info_long_noNAs <-
city_info_long %>%
# Columns to drop where NA
drop_na(cases)
city week_ending cases
Leicester 2021-10-25 425.9
Nottingham 2021-10-25 229.0
Derby 2021-11-01 359.0
Leicester 2021-11-01 337.0
17 / 29

tidyr::complete


Some analysis or visualisation procedures might require a complete table

  • where missing values are represented as NAs
  • for instance, when creating a map
    • you might want to use a specific colour for missing values
    • rather than a missing polygon

Complete table by turning implicit missing values into explicit missing values

city_info_long_noNAs %>%
# Complete table with all week_ending
# and city combinations making missing
# values for remaining columns explicit
complete(week_ending, city)
city week_ending cases
Derby 2021-10-25 NA
Derby 2021-11-01 359.0
Leicester 2021-10-25 425.9
Leicester 2021-11-01 337.0
Nottingham 2021-10-25 229.0
Nottingham 2021-11-01 NA
18 / 29

Joining tables

19 / 29

Example


cities <-
data.frame(
city_name = c(
"Barcelona", "London", "Rome", "Los Angeles"
),
country_name = c(
"Spain", "UK", "Italy", "US"
),
city_pop_M = c(
1.62, 8.98, 4.34, 3.99
)
)
cities_area <-
data.frame(
city_name = c(
"Barcelona", "London", "Rome", "Munich"
),
city_area_km2 = c(
101, 1572, 496, 310
)
)
city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310
20 / 29

Joining data

Tables can be joined (or 'merged')

  • information from two tables can be combined
  • specifying column(s) from two tables with common values
    • usually one with a unique identifier of an entity
  • rows having the same value are joined
  • depending on parameters
    • a row from one table can be merged with multiple rows from the other table
    • rows with no matching values in the other table can be retained
  • merge base function or join functions in dplyr


21 / 29

dplyr joins

dplyr provides a series of join verbs

  • Mutating joins
    • inner_join: inner join
    • left_join: left join
    • right_join: right join
    • full_join: full join
  • Nesting joins
    • nest_join: all rows columns from left table, plus a column of tibbles with matching from right
  • Filtering joins (keep only columns from left)
    • semi_join: , rows from left where match with right
    • anti_join: rows from left where no match with right


22 / 29

full_join

Input tables

city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310

full_join combines all the available data

full_join(
# first argument, left table
# second argument, right table
cities, cities_area,
# specify which column to be matched
by = c("city_name" = "city_name")
)
city_name country_name city_pop_M city_area_km2
Barcelona Spain 1.62 101
London UK 8.98 1572
Rome Italy 4.34 496
Los Angeles US 3.99 NA
Munich NA NA 310
23 / 29

Pipes and shorthands

Input tables

city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310

When using (any) join verbs in dplyr

# using pipe, left table is "coming down the pipe"
cities %>%
full_join(
cities_area,
by = c("city_name" = "city_name")
)
# if no columns specified, columns with the same name are matched
cities %>%
full_join(cities_area)
city_name country_name city_pop_M city_area_km2
Barcelona Spain 1.62 101
London UK 8.98 1572
Rome Italy 4.34 496
Los Angeles US 3.99 NA
Munich NA NA 310
24 / 29

left_join

Input tables

city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310

left_join

  • keeps all the data from the left table
    • first argument or "coming down the pipe"
  • rows from the right table without a match are dropped
    • second argument (or first when using pipes)
cities %>%
left_join(cities_area)
city_name country_name city_pop_M city_area_km2
Barcelona Spain 1.62 101
London UK 8.98 1572
Rome Italy 4.34 496
Los Angeles US 3.99 NA
25 / 29

right_join

Input tables

city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310

right_join

  • keeps all the data from the right table
    • second argument (or first when using pipes)
  • rows from the left table without a match are dropped
    • first argument or "coming down the pipe"
cities %>%
right_join(cities_area)
city_name country_name city_pop_M city_area_km2
Barcelona Spain 1.62 101
London UK 8.98 1572
Rome Italy 4.34 496
Munich NA NA 310
26 / 29

inner_join

Input tables

city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310

inner_join

  • keeps only rows that have a match in both tables
  • rows without a match either way are dropped
cities %>%
inner_join(cities_area)
city_name country_name city_pop_M city_area_km2
Barcelona Spain 1.62 101
London UK 8.98 1572
Rome Italy 4.34 496
27 / 29

semi_join and anti_join

Input tables

city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34
Los Angeles US 3.99


city_name city_area_km2
Barcelona 101
London 1572
Rome 496
Munich 310

semi_join as inner join but keeps only left columns

cities %>% semi_join(cities_area)
city_name country_name city_pop_M
Barcelona Spain 1.62
London UK 8.98
Rome Italy 4.34


anti_join keeps un-matched rows from left table

cities %>% anti_join(cities_area)
city_name country_name city_pop_M
Los Angeles US 3.99
28 / 29

Summary


Today: Table operations

  • Long and wide table formats
  • Pivot operations (not as in Excel)
  • Join operations

Next week: Exploratory visualisation

  • Grammar of graphics
  • Visualising amounts and proportions
  • Visualising variable distributions and relationships


Slides created via the R package xaringan. The chakra comes from remark.js, knitr, and R Markdown.

by ggplot2 authors
via ggplot2 GitHub repository, MIT License

29 / 29

Table operations

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow