Previously: Data manipulation
Today: Table operations
by tidyr authors
via tidyr GitHub repository, MIT License
CONTENT WARNING:
Some of the examples used in these slides discuss issues that some people might find distressing: disease.
Each real-world entity is represented by one single row
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.9 | 3.1 | 1.5 | 0.1 | setosa |
Each real-world entity is represented by multiple rows
Common approach for temporal series
Example: Covid-19 7-day case rates per 100,000 people
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | NA |
Leicester | 2021-10-25 | 425.9 |
Nottingham | 2021-10-25 | 229.0 |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-11-01 | NA |
Derived from data available from Coronavirus (COVID-19) in the UK under the Open Government Licence v3.0.
One can be transforme into the other
city | cases_2021_10_25 | cases_2021_11_01 |
---|---|---|
Derby | NA | 359 |
Leicester | 425.9 | 337 |
Nottingham | 229.0 | NA |
The tidyr
(pronounced tidy-er) library is part of tidyverse
Provides a series of functions to "tidy-up" your data, including
tidyr::pivot_wider
: pivot from long to widetidyr::pivot_longer
: pivot from wide to longtidyr::drop_na
: remove rows with missing datatidyr::replace_na
: replace missing datatidyr::fill
: fill missing datatidyr::complete
: add missing value combinationscity_info_long <- data.frame( city = c( "Derby", "Leicester" , "Nottingham", "Derby", "Leicester", "Nottingham" ), week_ending = c( "2021-10-25", "2021-10-25", "2021-10-25", "2021-11-01", "2021-11-01", "2021-11-01" ), cases = c( NA, 425.9, 229, 359, 337, NA ) ) %>% tibble::as_tibble()
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | NA |
Leicester | 2021-10-25 | 425.9 |
Nottingham | 2021-10-25 | 229.0 |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-11-01 | NA |
Re-shape from long to wide format
city_info_wide <- city_info_long %>% pivot_wider( # Column from which to extract # new column names names_from = week_ending, # Column from which to extract # values values_from = cases )
From:
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | NA |
Leicester | 2021-10-25 | 425.9 |
Nottingham | 2021-10-25 | 229.0 |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-11-01 | NA |
To:
city | 2021-10-25 | 2021-11-01 |
---|---|---|
Derby | NA | 359 |
Leicester | 425.9 | 337 |
Nottingham | 229.0 | NA |
It might be useful (or indeed necessary) to format the values that will become the names of the new columns
city_info_wide <- city_info_long %>% mutate( # Change "-" to "_" week_ending = str_replace_all( week_ending, "-", "_" ) ) %>% pivot_wider( names_from = week_ending, values_from = cases, # As before names_prefix = "cases_" # Add a prefix )
city | cases_2021_10_25 | cases_2021_11_01 |
---|---|---|
Derby | NA | 359 |
Leicester | 425.9 | 337 |
Nottingham | 229.0 | NA |
Re-shape from wide to long format
city_info_back_to_long <- city_info_wide %>% pivot_longer( # Pivot all columns, excluding city cols = -city, # Name column for column names names_to = "week_ending", # Name column for values values_to = "cases" )
From:
city | cases_2021_10_25 | cases_2021_11_01 |
---|---|---|
Derby | NA | 359 |
Leicester | 425.9 | 337 |
Nottingham | 229.0 | NA |
To:
city | week_ending | cases |
---|---|---|
Derby | cases_2021_10_25 | NA |
Derby | cases_2021_11_01 | 359.0 |
Leicester | cases_2021_10_25 | 425.9 |
Leicester | cases_2021_11_01 | 337.0 |
Nottingham | cases_2021_10_25 | 229.0 |
Nottingham | cases_2021_11_01 | NA |
It might be useful (or indeed necessary) to format the values extracted from the column names
city_info_back_to_long <- city_info_wide %>% pivot_longer( # As before cols = -city, names_to = "week_ending", values_to = "cases", # Remove name prefix names_prefix = "cases_", # Transform the values that will # become column names names_transform = list( # Provide a function name or define one week_ending = function (x) { str_replace_all(x, "_", "-") } ) )
... which brings us back exactly where we started.
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | NA |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-10-25 | 425.9 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-10-25 | 229.0 |
Nottingham | 2021-11-01 | NA |
The tidyr
(pronounced tidy-er) library is part of tidyverse
Provides a series of functions to "tidy-up" your data, including
tidyr::pivot_wider
: pivot from long to widetidyr::pivot_longer
: pivot from wide to longtidyr::drop_na
: remove rows with missing datatidyr::replace_na
: replace missing datatidyr::fill
: fill missing datatidyr::complete
: add missing value combinationsIf the data allow for a baseline value, missing values can be replaced
city_info_long %>% replace_na( # List of columns <-> values to replace NA list(cases = 0) )
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | 0.0 |
Leicester | 2021-10-25 | 425.9 |
Nottingham | 2021-10-25 | 229.0 |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-11-01 | 0.0 |
Sometimes it can make sense to fill missing values using "nearby" values, but caution, order and grouping matter!
city_info_long %>% group_by(city) %>% arrange(week_ending) %>% # Columns to fill fill(cases)
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | NA |
Leicester | 2021-10-25 | 425.9 |
Nottingham | 2021-10-25 | 229.0 |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-11-01 | 229.0 |
In other cases, it might be simpler or safer to just remove all the rows with missing data
city_info_long_noNAs <- city_info_long %>% # Columns to drop where NA drop_na(cases)
city | week_ending | cases |
---|---|---|
Leicester | 2021-10-25 | 425.9 |
Nottingham | 2021-10-25 | 229.0 |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-11-01 | 337.0 |
Some analysis or visualisation procedures might require a complete table
Complete table by turning implicit missing values into explicit missing values
city_info_long_noNAs %>% # Complete table with all week_ending # and city combinations making missing # values for remaining columns explicit complete(week_ending, city)
city | week_ending | cases |
---|---|---|
Derby | 2021-10-25 | NA |
Derby | 2021-11-01 | 359.0 |
Leicester | 2021-10-25 | 425.9 |
Leicester | 2021-11-01 | 337.0 |
Nottingham | 2021-10-25 | 229.0 |
Nottingham | 2021-11-01 | NA |
cities <- data.frame( city_name = c( "Barcelona", "London", "Rome", "Los Angeles" ), country_name = c( "Spain", "UK", "Italy", "US" ), city_pop_M = c( 1.62, 8.98, 4.34, 3.99 ) )cities_area <- data.frame( city_name = c( "Barcelona", "London", "Rome", "Munich" ), city_area_km2 = c( 101, 1572, 496, 310 ))
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
Tables can be joined (or 'merged')
merge
base function or join functions in dplyr
dplyr
provides a series of join verbs
inner_join
: inner joinleft_join
: left joinright_join
: right joinfull_join
: full joinnest_join
: all rows columns from left table, plus a column of tibbles with matching from rightsemi_join
: , rows from left where match with rightanti_join
: rows from left where no match with rightInput tables
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
full_join
combines all the available data
full_join( # first argument, left table # second argument, right table cities, cities_area, # specify which column to be matched by = c("city_name" = "city_name") )
city_name | country_name | city_pop_M | city_area_km2 |
---|---|---|---|
Barcelona | Spain | 1.62 | 101 |
London | UK | 8.98 | 1572 |
Rome | Italy | 4.34 | 496 |
Los Angeles | US | 3.99 | NA |
Munich | NA | NA | 310 |
Input tables
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
When using (any) join verbs in dplyr
# using pipe, left table is "coming down the pipe"cities %>% full_join( cities_area, by = c("city_name" = "city_name") )# if no columns specified, columns with the same name are matchedcities %>% full_join(cities_area)
city_name | country_name | city_pop_M | city_area_km2 |
---|---|---|---|
Barcelona | Spain | 1.62 | 101 |
London | UK | 8.98 | 1572 |
Rome | Italy | 4.34 | 496 |
Los Angeles | US | 3.99 | NA |
Munich | NA | NA | 310 |
Input tables
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
left_join
cities %>% left_join(cities_area)
city_name | country_name | city_pop_M | city_area_km2 |
---|---|---|---|
Barcelona | Spain | 1.62 | 101 |
London | UK | 8.98 | 1572 |
Rome | Italy | 4.34 | 496 |
Los Angeles | US | 3.99 | NA |
Input tables
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
right_join
cities %>% right_join(cities_area)
city_name | country_name | city_pop_M | city_area_km2 |
---|---|---|---|
Barcelona | Spain | 1.62 | 101 |
London | UK | 8.98 | 1572 |
Rome | Italy | 4.34 | 496 |
Munich | NA | NA | 310 |
Input tables
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
inner_join
cities %>% inner_join(cities_area)
city_name | country_name | city_pop_M | city_area_km2 |
---|---|---|---|
Barcelona | Spain | 1.62 | 101 |
London | UK | 8.98 | 1572 |
Rome | Italy | 4.34 | 496 |
Input tables
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
Los Angeles | US | 3.99 |
city_name | city_area_km2 |
---|---|
Barcelona | 101 |
London | 1572 |
Rome | 496 |
Munich | 310 |
semi_join
as inner join but keeps only left columns
cities %>% semi_join(cities_area)
city_name | country_name | city_pop_M |
---|---|---|
Barcelona | Spain | 1.62 |
London | UK | 8.98 |
Rome | Italy | 4.34 |
anti_join
keeps un-matched rows from left table
cities %>% anti_join(cities_area)
city_name | country_name | city_pop_M |
---|---|---|
Los Angeles | US | 3.99 |
Today: Table operations
Next week: Exploratory visualisation
Slides created via the R package xaringan. The chakra comes from remark.js, knitr, and R Markdown.
by ggplot2 authors
via ggplot2 GitHub repository, MIT License
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |