4 Table operations

The first two sections illustrate the pivot and join functionalities of the Tidyverse libraries using simple examples. The last section instead presents a more complex example, loading and wrangling with data related to the 2011 Output Area Classification and the Indexes of Multiple Deprivation 2015.

library(tidyverse)
library(knitr)

4.1 Long and wide table formats

Tabular data are usually presented in two different formats.

• Wide: this is the most common approach, where each real-world entity (e.g. a city) is represented by one single row and its attributes are represented through different columns (e.g., a column representing the total population in the area, another column representing the size of the area, etc.).
City Population Area Density
Leicester 329,839 73.3 4,500
Nottingham 321,500 74.6 4,412
• Long: this is probably a less common approach, but still necessary in many cases, where each real-world entity (e.g. a city) is represented by multiple rows, each one reporting only one of its attributes. In this case, one column is used to indicate which attribute each row represent, and another column is used to report the value.
City Attribute Value
Leicester Population 329,839
Leicester Area 73.3
Leicester Density 4,500
Nottingham Population 321,500
Nottingham Area 74.6
Nottingham Density 4,412

The tidyr library provides two functions that allow transforming wide-formatted data to a long format, and vice-versa. Please take your time to understand the example below and check out the tidyr help pages before continuing.

city_info_wide <- data.frame(
city = c("Leicester", "Nottingham"),
population = c(329839, 321500),
area = c(73.3, 74.6),
density = c(4500, 4412)
) %>%
as_tibble()

city_info_wide %>%
kable()
city population area density
Leicester 329839 73.3 4500
Nottingham 321500 74.6 4412
city_info_long <- city_info_wide %>%
pivot_longer(
# exclude IDs (city names) from the pivoted columns
cols = -city,
# name for the new column containing
# the names of the old columns
names_to = "attribute",
# name for the new column containing
# the values included under the old columns
values_to = "value"
)

city_info_long %>%
kable()
city attribute value
Leicester population 329839.0
Leicester area 73.3
Leicester density 4500.0
Nottingham population 321500.0
Nottingham area 74.6
Nottingham density 4412.0
city_info_back_to_wide <- city_info_long %>%
pivot_wider(
# column containing the attribute names
names_from = attribute,
# column containing the values
values_from = value
)

city_info_back_to_wide %>%
kable()
city population area density
Leicester 329839 73.3 4500
Nottingham 321500 74.6 4412

4.2 Joining tables

A join operation combines two tables into one by matching rows that have the same values in the specified column. This operation is usually executed on columns containing identifiers, which are matched through different tables containing different data about the same real-world entities. For instance, the table below presents the telephone prefixes for two cities. That information can be combined with the data present in the wide-formatted table above through a join operation on the columns containing the city names. As the two tables do not contain all the same cities, if a full join operation is executed, some cells have no values assigned.

city telephone_prefix
Leicester 0116
Birmingham 0121
city population area density telephone_prefix
Leicester 329,839 73.3 4,500 0116
Nottingham 321,500 74.6 4,412
Birmingham 0121

As discussed in the lecture, the dplyr library offers different types of join operations, which correspond to the different SQL joins illustrated in the image below. The use and implications of these different types of joins will be discussed in more detail in the GY7708 module next semester.

Please take your time to understand the example below and check out the related dplyr help pages before continuing. The first four examples execute the exact same full join operation using three different syntaxes: with or without using the pipe operator, and specifying the by argument or not. Note that all those approaches to writing the join are valid and produce the same result. The choice about which approach to use will depend on the code you are writing. In particular, you might find it useful to use the syntax that uses the pipe operator when the join operation is itself only one stem in a series of data manipulation steps. Using the by argument is usually advisable unless you are certain that you aim to join two tables with all and exactly the column that have the same names in the two table.

Note how the result of the join operations is not saved to a variable. The function kable is added after each join operation through a pipe %>% to display the resulting table in a nice format.

city_telephone_prexix <- data.frame(
city = c("Leicester", "Birmingham"),
telephon_prefix = c("0116", "0121")
) %>%
as_tibble()

city_telephone_prexix %>%
kable()
city telephon_prefix
Leicester 0116
Birmingham 0121
# Option 1: without using the pipe operator

# full join verb
full_join(
# left table
city_info_wide,
# right table
city_telephone_prexix,
# columns to match
by = c("city" = "city")
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116
Nottingham 321500 74.6 4412 NA
Birmingham NA NA NA 0121
# Option 2: without using the pipe operator
#   and without using the argument "by"
#   as columns have the same name
#   in the two tables.
# Same result as Option 1

# full join verb
full_join(
# left table
city_info_wide,
# right table
city_telephone_prexix
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116
Nottingham 321500 74.6 4412 NA
Birmingham NA NA NA 0121
# Option 3: using the pipe operator
#   and without using the argument "by"
#   as columns have the same name
#   in the two tables.
# Same result as Option 1 and 2

# left table
city_info_wide %>%
# full join verb
full_join(
# right table
city_telephone_prexix
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116
Nottingham 321500 74.6 4412 NA
Birmingham NA NA NA 0121
# Option 4: using the pipe operator
#   and using the argument "by".
# Same result as Option 1, 2 and 3

# left table
city_info_wide %>%
# full join verb
full_join(
# right table
city_telephone_prexix,
# columns to match
by = c("city" = "city")
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116
Nottingham 321500 74.6 4412 NA
Birmingham NA NA NA 0121
# Left join
# Using syntax similar to Option 1 above

# left join
left_join(
# left table
city_info_wide,
# right table
city_telephone_prexix,
# columns to match
by = c("city" = "city")
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116
Nottingham 321500 74.6 4412 NA
# Right join
# Using syntax similar to Option 2 above

# right join verb
right_join(
# left table
city_info_wide,
# right table
city_telephone_prexix
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116
Birmingham NA NA NA 0121
# Inner join
# Using syntax similar to Option 3 above

# left table
city_info_wide %>%
# inner join
inner_join(
# right table
city_telephone_prexix
) %>%
kable()
city population area density telephon_prefix
Leicester 329839 73.3 4500 0116

4.3 Data: Indices of Multiple Deprivation

Open the Leicester_population project created for the previous chapter. Create a new RMarkdown document using “Exploring deprivation indices in Leicester” as the title and PDF as the output file type. Delete the example code after the setup chunk. Add a new markdown second-heading section named Libraries and a chunk loading the tidyverse and knitr libraries (see below). Save the file with the name exploring_Leicester_deprivation.Rmd in the Leicester_population project.

Download from Blackboard (or the data folder of the repository) the file IndexesMultipleDeprivation2015_Leicester.csv and upload the file to the Leicester_population folder by clicking on the Upload button and selecting the files from your computer.

The Indices of Multiple Deprivation 2015 (see map at cdrc.ac.uk) are based on a series of variables across seven distinct domains of deprivation which are combined to calculate the Index of Multiple Deprivation 2015 (IMD 2015). That is an overall measure of multiple deprivations experienced by people living in an area. These indexes are calculated for every Lower layer Super Output Area (LSOA), which are larger geographic units than the OAs used for the 2011 OAC. The dataset in the file IndexesMultipleDeprivation2015_Leicester.csv contains the main Index of Multiple Deprivation, as well as the values for the seven distinct domains of deprivation and two additional indexes regarding deprivation affecting children and older people. The dataset includes scores, ranks (where 1 indicates the most deprived area), and decile (i.e., the first decile includes the 10% most deprived areas in England).

The code below loads the IMD 2015 dataset.

# Load Indexes of Multiple deprivation data
leicester_IMD2015 <-
read_csv("IndexesMultipleDeprivation2015_Leicester.csv")
leicester_2011OAC <-
read_csv("2011_OAC_Raw_uVariables_Leicester.csv")

4.4 Working with multiple tables

4.4.1 From long to wide table

The IMD 2015 data are in a long format, which means that every area is represented by more than one row: the column Value presents the value; the column IndicesOfDeprivation indicates which index the value refers to; the column Measurement indicates whether the value is a score, rank, or decile. The code below illustrates the data format used for the IndicesOfDeprivation table, and showing the rows for the LSOA including the University of Leicester (feature code E01013649).

leicester_IMD2015 %>%
filter(FeatureCode == "E01013649") %>%
select(FeatureCode, IndicesOfDeprivation, Measurement, Value) %>%
kable()
FeatureCode IndicesOfDeprivation Measurement Value
E01013649 Income Deprivation Domain Score 0.070
E01013649 Employment Deprivation Domain Score 0.075
E01013649 Income Deprivation Affecting Children Index (IDACI) Score 0.087
E01013649 Income Deprivation Affecting Older People Index (IDAOPI) Score 0.153
E01013649 Health Deprivation and Disability Domain Score 0.272
E01013649 Index of Multiple Deprivation (IMD) Score 19.665
E01013649 Education, Skills and Training Domain Score 2.195
E01013649 Barriers to Housing and Services Domain Score 14.324
E01013649 Living Environment Deprivation Domain Score 57.197
E01013649 Crime Domain Score 1.159
E01013649 Income Deprivation Domain Rank 23511.000
E01013649 Index of Multiple Deprivation (IMD) Rank 14539.000
E01013649 Employment Deprivation Domain Rank 21227.000
E01013649 Education, Skills and Training Domain Rank 30744.000
E01013649 Barriers to Housing and Services Domain Rank 23885.000
E01013649 Health Deprivation and Disability Domain Rank 12269.000
E01013649 Living Environment Deprivation Domain Rank 1197.000
E01013649 Crime Domain Rank 2214.000
E01013649 Income Deprivation Affecting Children Index (IDACI) Rank 22984.000
E01013649 Income Deprivation Affecting Older People Index (IDAOPI) Rank 16055.000
E01013649 Employment Deprivation Domain Decile 7.000
E01013649 Income Deprivation Affecting Older People Index (IDAOPI) Decile 5.000
E01013649 Barriers to Housing and Services Domain Decile 8.000
E01013649 Income Deprivation Affecting Children Index (IDACI) Decile 7.000
E01013649 Crime Domain Decile 1.000
E01013649 Income Deprivation Domain Decile 8.000
E01013649 Health Deprivation and Disability Domain Decile 4.000
E01013649 Living Environment Deprivation Domain Decile 1.000
E01013649 Education, Skills and Training Domain Decile 10.000
E01013649 Index of Multiple Deprivation (IMD) Decile 5.000

In the following section, the analysis aims to explore how certain census variables vary in areas with different deprivation levels. Thus, we need to extract the Decile rows from the IMD 2015 dataset and transform the data in a wide format, where each index is represented as a separate column.

To that purpose, we also need to change the name of the indexes slightly, to exclude spaces and punctuation, so that the new column names are simpler than the original text, and can be used as column names. That part of the manipulation is performed using mutate and functions from the stringr library.

leicester_IMD2015_decile_wide <- leicester_IMD2015 %>%
# Select only Socres
filter(Measurement == "Decile") %>%
# Trim names of IndicesOfDeprivation
mutate(
IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "\\s", "")
) %>%
mutate(
IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "[:punct:]", "")
) %>%
mutate(
IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "\$$", "") ) %>% mutate( IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "\$$", "")
) %>%
pivot_wider(
names_from = IndicesOfDeprivation,
values_from = Value
) %>%
# Drop columns
select(-DateCode, -Measurement, -Units)

Let’s compare the columns of the original long IMD 2015 dataset with the wide dataset created above, using the function colnames.

leicester_IMD2015 %>%
colnames() %>%
# limit width of printing area
print(width = 70)
## [1] "FeatureCode"          "DateCode"
## [3] "Measurement"          "Units"
## [5] "Value"                "IndicesOfDeprivation"
leicester_IMD2015_decile_wide %>%
colnames() %>%
# limit width of printing area
print(width = 70)
##  [1] "FeatureCode"
##  [2] "HealthDeprivationandDisabilityDomain"
##  [3] "IncomeDeprivationAffectingOlderPeopleIndexIDAOPI"
##  [4] "BarrierstoHousingandServicesDomain"
##  [5] "EmploymentDeprivationDomain"
##  [6] "EducationSkillsandTrainingDomain"
##  [7] "LivingEnvironmentDeprivationDomain"
##  [8] "IncomeDeprivationAffectingChildrenIndexIDACI"
##  [9] "CrimeDomain"
## [10] "IndexofMultipleDeprivationIMD"
## [11] "IncomeDeprivationDomain"

In leicester_IMD2015_decile_wide, we now have only one row representing the LSOA including the University of Leicester (feature code E01013649) and the main Index of Multiple Deprivations is now represented by the column IndexofMultipleDeprivationIMD. The value reported is the same – that is 5, which means that the selected LSOA is estimated to be in the range 40-50% most deprived areas in England – but we changed the data format.

# Original long IMD 2015 dataset
leicester_IMD2015 %>%
filter(
FeatureCode == "E01013649",
IndicesOfDeprivation == "Index of Multiple Deprivation (IMD)",
Measurement == "Decile"
) %>%
select(FeatureCode, IndicesOfDeprivation, Measurement, Value) %>%
kable()
FeatureCode IndicesOfDeprivation Measurement Value
E01013649 Index of Multiple Deprivation (IMD) Decile 5
# New wide IMD 2015 dataset
leicester_IMD2015_decile_wide %>%
filter(FeatureCode == "E01013649") %>%
select(FeatureCode, IndexofMultipleDeprivationIMD) %>%
kable()
FeatureCode IndexofMultipleDeprivationIMD
E01013649 5

4.4.2 Joining tables

As discussed above, two tables can be joined using a common column of identifiers. We can thus join the 2011 OAC and the IMD 2015 datasets into a single table. The LSOA code included in the 2011 OAC table is used to match that information with the corresponding row in the IMD 2015. The resulting table provides all the information from the 2011 OAC for each OA, plus the Index of Multiple Deprivations decile for the LSOA containing each OA.

That operation can be carried out using the function inner_join, and specifying the common column (or columns, if more than one is to be used as identifier) as argument of by. Note that using inner_join would result in dropping any row which doesn’t have a match in the other table, either way. In this case, that should not happen, as all OAs are part of an LSOA, and any LSOA contains at least one OA.

leicester_2011OAC_IMD2015 <-
leicester_2011OAC %>%
inner_join(
leicester_IMD2015_decile_wide,
by = c("LSOA11CD" = "FeatureCode")
)

As each LSOA contains multiple OAs, each row from the leicester_IMD2015_decile_wide table is matched to multiple rows from the leicester_2011OAC table. For instance, as shown in the table below, the information from the IMD 2015 dataset about the LSOA encompassing the University of Leicester (feature code E01013649) is joined to multiple rows from the 2011 OAC dataset, including the OA encompassing the University of Leicester (feature code E00068890) as well as other neighbouring OAs.

leicester_2011OAC_IMD2015 %>%
# Note that the LSOA11CD column needs to be used
# as the previous join as combined
# LSOA11CD and FeatureCode
# into one, name LSOA11CD
filter(LSOA11CD == "E01013649") %>%
select(OA11CD, LSOA11CD, supgrpname, IndexofMultipleDeprivationIMD) %>%
kable()
OA11CD LSOA11CD supgrpname IndexofMultipleDeprivationIMD
E00169447 E01013649 Cosmopolitans 5
E00168083 E01013649 Cosmopolitans 5
E00068893 E01013649 Cosmopolitans 5
E00068892 E01013649 Cosmopolitans 5
E00068890 E01013649 Cosmopolitans 5

Once the result is stored into the variable leicester_2011OAC_IMD2015, further analysis can be carried out. For instance, count can be used to count how many OAs fall into each 2011 OAC supergroup and decile of the Index of Multiple Deprivations.

leicester_2011OAC_IMD2015 %>%
count(supgrpname, IndexofMultipleDeprivationIMD) %>%
kable()
supgrpname IndexofMultipleDeprivationIMD n
Constrained City Dwellers 1 30
Constrained City Dwellers 2 3
Constrained City Dwellers 3 2
Constrained City Dwellers 6 1
Cosmopolitans 2 25
Cosmopolitans 3 15
Cosmopolitans 4 15
Cosmopolitans 5 8
Cosmopolitans 6 10
Cosmopolitans 8 10
Ethnicity Central 1 28
Ethnicity Central 2 18
Ethnicity Central 3 5
Ethnicity Central 4 5
Ethnicity Central 5 1
Hard-Pressed Living 1 68
Hard-Pressed Living 2 24
Hard-Pressed Living 3 5
Hard-Pressed Living 5 3
Hard-Pressed Living 6 1
Multicultural Metropolitans 1 107
Multicultural Metropolitans 2 119
Multicultural Metropolitans 3 132
Multicultural Metropolitans 4 100
Multicultural Metropolitans 5 56
Multicultural Metropolitans 6 34
Multicultural Metropolitans 7 6
Multicultural Metropolitans 8 17
Multicultural Metropolitans 9 2
Suburbanites 2 2
Suburbanites 3 2
Suburbanites 4 2
Suburbanites 5 10
Suburbanites 6 11
Suburbanites 7 16
Suburbanites 8 3
Suburbanites 9 8
Urbanites 1 1
Urbanites 2 3
Urbanites 3 5
Urbanites 4 15
Urbanites 5 14
Urbanites 6 4
Urbanites 7 12
Urbanites 8 7
Urbanites 9 4

As another example, the code below can be used to group OAs based on the decile and then calculate the percentage of adults not in employment using the u074 (No adults in employment in household: With dependent children) and u075 (No adults in employment in household: No dependent children) variables from the 2011 OAC dataset.

leicester_2011OAC_IMD2015 %>%
group_by(IndexofMultipleDeprivationIMD) %>%
summarise(
adults_not_empl_perc = (sum(u074 + u075) / sum(Total_Population)) * 100
) %>%
kable()
1 17.071876
2 14.191205
3 10.405029
4 9.966309
5 11.337036
6 10.710509
7 10.641026
8 9.686658
9 9.898140

4.5 Exercise 104.1

Extend the “Exploring deprivation indices in Leicester” document to include the code necessary to solve the questions below. Use the full list of variable names from the 2011 UK Census used to generate the 2011 OAC that can be found in the file 2011_OAC_Raw_uVariables_Lookup.csv to indetify which columns to use to complete the tasks.

Question 104.1.1: Write a piece of code using the pipe operator and the dplyr library to generate a table showing the percentage of EU citizens over total population, calculated grouping OAs by the related decile of the Index of Multiple Deprivations, but only accounting for areas classified as Cosmopolitans or Ethnicity Central or Multicultural Metropolitans.

Question 104.1.2: Write a piece of code using the pipe operator and the dplyr library to generate a table showing the percentage of EU citizens over total population, calculated grouping OAs by the related supergroup in the 2011 OAC, but only accounting for areas in the top 5 deciles of the Index of Multiple Deprivations.

Question 104.1.3: Write a piece of code using the pipe operator and the dplyr library to generate a table showing the percentage of people aged 65 and above, calculated grouping OAs by the related supergroup in the 2011 OAC and decile of the Index of Multiple Deprivations, and ordering the table by the calculated value in a descending order.

Question 104.1.4: Write a piece of code using the pipe operator and the dplyr and tidyr libraries to generate a long format of the leicester_2011OAC_IMD2015 table only including the values (census variables) used in Question 104.1.3.

Question 104.1.5: Write a piece of code using the pipe operator and the dplyr and tidyr libraries to generate a table similar to the one generated for Question 104.1.4, but showing the values as percentages over total population.

by Stefano De Sabbata – text licensed under the CC BY-SA 4.0, contains public sector information licensed under the Open Government Licence v3.0, code licensed under the GNU GPL v3.0.