4 Table operations

Print this chapter

The first two sections illustrate the pivot and join functionalities of the Tidyverse libraries using simple examples. The last section instead presents a more complex example, loading and wrangling with data related to the 2011 Output Area Classification and the Indexes of Multiple Deprivation 2015.

library(tidyverse)
library(knitr)

4.1 Long and wide table formats

Tabular data are usually presented in two different formats.

Wide: this is the most common approach, where each real-world entity (e.g. a city) is represented by one single row and its attributes are represented through different columns (e.g., a column representing the total population in the area, another column representing the size of the area, etc.).

City	Population	Area	Density
Leicester	329,839	73.3	4,500
Nottingham	321,500	74.6	4,412

Long: this is probably a less common approach, but still necessary in many cases, where each real-world entity (e.g. a city) is represented by multiple rows, each one reporting only one of its attributes. In this case, one column is used to indicate which attribute each row represent, and another column is used to report the value.

City	Attribute	Value
Leicester	Population	329,839
Leicester	Area	73.3
Leicester	Density	4,500
Nottingham	Population	321,500
Nottingham	Area	74.6
Nottingham	Density	4,412

The tidyr library provides two functions that allow transforming wide-formatted data to a long format, and vice-versa. Please take your time to understand the example below and check out the tidyr help pages before continuing.

city_info_wide <- data.frame(
    city = c("Leicester", "Nottingham"),
    population = c(329839, 321500),
    area = c(73.3, 74.6),
    density = c(4500, 4412)
  ) %>%
  as_tibble()

city_info_wide %>%
 kable()

city	population	area	density
Leicester	329839	73.3	4500
Nottingham	321500	74.6	4412

city_info_long <- city_info_wide %>%
  pivot_longer(
    # exclude IDs (city names) from the pivoted columns
    cols = -city,
    # name for the new column containing
    # the names of the old columns
    names_to = "attribute",
    # name for the new column containing
    # the values included under the old columns
    values_to = "value"
  )

city_info_long %>%
  kable()

city	attribute	value
Leicester	population	329839.0
Leicester	area	73.3
Leicester	density	4500.0
Nottingham	population	321500.0
Nottingham	area	74.6
Nottingham	density	4412.0

city_info_back_to_wide <- city_info_long %>%
  pivot_wider(
    # column containing the attribute names
    names_from = attribute,
    # column containing the values
    values_from = value
  )

city_info_back_to_wide %>%
  kable()

city	population	area	density
Leicester	329839	73.3	4500
Nottingham	321500	74.6	4412

4.2 Joining tables

A join operation combines two tables into one by matching rows that have the same values in the specified column. This operation is usually executed on columns containing identifiers, which are matched through different tables containing different data about the same real-world entities. For instance, the table below presents the telephone prefixes for two cities. That information can be combined with the data present in the wide-formatted table above through a join operation on the columns containing the city names. As the two tables do not contain all the same cities, if a full join operation is executed, some cells have no values assigned.

city	telephone_prefix
Leicester	0116
Birmingham	0121

city	population	area	density	telephone_prefix
Leicester	329,839	73.3	4,500	0116
Nottingham	321,500	74.6	4,412
Birmingham				0121

As discussed in the lecture, the dplyr library offers different types of join operations, which correspond to the different SQL joins illustrated in the image below. The use and implications of these different types of joins will be discussed in more detail in the GY7708 module next semester.

Please take your time to understand the example below and check out the related dplyr help pages before continuing. The first four examples execute the exact same full join operation using three different syntaxes: with or without using the pipe operator, and specifying the by argument or not. Note that all those approaches to writing the join are valid and produce the same result. The choice about which approach to use will depend on the code you are writing. In particular, you might find it useful to use the syntax that uses the pipe operator when the join operation is itself only one stem in a series of data manipulation steps. Using the by argument is usually advisable unless you are certain that you aim to join two tables with all and exactly the column that have the same names in the two table.

Note how the result of the join operations is not saved to a variable. The function kable is added after each join operation through a pipe %>% to display the resulting table in a nice format.

city_telephone_prexix <- data.frame(
    city = c("Leicester", "Birmingham"),
    telephon_prefix = c("0116", "0121")
  ) %>%
  as_tibble()

city_telephone_prexix %>%
  kable()

city	telephon_prefix
Leicester	0116
Birmingham	0121

# Option 1: without using the pipe operator

# full join verb
full_join(
    # left table
    city_info_wide,
    # right table
    city_telephone_prexix,
    # columns to match
    by = c("city" = "city")
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116
Nottingham	321500	74.6	4412	NA
Birmingham	NA	NA	NA	0121

# Option 2: without using the pipe operator
#   and without using the argument "by"
#   as columns have the same name
#   in the two tables.
# Same result as Option 1

# full join verb
full_join(
    # left table
    city_info_wide,
    # right table
    city_telephone_prexix
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116
Nottingham	321500	74.6	4412	NA
Birmingham	NA	NA	NA	0121

# Option 3: using the pipe operator
#   and without using the argument "by"
#   as columns have the same name
#   in the two tables.
# Same result as Option 1 and 2

# left table
city_info_wide %>%
  # full join verb
  full_join(
    # right table
    city_telephone_prexix
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116
Nottingham	321500	74.6	4412	NA
Birmingham	NA	NA	NA	0121

# Option 4: using the pipe operator
#   and using the argument "by".
# Same result as Option 1, 2 and 3

# left table
city_info_wide %>%
  # full join verb
  full_join(
    # right table
    city_telephone_prexix,
    # columns to match
    by = c("city" = "city")
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116
Nottingham	321500	74.6	4412	NA
Birmingham	NA	NA	NA	0121

# Left join
# Using syntax similar to Option 1 above

# left join
left_join(
    # left table
    city_info_wide, 
    # right table
    city_telephone_prexix,
    # columns to match
    by = c("city" = "city")
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116
Nottingham	321500	74.6	4412	NA

# Right join
# Using syntax similar to Option 2 above

# right join verb
right_join(
    # left table
    city_info_wide, 
    # right table
    city_telephone_prexix
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116
Birmingham	NA	NA	NA	0121

# Inner join
# Using syntax similar to Option 3 above

# left table
city_info_wide %>%
  # inner join
  inner_join(
    # right table
    city_telephone_prexix
  ) %>%
  kable()

city	population	area	density	telephon_prefix
Leicester	329839	73.3	4500	0116

4.3 Data: Indices of Multiple Deprivation

Open the Leicester_population project created for the previous chapter. Create a new RMarkdown document using “Exploring deprivation indices in Leicester” as the title and PDF as the output file type. Delete the example code after the setup chunk. Add a new markdown second-heading section named Libraries and a chunk loading the tidyverse and knitr libraries (see below). Save the file with the name exploring_Leicester_deprivation.Rmd in the Leicester_population project.

Download from Blackboard (or the data folder of the repository) the file IndexesMultipleDeprivation2015_Leicester.csv and upload the file to the Leicester_population folder by clicking on the Upload button and selecting the files from your computer.

The Indices of Multiple Deprivation 2015 (see map at cdrc.ac.uk) are based on a series of variables across seven distinct domains of deprivation which are combined to calculate the Index of Multiple Deprivation 2015 (IMD 2015). That is an overall measure of multiple deprivations experienced by people living in an area. These indexes are calculated for every Lower layer Super Output Area (LSOA), which are larger geographic units than the OAs used for the 2011 OAC. The dataset in the file IndexesMultipleDeprivation2015_Leicester.csv contains the main Index of Multiple Deprivation, as well as the values for the seven distinct domains of deprivation and two additional indexes regarding deprivation affecting children and older people. The dataset includes scores, ranks (where 1 indicates the most deprived area), and decile (i.e., the first decile includes the 10% most deprived areas in England).

The code below loads the IMD 2015 dataset.

# Load Indexes of Multiple deprivation data
leicester_IMD2015 <- 
  read_csv("IndexesMultipleDeprivation2015_Leicester.csv")

leicester_2011OAC <- 
  read_csv("2011_OAC_Raw_uVariables_Leicester.csv")

4.4 Working with multiple tables

4.4.1 From long to wide table

The IMD 2015 data are in a long format, which means that every area is represented by more than one row: the column Value presents the value; the column IndicesOfDeprivation indicates which index the value refers to; the column Measurement indicates whether the value is a score, rank, or decile. The code below illustrates the data format used for the IndicesOfDeprivation table, and showing the rows for the LSOA including the University of Leicester (feature code E01013649).

leicester_IMD2015 %>%
  filter(FeatureCode == "E01013649") %>%
  select(FeatureCode, IndicesOfDeprivation, Measurement, Value) %>%
  kable()

FeatureCode	IndicesOfDeprivation	Measurement	Value
E01013649	Income Deprivation Domain	Score	0.070
E01013649	Employment Deprivation Domain	Score	0.075
E01013649	Income Deprivation Affecting Children Index (IDACI)	Score	0.087
E01013649	Income Deprivation Affecting Older People Index (IDAOPI)	Score	0.153
E01013649	Health Deprivation and Disability Domain	Score	0.272
E01013649	Index of Multiple Deprivation (IMD)	Score	19.665
E01013649	Education, Skills and Training Domain	Score	2.195
E01013649	Barriers to Housing and Services Domain	Score	14.324
E01013649	Living Environment Deprivation Domain	Score	57.197
E01013649	Crime Domain	Score	1.159
E01013649	Income Deprivation Domain	Rank	23511.000
E01013649	Index of Multiple Deprivation (IMD)	Rank	14539.000
E01013649	Employment Deprivation Domain	Rank	21227.000
E01013649	Education, Skills and Training Domain	Rank	30744.000
E01013649	Barriers to Housing and Services Domain	Rank	23885.000
E01013649	Health Deprivation and Disability Domain	Rank	12269.000
E01013649	Living Environment Deprivation Domain	Rank	1197.000
E01013649	Crime Domain	Rank	2214.000
E01013649	Income Deprivation Affecting Children Index (IDACI)	Rank	22984.000
E01013649	Income Deprivation Affecting Older People Index (IDAOPI)	Rank	16055.000
E01013649	Employment Deprivation Domain	Decile	7.000
E01013649	Income Deprivation Affecting Older People Index (IDAOPI)	Decile	5.000
E01013649	Barriers to Housing and Services Domain	Decile	8.000
E01013649	Income Deprivation Affecting Children Index (IDACI)	Decile	7.000
E01013649	Crime Domain	Decile	1.000
E01013649	Income Deprivation Domain	Decile	8.000
E01013649	Health Deprivation and Disability Domain	Decile	4.000
E01013649	Living Environment Deprivation Domain	Decile	1.000
E01013649	Education, Skills and Training Domain	Decile	10.000
E01013649	Index of Multiple Deprivation (IMD)	Decile	5.000

In the following section, the analysis aims to explore how certain census variables vary in areas with different deprivation levels. Thus, we need to extract the Decile rows from the IMD 2015 dataset and transform the data in a wide format, where each index is represented as a separate column.

To that purpose, we also need to change the name of the indexes slightly, to exclude spaces and punctuation, so that the new column names are simpler than the original text, and can be used as column names. That part of the manipulation is performed using mutate and functions from the stringr library.

leicester_IMD2015_decile_wide <- leicester_IMD2015 %>%
  # Select only Socres
  filter(Measurement == "Decile") %>%
  # Trim names of IndicesOfDeprivation
  mutate(
    IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "\\s", "")
  ) %>%
  mutate(
    IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "[:punct:]", "")
  ) %>%
  mutate(
    IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "\\(", "")
  ) %>%
  mutate(
    IndicesOfDeprivation = str_replace_all(IndicesOfDeprivation, "\\)", "")
  ) %>%
  # Spread
  pivot_wider(
    names_from = IndicesOfDeprivation,
    values_from = Value
  ) %>%
  # Drop columns
  select(-DateCode, -Measurement, -Units)

Let’s compare the columns of the original long IMD 2015 dataset with the wide dataset created above, using the function colnames.

leicester_IMD2015 %>% 
  colnames() %>%
  # limit width of printing area
  print(width = 70)

## [1] "FeatureCode"          "DateCode"            
## [3] "Measurement"          "Units"               
## [5] "Value"                "IndicesOfDeprivation"

leicester_IMD2015_decile_wide %>% 
  colnames() %>%
  # limit width of printing area
  print(width = 70)

##  [1] "FeatureCode"                                     
##  [2] "HealthDeprivationandDisabilityDomain"            
##  [3] "IncomeDeprivationAffectingOlderPeopleIndexIDAOPI"
##  [4] "BarrierstoHousingandServicesDomain"              
##  [5] "EmploymentDeprivationDomain"                     
##  [6] "EducationSkillsandTrainingDomain"                
##  [7] "LivingEnvironmentDeprivationDomain"              
##  [8] "IncomeDeprivationAffectingChildrenIndexIDACI"    
##  [9] "CrimeDomain"                                     
## [10] "IndexofMultipleDeprivationIMD"                   
## [11] "IncomeDeprivationDomain"

In leicester_IMD2015_decile_wide, we now have only one row representing the LSOA including the University of Leicester (feature code E01013649) and the main Index of Multiple Deprivations is now represented by the column IndexofMultipleDeprivationIMD. The value reported is the same – that is 5, which means that the selected LSOA is estimated to be in the range 40-50% most deprived areas in England – but we changed the data format.

# Original long IMD 2015 dataset
leicester_IMD2015 %>%
  filter(
    FeatureCode == "E01013649",
    IndicesOfDeprivation == "Index of Multiple Deprivation (IMD)",
    Measurement == "Decile"
  ) %>%
  select(FeatureCode, IndicesOfDeprivation, Measurement, Value) %>%
  kable()

FeatureCode	IndicesOfDeprivation	Measurement	Value
E01013649	Index of Multiple Deprivation (IMD)	Decile	5

# New wide IMD 2015 dataset
leicester_IMD2015_decile_wide %>%
  filter(FeatureCode == "E01013649") %>%
  select(FeatureCode, IndexofMultipleDeprivationIMD) %>%
  kable()

FeatureCode	IndexofMultipleDeprivationIMD
E01013649	5

4.4.2 Joining tables

As discussed above, two tables can be joined using a common column of identifiers. We can thus join the 2011 OAC and the IMD 2015 datasets into a single table. The LSOA code included in the 2011 OAC table is used to match that information with the corresponding row in the IMD 2015. The resulting table provides all the information from the 2011 OAC for each OA, plus the Index of Multiple Deprivations decile for the LSOA containing each OA.

That operation can be carried out using the function inner_join, and specifying the common column (or columns, if more than one is to be used as identifier) as argument of by. Note that using inner_join would result in dropping any row which doesn’t have a match in the other table, either way. In this case, that should not happen, as all OAs are part of an LSOA, and any LSOA contains at least one OA.

leicester_2011OAC_IMD2015 <- 
  leicester_2011OAC %>%
  inner_join(
    leicester_IMD2015_decile_wide, 
    by = c("LSOA11CD" = "FeatureCode")
  )

As each LSOA contains multiple OAs, each row from the leicester_IMD2015_decile_wide table is matched to multiple rows from the leicester_2011OAC table. For instance, as shown in the table below, the information from the IMD 2015 dataset about the LSOA encompassing the University of Leicester (feature code E01013649) is joined to multiple rows from the 2011 OAC dataset, including the OA encompassing the University of Leicester (feature code E00068890) as well as other neighbouring OAs.

leicester_2011OAC_IMD2015 %>%
  # Note that the LSOA11CD column needs to be used
  # as the previous join as combined 
  # LSOA11CD and FeatureCode
  # into one, name LSOA11CD
  filter(LSOA11CD == "E01013649") %>%
  select(OA11CD, LSOA11CD, supgrpname, IndexofMultipleDeprivationIMD) %>%
  kable()

OA11CD	LSOA11CD	supgrpname	IndexofMultipleDeprivationIMD
E00169447	E01013649	Cosmopolitans	5
E00168083	E01013649	Cosmopolitans	5
E00068893	E01013649	Cosmopolitans	5
E00068892	E01013649	Cosmopolitans	5
E00068890	E01013649	Cosmopolitans	5

Once the result is stored into the variable leicester_2011OAC_IMD2015, further analysis can be carried out. For instance, count can be used to count how many OAs fall into each 2011 OAC supergroup and decile of the Index of Multiple Deprivations.

leicester_2011OAC_IMD2015 %>%
  count(supgrpname, IndexofMultipleDeprivationIMD) %>%
  kable()

supgrpname	IndexofMultipleDeprivationIMD	n
Constrained City Dwellers	1	30
Constrained City Dwellers	2	3
Constrained City Dwellers	3	2
Constrained City Dwellers	6	1
Cosmopolitans	2	25
Cosmopolitans	3	15
Cosmopolitans	4	15
Cosmopolitans	5	8
Cosmopolitans	6	10
Cosmopolitans	8	10
Ethnicity Central	1	28
Ethnicity Central	2	18
Ethnicity Central	3	5
Ethnicity Central	4	5
Ethnicity Central	5	1
Hard-Pressed Living	1	68
Hard-Pressed Living	2	24
Hard-Pressed Living	3	5
Hard-Pressed Living	5	3
Hard-Pressed Living	6	1
Multicultural Metropolitans	1	107
Multicultural Metropolitans	2	119
Multicultural Metropolitans	3	132
Multicultural Metropolitans	4	100
Multicultural Metropolitans	5	56
Multicultural Metropolitans	6	34
Multicultural Metropolitans	7	6
Multicultural Metropolitans	8	17
Multicultural Metropolitans	9	2
Suburbanites	2	2
Suburbanites	3	2
Suburbanites	4	2
Suburbanites	5	10
Suburbanites	6	11
Suburbanites	7	16
Suburbanites	8	3
Suburbanites	9	8
Urbanites	1	1
Urbanites	2	3
Urbanites	3	5
Urbanites	4	15
Urbanites	5	14
Urbanites	6	4
Urbanites	7	12
Urbanites	8	7
Urbanites	9	4

As another example, the code below can be used to group OAs based on the decile and then calculate the percentage of adults not in employment using the u074 (No adults in employment in household: With dependent children) and u075 (No adults in employment in household: No dependent children) variables from the 2011 OAC dataset.

leicester_2011OAC_IMD2015 %>%
  group_by(IndexofMultipleDeprivationIMD) %>%
  summarise(
    adults_not_empl_perc = (sum(u074 + u075) / sum(Total_Population)) * 100
  ) %>%
  kable()

IndexofMultipleDeprivationIMD	adults_not_empl_perc
1	17.071876
2	14.191205
3	10.405029
4	9.966309
5	11.337036
6	10.710509
7	10.641026
8	9.686658
9	9.898140

4.5 Exercise 104.1

Extend the “Exploring deprivation indices in Leicester” document to include the code necessary to solve the questions below. Use the full list of variable names from the 2011 UK Census used to generate the 2011 OAC that can be found in the file 2011_OAC_Raw_uVariables_Lookup.csv to indetify which columns to use to complete the tasks.

Question 104.1.1: Write a piece of code using the pipe operator and the dplyr library to generate a table showing the percentage of EU citizens over total population, calculated grouping OAs by the related decile of the Index of Multiple Deprivations, but only accounting for areas classified as Cosmopolitans or Ethnicity Central or Multicultural Metropolitans.

Question 104.1.2: Write a piece of code using the pipe operator and the dplyr library to generate a table showing the percentage of EU citizens over total population, calculated grouping OAs by the related supergroup in the 2011 OAC, but only accounting for areas in the top 5 deciles of the Index of Multiple Deprivations.

Question 104.1.3: Write a piece of code using the pipe operator and the dplyr library to generate a table showing the percentage of people aged 65 and above, calculated grouping OAs by the related supergroup in the 2011 OAC and decile of the Index of Multiple Deprivations, and ordering the table by the calculated value in a descending order.

Question 104.1.4: Write a piece of code using the pipe operator and the dplyr and tidyr libraries to generate a long format of the leicester_2011OAC_IMD2015 table only including the values (census variables) used in Question 104.1.3.

Question 104.1.5: Write a piece of code using the pipe operator and the dplyr and tidyr libraries to generate a table similar to the one generated for Question 104.1.4, but showing the values as percentages over total population.

by Stefano De Sabbata – text licensed under the CC BY-SA 4.0, contains public sector information licensed under the Open Government Licence v3.0, code licensed under the GNU GPL v3.0.

3 Data manipulation

5 Exploratory visualisation