3 Data manipulation

Print this chapter

Programming languages commonly provide both simple data types, such as those seen in the first chapter, and more complex objects capable of storing and organising multiple values, such as the tables that we have briefly seen in action in the second chapter. In this chapter we bridge that gap by discussing a series of complex data types and how to manipulate the information they store. In doing so, we discuss the dplyr library (also part of the Tidyverse), which it offers a grammar for data manipulation.

Before continuing, create a new R project named GY7702-practical-103 and create a new R script named complex-data-types.R. Follow along with the examples below by copy-pasting the code in the script and then running one line at the time.

3.1 Complex data types

3.1.1 Vectors

The simplest of the complex objects usually allow storing multiple values of the same type in an ordered list. Such objects take different names in different languages. In R, they are referred to as vectors¹⁶.

Vectors can be defined in R by using the function c, which takes as parameters the items to be stored in the vector. The items are stored in the order in which they are provided.

east_midlands_cities <- c("Derby", "Leicester", "Lincoln", "Nottingham")
length(east_midlands_cities)

## [1] 4

# or using the pipe operator
library(tidyverse)
east_midlands_cities %>% length()

## [1] 4

Once the vector has been created and assigned to an identifier, the elements within the vector can be retrieved by specifying the identifier, followed by square brackets and the index (or indices as we will see further below) of the elements to be retrieved. Indices start from 1, so the index of the first element is 1, the index of the second element is 2, and so on and so forth¹⁷.

# Retrieve the third city
east_midlands_cities[3]

## [1] "Lincoln"

Alternatively, the first, last, or nth functions of the dplyr library within Tidyverse can be used to extract single values from vectors.

# Retrieve the first city
east_midlands_cities %>% first()

## [1] "Derby"

# Retrieve the first city
east_midlands_cities %>% nth(1)

## [1] "Derby"

# Retrieve the third city
east_midlands_cities %>% nth(3)

## [1] "Lincoln"

To retrieve any subset of a vector (i.e., more than one element), you can specify an integer vector containing the indices (rather than a single integer value) of the items of interest between square brackets.

# Retrieve first and third city
east_midlands_cities[c(1, 3)]

## [1] "Derby"   "Lincoln"

Functions and operators can be applied to vectors in the same way as they would be applied to simple values. For instance, all built-in numerical functions in R can be used on a vector variable directly. That is, if a vector is specified as input, the selected function is applied to each element of the vector.

one_to_ten <- 1:10
one_to_ten

##  [1]  1  2  3  4  5  6  7  8  9 10

one_to_ten + 1

##  [1]  2  3  4  5  6  7  8  9 10 11

sqrt(one_to_ten)

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
##  [9] 3.000000 3.162278

Similarly, string functions can be applied to vectors containing character values. For instance, the code below uses str_length to obtain a vector of numeric values representing the lengths of the city names included in the vector of character values east_midlands_cities.

str_length(east_midlands_cities)

## [1]  5  9  7 10

As seen in the previous chapter, a condition entered in the Console is evaluated for the provided input, and a logical value (TRUE or FALSE) is provided as output. Similarly, if the provided input is a vector, the condition is evaluated for each element of the vector, and a vector of logical values is returned – which contains the respective results of the conditions for each element.

-3 > 0

## [1] FALSE

minus_two_to_two <- c(-2, -1, 0, 1, 2) 
minus_two_to_two > 0

## [1] FALSE FALSE FALSE  TRUE  TRUE

A subset of the elements of a vector can also be selected by providing a vector of logical values between brackets after the identifier. A new vector is returned, containing only the values for which a TRUE value has been specified correspondingly.

minus_two_to_two[c(TRUE, TRUE, FALSE, FALSE, TRUE)]

## [1] -2 -1  2

As the result of evaluating the condition on a vector is a vector of logical values, this can be used to filter vectors based on conditions. If a condition is provided between square brackets (after the vector identifier instead of an index), a new vector is returned, which contains only the elements for which the condition is true.

minus_two_to_two > 0

## [1] FALSE FALSE FALSE  TRUE  TRUE

minus_two_to_two[minus_two_to_two > 0]

## [1] 1 2

A factor is a data type similar to a vector. However, the values contained in a factor can only be selected from a set of levels. The code below illustrates the difference between vectors and factors.

houses_vector <- c("Bungalow", "Flat", "Flat",
  "Detached", "Flat", "Terrace", "Terrace")
houses_vector

## [1] "Bungalow" "Flat"     "Flat"     "Detached" "Flat"     "Terrace"  "Terrace"

houses_factor <- factor(c("Bungalow", "Flat", "Flat",
  "Detached", "Flat", "Terrace", "Terrace"))
houses_factor

## [1] Bungalow Flat     Flat     Detached Flat     Terrace  Terrace 
## Levels: Bungalow Detached Flat Terrace

R also provides a data type named ordered, which is similar to the data type factor but the levels are considered as ordered categories and respond to operators like > (greater then) when filtered. Appending 1 provides some more details on vectors for those eager to lear – or if you get stuck with your code and feel like knowing more about how to work with vectors might help. 😊

3.1.2 Lists

Variables of the type list can contain elements of different types (including vectors and matrices), whereas elements of vectors are all of the same type.

employee <- list("Stef", 2015)
employee

## [[1]]
## [1] "Stef"
## 
## [[2]]
## [1] 2015

employee[[1]] # Note the double square brackets for selection

## [1] "Stef"

In named lists each element has a name, and elements can be selected using their name after the symbol $.

employee <- list(employee_name = "Stef", start_year = 2015)
employee

## $employee_name
## [1] "Stef"
## 
## $start_year
## [1] 2015

employee$employee_name

## [1] "Stef"

3.2 Data frames and tibbles

As mentioned in the first chapter, data frames are complex data types which encode the concept of a table in R by combining and arranging together a series of simple objects. Data frames are similar to named lists, where each element of the list is a vector representing a column and all vectors have the same length, thus representing the same number of rows.

In the Tidyverse, a tibble is defined as: “a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code”.

Before continuing, create a data subfolder and copy the 2011_OAC_Raw_uVariables_Leicester.csv file in the data subfolder. Then, create a new RMarkdown documemnt to be compiled as a PDF file, using Data manipulation as a title and data-manipulation.Rmd as file name. Follow along with the examples below by copy-pasting the code in new R chunks, running them one at the time and then compiling the whole document.

3.2.1 Selecting and filtering tables

The approaches seen above for selecting and filtering data from vectors can be applied to data frames and tibbles. The only difference is that tables are bi-dimensional (rather than one-dimensional), and thus, two pieces of information are necessary. The first index specifies which rows to select or filter, and the second index specifies which columns to select or filter. If no information is provided for either the first or second index, all rows or columns are provided. However, as you can see from the examples below, the more complex the selection and filtering query become, the longer and less readable the code becomes.

library(tidyverse)
library(knitr)

# Read the Leicester 2011 OAC dataset from the csv file
leicester_2011OAC <- 
  read_csv("2011_OAC_Raw_uVariables_Leicester.csv")

# Select the 5th row
leicester_2011OAC[5, ]

# Select the 9th column (OAC supergroup name)
leicester_2011OAC[, 9]

# Select the 5th row and 9th column
leicester_2011OAC[5, 9]

# Select the OAC supergroup name column (9th column) of the 5th row
leicester_2011OAC[5, "supgrpname"]

# Select the OA code, OAC supergroup code name columns of the 5th row
leicester_2011OAC[5, c("OA11CD", "supgrpcode", "supgrpname")]

# Select the OA code, OAC supergroup code name columns
# of the 5th to 10th rows 
leicester_2011OAC[5:10, c("OA11CD", "supgrpcode", "supgrpname")]

# Select the OA code, OAC supergroup code name columns
# for all OAs with more than 600 inhabitants
leicester_2011OAC[leicester_2011OAC$Total_Population > 600, c("OA11CD", "supgrpcode", "supgrpname")]

Fortunately, rather than working with base R instructions, we can use the dplyr library, which is part of the Tidyverse and offers a grammar for data manipulation. The function select can be used to select some columns to output. For instance, in the code below, the function select is used to select the columns OA11CD, supgrpcode, and supgrpname, in combination with the function slice_head, which can be used to include only the first n rows (5 in the example below) to output.

leicester_2011OAC %>%
  select(OA11CD, supgrpcode, supgrpname) %>%
  slice_head(n = 5) %>%
  kable()

OA11CD	supgrpcode	supgrpname
E00069517	6	Suburbanites
E00069514	2	Cosmopolitans
E00169516	4	Multicultural Metropolitans
E00169048	4	Multicultural Metropolitans
E00169044	4	Multicultural Metropolitans

The function filter can instead be used to filter rows based on a specified condition. In the example below, the output of the filter step only includes the rows where the value of grpname is "Students Around Campus" (i.e., OAs classified as part of the Students Around Campus group). Note that the grpname needs to be included in the select step in order to be able to used it in the subsequent filter operation. All functions in the dplyr library can be combined in any other order that makes logical sense. However, if the select step didn’t include grpname, that same column couldn’t have been used in the filter step.

leicester_2011OAC %>%
  select(OA11CD, supgrpname, grpname, Total_Population) %>%
  filter(grpname == "Students Around Campus") %>%
  slice_head(n = 5) %>%
  kable()

OA11CD	supgrpname	grpname	Total_Population
E00069514	Cosmopolitans	Students Around Campus	323
E00068882	Cosmopolitans	Students Around Campus	285
E00169553	Cosmopolitans	Students Around Campus	714
E00068869	Cosmopolitans	Students Around Campus	319
E00068876	Cosmopolitans	Students Around Campus	255

3.2.2 Summarise

the function count of the dplyr library can be used to count the number of rows in a data frame. The code below provides the leicester_2011OAC dataframe (as read in input in the section above) as input to the function count through the pipe operator, thus creating a new tibble with only one row and one column, containing the number of rows in that dataframe – that is, the number of OAs in Leicester.

leicester_2011OAC %>%
  count() %>%
  kable()

n
969

As discussed in the previous lecture, a tibble is a data type similar to data frames, used by all the Tidyverse libraries. All Tidyverse functions output tibble rather than data.frame objects when representing a table. However, data.frame object can be provided as input, as they are automatically converted by Tidyverse functions before proceeding with the processing steps.

The example above already shows how the pipe operator can be used effectively in a multi-step operation. In the tibble outputted by the count function above, the column n provides the count. The function kable of the library knitr is used to produce a well-formatted table.

Note how the code above goes to a new line after every %>%, and space is added at the beginning of new lines. That is very common in R programming (especially when functions have many parameters) as it makes the code more readable.

The verb count is a special case (a shorthand) of the more general verb summarise (summarize using the American English spelling is also available with the same functionality), which allows generating tables presenting aggregate values of input data.

For instance, summarise can be used to create a table containing:

a column presenting the total population of Leicester;
a column presenting the total number of OAs in Leicester;
a column presenting the average population per OAs in Leicester;
a column presenting a logical value stating whether any OA in Leicester contains more than 600 people.

That can be achieved applying different aggregate functions to the same column (Total_Population) of the leicester_2011OAC as illustrated below.

sum to calculate the total population of Leicester;
n to count the number of rows (as mentioned above count is a shorthand for the combination of summarise with n);
mean to calculate the average population per OAs in Leicester;
any with a conditional statement comparing Total_Population and the value 600 using the numeric operator > (greater than) to assess whether any OA in Leicester contains more than 600 people.

leicester_2011OAC %>%
  summarise(
    # Total population in Leicester
    tot_pop = sum(Total_Population),
    # Number of OAs, no input column needed
    num_of_OAs = n(),
    # Average population
    avg_pop = mean(Total_Population), 
    # Is there any OA with over 200 people?
    pop_over_600 = any(Total_Population > 600)
  ) %>% 
  kable()

tot_pop	num_of_OAs	avg_pop	pop_over_600
329839	969	340.3911	TRUE

A set of aggregate functions which are frequently used with summarise is available in the dplyr page on summarise.

To carry out more complex aggregations, the function summarise can be used in combination with the function group_by to summarise the values in the data frame based on groups. Rows having the same value for the column specified for the verb group_by (in the example below, the 2011 OAC supergroup, supgrpname) are grouped together, then values are aggregated based on the functions specified for the verb summarise (using one or more columns in the calculation).

leicester_2011OAC %>%
  group_by(supgrpname) %>% 
  summarise(
    # Total population in Leicester
    tot_pop = sum(Total_Population),
    # Number of OAs, no input column needed
    num_of_OAs = n(),
    # Average population
    avg_pop = mean(Total_Population), 
    # Is there any OA with over 200 people?
    pop_over_600 = any(Total_Population > 600)
  ) %>% 
  kable()

supgrpname	tot_pop	num_of_OAs	avg_pop	pop_over_600
Constrained City Dwellers	9263	36	257.3056	FALSE
Cosmopolitans	26813	83	323.0482	TRUE
Ethnicity Central	19137	57	335.7368	FALSE
Hard-Pressed Living	32069	101	317.5149	FALSE
Multicultural Metropolitans	205007	573	357.7784	TRUE
Suburbanites	17326	54	320.8519	FALSE
Urbanites	20224	65	311.1385	FALSE

The shorthand function count also has the option to specify a column that will be used for an internal grouping. In the example below, the column name supgrpname is provided as an argument to the function count. As a result, the output shows the number of rows grouped by 2011 OAC supergroup.

leicester_2011OAC %>%
  count(supgrpname) %>%
  kable()

supgrpname	n
Constrained City Dwellers	36
Cosmopolitans	83
Ethnicity Central	57
Hard-Pressed Living	101
Multicultural Metropolitans	573
Suburbanites	54
Urbanites	65

3.2.3 Mutate

The function mutate can be used to create a new column by conducting operations on current columns. For instance, in the example below, summarise is first used to calculate the total number of people and the number of OAs per 2011 OAC supergroup. The verb mutate is then used to calculate the average population per OA per 2011 OAC supergroup, recreating the same avg_pop column as above but through a different process.

leicester_2011OAC %>%
  group_by(supgrpname) %>% 
  summarise(
    # Total population in Leicester
    tot_pop = sum(Total_Population),
    # Number of OAs, no input column needed
    num_of_OAs = n(),
    # Is there any OA with over 200 people?
    pop_over_600 = any(Total_Population > 600)
  ) %>% 
  mutate(
    # Average population
    avg_pop = tot_pop / num_of_OAs, 
  ) %>% 
  kable(digits = c(0, 0, 0, 0, 2))

supgrpname	tot_pop	num_of_OAs	pop_over_600	avg_pop
Constrained City Dwellers	9263	36	FALSE	257.31
Cosmopolitans	26813	83	TRUE	323.05
Ethnicity Central	19137	57	FALSE	335.74
Hard-Pressed Living	32069	101	FALSE	317.51
Multicultural Metropolitans	205007	573	TRUE	357.78
Suburbanites	17326	54	FALSE	320.85
Urbanites	20224	65	FALSE	311.14

In this second example, the u005 column (which represents the area of the OA in hectares, see 2011_OAC_Raw_uVariables_Lookup.csv) is used to calculate the population density for each OA.

leicester_2011OAC %>%
  mutate(
    # Population density
    pop_density = Total_Population / u005, 
  ) %>% 
  select(OA11CD, pop_density) %>% 
  slice_head(n = 10) %>% 
  kable(digits = c(0, 2))

OA11CD	pop_density
E00069517	33.69
E00069514	131.30
E00169516	240.14
E00169048	104.23
E00169044	99.08
E00069041	51.31
E00169049	42.53
E00068806	57.78
E00068886	122.87
E00068807	76.21

3.2.4 Arrange

The function arrange can be used to sort a tibble by ascending order of the values in the specified column. If the operator - is specified before the column name, the descending order is used. The code below would produce a table showing the ten OAs with the largest population in Leicester.

leicester_2011OAC %>%
  select(
    OA11CD, supgrpname, 
    Total_Population
  ) %>% 
  arrange(
    # Descending delay
    -Total_Population
  ) %>% 
  slice_head(n = 10) %>% 
  kable()

In the example above, we have used slice_head to present only the first n (in the example 10) rows in a table based on the existing order. The dplyr library also provides the functions slice_max and slice_min which incorporate the sorting functionality (see slice reference page).

As such, the following code uses slice_max to produce a table including only the 10 OAs with the highest population.

leicester_2011OAC %>%
  select(
    OA11CD, supgrpname, 
    Total_Population
  ) %>% 
  slice_max(
    Total_Population, 
    n = 10
  ) %>%
  kable()

OA11CD	supgrpname	Total_Population
E00169553	Cosmopolitans	714
E00168096	Cosmopolitans	708
E00069303	Multicultural Metropolitans	623
E00169458	Cosmopolitans	622
E00069304	Multicultural Metropolitans	608
E00169575	Cosmopolitans	597
E00069393	Ethnicity Central	591
E00169564	Cosmopolitans	591
E00069482	Multicultural Metropolitans	590
E00069216	Multicultural Metropolitans	586
E00169446	Cosmopolitans	586
E00069439	Multicultural Metropolitans	586

The following code, instead, uses slice_min, thus producing a table including only the 10 OAs with the lowest population..

leicester_2011OAC %>%
  select(
    OA11CD, supgrpname, 
    Total_Population
  ) %>% 
  slice_min(
    Total_Population, 
    n = 10
  ) %>%
  kable()

OA11CD	supgrpname	Total_Population
E00169530	Ethnicity Central	103
E00169528	Ethnicity Central	119
E00069382	Ethnicity Central	126
E00069149	Cosmopolitans	131
E00069332	Multicultural Metropolitans	132
E00168107	Multicultural Metropolitans	133
E00168094	Cosmopolitans	141
E00169486	Cosmopolitans	150
E00069184	Urbanites	158
E00169472	Constrained City Dwellers	160

In both cases, if the table contains ties, all rows containing a value that is present among the maximum or minimum selected values are presented, as is the case with the rows containing the value 21 in the example above.

3.2.5 Data manipulation workflow

Finally, the code below illustrates a more complex, multi-step operation using all the functions discussed above. This is a full example of a short analysis using only one series of pipes to read, process and write data using R and almost all the tidyverse verbs seen so far.

The input data are read from the csv file and part of the data are selected and filtered. The data are grouped, aggregated and arranged in order. The percentage of people living in OAs assigned to each 2011 OAC supergroup is calculated. The Tee pipe (from the magrittr library) is used to write the resulting table to a file, while also passing the same input to the subsequent kable function to display the data.

library(magrittr)

# Let's start from the filename
"2011_OAC_supgrp_Leicester.csv" %>% 
  # as input to the read_csv function
  read_csv(col_types = "cccci") %>% 
  # Select only the necessary columns
  select(supgrpname,    Total_Population) %>% 
  # Let's say we are not interested in
  # the Suburbanites supergroup
  filter(supgrpname != "Suburbanites") %>% 
  # Group by supergroup
  group_by(supgrpname) %>% 
  # Aggregate population
  summarise(tot_pop = sum(Total_Population)) %>% 
  # Ungroup
  ungroup() %>% 
  # Arrange by descending total population
  arrange(-tot_pop) %>% 
  # Calculate percentage
  mutate(perc_pop = (tot_pop / 329839) * 100) %T>%
  # Then use the Tee pipe %T>% in the line above
  # to write the calculated values to file
  # and pass the same input values to the kable function
  write_csv(
    "2011_Leicester_pop_per_OAC_supgrp_excl_suburb.csv"
  ) %>% 
  # Print to screen nicely
  kable()

supgrpname	tot_pop	perc_pop
Multicultural Metropolitans	205007	62.15
Hard-Pressed Living	32069	9.72
Cosmopolitans	26813	8.13
Urbanites	20224	6.13
Ethnicity Central	19137	5.80
Constrained City Dwellers	9263	2.81

3.3 Componentization

It is important to note that in the example above, no information is stored in the local environment. The input is read directly from the computer storage, the whole process is conducted in the R internal memory, and the output is saved back to the computer storage. That is the whole purpose of the pipe operator: to avoid creating unnecessary and temporary “mid-products” during the computation (i.e., variables whose only purpose is to store the data before the next step).

However, it is frequently useful to split the processing into different steps, which allow the data to be inspected mid-way through the analysis or for the data to be used twice in two subsequent steps. Finding a good balance between too long and too short pipe sequences and too many or too few “mid-products” is the programmer’s task – and something that can only be learnt by experience and observing good examples of code.

The following example showcases how the process above can be subdivided into five meaningful steps.

Read the data. This is frequently done as a separate step, as some datasets might require significant time to be read from computer storage, and it is frequently best to read them only once.
Calculate the total population per 2011 OAC supergroup, excluding the suburban population. This is a significant transformation of the original data, which might be worth saving and inspecting.
Update the new table using mutate and the assignment pipe operator %<>% (from the magrittr library).
Print the result.
Save the result to the computer storage.

leicester_2011OAC <-
  # Let's start from the filename
  "2011_OAC_supgrp_Leicester.csv" %>% 
  # as input to the read_csv function
  read_csv(col_types = "cccci")

leicester_nonsuburb_pop <-
  leicester_2011OAC %>% 
  # Select only the necessary columns
  select(supgrpname,    Total_Population) %>% 
  # Let's say we are not interested in
  # the Suburbanites supergroup
  filter(supgrpname != "Suburbanites") %>% 
  # Group by supergroup
  group_by(supgrpname) %>% 
  # Aggregate population
  summarise(tot_pop = sum(Total_Population)) %>% 
  # Ungroup
  ungroup() %>% 
  # Arrange by descending total population
  arrange(-tot_pop)

leicester_nonsuburb_pop %<>%
  # Calculate percentage
  mutate(perc_pop = (tot_pop / 329839) * 100)

leicester_nonsuburb_pop %>% 
  # Print to screen nicely
  kable(digits = c(0, 0, 2))

supgrpname	tot_pop	perc_pop
Multicultural Metropolitans	205007	62.15
Hard-Pressed Living	32069	9.72
Cosmopolitans	26813	8.13
Urbanites	20224	6.13
Ethnicity Central	19137	5.80
Constrained City Dwellers	9263	2.81

leicester_nonsuburb_pop %>% 
  # Write the calculated values to file
  write_csv(
    "2011_Leicester_pop_per_OAC_supgrp_excl_suburb.csv"
  )

3.4 Exercises 103.1

Create an RMarkdown document in RStudio, using Exercise 103 as title and PDF as output. Delete all the contents except the first five lines which compose the heading. Save the document as practical-103_exercises.Rmd. Add the libraries and code necessary to read the data from the 2011_OAC_Raw_uVariables_Leicester.csv file. Create a first section of the document (e.g., adding a second heading Exercise 103.1) and add your answers to the questions below.

In order to answer the questions below, inspect the look-up table 2011_OAC_Raw_uVariables_Lookup.csv (e.g., using Microsoft Excel) to identify the columns necessary to complete the task.

Question 103.1.1: Identify the five variables which are part of the variable subdomain Housing Type and write the code necessary to compute the total number of household spaces in Leicester for each housing type.

Question 103.1.2: Write the code necessary to compute the total number of household spaces in Leicester for each housing type grouped by 2011 OAC supergroup.

Question 103.1.3: Write the code necessary to compute the percentage of household spaces (i.e., over to the total number of household spaces) in Leicester for each housing type grouped by 2011 OAC supergroup.

Question 103.1.4: Modify the code written for Question 103.1.3, using the verb rename to change the column names of the columns containing the percentages to names that resemble the related housing type (e.g., perc_of_detached).

3.5 Exercises 103.2

Question 103.2.1: Explore the look-up table 2011_OAC_Raw_uVariables_Lookup.csv and identify another set of variables that you think might relate to the type of housing and the 2011 OAC supergroups. Create a new section of the document (e.g., adding a second heading) and include a short text (up to 200 words) in the RMarkdown document describing and justifying the set of variables you choose.

Question 103.2.2: Write the code necessary to conduct the same analysis conducted for Question 103.1 but using the variables identified in Question 103.2.1.

Question 103.2.3: Inspect the table obtained from your answer to Question 103.1.3 and the table obtained from your answer to Question 103.2.2, and compare the results. Write a short text (up to 300 words) in the RMarkdown document reporting and discussing what the results tell you about the socio-demographic structure of Leicester.

by Stefano De Sabbata – text licensed under the CC BY-SA 4.0, contains public sector information licensed under the Open Government Licence v3.0, code licensed under the GNU GPL v3.0.

2 Reproducible data science

4 Table operations