2020-01-15
Moving on towards data science
Moving on towards data science
Data frames can be joined (or ‘merged’)
merge
base function or join functions in dplyr
employees <- data.frame( Name = c("Maria", "Pete", "Sarah", "Jo"), Age = c(47, 34, 32, 25), Role = c("Professor", "Researcher", "Researcher", "Postgrad") ) city_of_birth <-data.frame( Name = c("Maria", "Pete", "Sarah", "Mel"), City = c("Barcelona", "London", "Boston", "Los Angeles") )
Name | Age | Role |
---|---|---|
Maria | 47 | Professor |
Pete | 34 | Researcher |
Sarah | 32 | Researcher |
Jo | 25 | Postgrad |
Name | City |
---|---|
Maria | Barcelona |
Pete | London |
Sarah | Boston |
Mel | Los Angeles |
dplyr
provides a series of join functions
full_join
combines all the available dataemployees %>% full_join( city_of_birth, by = c("Name" = "Name") # join columns ) %>% kable()
Name | Age | Role | City |
---|---|---|---|
Maria | 47 | Professor | Barcelona |
Pete | 34 | Researcher | London |
Sarah | 32 | Researcher | Boston |
Jo | 25 | Postgrad | NA |
Mel | NA | NA | Los Angeles |
left_join
keeps all the data from the left table
%>%
, that’s the data “coming down the pipe”employees %>% left_join( city_of_birth, by = c("Name" = "Name") # join columns ) %>% kable()
Name | Age | Role | City |
---|---|---|---|
Maria | 47 | Professor | Barcelona |
Pete | 34 | Researcher | London |
Sarah | 32 | Researcher | Boston |
Jo | 25 | Postgrad | NA |
right_join
keeps all the data from the right table
%>%
, that’s the data provided as an argumentemployees %>% right_join( city_of_birth, by = c("Name" = "Name") # join columns ) %>% kable()
Name | Age | Role | City |
---|---|---|---|
Maria | 47 | Professor | Barcelona |
Pete | 34 | Researcher | London |
Sarah | 32 | Researcher | Boston |
Mel | NA | NA | Los Angeles |
inner_join
keeps only rows that have a match in both tablesemployees %>% inner_join( city_of_birth, by = c("Name" = "Name") # join columns ) %>% kable()
Name | Age | Role | City |
---|---|---|---|
Maria | 47 | Professor | Barcelona |
Pete | 34 | Researcher | London |
Sarah | 32 | Researcher | Boston |
This is the most common approach
City | Population | Area | Density |
---|---|---|---|
Leicester | 329,839 | 73.3 | 4,500 |
Nottingham | 321,500 | 74.6 | 4,412 |
This is probably a less common approach, but still necessary in many cases
City | Attribute | Value |
---|---|---|
Leicester | Population | 329,839 |
Leicester | Area | 73.3 |
Leicester | Density | 4,500 |
Nottingham | Population | 321,500 |
Nottingham | Area | 74.6 |
Nottingham | Density | 4,412 |
The tidyr
(pronounced tidy-er) library is part of tidyverse
and it provides functions to re-shape your data
city_info_wide <- data.frame( City = c("Leicester", "Nottingham"), Population = c(329839, 321500), Area = c(73.3, 74.6), Density = c(4500, 4412) ) kable(city_info_wide)
City | Population | Area | Density |
---|---|---|---|
Leicester | 329839 | 73.3 | 4500 |
Nottingham | 321500 | 74.6 | 4412 |
Re-shape from wide to long format
city_info_long <- city_info_wide %>% gather( -City, # exclude city names from gathering key = "Attribute", # name for the new key column value = "Value" # name for the new value column )
City | Attribute | Value |
---|---|---|
Leicester | Population | 329839.0 |
Nottingham | Population | 321500.0 |
Leicester | Area | 73.3 |
Nottingham | Area | 74.6 |
Leicester | Density | 4500.0 |
Nottingham | Density | 4412.0 |
Rre-shape from long to wide format
city_info_back_to_wide <- city_info_long %>% spread( key = "Attribute", # specify key column value = "Value" # specify value column )
City | Area | Density | Population |
---|---|---|---|
Leicester | 73.3 | 4500 | 329839 |
Nottingham | 74.6 | 4412 | 321500 |
The file 2011_OAC_Raw_uVariables_Leicester.csv
- contains data used for the 2011 Output Area Classificagtion - 167 variables, as well as the resulting groups - for the city of Leicester
Extract showing only some columns
OA11CD,LSOA11CD, ... supgrpcode,supgrpname,Total_Population, ... E00069517,E01013785, ... 6,Suburbanites,313, ... E00169516,E01013713, ... 4,Multicultural Metropolitans,341, ... E00169048,E01032862, ... 4,Multicultural Metropolitans,345, ...
The full variable names can be found in the file - 2011_OAC_Raw_uVariables_Lookup.csv
.
The read_csv
function of the readr
library reads a csv file from the path provided as the first argument
leicester_2011OAC <- read_csv("2011_OAC_Raw_uVariables_Leicester.csv") leicester_2011OAC %>% select(OA11CD,LSOA11CD, supgrpcode,supgrpname,Total_Population) %>% top_n(3) %>% kable()
OA11CD | LSOA11CD | supgrpcode | supgrpname | Total_Population |
---|---|---|---|---|
E00169553 | E01013648 | 2 | Cosmopolitans | 714 |
E00069303 | E01013739 | 4 | Multicultural Metropolitans | 623 |
E00168096 | E01013689 | 2 | Cosmopolitans | 708 |
The function write_csv
can be used to save a dataset to csv
Example:
6
)read_csv("2011_OAC_Raw_uVariables_Leicester.csv") %>% select(OA11CD, supgrpcode, supgrpname) %>% filter(supgrpcode == 6) %>% write_csv("~/Leicester_Suburbanites.csv")
Moving on towards data science
In the practical session, we will see
Reproducibility in (geographic) data science
knitr
and rmarkdown