2020-01-15
Moving on towards data science
Moving on towards data science
Data frames can be joined (or ‘merged’)
merge base function or join functions in dplyr
employees <- data.frame(
Name = c("Maria", "Pete", "Sarah", "Jo"),
Age = c(47, 34, 32, 25),
Role = c("Professor", "Researcher", "Researcher", "Postgrad")
)
city_of_birth <-data.frame(
Name = c("Maria", "Pete", "Sarah", "Mel"),
City = c("Barcelona", "London", "Boston", "Los Angeles")
)
| Name | Age | Role |
|---|---|---|
| Maria | 47 | Professor |
| Pete | 34 | Researcher |
| Sarah | 32 | Researcher |
| Jo | 25 | Postgrad |
| Name | City |
|---|---|
| Maria | Barcelona |
| Pete | London |
| Sarah | Boston |
| Mel | Los Angeles |
dplyr provides a series of join functions
full_join combines all the available dataemployees %>% full_join(
city_of_birth,
by = c("Name" = "Name") # join columns
) %>%
kable()
| Name | Age | Role | City |
|---|---|---|---|
| Maria | 47 | Professor | Barcelona |
| Pete | 34 | Researcher | London |
| Sarah | 32 | Researcher | Boston |
| Jo | 25 | Postgrad | NA |
| Mel | NA | NA | Los Angeles |
left_join keeps all the data from the left table
%>%, that’s the data “coming down the pipe”employees %>% left_join(
city_of_birth,
by = c("Name" = "Name") # join columns
) %>%
kable()
| Name | Age | Role | City |
|---|---|---|---|
| Maria | 47 | Professor | Barcelona |
| Pete | 34 | Researcher | London |
| Sarah | 32 | Researcher | Boston |
| Jo | 25 | Postgrad | NA |
right_join keeps all the data from the right table
%>%, that’s the data provided as an argumentemployees %>% right_join(
city_of_birth,
by = c("Name" = "Name") # join columns
) %>%
kable()
| Name | Age | Role | City |
|---|---|---|---|
| Maria | 47 | Professor | Barcelona |
| Pete | 34 | Researcher | London |
| Sarah | 32 | Researcher | Boston |
| Mel | NA | NA | Los Angeles |
inner_join keeps only rows that have a match in both tablesemployees %>% inner_join(
city_of_birth,
by = c("Name" = "Name") # join columns
) %>%
kable()
| Name | Age | Role | City |
|---|---|---|---|
| Maria | 47 | Professor | Barcelona |
| Pete | 34 | Researcher | London |
| Sarah | 32 | Researcher | Boston |
This is the most common approach
| City | Population | Area | Density |
|---|---|---|---|
| Leicester | 329,839 | 73.3 | 4,500 |
| Nottingham | 321,500 | 74.6 | 4,412 |
This is probably a less common approach, but still necessary in many cases
| City | Attribute | Value |
|---|---|---|
| Leicester | Population | 329,839 |
| Leicester | Area | 73.3 |
| Leicester | Density | 4,500 |
| Nottingham | Population | 321,500 |
| Nottingham | Area | 74.6 |
| Nottingham | Density | 4,412 |
The tidyr (pronounced tidy-er) library is part of tidyverse and it provides functions to re-shape your data
city_info_wide <- data.frame(
City = c("Leicester", "Nottingham"),
Population = c(329839, 321500),
Area = c(73.3, 74.6),
Density = c(4500, 4412)
)
kable(city_info_wide)
| City | Population | Area | Density |
|---|---|---|---|
| Leicester | 329839 | 73.3 | 4500 |
| Nottingham | 321500 | 74.6 | 4412 |
Re-shape from wide to long format
city_info_long <- city_info_wide %>%
gather(
-City, # exclude city names from gathering
key = "Attribute", # name for the new key column
value = "Value" # name for the new value column
)
| City | Attribute | Value |
|---|---|---|
| Leicester | Population | 329839.0 |
| Nottingham | Population | 321500.0 |
| Leicester | Area | 73.3 |
| Nottingham | Area | 74.6 |
| Leicester | Density | 4500.0 |
| Nottingham | Density | 4412.0 |
Rre-shape from long to wide format
city_info_back_to_wide <- city_info_long %>%
spread(
key = "Attribute", # specify key column
value = "Value" # specify value column
)
| City | Area | Density | Population |
|---|---|---|---|
| Leicester | 73.3 | 4500 | 329839 |
| Nottingham | 74.6 | 4412 | 321500 |
The file 2011_OAC_Raw_uVariables_Leicester.csv - contains data used for the 2011 Output Area Classificagtion - 167 variables, as well as the resulting groups - for the city of Leicester
Extract showing only some columns
OA11CD,LSOA11CD, ... supgrpcode,supgrpname,Total_Population, ... E00069517,E01013785, ... 6,Suburbanites,313, ... E00169516,E01013713, ... 4,Multicultural Metropolitans,341, ... E00169048,E01032862, ... 4,Multicultural Metropolitans,345, ...
The full variable names can be found in the file - 2011_OAC_Raw_uVariables_Lookup.csv.
The read_csv function of the readr library reads a csv file from the path provided as the first argument
leicester_2011OAC <- read_csv("2011_OAC_Raw_uVariables_Leicester.csv")
leicester_2011OAC %>%
select(OA11CD,LSOA11CD, supgrpcode,supgrpname,Total_Population) %>%
top_n(3) %>%
kable()
| OA11CD | LSOA11CD | supgrpcode | supgrpname | Total_Population |
|---|---|---|---|---|
| E00169553 | E01013648 | 2 | Cosmopolitans | 714 |
| E00069303 | E01013739 | 4 | Multicultural Metropolitans | 623 |
| E00168096 | E01013689 | 2 | Cosmopolitans | 708 |
The function write_csv can be used to save a dataset to csv
Example:
6)read_csv("2011_OAC_Raw_uVariables_Leicester.csv") %>%
select(OA11CD, supgrpcode, supgrpname) %>%
filter(supgrpcode == 6) %>%
write_csv("~/Leicester_Suburbanites.csv")
Moving on towards data science
In the practical session, we will see
Reproducibility in (geographic) data science
knitr and rmarkdown