2020-01-15
Basis of R programming
Moving on towards data science
The dplyr
(pronounced dee-ply-er) library is part of tidyverse
and it offers a grammar for data manipulation
select
: select specific columnsfilter
: select specific rowsarrange
: arrange rows in a particular ordersummarise
: calculate aggregated values (e.g., mean, max, etc)group_by
: group data based on common column valuesmutate
: add columnsjoin
: merge data frameslibrary(tidyverse)
The library nycflights13
contains a dataset storing data about all the flights departed from New York City in 2013
install.packages("nycflights13")
library(nycflights13) flights_from_nyc <- nycflights13::flights colnames(flights_from_nyc)
## [1] "year" "month" "day" "dep_time" ## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time" ## [9] "arr_delay" "carrier" "flight" "tailnum" ## [13] "origin" "dest" "air_time" "distance" ## [17] "hour" "minute" "time_hour"
select
can be used to specify which columns to retain
delays <- select(flights_from_nyc, origin, dest, dep_delay, arr_delay, year:day ) # Drop column arr_delay using - in front of the column name dep_delays <- select(delays, -arr_delay) delays[1:3, ]
## # A tibble: 3 x 7 ## origin dest dep_delay arr_delay year month day ## <chr> <chr> <dbl> <dbl> <int> <int> <int> ## 1 EWR IAH 2 11 2013 1 1 ## 2 LGA IAH 4 20 2013 1 1 ## 3 JFK MIA 2 33 2013 1 1
… using the pipe operator
dep_delays <- flights_from_nyc %>% select(origin, dest, dep_delay, arr_delay, year:day) %>% select(-arr_delay) delays[1:3, ]
## # A tibble: 3 x 7 ## origin dest dep_delay arr_delay year month day ## <chr> <chr> <dbl> <dbl> <int> <int> <int> ## 1 EWR IAH 2 11 2013 1 1 ## 2 LGA IAH 4 20 2013 1 1 ## 3 JFK MIA 2 33 2013 1 1
Conditional statements can be used to filter a vector, i.e. to retain only certain values
a_numeric_vector <- -3:3 a_numeric_vector
## [1] -3 -2 -1 0 1 2 3
a_numeric_vector[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)]
## [1] 0 1 2 3
As a condition expression results in a logic vector, that condition can be used for filtering
a_numeric_vector > 0
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
a_numeric_vector[a_numeric_vector > 0]
## [1] 1 2 3
The same can be applied to data frames
nov_dep_delays <- dep_delays[dep_delays$month == 11, ] nov_dep_delays[1:3, ]
## # A tibble: 3 x 6 ## origin dest dep_delay year month day ## <chr> <chr> <dbl> <int> <int> <int> ## 1 JFK PSE 6 2013 11 1 ## 2 JFK SYR 105 2013 11 1 ## 3 EWR CLT -5 2013 11 1
nov_dep_delays <- dep_delays %>% filter(month == 11) # Flights in November nov_dep_delays[1:3, ]
## # A tibble: 3 x 6 ## origin dest dep_delay year month day ## <chr> <chr> <dbl> <int> <int> <int> ## 1 JFK PSE 6 2013 11 1 ## 2 JFK SYR 105 2013 11 1 ## 3 EWR CLT -5 2013 11 1
nov_dep_delays <- nov_dep_delays %>% arrange( dest, # Ascending destination name -dep_delay # Descending delay ) nov_dep_delays[1:3, ]
## # A tibble: 3 x 6 ## origin dest dep_delay year month day ## <chr> <chr> <dbl> <int> <int> <int> ## 1 JFK ABQ 25 2013 11 29 ## 2 JFK ABQ 21 2013 11 22 ## 3 JFK ABQ 17 2013 11 21
summarise
: calculate aggregated values (e.g., mean, max, etc)
aggr_dep_delays_nov <- nov_dep_delays %>% # Need to filter out rows where delay is NA filter(!is.na(dep_delay)) %>% # Create two aggregated columns summarize( avg_dep_delay = mean(dep_delay), tot_dep_delay = sum(dep_delay) ) aggr_dep_delays_nov
## # A tibble: 1 x 2 ## avg_dep_delay tot_dep_delay ## <dbl> <dbl> ## 1 5.44 146945
dest_dep_delays_nov <- nov_dep_delays %>% # Need to filter out rows where delay is NA filter(!is.na(dep_delay)) %>% # Frist group by same destination group_by(dest) %>% # Then caluculate aggregated value summarize( tot_dep_delay = sum(dep_delay) ) dest_dep_delays_nov[1:3, ]
## # A tibble: 3 x 2 ## dest tot_dep_delay ## <chr> <dbl> ## 1 ABQ -66 ## 2 ALB 636 ## 3 ATL 8184
dest_dep_delays_nov <- dest_dep_delays_nov %>% mutate( tot_dep_delay_days = ((tot_dep_delay / 60) /24) ) dest_dep_delays_nov[1:3, ]
## # A tibble: 3 x 3 ## dest tot_dep_delay tot_dep_delay_days ## <chr> <dbl> <dbl> ## 1 ABQ -66 -0.0458 ## 2 ALB 636 0.442 ## 3 ATL 8184 5.68
dest_dep_delays_nov <- nycflights13::flights %>% select(origin, dest, dep_delay, arr_delay, year:day) %>% select(-arr_delay) %>% filter(month == 11) %>% filter(!is.na(dep_delay)) %>% arrange(dest, -dep_delay) %>% group_by(dest) %>% summarize(tot_dep_delay = sum(dep_delay)) %>% mutate(tot_dep_delay_days = ((tot_dep_delay / 60) /24)) dest_dep_delays_nov[1:3, ]
## # A tibble: 3 x 3 ## dest tot_dep_delay tot_dep_delay_days ## <chr> <dbl> <dbl> ## 1 ABQ -66 -0.0458 ## 2 ALB 636 0.442 ## 3 ATL 8184 5.68
Moving on towards data science
In the practical session we will see
Moving on towards data science