17 Selection and filtering
17.1 dplyr
The dplyr
(pronounced dee-ply-er) library is part of tidyverse
and it offers a grammar for data manipulation
select
: select specific columnsfilter
: select specific rowsarrange
: arrange rows in a particular ordersummarise
: calculate aggregated values (e.g., mean, max, etc)group_by
: group data based on common column valuesmutate
: add columnsjoin
: merge data frames
17.2 Example dataset
The library nycflights13
contains a dataset storing data about all the flights departed from New York City in 2013
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
17.3 dplyr::select
select
can be used to specify which columns to retain
delays <- select(flights_from_nyc,
origin, dest, dep_delay, arr_delay,
year:day
)
# Drop column arr_delay using - in front of the column name
dep_delays <- select(delays, -arr_delay)
delays[1:3, ]
## # A tibble: 3 x 7
## origin dest dep_delay arr_delay year month day
## <chr> <chr> <dbl> <dbl> <int> <int> <int>
## 1 EWR IAH 2 11 2013 1 1
## 2 LGA IAH 4 20 2013 1 1
## 3 JFK MIA 2 33 2013 1 1
17.4 dplyr::select
… using the pipe operator
dep_delays <- flights_from_nyc %>%
select(origin, dest, dep_delay, arr_delay, year:day) %>%
select(-arr_delay)
delays[1:3, ]
## # A tibble: 3 x 7
## origin dest dep_delay arr_delay year month day
## <chr> <chr> <dbl> <dbl> <int> <int> <int>
## 1 EWR IAH 2 11 2013 1 1
## 2 LGA IAH 4 20 2013 1 1
## 3 JFK MIA 2 33 2013 1 1
17.5 Logical filtering
Conditional statements can be used to filter a vector, i.e. to retain only certain values
## [1] -3 -2 -1 0 1 2 3
## [1] 0 1 2 3
17.6 Conditional filtering
As a condition expression results in a logic vector, that condition can be used for filtering
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [1] 1 2 3
17.7 Filtering data frames
The same can be applied to data frames
## # A tibble: 3 x 6
## origin dest dep_delay year month day
## <chr> <chr> <dbl> <int> <int> <int>
## 1 JFK PSE 6 2013 11 1
## 2 JFK SYR 105 2013 11 1
## 3 EWR CLT -5 2013 11 1
17.8 dplyr::filter
## # A tibble: 3 x 6
## origin dest dep_delay year month day
## <chr> <chr> <dbl> <int> <int> <int>
## 1 JFK PSE 6 2013 11 1
## 2 JFK SYR 105 2013 11 1
## 3 EWR CLT -5 2013 11 1