17 Selection and filtering

17.1 dplyr

The dplyr (pronounced dee-ply-er) library is part of tidyverse and it offers a grammar for data manipulation

  • select: select specific columns
  • filter: select specific rows
  • arrange: arrange rows in a particular order
  • summarise: calculate aggregated values (e.g., mean, max, etc)
  • group_by: group data based on common column values
  • mutate: add columns
  • join: merge data frames

17.2 Example dataset

The library nycflights13 contains a dataset storing data about all the flights departed from New York City in 2013

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

17.3 dplyr::select

select can be used to specify which columns to retain

## # A tibble: 3 x 7
##   origin dest  dep_delay arr_delay  year month   day
##   <chr>  <chr>     <dbl>     <dbl> <int> <int> <int>
## 1 EWR    IAH           2        11  2013     1     1
## 2 LGA    IAH           4        20  2013     1     1
## 3 JFK    MIA           2        33  2013     1     1

17.4 dplyr::select

… using the pipe operator

## # A tibble: 3 x 7
##   origin dest  dep_delay arr_delay  year month   day
##   <chr>  <chr>     <dbl>     <dbl> <int> <int> <int>
## 1 EWR    IAH           2        11  2013     1     1
## 2 LGA    IAH           4        20  2013     1     1
## 3 JFK    MIA           2        33  2013     1     1

17.5 Logical filtering

Conditional statements can be used to filter a vector, i.e. to retain only certain values

## [1] -3 -2 -1  0  1  2  3
## [1] 0 1 2 3

17.6 Conditional filtering

As a condition expression results in a logic vector, that condition can be used for filtering

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [1] 1 2 3

17.7 Filtering data frames

The same can be applied to data frames

## # A tibble: 3 x 6
##   origin dest  dep_delay  year month   day
##   <chr>  <chr>     <dbl> <int> <int> <int>
## 1 JFK    PSE           6  2013    11     1
## 2 JFK    SYR         105  2013    11     1
## 3 EWR    CLT          -5  2013    11     1

17.8 dplyr::filter

## # A tibble: 3 x 6
##   origin dest  dep_delay  year month   day
##   <chr>  <chr>     <dbl> <int> <int> <int>
## 1 JFK    PSE           6  2013    11     1
## 2 JFK    SYR         105  2013    11     1
## 3 EWR    CLT          -5  2013    11     1