2021-10-03

Hierarchical and density-based clustering

Recap

Prev: Centroid-based clustering

  • K-means
  • Fuzzy c-means
  • Geodemographic classification

Now: Hierarchical and density-based clustering

  • Hierarchical
  • Mixed
  • Density-based

Example

Can we automatically identify the two groups visible in the scatterplot, without any previous knowledge of the groups?

penguins_norm <-
  palmerpenguins::penguins %>%
  dplyr::filter(
    species %in%
      c("Adelie", "Gentoo")
  ) %>%
  dplyr::filter(
    !is.na(body_mass_g) | 
      !is.na(bill_depth_mm)
  ) %>%
  dplyr::mutate(
    body_mass_norm = 
      scale(body_mass_g),
    bill_depth_norm = 
      scale(bill_depth_mm)
  )

Hierarchical clustering

Bottom-up approach

  • rather than splitting objects into clusters
  • aggregate from single objects upwards

Algorithm:

  • each object is initialised as it’s own cluster
  • then repeat
    • join the two most similar clusters
      • based on a distance-based metric
      • e.g., Ward’s (1963) approach is based on variance
    • until only one single cluster is achieved

Limitation: computationally expensive

stats::hclust

penguins_hclust_result <- 
  penguins_norm %>%
  dplyr::select(
    body_mass_norm, 
    bill_depth_norm
  ) %>%
  # Calculate distance matrix
  stats::dist(method="euclidean") %>%
  # Cluster data
  stats::hclust(method="ward.D2")

penguins_bm_bd_hclust <- penguins_norm %>%
  tibble::add_column(
    bm_bd_hclust = stats::cutree(
      penguins_hclust_result, 
      k = 2
    )
  )

clustering tree

Generates a clustering tree (dendrogram), which can then be “cut” at the desired height

## integer(0)

Hierarchical clustering result

Bagged clustering

Bootstrap aggregating (b-agg-ed) clustering approach

  • first, k-means
    • randomly select a sample
    • calculate K-means
    • repeat on many samples
  • then, hierarchical
    • execute hierarchical clustering on the centroids of the clusters generated in the previous step
  • finally
    • select required number of clusters
    • assign object to closest centroid

Leisch, F., 1999. Bagged clustering.

e1071::bclust

penguins_bclust_result <- 
  penguins_norm %>%
  dplyr::select(body_mass_norm, bill_depth_norm) %>%
  e1071::bclust(
    hclust.method = "ward.D2", 
    resample = TRUE
  )

penguins_bm_bd_bclust <- 
  penguins_norm %>%
  tibble::add_column(
    bm_bd_bclust = e1071::clusters.bclust(
      penguins_bclust_result, 
      2
    )
  )

Bagged clustering result

Density based clustering

Density-based spatial clustering of applications with noise (DBSCAN)

  • start from a random unclustered point
    • proceed by aggregating its neighbours to the same cluster
      • as long as they are within a certain distance eps
  • once no more objects can be added
    • select another random point
    • and start aggregating again to a new cluster

Limitation: selection of eps

Ester, M., Kriegel, H.P., Sander, J. and Xu, X., 1996, August. Density-based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining (Vol. 240, p. 6).

dbscan::dbscan

penguins_dbscan_result <- 
  penguins_norm %>%
  dplyr::select(body_mass_norm, bill_depth_norm) %>%
  dbscan::dbscan(
    eps = 1, 
    minPts = 5
  )

penguins_bm_bd_dbscan <- 
  penguins_norm %>%
  tibble::add_column(
    bm_bd_dbscan = 
      penguins_dbscan_result %$% 
      cluster
  )

DBSCAN result

Using: dbscan(eps = 1, minPts = 5)

DBSCAN result

Using: dbscan(eps = 0.5, minPts = 5)