2021-10-03

## Recap

Prev: Centroid-based clustering

• K-means
• Fuzzy c-means
• Geodemographic classification

Now: Hierarchical and density-based clustering

• Hierarchical
• Mixed
• Density-based

## Example

Can we automatically identify the two groups visible in the scatterplot, without any previous knowledge of the groups?

```penguins_norm <-
palmerpenguins::penguins %>%
dplyr::filter(
species %in%
) %>%
dplyr::filter(
!is.na(body_mass_g) |
!is.na(bill_depth_mm)
) %>%
dplyr::mutate(
body_mass_norm =
scale(body_mass_g),
bill_depth_norm =
scale(bill_depth_mm)
)```

## Hierarchical clustering

Bottom-up approach

• rather than splitting objects into clusters
• aggregate from single objects upwards

Algorithm:

• each object is initialised as itâ€™s own cluster
• then repeat
• join the two most similar clusters
• based on a distance-based metric
• e.g., Wardâ€™s (1963) approach is based on variance
• until only one single cluster is achieved

Limitation: computationally expensive

## stats::hclust

```penguins_hclust_result <-
penguins_norm %>%
dplyr::select(
body_mass_norm,
bill_depth_norm
) %>%
# Calculate distance matrix
stats::dist(method="euclidean") %>%
# Cluster data
stats::hclust(method="ward.D2")

penguins_bm_bd_hclust <- penguins_norm %>%
bm_bd_hclust = stats::cutree(
penguins_hclust_result,
k = 2
)
)```

## clustering tree

Generates a clustering tree (dendrogram), which can then be “cut” at the desired height

`## integer(0)`

## Bagged clustering

Bootstrap aggregating (b-agg-ed) clustering approach

• first, k-means
• randomly select a sample
• calculate K-means
• repeat on many samples
• then, hierarchical
• execute hierarchical clustering on the centroids of the clusters generated in the previous step
• finally
• select required number of clusters
• assign object to closest centroid

Leisch, F., 1999. Bagged clustering.

## e1071::bclust

```penguins_bclust_result <-
penguins_norm %>%
dplyr::select(body_mass_norm, bill_depth_norm) %>%
e1071::bclust(
hclust.method = "ward.D2",
resample = TRUE
)

penguins_bm_bd_bclust <-
penguins_norm %>%
bm_bd_bclust = e1071::clusters.bclust(
penguins_bclust_result,
2
)
)```

## Density based clustering

Density-based spatial clustering of applications with noise (DBSCAN)

• start from a random unclustered point
• proceed by aggregating its neighbours to the same cluster
• as long as they are within a certain distance `eps`
• once no more objects can be added
• select another random point
• and start aggregating again to a new cluster

Limitation: selection of `eps`

Ester, M., Kriegel, H.P., Sander, J. and Xu, X., 1996, August. Density-based spatial clustering of applications with noise. In Int. Conf. Knowledge Discovery and Data Mining (Vol. 240, p.Â 6).

## dbscan::dbscan

```penguins_dbscan_result <-
penguins_norm %>%
dplyr::select(body_mass_norm, bill_depth_norm) %>%
dbscan::dbscan(
eps = 1,
minPts = 5
)

penguins_bm_bd_dbscan <-
penguins_norm %>%
Using: `dbscan(eps = 1, minPts = 5)`
Using: `dbscan(eps = 0.5, minPts = 5)`