2020-01-15
“The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.”
Mitchell, T. (1997). Machine Learning. McGraw Hill.
Mitchell, T.M., 2006. The discipline of machine learning (Vol. 9). Pittsburgh, PA: Carnegie Mellon University, School of Computer Science, Machine Learning Department.
Machine learning approaches are divided into two main types
Supervised learning approach simulating simplistic neurons
Neural networks with multiple hidden layers
The fundamental idea is that “deeper” neurons allow for the encoding of more complex characteristics
Example: De Sabbata, S. and Liu, P. (2019). Deep learning geodemographics with autoencoders and geographic convolution. In proceedings of the 22nd AGILE Conference on Geographic Information Science, Limassol, Cyprus.
Deep neural networks with convolutional hidden layers
Example: Liu, P. and De Sabbata, S. (2019). Learning Digital Geographies through a Graph-Based Semi-supervised Approach. In proceedings of the 15th International Conference on GeoComputation, Queenstown, New Zealand.
"Clustering is an unsupervised machine learning task that automatically divides the data into clusters , or groups of similar items". (Lantz, 2019)
Methods:
data_to_cluster <- data.frame( x_values = c(rnorm(40, 5, 1), rnorm(60, 10, 1), rnorm(20, 12, 3)), y_values = c(rnorm(40, 5, 1), rnorm(60, 5, 3), rnorm(20, 15, 1)), original_group = c(rep("A", 40), rep("B", 60), rep("C", 20)) )
k-mean clusters n
observations in k
clusters, minimising the within-cluster sum of squares (WCSS)
Algorithm: k
observations a randomly selected as initial centroids, then repeat
until centroids don’t change anymore, the algorithm has converged
kmeans_found_clusters <- data_to_cluster %>% select(x_values, y_values) %>% kmeans(centers=3, iter.max=50) data_to_cluster <- data_to_cluster %>% add_column(kmeans_cluster = kmeans_found_clusters$cluster)
Fuzzy c-means is similar to k-means but allows for "fuzzy" membership to clusters
Each observation is assigned with a value per each cluster
0
to 1
library(e1071) cmeans_result <- data_to_cluster %>% select(x_values, y_values) %>% cmeans(centers=3, iter.max=50) data_to_cluster <- data_to_cluster %>% add_column(c_means_assigned_cluster = cmeans_result$cluster)
A “crisp” classification can be created by picking the highest membership value.
0.75
)data_to_cluster <- data_to_cluster %>% add_column( c_means_membership = apply(cmeans_result$membership, 1, max) ) %>% mutate( c_means_cluster = ifelse( c_means_membership > 0.75, c_means_assigned_cluster, 0 ) )
Algorithm: each object is initialised as, then repeat
until only one single cluster is achieved
hclust_result <- data_to_cluster %>% select(x_values, y_values) %>% dist(method="euclidean") %>% hclust(method="ward.D2") data_to_cluster <- data_to_cluster %>% add_column(hclust_cluster = cutree(hclust_result, k=3))
This approach generates a clustering tree (dendrogram), which can then be “cut” at the desired height
plot(hclust_result) + abline(h = 30, col = "red")
## integer(0)
Bootstrap aggregating (b-agg-ed) clustering approach (Leisch, 1999)
library(e1071) bclust_result <- data_to_cluster %>% select(x_values, y_values) %>% bclust(hclust.method="ward.D2", resample = TRUE) data_to_cluster <- data_to_cluster %>% add_column(bclust_cluster = clusters.bclust(bclust_result, 3))
DBSCAN (“density-based spatial clustering of applications with noise”) starts from an unclustered point and proceeds by aggregating its neighbours to the same cluster, as long as they are within a certain distance. (Ester et al, 1996)
library(dbscan) dbscan_result <- data_to_cluster %>% select(x_values, y_values) %>% dbscan(eps = 1, minPts = 5) data_to_cluster <- data_to_cluster %>% add_column(dbscan_cluster = dbscan_result$cluster)
In GIScience, the clustering is commonly used to create geodemographic classifications such as the 2011 Output Area Classification (Gale et al., 2016)
In the practical session we will see:
Geodemographic classification in R