2021-10-03

Principal Component Analysis

Recap

Prev: Comparing data

  • 401 Lecture Introduction to Machine Learning
  • 402 Lecture Artificial Neural Networks
  • 403 Lecture Support vector machines
  • 404 Practical session

Now: Principal Component Analysis

  • Principal components
  • stats::prcomp
  • Dimensionality reduction

Principal components

Principal component are

  • a set of directions orthogonal to each other
  • that best fit a set of data

Can be interpreted as a “re-projection” of the data

Dimensionality reduction

Alternatively, principal components can be interpreted as

  • lower-dimensional representation of the data

Especially useful when working numerous variables

  • a limited number of principal components can be retained
    • most variance maintained
    • distance in data space approximated
    • high-dimensional data can be more easily plotted
  • commonly used as dimensionality reduction step
    • supervised learning models
      • linear regression
    • clustering

stats::prcomp

Principal component analysis on body mass, flipper length, and bill length and depth

penguins_pca <-
  palmerpenguins::penguins %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  # remove missing data
  dplyr::filter(
    !is.na(bill_length_mm) | !is.na(bill_depth_mm) |
    !is.na(flipper_length_mm) | !is.na(body_mass_g)
  ) %>%
  stats::prcomp(center = TRUE, scale. = TRUE) 

summary(penguins_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.6594 0.8789 0.60435 0.32938
## Proportion of Variance 0.6884 0.1931 0.09131 0.02712
## Cumulative Proportion  0.6884 0.8816 0.97288 1.00000

The first component alone explains 68.84% of variance, and the first two together explain 88.16% of variance

PCA results

penguins_with_pca <- palmerpenguins::penguins %>%
  dplyr::filter(!is.na(bill_length_mm) | !is.na(bill_depth_mm) | 
                !is.na(flipper_length_mm) | !is.na(body_mass_g)) %>%
  dplyr::bind_cols(
      penguins_pca %$% x %>% as.data.frame()
  )

Plotting PCA

library(factoextra)

penguins_pca %>% fviz_pca_biplot(label = "var")

Summary

Principal Component Analysis

  • Principal components
  • stats::prcomp
  • Interpretation

Next: Centroid-based clustering

  • K-means
  • Fuzzy c-means
  • Geodemographic classification