2021-10-03

Exploring assumptions

Recap

Prev: Descriptive statistics

  • stat.desc
  • dplyr::across

Next: Exploring assumptions

  • Normality
  • Skewness and kurtosis
  • Homogeneity of variance

Normal distribution

  • characterized by the bell-shaped curve
  • majority of values lie around the centre of the distribution
  • the further the values are from the centre, the lower their frequency
  • about 95% of values within 2 standard deviations from the mean

Density histogram

palmerpenguins::penguins %>% 
  ggplot2::ggplot(
    aes(x = flipper_length_mm)
  ) +
  ggplot2::geom_histogram(
    aes(
      y =..density..
    )
  ) + 
  ggplot2::stat_function(
    fun = dnorm, 
    args = list(
      # mean and stddev
      # calculations
      # omitted here
      mean = ...,
      sd = ... ),
    colour = "black", size = 1)

Q-Q plot

Values against the cumulative probability of a particular distribution (in this case, normal distribution)

palmerpenguins::penguins %>% 
  ggplot2::ggplot(
    aes(
      sample = 
        flipper_length_mm
    )
  ) +
  ggplot2::stat_qq() +
  ggplot2::stat_qq_line()

Normality

Shapiro–Wilk test compares the distribution of a variable with a normal distribution having same mean and standard deviation

  • If significant, the distribution is not normal
  • shapiro.test function in stats
  • or normtest values in pastecs::stat.desc
palmerpenguins::penguins %>% 
  dplyr::pull(flipper_length_mm) %>%
  stats::shapiro.test()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.95155, p-value = 3.54e-09

Significance

Most statistical tests are based on the idea of hypothesis testing

  • a null hypothesis is set
  • the data are fit into a statistical model
  • the model is assessed with a test statistic
  • the significance is the probability of obtaining that test statistic value by chance

The threshold to accept or reject an hypotheis is arbitrary and based on conventions (e.g., p < .01 or p < .05)

Example: The null hypotheis of the Shapiro–Wilk test is that the sample is normally distributed and p < .01 indicates that the probability of that being true is very low. So, the flipper length of penguins in the Palmer Station dataset is not normally distributed.

Example

The flipper length of Adelie penguins is normally distributed

palmerpenguins::penguins %>% 
  filter(
    species == "Adelie"
  ) %>%
  dplyr::pull(
    flipper_length_mm
  ) %>%
  stats::shapiro.test()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.99339, p-value = 0.72

Example

The flipper length of Adelie penguins is normally distributed

palmerpenguins::penguins %>% 
  filter(
    species == "Adelie"
  ) %>%
  dplyr::pull(
    flipper_length_mm
  ) %>%
  stats::shapiro.test()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.99339, p-value = 0.72

Skewness and kurtosis

In a normal distribution, skewness and kurtosis should be zero

  • skewness: skewness value indicates
    • positive: the distribution is skewed towards the left
    • negative: the distribution is skewed towards the right
  • kurtosis: kurtosis value indicates
    • positive: heavy-tailed distribution
    • negative: flat distribution
  • skew.2SE and kurt.2SE: skewness and kurtosis divided by 2 standard errors. Therefore
    • if > 1 (or < -1) then the stat significant (p < .05)
    • if > 1.29 (or < -1.29) then stat significant (p < .01)

Example

Flipper length is not normally distributed

  • skewed left (skewness positive, skew.2SE > 1.29)
  • flat distribution (kurtosis negative, kurt.2SE < -1.29)
palmerpenguins::penguins %>% 
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
  pastecs::stat.desc(basic = FALSE, desc = FALSE, norm = TRUE)
bill_length_mm bill_depth_mm flipper_length_mm
skewness 0.0526530 -0.1422086 0.3426554
skew.2SE 0.1996290 -0.5391705 1.2991456
kurtosis -0.8931397 -0.9233523 -0.9991866
kurt.2SE -1.6979696 -1.7554076 -1.8995781
normtest.W 0.9748548 0.9725838 0.9515451
normtest.p 0.0000112 0.0000044 0.0000000

Example

Values are instead not significant for Adelie penguins

  • both skew.2SE and kurt.2SE between -1 and 1
palmerpenguins::penguins %>% 
  filter(species == "Adelie") %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
  pastecs::stat.desc(basic = FALSE, desc = FALSE, norm = TRUE)
bill_length_mm bill_depth_mm flipper_length_mm
skewness 0.1584764 0.3148847 0.0856093
skew.2SE 0.4014211 0.7976035 0.2168485
kurtosis -0.2285951 -0.1361153 0.2382734
kurt.2SE -0.2913388 -0.1734755 0.3036734
normtest.W 0.9933618 0.9846683 0.9933916
normtest.p 0.7166005 0.0924897 0.7200466

Homogeneity of variance

Levene’s test for equality of variance in different levels

  • If significant, the variance is different in different levels
library(car)
palmerpenguins::penguins %>% 
  car::leveneTest(
    body_mass_g ~ species, data = .
  )

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  5.1203 0.006445 **
##       339                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary

Exploring assumptions

  • Normality
  • Skewness and kurtosis
  • Homogeneity of variance

Next: Practical session

  • Data visualisation
  • Descriptive statistics
  • Exploring assumptions