Lecture 202

class: center, middle, inverse, title-slide

.title[
# Lecture 202
]
.author[
### Dr Stefano De Sabbata<br /><small>School of Geography, Geology, and the Env., University of Leicester<br /><a href="https://github.com/sdesabbata/r-for-geographic-data-science" style="color: white">github.com/sdesabbata/r-for-geographic-data-science</a><br /><a href="mailto:s.desabbata@le.ac.uk" style="color: white">s.desabbata@le.ac.uk</a> | <a href="https://twitter.com/maps4thought" style="color: white">@maps4thought</a><br />text licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/" style="color: white">CC BY-SA 4.0</a>, code licensed under <a href="https://www.gnu.org/licenses/gpl-3.0.html" style="color: white">GNU GPL v3.0</a></small>
]

---

class: inverse, center, middle

# Descriptive statistics

---
## Recap

.pull-left[

**Previously**: Exploratory visualisation

- Grammar of graphics
- Visualising amounts and proportions
- Visualising variable distributions and relationships

**Today**: Exploratory statistics

- Descriptive statistics
- Exploring assumptions
    - Normality
    - Skewness and kurtosis
    - Homogeneity of variance

<br/>

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-2-1.png)

]

---
## Meet the Palmer penguins

.pull-left[

<br>

Original data collected and released by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](Palmer Station, Antarctica LTER), a member of the [Long Term Ecological Research Network](https://lternet.edu/).

Horst AM, Hill AP, Gorman KB (2020). [palmerpenguins: Palmer Archipelago (Antarctica) penguin data](https://allisonhorst.github.io/palmerpenguins/). R package version 0.1.0. doi:10.5281/zenodo.3960218.

<br>

```r
library(palmerpenguins)
```

]
.pull-right[

![:scale 70%](data:image/png;base64,#https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png)

![:scale 70%](data:image/png;base64,#https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png)

<br/>
.referencenote[
*Artwork by @allison_horst*
]

]

---
## Descriptive statistics

<br/>

.pull-left[

Quantitatively describe or summarize variables

- `stat.desc` from `pastecs` library
    - `base` includes counts
    - `desc` includes descriptive stats
    - `norm` (default is `FALSE`) includes distribution stats

```r
library(pastecs)

penguins %>%
  select(bill_length_mm, bill_depth_mm) %>%
  stat.desc() %>%
  kable(digits = c(2, 2))
```

]
.pull-right[

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> bill_length_mm </th>
   <th style="text-align:right;"> bill_depth_mm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> nbr.val </td>
   <td style="text-align:right;"> 342.00 </td>
   <td style="text-align:right;"> 342.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> nbr.null </td>
   <td style="text-align:right;"> 0.00 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> nbr.na </td>
   <td style="text-align:right;"> 2.00 </td>
   <td style="text-align:right;"> 2.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> min </td>
   <td style="text-align:right;"> 32.10 </td>
   <td style="text-align:right;"> 13.10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> max </td>
   <td style="text-align:right;"> 59.60 </td>
   <td style="text-align:right;"> 21.50 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> range </td>
   <td style="text-align:right;"> 27.50 </td>
   <td style="text-align:right;"> 8.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sum </td>
   <td style="text-align:right;"> 15021.30 </td>
   <td style="text-align:right;"> 5865.70 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> median </td>
   <td style="text-align:right;"> 44.45 </td>
   <td style="text-align:right;"> 17.30 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> mean </td>
   <td style="text-align:right;"> 43.92 </td>
   <td style="text-align:right;"> 17.15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SE.mean </td>
   <td style="text-align:right;"> 0.30 </td>
   <td style="text-align:right;"> 0.11 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> CI.mean.0.95 </td>
   <td style="text-align:right;"> 0.58 </td>
   <td style="text-align:right;"> 0.21 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> var </td>
   <td style="text-align:right;"> 29.81 </td>
   <td style="text-align:right;"> 3.90 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> std.dev </td>
   <td style="text-align:right;"> 5.46 </td>
   <td style="text-align:right;"> 1.97 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> coef.var </td>
   <td style="text-align:right;"> 0.12 </td>
   <td style="text-align:right;"> 0.12 </td>
  </tr>
</tbody>
</table>

]

---
## stat.desc: basic

.pull-left[

<br/>

- `nbr.val`: overall number of values in the dataset
- `nbr.null`: number of `NULL` values -- NULL is often returned by expressions and functions whose values are undefined
- `nbr.na`: number of `NA`s -- missing value indicator
- `min` (also `min()`): **minimum** value in the dataset
- `max` (also `max()`): **maximum** value in the dataset
- `range`: difference between `min` and `max` (different from `range()`)
- `sum` (also `sum()`): sum of the values in the dataset

]
.pull-right[

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> bill_length_mm </th>
   <th style="text-align:right;"> bill_depth_mm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> nbr.val </td>
   <td style="text-align:right;"> 342.0 </td>
   <td style="text-align:right;"> 342.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> nbr.null </td>
   <td style="text-align:right;"> 0.0 </td>
   <td style="text-align:right;"> 0.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> nbr.na </td>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:right;"> 2.0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> min </td>
   <td style="text-align:right;"> 32.1 </td>
   <td style="text-align:right;"> 13.1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> max </td>
   <td style="text-align:right;"> 59.6 </td>
   <td style="text-align:right;"> 21.5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> range </td>
   <td style="text-align:right;"> 27.5 </td>
   <td style="text-align:right;"> 8.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sum </td>
   <td style="text-align:right;"> 15021.3 </td>
   <td style="text-align:right;"> 5865.7 </td>
  </tr>
</tbody>
</table>

]

---
## stat.desc: desc

.pull-left[

- `mean` (`mean()`): **arithmetic mean**, that is `sum` over the number of values not `NA`
- `median` (`median()`): **median**, that is the value separating the higher half from the lower half the values
- `mode()` function is available: **mode**, the value that appears most often

Assuming a population sample

- `SE.mean`: **standard error of the mean** -- estimation of the variability of the mean calculated on different samples of the data (see also *central limit theorem*)
- `CI.mean.0.95`: **95% confidence interval of the mean** -- indicates that there is a 95% probability that the actual mean is within that distance from the sample mean

]
.pull-right[

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> bill_length_mm </th>
   <th style="text-align:right;"> bill_depth_mm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> median </td>
   <td style="text-align:right;"> 44.45 </td>
   <td style="text-align:right;"> 17.30 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> mean </td>
   <td style="text-align:right;"> 43.92 </td>
   <td style="text-align:right;"> 17.15 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SE.mean </td>
   <td style="text-align:right;"> 0.30 </td>
   <td style="text-align:right;"> 0.11 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> CI.mean.0.95 </td>
   <td style="text-align:right;"> 0.58 </td>
   <td style="text-align:right;"> 0.21 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> var </td>
   <td style="text-align:right;"> 29.81 </td>
   <td style="text-align:right;"> 3.90 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> std.dev </td>
   <td style="text-align:right;"> 5.46 </td>
   <td style="text-align:right;"> 1.97 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> coef.var </td>
   <td style="text-align:right;"> 0.12 </td>
   <td style="text-align:right;"> 0.12 </td>
  </tr>
</tbody>
</table>

]

???

The standard deviation of sample means is known as the 
- standard error of the mean (SE):
  - standard deviation of sample means
  - calculation
    - difference between each sample mean and overall mean, 
    - square the differences
    - sum them up 
    - divide by number of samples
    - square root

---
## Estimating variation

- `var`: **variance** ($\sigma^2$), it quantifies the amount of variation as the average of squared distances from the mean

`$$\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2$$`

- `std.dev`: **standard deviation** ($\sigma$), it quantifies the amount of variation as the square root of the variance

`$$\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2}$$`

- `coef.var`: **variation coefficient** it quantifies the amount of variation as the standard deviation divided by the mean

<!--
## Broom

Part `tidymodels` (under development), converts statistical analysis objects into tidy format

```r
library(broom)

nycflights13::flights %>%
  filter(month == 11, carrier == "US") %>%
  select(dep_delay, arr_delay, distance) %>%
  stat.desc() %>%
  tidy()
```

```
## # A tibble: 3 × 13
##   column        n   mean      sd median trimmed   mad   min    max  range  skew
##   <chr>     <dbl>  <dbl>   <dbl>  <dbl>   <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1 dep_delay    14   245.    483.   30.0   148.   31.8   -17   1668   1685  2.22
## 2 arr_delay    14  -134.   1320.   11.1    75.2  22.4 -4450   1667   6117 -2.58
## 3 distance     14 95996. 270025.  550.  31032.  550.      0 971558 971558  2.76
## # ℹ 2 more variables: kurtosis <dbl>, se <dbl>
```
-->

---
## dplyr::across

The `dplyr` verb `across` allows to apply `summarise` verbs on multiple columns.

.pull-left[

Instead of specifying `mean` for each one of four columns

```r
penguins %>%
  # filter out raws with missing data
  filter(!is.na(bill_length_mm)) %>%
  # summarise
  summarise(
    avg_bill_len_mm = mean(bill_length_mm), 
    avg_bill_dpt_mm = mean(bill_depth_mm),
    avg_flip_len_mm = mean(flipper_length_mm),
    avg_body_mass_g = mean(body_mass_g)
  ) %>%
  kable("html", digits = c(2, 2, 2,2 )) %>%
  kable_styling(font_size = 14)
```

]
.pull-right[

One can specify the same function `mean` across a range of columns

```r
penguins %>%
  # filter out raws with missing data
  filter(!is.na(bill_length_mm)) %>%
  # summarise
  summarise(
    across(
      # vector of column names
      bill_length_mm:body_mass_g, 
      # function to be applied
      mean                        
    )
  ) %>%
  kable("html", digits = c(2, 2, 2, 2)) %>%
  kable_styling(font_size = 14)
```

]

???

Particularly useful when working with many variables to ensure correctness and readibility

---
## dplyr::across

.pull-left[

The verb `across` can also be used with `mutate`, to apply the same function to a number of columns

```r
penguins %>%
  # mutate cross columns
  mutate(
    across(
      c(
        bill_length_mm, 
        bill_depth_mm, 
        flipper_length_mm
      ),
      # divide  values by 25.4
      function(x){ x / 25.4 }
    )
  ) %>%
  rename(
    bill_length_in = bill_length_mm,
    bill_depth_in = bill_depth_mm,
    flipper_length_in = flipper_length_mm
  )
```

]
.pull-right[

Old columns:

<table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> bill_length_mm </th>
   <th style="text-align:right;"> bill_depth_mm </th>
   <th style="text-align:right;"> flipper_length_mm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 39.1 </td>
   <td style="text-align:right;"> 18.7 </td>
   <td style="text-align:right;"> 181 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 39.5 </td>
   <td style="text-align:right;"> 17.4 </td>
   <td style="text-align:right;"> 186 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 40.3 </td>
   <td style="text-align:right;"> 18.0 </td>
   <td style="text-align:right;"> 195 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 36.7 </td>
   <td style="text-align:right;"> 19.3 </td>
   <td style="text-align:right;"> 193 </td>
  </tr>
</tbody>
</table>

New columns:

<table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> bill_length_in </th>
   <th style="text-align:right;"> bill_depth_in </th>
   <th style="text-align:right;"> flipper_length_in </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1.54 </td>
   <td style="text-align:right;"> 0.74 </td>
   <td style="text-align:right;"> 7.13 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.56 </td>
   <td style="text-align:right;"> 0.69 </td>
   <td style="text-align:right;"> 7.32 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.59 </td>
   <td style="text-align:right;"> 0.71 </td>
   <td style="text-align:right;"> 7.68 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1.44 </td>
   <td style="text-align:right;"> 0.76 </td>
   <td style="text-align:right;"> 7.60 </td>
  </tr>
</tbody>
</table>

]

---
class: inverse, center, middle

# Exploring assumptions

---
## Normal distribution

.pull-left[

*Distribution* can refer to:

- how many cases of a value are present in a set
- the probability of having that amount of cases in a set

A set of values is said to *be normally distributed* if their distribution follows (within certain margins of errors) the **normal distribution**

- characterized by the bell-shaped curve 
- majority of values lie around the centre of the distribution
- the further the values are from the centre, the lower their frequency
- about 95% of values within 2 standard deviations from the mean

]
.pull-right[

<br/>

]

???

- For instance you might imagine that all penguins have a rather similar flipper length. 
- So in most cases if we have a set of values representing measurements of penguin flipper lengths most values will be close to the mean. 
- Very long or very short flipper lengths are ratehr rare. 
- Thus a good question could be are penguins flipper lengths normally distributed?

---
## Density histogram

.pull-left[

```r
penguins %>% 
  ggplot(
    aes(
      x = flipper_length_mm
    )
  ) +
  geom_histogram(
    aes(
      y =..density..
    ),
  ) + 
  stat_function(
    fun = dnorm, 
    args = list(
      mean = 
        penguins %>% 
        filter(!is.na(flipper_length_mm)) %>% 
        pull(flipper_length_mm) %>% 
        mean(),
      sd = 
        penguins %>% 
        filter(!is.na(flipper_length_mm)) %>% 
        pull(flipper_length_mm) %>% 
        sd()
    ),
    colour = "red", 
    size = 1
  )
```

]
.pull-right[

]

---
## Q-Q plot

<br/>

.pull-left[

A Q-Q plot illustrates

- values against 
- the cumulative probability of a particular distribution 
  - (in this case, *normal* distribution)

```r
penguins %>% 
  ggplot(
    aes(
      sample = 
        flipper_length_mm
    )
  ) +
  stat_qq() +
  stat_qq_line()
```

]
.pull-right[

]

---
## Shapiro–Wilk test

- Compares
  - the distribution of a variable 
  - with a normal distribution having same mean and standard deviation
- if significant
  - the distribution is not normal
- R functions
  - `shapiro.test` function in `stats`
  - `normtest` values in `stat.desc` (more on this in the coming slides)

```r
penguins %>% 
  pull(flipper_length_mm) %>%
  shapiro.test()
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  .
## W = 0.95155, p-value = 3.54e-09
```

**Conclusion**: The *flipper length* of penguins in the Palmer Station dataset **is not** normally distributed.

---
## Significance

<br/>

Most statistical tests are based on the idea of hypothesis testing

- a **null hypothesis** is set
- the data are fit into a statistical model
- the model is assessed with a **test statistic**
- the **significance** is the probability of obtaining that test statistic value by chance

The threshold to accept or reject an hypothesis is arbitrary and based on conventions (e.g., *p < .01* or *p < .05*)

**Example:** The null hypothesis of the Shapiro–Wilk test is that the sample is normally distributed and *p < .01* indicates that the probability of that being true is very low. So, the *flipper length* of  penguins in the Palmer Station dataset **is not** normally distributed.

---
## Example

The *flipper length* of **Adelie** penguins **is normally distributed**

.pull-left[

```r
penguins %>% 
  filter(
    species == "Adelie"
  ) %>%
  pull(
    flipper_length_mm
  ) %>%
  shapiro.test()
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  .
## W = 0.99339, p-value = 0.72
```

]
.pull-right[

]

---
## Example

The *flipper length* of **Adelie** penguins **is normally distributed**

.pull-left[

```r
penguins %>% 
  filter(
    species == "Adelie"
  ) %>%
  pull(
    flipper_length_mm
  ) %>%
  shapiro.test()
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  .
## W = 0.99339, p-value = 0.72
```

]
.pull-right[

]

---
## Example: Leicester OAs

.pull-left[

Is the population in Leicester's OAs normally distributed?

... almost, but actually, no!

```r
leicester_2011OAC <- 
  read_csv(
    "2011_OAC_Raw_uVariables_Leicester.csv"
  )
```

```r
leicester_2011OAC %>% 
  pull(
    Total_Population
  ) %>%
  shapiro.test()
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  .
## W = 0.97505, p-value = 7.626e-12
```

]
.pull-right[

]

---
## Skewness and kurtosis

<br/>

In a normal distribution, *skewness* and *kurtosis* should be **zero**

- `skewness`: **skewness** value indicates
  - positive: the distribution is skewed towards the left
  - negative: the distribution is skewed towards the right

- `kurtosis`: **kurtosis** value indicates
  - positive: heavy-tailed distribution
  - negative: flat distribution

- `skew.2SE` and `kurt.2SE`: skewness and kurtosis divided by 2 standard errors. Therefore
  - if `> 1` (or `< -1`) then the stat significant *(p < .05)*
  - if `> 1.29` (or `< -1.29`) then stat significant *(p < .01)*

---
## Example

*Flipper length* is not normally distributed

- skewed left (skewness positive, `skew.2SE > 1.29`)
- flat distribution (kurtosis negative, `kurt.2SE < -1.29`)

```r
penguins %>% 
  select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
  stat.desc(basic = FALSE, desc = FALSE, norm = TRUE)
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> bill_length_mm </th>
   <th style="text-align:right;"> bill_depth_mm </th>
   <th style="text-align:right;"> flipper_length_mm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> skewness </td>
   <td style="text-align:right;"> 0.0526530 </td>
   <td style="text-align:right;"> -0.1422086 </td>
   <td style="text-align:right;"> 0.3426554 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> skew.2SE </td>
   <td style="text-align:right;"> 0.1996290 </td>
   <td style="text-align:right;"> -0.5391705 </td>
   <td style="text-align:right;"> 1.2991456 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> kurtosis </td>
   <td style="text-align:right;"> -0.8931397 </td>
   <td style="text-align:right;"> -0.9233523 </td>
   <td style="text-align:right;"> -0.9991866 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> kurt.2SE </td>
   <td style="text-align:right;"> -1.6979696 </td>
   <td style="text-align:right;"> -1.7554076 </td>
   <td style="text-align:right;"> -1.8995781 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> normtest.W </td>
   <td style="text-align:right;"> 0.9748548 </td>
   <td style="text-align:right;"> 0.9725838 </td>
   <td style="text-align:right;"> 0.9515451 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> normtest.p </td>
   <td style="text-align:right;"> 0.0000112 </td>
   <td style="text-align:right;"> 0.0000044 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
</tbody>
</table>

---
## Example

Values are instead not significant for **Adelie** penguins

- both `skew.2SE` and `kurt.2SE` between `-1` and `1`

```r
penguins %>% 
  filter(species == "Adelie") %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>%
  stat.desc(basic = FALSE, desc = FALSE, norm = TRUE)
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> bill_length_mm </th>
   <th style="text-align:right;"> bill_depth_mm </th>
   <th style="text-align:right;"> flipper_length_mm </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> skewness </td>
   <td style="text-align:right;"> 0.1584764 </td>
   <td style="text-align:right;"> 0.3148847 </td>
   <td style="text-align:right;"> 0.0856093 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> skew.2SE </td>
   <td style="text-align:right;"> 0.4014211 </td>
   <td style="text-align:right;"> 0.7976035 </td>
   <td style="text-align:right;"> 0.2168485 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> kurtosis </td>
   <td style="text-align:right;"> -0.2285951 </td>
   <td style="text-align:right;"> -0.1361153 </td>
   <td style="text-align:right;"> 0.2382734 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> kurt.2SE </td>
   <td style="text-align:right;"> -0.2913388 </td>
   <td style="text-align:right;"> -0.1734755 </td>
   <td style="text-align:right;"> 0.3036734 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> normtest.W </td>
   <td style="text-align:right;"> 0.9933618 </td>
   <td style="text-align:right;"> 0.9846683 </td>
   <td style="text-align:right;"> 0.9933916 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> normtest.p </td>
   <td style="text-align:right;"> 0.7166005 </td>
   <td style="text-align:right;"> 0.0924897 </td>
   <td style="text-align:right;"> 0.7200466 </td>
  </tr>
</tbody>
</table>

---
## Homogeneity of variance

<br/>

.pull-left[

**Levene’s test** for equality of variance in different levels

- If significant, the variance is different in different levels

```r
library(car)

penguins %>% 
  leveneTest(
    body_mass_g ~ species, 
    data = .
  )
```

```
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  5.1203 0.006445 **
##       339                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

]
.pull-right[

]

---
## Summary

.pull-left[

**Today**: Exploratory statistics

- Descriptive statistics
- Exploring assumptions
    - Normality
    - Skewness and kurtosis
    - Homogeneity of variance

**Next time**: Comparing variables

- Comparing distributions through mean
- Correlation analysis
- Variable transformation

<br/>

.referencenote[
Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com).
]

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-36-1.png)

]