class: center, middle, inverse, title-slide .title[ # Lecture 202 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Descriptive statistics --- ## Recap .pull-left[ **Previously**: Exploratory visualisation - Grammar of graphics - Visualising amounts and proportions - Visualising variable distributions and relationships **Today**: Exploratory statistics - Descriptive statistics - Exploring assumptions - Normality - Skewness and kurtosis - Homogeneity of variance <br/> ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- ## Meet the Palmer penguins .pull-left[ <br> Original data collected and released by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](Palmer Station, Antarctica LTER), a member of the [Long Term Ecological Research Network](https://lternet.edu/). Horst AM, Hill AP, Gorman KB (2020). [palmerpenguins: Palmer Archipelago (Antarctica) penguin data](https://allisonhorst.github.io/palmerpenguins/). R package version 0.1.0. doi:10.5281/zenodo.3960218. <br> ```r library(palmerpenguins) ``` ] .pull-right[ ![:scale 70%](data:image/png;base64,#https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png) ![:scale 70%](data:image/png;base64,#https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png) <br/> .referencenote[ *Artwork by @allison_horst* ] ] --- ## Descriptive statistics <br/> .pull-left[ Quantitatively describe or summarize variables - `stat.desc` from `pastecs` library - `base` includes counts - `desc` includes descriptive stats - `norm` (default is `FALSE`) includes distribution stats ```r library(pastecs) penguins %>% select(bill_length_mm, bill_depth_mm) %>% stat.desc() %>% kable(digits = c(2, 2)) ``` ] .pull-right[ <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> nbr.val </td> <td style="text-align:right;"> 342.00 </td> <td style="text-align:right;"> 342.00 </td> </tr> <tr> <td style="text-align:left;"> nbr.null </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;"> nbr.na </td> <td style="text-align:right;"> 2.00 </td> <td style="text-align:right;"> 2.00 </td> </tr> <tr> <td style="text-align:left;"> min </td> <td style="text-align:right;"> 32.10 </td> <td style="text-align:right;"> 13.10 </td> </tr> <tr> <td style="text-align:left;"> max </td> <td style="text-align:right;"> 59.60 </td> <td style="text-align:right;"> 21.50 </td> </tr> <tr> <td style="text-align:left;"> range </td> <td style="text-align:right;"> 27.50 </td> <td style="text-align:right;"> 8.40 </td> </tr> <tr> <td style="text-align:left;"> sum </td> <td style="text-align:right;"> 15021.30 </td> <td style="text-align:right;"> 5865.70 </td> </tr> <tr> <td style="text-align:left;"> median </td> <td style="text-align:right;"> 44.45 </td> <td style="text-align:right;"> 17.30 </td> </tr> <tr> <td style="text-align:left;"> mean </td> <td style="text-align:right;"> 43.92 </td> <td style="text-align:right;"> 17.15 </td> </tr> <tr> <td style="text-align:left;"> SE.mean </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 0.11 </td> </tr> <tr> <td style="text-align:left;"> CI.mean.0.95 </td> <td style="text-align:right;"> 0.58 </td> <td style="text-align:right;"> 0.21 </td> </tr> <tr> <td style="text-align:left;"> var </td> <td style="text-align:right;"> 29.81 </td> <td style="text-align:right;"> 3.90 </td> </tr> <tr> <td style="text-align:left;"> std.dev </td> <td style="text-align:right;"> 5.46 </td> <td style="text-align:right;"> 1.97 </td> </tr> <tr> <td style="text-align:left;"> coef.var </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 0.12 </td> </tr> </tbody> </table> ] --- ## stat.desc: basic .pull-left[ <br/> - `nbr.val`: overall number of values in the dataset - `nbr.null`: number of `NULL` values -- NULL is often returned by expressions and functions whose values are undefined - `nbr.na`: number of `NA`s -- missing value indicator - `min` (also `min()`): **minimum** value in the dataset - `max` (also `max()`): **maximum** value in the dataset - `range`: difference between `min` and `max` (different from `range()`) - `sum` (also `sum()`): sum of the values in the dataset ] .pull-right[ <br/><br/><br/> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> nbr.val </td> <td style="text-align:right;"> 342.0 </td> <td style="text-align:right;"> 342.0 </td> </tr> <tr> <td style="text-align:left;"> nbr.null </td> <td style="text-align:right;"> 0.0 </td> <td style="text-align:right;"> 0.0 </td> </tr> <tr> <td style="text-align:left;"> nbr.na </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 2.0 </td> </tr> <tr> <td style="text-align:left;"> min </td> <td style="text-align:right;"> 32.1 </td> <td style="text-align:right;"> 13.1 </td> </tr> <tr> <td style="text-align:left;"> max </td> <td style="text-align:right;"> 59.6 </td> <td style="text-align:right;"> 21.5 </td> </tr> <tr> <td style="text-align:left;"> range </td> <td style="text-align:right;"> 27.5 </td> <td style="text-align:right;"> 8.4 </td> </tr> <tr> <td style="text-align:left;"> sum </td> <td style="text-align:right;"> 15021.3 </td> <td style="text-align:right;"> 5865.7 </td> </tr> </tbody> </table> ] --- ## stat.desc: desc .pull-left[ - `mean` (`mean()`): **arithmetic mean**, that is `sum` over the number of values not `NA` - `median` (`median()`): **median**, that is the value separating the higher half from the lower half the values - `mode()` function is available: **mode**, the value that appears most often Assuming a population sample - `SE.mean`: **standard error of the mean** -- estimation of the variability of the mean calculated on different samples of the data (see also *central limit theorem*) - `CI.mean.0.95`: **95% confidence interval of the mean** -- indicates that there is a 95% probability that the actual mean is within that distance from the sample mean ] .pull-right[ <br/><br/><br/> <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> median </td> <td style="text-align:right;"> 44.45 </td> <td style="text-align:right;"> 17.30 </td> </tr> <tr> <td style="text-align:left;"> mean </td> <td style="text-align:right;"> 43.92 </td> <td style="text-align:right;"> 17.15 </td> </tr> <tr> <td style="text-align:left;"> SE.mean </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 0.11 </td> </tr> <tr> <td style="text-align:left;"> CI.mean.0.95 </td> <td style="text-align:right;"> 0.58 </td> <td style="text-align:right;"> 0.21 </td> </tr> <tr> <td style="text-align:left;"> var </td> <td style="text-align:right;"> 29.81 </td> <td style="text-align:right;"> 3.90 </td> </tr> <tr> <td style="text-align:left;"> std.dev </td> <td style="text-align:right;"> 5.46 </td> <td style="text-align:right;"> 1.97 </td> </tr> <tr> <td style="text-align:left;"> coef.var </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 0.12 </td> </tr> </tbody> </table> ] ??? The standard deviation of sample means is known as the - standard error of the mean (SE): - standard deviation of sample means - calculation - difference between each sample mean and overall mean, - square the differences - sum them up - divide by number of samples - square root --- ## Estimating variation - `var`: **variance** ($\sigma^2$), it quantifies the amount of variation as the average of squared distances from the mean `$$\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2$$` - `std.dev`: **standard deviation** ($\sigma$), it quantifies the amount of variation as the square root of the variance `$$\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2}$$` - `coef.var`: **variation coefficient** it quantifies the amount of variation as the standard deviation divided by the mean <!-- ## Broom Part `tidymodels` (under development), converts statistical analysis objects into tidy format ```r library(broom) nycflights13::flights %>% filter(month == 11, carrier == "US") %>% select(dep_delay, arr_delay, distance) %>% stat.desc() %>% tidy() ``` ``` ## # A tibble: 3 × 13 ## column n mean sd median trimmed mad min max range skew ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 dep_delay 14 245. 483. 30.0 148. 31.8 -17 1668 1685 2.22 ## 2 arr_delay 14 -134. 1320. 11.1 75.2 22.4 -4450 1667 6117 -2.58 ## 3 distance 14 95996. 270025. 550. 31032. 550. 0 971558 971558 2.76 ## # ℹ 2 more variables: kurtosis <dbl>, se <dbl> ``` --> --- ## dplyr::across The `dplyr` verb `across` allows to apply `summarise` verbs on multiple columns. .pull-left[ Instead of specifying `mean` for each one of four columns ```r penguins %>% # filter out raws with missing data filter(!is.na(bill_length_mm)) %>% # summarise summarise( avg_bill_len_mm = mean(bill_length_mm), avg_bill_dpt_mm = mean(bill_depth_mm), avg_flip_len_mm = mean(flipper_length_mm), avg_body_mass_g = mean(body_mass_g) ) %>% kable("html", digits = c(2, 2, 2,2 )) %>% kable_styling(font_size = 14) ``` <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> avg_bill_len_mm </th> <th style="text-align:right;"> avg_bill_dpt_mm </th> <th style="text-align:right;"> avg_flip_len_mm </th> <th style="text-align:right;"> avg_body_mass_g </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 43.92 </td> <td style="text-align:right;"> 17.15 </td> <td style="text-align:right;"> 200.92 </td> <td style="text-align:right;"> 4201.75 </td> </tr> </tbody> </table> ] .pull-right[ One can specify the same function `mean` across a range of columns ```r penguins %>% # filter out raws with missing data filter(!is.na(bill_length_mm)) %>% # summarise summarise( across( # vector of column names bill_length_mm:body_mass_g, # function to be applied mean ) ) %>% kable("html", digits = c(2, 2, 2, 2)) %>% kable_styling(font_size = 14) ``` <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> <th style="text-align:right;"> flipper_length_mm </th> <th style="text-align:right;"> body_mass_g </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 43.92 </td> <td style="text-align:right;"> 17.15 </td> <td style="text-align:right;"> 200.92 </td> <td style="text-align:right;"> 4201.75 </td> </tr> </tbody> </table> ] ??? Particularly useful when working with many variables to ensure correctness and readibility --- ## dplyr::across .pull-left[ The verb `across` can also be used with `mutate`, to apply the same function to a number of columns ```r penguins %>% # mutate cross columns mutate( across( c( bill_length_mm, bill_depth_mm, flipper_length_mm ), # divide values by 25.4 function(x){ x / 25.4 } ) ) %>% rename( bill_length_in = bill_length_mm, bill_depth_in = bill_depth_mm, flipper_length_in = flipper_length_mm ) ``` ] .pull-right[ Old columns: <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> <th style="text-align:right;"> flipper_length_mm </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 39.1 </td> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 181 </td> </tr> <tr> <td style="text-align:right;"> 39.5 </td> <td style="text-align:right;"> 17.4 </td> <td style="text-align:right;"> 186 </td> </tr> <tr> <td style="text-align:right;"> 40.3 </td> <td style="text-align:right;"> 18.0 </td> <td style="text-align:right;"> 195 </td> </tr> <tr> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 36.7 </td> <td style="text-align:right;"> 19.3 </td> <td style="text-align:right;"> 193 </td> </tr> </tbody> </table> New columns: <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> bill_length_in </th> <th style="text-align:right;"> bill_depth_in </th> <th style="text-align:right;"> flipper_length_in </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1.54 </td> <td style="text-align:right;"> 0.74 </td> <td style="text-align:right;"> 7.13 </td> </tr> <tr> <td style="text-align:right;"> 1.56 </td> <td style="text-align:right;"> 0.69 </td> <td style="text-align:right;"> 7.32 </td> </tr> <tr> <td style="text-align:right;"> 1.59 </td> <td style="text-align:right;"> 0.71 </td> <td style="text-align:right;"> 7.68 </td> </tr> <tr> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:right;"> 1.44 </td> <td style="text-align:right;"> 0.76 </td> <td style="text-align:right;"> 7.60 </td> </tr> </tbody> </table> ] --- class: inverse, center, middle # Exploring assumptions --- ## Normal distribution .pull-left[ *Distribution* can refer to: - how many cases of a value are present in a set - the probability of having that amount of cases in a set A set of values is said to *be normally distributed* if their distribution follows (within certain margins of errors) the **normal distribution** - characterized by the bell-shaped curve - majority of values lie around the centre of the distribution - the further the values are from the centre, the lower their frequency - about 95% of values within 2 standard deviations from the mean ] .pull-right[ <br/> <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-16-1.png" width="100%" /> ] ??? - For instance you might imagine that all penguins have a rather similar flipper length. - So in most cases if we have a set of values representing measurements of penguin flipper lengths most values will be close to the mean. - Very long or very short flipper lengths are ratehr rare. - Thus a good question could be are penguins flipper lengths normally distributed? --- ## Density histogram .pull-left[ ```r penguins %>% ggplot( aes( x = flipper_length_mm ) ) + geom_histogram( aes( y =..density.. ), ) + stat_function( fun = dnorm, args = list( mean = penguins %>% filter(!is.na(flipper_length_mm)) %>% pull(flipper_length_mm) %>% mean(), sd = penguins %>% filter(!is.na(flipper_length_mm)) %>% pull(flipper_length_mm) %>% sd() ), colour = "red", size = 1 ) ``` ] .pull-right[ <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> ] --- ## Q-Q plot <br/> .pull-left[ A Q-Q plot illustrates - values against - the cumulative probability of a particular distribution - (in this case, *normal* distribution) ```r penguins %>% ggplot( aes( sample = flipper_length_mm ) ) + stat_qq() + stat_qq_line() ``` ] .pull-right[ <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-20-1.png" width="80%" /> ] --- ## Shapiro–Wilk test - Compares - the distribution of a variable - with a normal distribution having same mean and standard deviation - if significant - the distribution is not normal - R functions - `shapiro.test` function in `stats` - `normtest` values in `stat.desc` (more on this in the coming slides) ```r penguins %>% pull(flipper_length_mm) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.95155, p-value = 3.54e-09 ``` **Conclusion**: The *flipper length* of penguins in the Palmer Station dataset **is not** normally distributed. --- ## Significance <br/> Most statistical tests are based on the idea of hypothesis testing - a **null hypothesis** is set - the data are fit into a statistical model - the model is assessed with a **test statistic** - the **significance** is the probability of obtaining that test statistic value by chance The threshold to accept or reject an hypothesis is arbitrary and based on conventions (e.g., *p < .01* or *p < .05*) **Example:** The null hypothesis of the Shapiro–Wilk test is that the sample is normally distributed and *p < .01* indicates that the probability of that being true is very low. So, the *flipper length* of penguins in the Palmer Station dataset **is not** normally distributed. --- ## Example The *flipper length* of **Adelie** penguins **is normally distributed** .pull-left[ ```r penguins %>% filter( species == "Adelie" ) %>% pull( flipper_length_mm ) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.99339, p-value = 0.72 ``` ] .pull-right[ <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-23-1.png" width="100%" /> ] --- ## Example The *flipper length* of **Adelie** penguins **is normally distributed** .pull-left[ ```r penguins %>% filter( species == "Adelie" ) %>% pull( flipper_length_mm ) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.99339, p-value = 0.72 ``` ] .pull-right[ <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-25-1.png" width="100%" /> ] --- ## Example: Leicester OAs .pull-left[ Is the population in Leicester's OAs normally distributed? ... almost, but actually, no! ```r leicester_2011OAC <- read_csv( "2011_OAC_Raw_uVariables_Leicester.csv" ) ``` ```r leicester_2011OAC %>% pull( Total_Population ) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.97505, p-value = 7.626e-12 ``` ] .pull-right[ <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-28-1.png" width="55%" /> <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-29-1.png" width="55%" /> ] --- ## Skewness and kurtosis <br/> In a normal distribution, *skewness* and *kurtosis* should be **zero** - `skewness`: **skewness** value indicates - positive: the distribution is skewed towards the left - negative: the distribution is skewed towards the right - `kurtosis`: **kurtosis** value indicates - positive: heavy-tailed distribution - negative: flat distribution - `skew.2SE` and `kurt.2SE`: skewness and kurtosis divided by 2 standard errors. Therefore - if `> 1` (or `< -1`) then the stat significant *(p < .05)* - if `> 1.29` (or `< -1.29`) then stat significant *(p < .01)* --- ## Example *Flipper length* is not normally distributed - skewed left (skewness positive, `skew.2SE > 1.29`) - flat distribution (kurtosis negative, `kurt.2SE < -1.29`) ```r penguins %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>% stat.desc(basic = FALSE, desc = FALSE, norm = TRUE) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> <th style="text-align:right;"> flipper_length_mm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> skewness </td> <td style="text-align:right;"> 0.0526530 </td> <td style="text-align:right;"> -0.1422086 </td> <td style="text-align:right;"> 0.3426554 </td> </tr> <tr> <td style="text-align:left;"> skew.2SE </td> <td style="text-align:right;"> 0.1996290 </td> <td style="text-align:right;"> -0.5391705 </td> <td style="text-align:right;"> 1.2991456 </td> </tr> <tr> <td style="text-align:left;"> kurtosis </td> <td style="text-align:right;"> -0.8931397 </td> <td style="text-align:right;"> -0.9233523 </td> <td style="text-align:right;"> -0.9991866 </td> </tr> <tr> <td style="text-align:left;"> kurt.2SE </td> <td style="text-align:right;"> -1.6979696 </td> <td style="text-align:right;"> -1.7554076 </td> <td style="text-align:right;"> -1.8995781 </td> </tr> <tr> <td style="text-align:left;"> normtest.W </td> <td style="text-align:right;"> 0.9748548 </td> <td style="text-align:right;"> 0.9725838 </td> <td style="text-align:right;"> 0.9515451 </td> </tr> <tr> <td style="text-align:left;"> normtest.p </td> <td style="text-align:right;"> 0.0000112 </td> <td style="text-align:right;"> 0.0000044 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> --- ## Example Values are instead not significant for **Adelie** penguins - both `skew.2SE` and `kurt.2SE` between `-1` and `1` ```r penguins %>% filter(species == "Adelie") %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm) %>% stat.desc(basic = FALSE, desc = FALSE, norm = TRUE) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bill_length_mm </th> <th style="text-align:right;"> bill_depth_mm </th> <th style="text-align:right;"> flipper_length_mm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> skewness </td> <td style="text-align:right;"> 0.1584764 </td> <td style="text-align:right;"> 0.3148847 </td> <td style="text-align:right;"> 0.0856093 </td> </tr> <tr> <td style="text-align:left;"> skew.2SE </td> <td style="text-align:right;"> 0.4014211 </td> <td style="text-align:right;"> 0.7976035 </td> <td style="text-align:right;"> 0.2168485 </td> </tr> <tr> <td style="text-align:left;"> kurtosis </td> <td style="text-align:right;"> -0.2285951 </td> <td style="text-align:right;"> -0.1361153 </td> <td style="text-align:right;"> 0.2382734 </td> </tr> <tr> <td style="text-align:left;"> kurt.2SE </td> <td style="text-align:right;"> -0.2913388 </td> <td style="text-align:right;"> -0.1734755 </td> <td style="text-align:right;"> 0.3036734 </td> </tr> <tr> <td style="text-align:left;"> normtest.W </td> <td style="text-align:right;"> 0.9933618 </td> <td style="text-align:right;"> 0.9846683 </td> <td style="text-align:right;"> 0.9933916 </td> </tr> <tr> <td style="text-align:left;"> normtest.p </td> <td style="text-align:right;"> 0.7166005 </td> <td style="text-align:right;"> 0.0924897 </td> <td style="text-align:right;"> 0.7200466 </td> </tr> </tbody> </table> --- ## Homogeneity of variance <br/> .pull-left[ **Levene’s test** for equality of variance in different levels - If significant, the variance is different in different levels ```r library(car) penguins %>% leveneTest( body_mass_g ~ species, data = . ) ``` ``` ## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 2 5.1203 0.006445 ** ## 339 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] .pull-right[ <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-35-1.png" width="80%" /> ] --- ## Summary .pull-left[ **Today**: Exploratory statistics - Descriptive statistics - Exploring assumptions - Normality - Skewness and kurtosis - Homogeneity of variance **Next time**: Comparing variables - Comparing distributions through mean - Correlation analysis - Variable transformation <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/202-slides-exploratory-statistics_files/figure-html/unnamed-chunk-36-1.png)<!-- --> ]