class: center, middle, inverse, title-slide .title[ # Lecture 203 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Comparing data --- ## Recap .pull-left[ <br/> **Prev**: Exploratory statistics - Descriptive statistics - Exploring assumptions - Normality - Skewness and kurtosis - Homogeneity of variance **Today**: Comparing data - Comparing distributions through mean - Correlation analysis - Variable transformation <br/> ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- ## Comparing means .pull-left[ When are (two) group means different? Independent T-test: - null hypothesis - there is **no difference** between the groups - if *p-value* (significance) below threshold (e.g., `0.05` or `0.01`) - **group means are different** - assumptions - normally distributed values in groups - homogeneity of variance of values in groups - if groups have different sizes - independence of groups - e.g. different conditions of an experiment ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- ## General linear model <br> **General linear model** - observation `\(i\)` can be predicted by a `\(model\)` (predictors) - accounting for some amount of error $$outcome_i = (model) + error_i $$ <br> **Independent T-test** as a general linear model - groups is the predictor (categorical variable) - single observation value as group mean plus error $$outcome_i = (group\ mean) + error_i $$ --- ## Example: bill depth .pull-left[ <br/> Is the bill depth of *Adelie* and *Gentoo* penguins different? 1. Check assumptions 1. Indipendent groups: ok 2. normal distribution: check using Shapiro-Wilk test 3. homogeneity of variance: not necessary 2. Run T-test 1. `stats::t.test` ] .pull-right[ Values are normally distributed for both groups ```r penguins %>% filter(species == "Adelie") %>% pull(bill_depth_mm) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.98467, p-value = 0.09249 ``` ```r penguins %>% filter(species == "Gentoo") %>% pull(bill_depth_mm) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.97609, p-value = 0.0277 ``` ] --- ## stats::t.test .pull-left[ The test is significant, the group means are different ```r penguins %>% filter(species %in% c("Adelie", "Gentoo")) %$% t.test(bill_depth_mm ~ species) ``` ``` ## ## Welch Two Sample t-test ## ## data: bill_depth_mm by species ## t = 25.337, df = 271.98, p-value < 2.2e-16 ## alternative hypothesis: true difference in means between group Adelie and group Gentoo is not equal to 0 ## 95 percent confidence interval: ## 3.102837 3.625651 ## sample estimates: ## mean in group Adelie mean in group Gentoo ## 18.34636 14.98211 ``` How to report: - *t*(271.98) = 25.34, *p* < .01 ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- ## Example: bill depth (2) .pull-left[ Is the bill depth of *Adelie* and *Chinstrap* penguins different? We already checked normality for *Adelie* and *Gentoo*. Are values for *Chinstrap* normally distributed? ```r penguins %>% filter(species == "Chinstrap") %>% pull(bill_depth_mm) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.97274, p-value = 0.1418 ``` Values are normally distributed for all three groups ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] --- ## stats::t.test .pull-left[ The test is not significant, the group means are not different ```r penguins %>% filter(species %in% c("Adelie", "Chinstrap")) %$% t.test(bill_depth_mm ~ species) ``` ``` ## ## Welch Two Sample t-test ## ## data: bill_depth_mm by species ## t = -0.43771, df = 137.75, p-value = 0.6623 ## alternative hypothesis: true difference in means between group Adelie and group Chinstrap is not equal to 0 ## 95 percent confidence interval: ## -0.4095657 0.2611044 ## sample estimates: ## mean in group Adelie mean in group Chinstrap ## 18.34636 18.42059 ``` How to report: - *t*(137.75) = -0.44, *p* = 0.66 ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] --- ## ANalysis Of VAriance (ANOVA) .pull-left[ ANOVA is similar to the T-tests, but more than two groups - null hypothesis - **no difference** between the groups - if *p-value* (significance) below threshold (e.g., `0.05` or `0.01`) - **group means are different** - assumptions - normally distributed values in groups - especially if groups have different sizes - homogeneity of variance of values in groups - if groups have different sizes - independence of groups - e.g. different conditions of an experiment ] .pull-right[ **General linear model** - observation `\(i\)` can be predicted by a `\(model\)` (predictors) - accounting for some amount of error $$outcome_i = (model) + error_i $$ <br/> **ANOVA** as a general linear model - groups is the predictor (categorical variable) - single observation value as group mean plus error $$outcome_i = (group\ mean) + error_i $$ ] --- ## Example: bill depth (3) .pull-left[ <br/> Is the bill depth different between all three species? 1. Check assumptions 1. Indipendent groups: ok 2. normal distribution: check using Shapiro-Wilk test 3. homogeneity of variance: not necessary 2. Run ANOVA 1. `stats::aov` ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- ## stats::aov <br/> .pull-left[ The test is significant, the group means are different ```r penguins %$% aov(bill_depth_mm ~ species) %>% summary() ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## species 2 904.0 452.0 359.8 <2e-16 *** ## Residuals 339 425.9 1.3 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 2 observations deleted due to missingness ``` How to report: - *F*(2, 339) = 359.79, *p* < .01 ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] --- class: inverse, center, middle # Correlation --- ## Correlation <br/> **Correlation** is a standardised measure of covariance .pull-left[ Two continuous variables can be - not related at all - related - positively: - entities with *high values* in one - tend to have *high values* in the other - negatively: - entities with *high values* in one - tend to have *low values* in the other ] .pull-right[ Three different approaches - **Pearson's r** - if two variables are **normally distributed** - **Spearman’s rho** - if two variables are **not normally distributed** - **Kendall’s tau** - if **not normally distributed** - and there are a **large number of ties** ] --- ## Example Are flipper length and body mass related in Chinstrap penguins? .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- ## Pearson’s r .pull-left[ <br/> If two variables are **normally distributed**, use **Pearson's r** - null hypothesis - there is no relationship between the variables - assumptions - variables are normally distributed The square of the correlation value indicates the percentage of shared variance ] .pull-right[ Flipper length and body mass are normally distributed in Chinstrap penguins ```r penguins %>% filter(species == "Chinstrap") %>% pull(flipper_length_mm) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.98891, p-value = 0.8106 ``` ```r penguins %>% filter(species == "Chinstrap") %>% pull(body_mass_g) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.98449, p-value = 0.5605 ``` ] --- ## stats::cor.test .pull-left[ <br/> If two variables are **normally distributed**, use **Pearson's r** - null hypothesis - there is no relationship between the variables - assumptions - variables are normally distributed The square of the correlation value indicates the percentage of shared variance ] .pull-right[ ```r penguins %>% filter(species == "Chinstrap") %$% cor.test(flipper_length_mm, body_mass_g) ``` ``` ## ## Pearson's product-moment correlation ## ## data: flipper_length_mm and body_mass_g ## t = 6.7947, df = 66, p-value = 3.748e-09 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.4759352 0.7632368 ## sample estimates: ## cor ## 0.6415594 ``` - Flipper length and body mass are related - `p-value < 0.01` - sharing 41.2% of variance - 0.642 `^ 2 =` 0.412 ] --- ## Example: flipper length and body mass .pull-left[ But, are flipper length and body mass related in penguins (without considering species as separated groups)? .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] ] .pull-right[ Flipper length and body mass are not normally distributed when all penguins are taken into account as a single group ```r penguins %>% pull(flipper_length_mm) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.95155, p-value = 3.54e-09 ``` ```r penguins %>% pull(body_mass_g) %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.95921, p-value = 3.679e-08 ``` ] --- ## Spearman’s rho .pull-left[ If two variables are **not normally distrib.**, use **Spearman’s rho** - null hypothesis - there is no relationship between the variables - non-parametric - uses full dataset to calculate the statistics - rather than estimate key parameters of distributions from data - based on rank difference - assumptions - ties are uncommon The square of the correlation value indicates the percentage of shared variance ] .pull-right[ ```r penguins %$% cor.test( flipper_length_mm, body_mass_g, method = "spearman" ) ``` ``` ## ## Spearman's rank correlation rho ## ## data: flipper_length_mm and body_mass_g ## S = 1066875, p-value < 2.2e-16 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.8399741 ``` - Flipper length and body mass are related - `p-value < 0.01` - sharing 70.6% of variance - 0.84 `^ 2 =` 0.706 ] --- ## Correlation with ties Spearman's rho - `Cannot compute exact p-value with ties` ```r penguins %$% cor.test(flipper_length_mm, body_mass_g, method = "spearman") ``` ```r ## Warning in cor.test.default(flipper_length_mm, body_mass_g, method = "spearman"): ## Cannot compute exact p-value with ties ``` ``` ## ## Spearman's rank correlation rho ## ## data: flipper_length_mm and body_mass_g ## S = 1066875, p-value < 2.2e-16 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.8399741 ``` --- ## Kendall’s tau .pull-left[ If two variables are **not normally distributed** and there are **many ties**, use **Kendall’s tau** - null hypothesis - there is no relationship between the variables - non-parametric - based on rank difference - no assumptions - less *powerful* - even if there is a relationship, significance might be high The square of the correlation value indicates the percentage of shared variance ] .pull-right[ ```r penguins %$% cor.test( flipper_length_mm, body_mass_g, method = "kendall" ) ``` ``` ## ## Kendall's rank correlation tau ## ## data: flipper_length_mm and body_mass_g ## z = 17.898, p-value < 2.2e-16 ## alternative hypothesis: true tau is not equal to 0 ## sample estimates: ## tau ## 0.6604675 ``` - Flipper length and body mass are related - `p-value < 0.01` - sharing 43.6% of variance - 0.66 `^ 2 =` 0.436 ] --- ## psych::pairs.panels .pull-left[ Combining: - histograms - scatter plots - correlations ```r library(psych) penguins %>% select( bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g ) %>% pairs.panels( method = "kendall", stars = TRUE ) ## Signif.: 0 '***' 0.001 '**' 0.01 ## 0.01 '*' 0.05 0.05 '.' 0.1 ' ' 1 ``` ] .pull-right[ .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-31-1.png)<!-- --> ] ] --- ## GGally::ggpairs More flexible and sophisticated pairs plots .pull-left[ ```r library(GGally) penguins %>% select( bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species ) %>% ggpairs( mapping = aes(color = species), upper = list(continuous = wrap(ggally_cor, method = "kendall")), lower = list(continuous = wrap("points", alpha = 0.3, size=0.1)) ) + scale_color_manual(values = c("darkorange","darkorchid","cyan4")) + scale_fill_manual(values = c("darkorange","darkorchid","cyan4")) + theme_bw() ## Signif.: 0 '***' 0.001 '**' 0.01 ## 0.01 '*' 0.05 0.05 '.' 0.1 ' ' 1 ``` ] .pull-right[ .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-33-1.png)<!-- --> ] ] --- ## Chi-squre .pull-left[ How to test a relationship between two **categorical** variables? Chi-square test: - null hypothesis - no relationship between variables - non-parametric - based on cross-tabulated expected counts - no assumptions ```r library(gmodels) penguins %$% CrossTable( island, species, chisq = TRUE, expected = TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE, sresid = TRUE, format = "SPSS" ) ``` ] .pull-right[ .small[ ``` ## ## Cell Contents ## |-------------------------| ## | Count | ## | Expected Values | ## | Row Percent | ## | Std Residual | ## |-------------------------| ## ## Total Observations in Table: 344 ## ## | species ## island | Adelie | Chinstrap | Gentoo | Row Total | ## -------------|-----------|-----------|-----------|-----------| ## Biscoe | 44 | 0 | 124 | 168 | ## | 74.233 | 33.209 | 60.558 | | ## | 26.190% | 0.000% | 73.810% | 48.837% | ## | -3.509 | -5.763 | 8.152 | | ## -------------|-----------|-----------|-----------|-----------| ## Dream | 56 | 68 | 0 | 124 | ## | 54.791 | 24.512 | 44.698 | | ## | 45.161% | 54.839% | 0.000% | 36.047% | ## | 0.163 | 8.784 | -6.686 | | ## -------------|-----------|-----------|-----------|-----------| ## Torgersen | 52 | 0 | 0 | 52 | ## | 22.977 | 10.279 | 18.744 | | ## | 100.000% | 0.000% | 0.000% | 15.116% | ## | 6.055 | -3.206 | -4.329 | | ## -------------|-----------|-----------|-----------|-----------| ## Column Total | 152 | 68 | 124 | 344 | ## -------------|-----------|-----------|-----------|-----------| ## ## ## Statistics for All Table Factors ## ## ## Pearson's Chi-squared test ## ------------------------------------------------------------ ## Chi^2 = 299.5503 d.f. = 4 p = 1.354574e-63 ## ## ## ## Minimum expected frequency: 10.27907 ``` ] ] ??? Different islands do have different amounts of penguins from different species --- class: inverse, center, middle # Data transformation --- ## Z-scores .pull-left[ Transform the values as relative to - distribution's mean `\(\mu\)` - standard deviation `\(\sigma\)` $$ z_i = \frac{x_i - \mu}{\sigma} $$ ] .pull-right[ ```r penguins %>% mutate( flipper_length_zscore = scale(flipper_length_mm) ) %>% ggplot(aes(x = flipper_length_zscore)) + geom_histogram() + theme_bw() ``` ] .pull-left[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-37-1.png)<!-- --> ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-38-1.png)<!-- --> ] ??? Same distribution (bar some aggregation in the histogram) Commonly used to render two variables easier to compare --- ## Common transformations .pull-left[ **Logarithmic** transformations are useful to *"un-skew"* variables Common approaches include: - natural logarithm (`log`) - binary logarithm (`log2`) - logarithm base 10 (`log10`) Only possible on values `> 0` <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **Inverse hyperbolic sine** transformations are useful to *"un-skew"* variables - similar to logarithmic transformations - defined on all values - in R: `asinh` <img src="data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ] .center[ Not always sufficient to *"un-skew"* variables ] --- ## Example: residents aged 20 to 24 .pull-left[ The number of residents aged 20 to 24 (`u011`) in the areas of Leicester described as *"Cosmopolitans"* by the [2011 Output Area Classification](https://www.gov.uk/government/statistics/2011-area-classification-for-super-output-areas) is skewed ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-41-1.png)<!-- --> ] .pull-right[ ```r leicester_2011OAC %>% filter(supgrpname == "Cosmopolitans") %>% select(u011) %>% stat.desc( basic = FALSE, desc = FALSE, norm = TRUE ) %>% kable("html", digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> u011 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> skewness </td> <td style="text-align:right;"> 1.521 </td> </tr> <tr> <td style="text-align:left;"> skew.2SE </td> <td style="text-align:right;"> 2.879 </td> </tr> <tr> <td style="text-align:left;"> kurtosis </td> <td style="text-align:right;"> 2.089 </td> </tr> <tr> <td style="text-align:left;"> kurt.2SE </td> <td style="text-align:right;"> 1.999 </td> </tr> <tr> <td style="text-align:left;"> normtest.W </td> <td style="text-align:right;"> 0.847 </td> </tr> <tr> <td style="text-align:left;"> normtest.p </td> <td style="text-align:right;"> 0.000 </td> </tr> </tbody> </table> ] --- ## Example: log10 .pull-left[ However, it's logarithm base 10 is normally distributed, thus it can be used with tests requiring normally distributed values ```r mutate(log10_u011 = log10(u011)) ``` ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-44-1.png)<!-- --> ] .pull-right[ ```r leicester_2011OAC %>% filter(supgrpname == "Cosmopolitans") %>% mutate(log10_u011 = log10(u011)) %>% select(log10_u011) %>% stat.desc( basic = FALSE, desc = FALSE, norm = TRUE ) %>% kable("html", digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> log10_u011 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> skewness </td> <td style="text-align:right;"> -0.504 </td> </tr> <tr> <td style="text-align:left;"> skew.2SE </td> <td style="text-align:right;"> -0.953 </td> </tr> <tr> <td style="text-align:left;"> kurtosis </td> <td style="text-align:right;"> 0.872 </td> </tr> <tr> <td style="text-align:left;"> kurt.2SE </td> <td style="text-align:right;"> 0.834 </td> </tr> <tr> <td style="text-align:left;"> normtest.W </td> <td style="text-align:right;"> 0.976 </td> </tr> <tr> <td style="text-align:left;"> normtest.p </td> <td style="text-align:right;"> 0.118 </td> </tr> </tbody> </table> ] --- ## Example: asinh .pull-left[ The Inverse hyperbolic sine is also normally distributed ```r mutate(ihs_u011 = asinh(u011)) ``` ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-47-1.png)<!-- --> ] .pull-right[ ```r leicester_2011OAC %>% filter(supgrpname == "Cosmopolitans") %>% mutate(ihs_u011 = asinh(u011)) %>% select(ihs_u011) %>% stat.desc( basic = FALSE, desc = FALSE, norm = TRUE ) %>% kable("html", digits = 3) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> ihs_u011 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> skewness </td> <td style="text-align:right;"> -0.500 </td> </tr> <tr> <td style="text-align:left;"> skew.2SE </td> <td style="text-align:right;"> -0.947 </td> </tr> <tr> <td style="text-align:left;"> kurtosis </td> <td style="text-align:right;"> 0.860 </td> </tr> <tr> <td style="text-align:left;"> kurt.2SE </td> <td style="text-align:right;"> 0.823 </td> </tr> <tr> <td style="text-align:left;"> normtest.W </td> <td style="text-align:right;"> 0.976 </td> </tr> <tr> <td style="text-align:left;"> normtest.p </td> <td style="text-align:right;"> 0.122 </td> </tr> </tbody> </table> ] --- ## Summary .pull-left[ <br/> **Today**: Comparing data - Comparing distributions through mean - Correlation analysis - Variable transformation **Next time**: Regression analysis - Simple regression - Multiple regression - Comparing models <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/203-slides-comparing-data_files/figure-html/unnamed-chunk-49-1.png)<!-- --> ]