class: center, middle, inverse, title-slide .title[ # Lecture 204 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Simple Regression --- ## Recap .pull-left[ <br/> **Prev**: Comparing data - Comparing distributions through mean - Correlation analysis - Variable transformation **Today**: Regression analysis - Simple regression - Least square via gradient descent - Logistic regression <br/> ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- ## Regression analysis <br/> **Regression analysis** is a supervised machine learning approach Special case of the general linear model $$ outcome_i = (model) + error_i $$ <br/> Predict (estimate) value of one outcome (dependent) variable as - one predictor (independent) variable: **simple / univariate** `$$Y_i=(b_0+b_1*X_{i1})+\epsilon_i$$` - more predictor (independent) variables: **multiple / multivar.** `$$Y_i=(b_0+b_1*X_{i1}+b_2*X_{i2}+\dots+b_M*X_{iM})+\epsilon_i$$` --- ## Example Can we predict a penguin's body mass from flipper length? `$$body\ mass_i=(b_0+b_1*flipper\ length_{i})+\epsilon_i$$` .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- ## Example Can we predict a penguin's body mass from flipper length? `$$body\ mass_i=(b_0+b_1*flipper\ length_{i})+\epsilon_i$$` .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- ## Least squares .pull-left[ **Least squares** is the most commonly used approach to generate a regression model The model fits a line - to **minimise** the squared values of the **residuals** (errors) - that is squared difference between - **observed values** - **model** <br/> `$$residual_i=observed_i-model_i$$` `$$deviation=\sum_i(observed_i-model_i)^2$$` ] .pull-right[ .center[ ![:scale 90%](data:image/png;base64,#images/489px-Linear_least_squares_example2.svg.png) .referencenote[ by Krishnavedala<br/> via Wikimedia Commons,<br/>CC-BY-SA-3.0 ] ] ] --- ## Least squares via gradient descent .center[ ![:scale 90%](data:image/png;base64,#images/linear-regression-gd_shiny-app.png) [sdesabbata.shinyapps.io/linear-regression-gd](https://sdesabbata.shinyapps.io/linear-regression-gd/) ] --- ## Least squares: Assumptions .pull-left[ <br/> - **Linearity** - the relationship is actually linear - **Normality** of residuals - standard residuals are normally distributed with mean `0` - **Homoscedasticity** of residuals - at each level of the predictor variable(s) the variance of the standard residuals should be the same (*homo-scedasticity*) rather than different (*hetero-scedasticity*) - **Independence** of residuals - adjacent standard residuals are not correlated ] .pull-right[ .center[ ![:scale 90%](data:image/png;base64,#images/489px-Linear_least_squares_example2.svg.png) .referencenote[ by Krishnavedala<br/> via Wikimedia Commons,<br/>CC-BY-SA-3.0 ] ] ] --- ## stats::lm .pull-left[ .small[ ```r bm_fl_model <- penguins %>% filter( !is.na(body_mass_g) & !is.na(flipper_length_mm) ) %$% lm(body_mass_g ~ flipper_length_mm) bm_fl_model %>% summary() ``` ``` ## ## Call: ## lm(formula = body_mass_g ~ flipper_length_mm) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1058.80 -259.27 -26.88 247.33 1288.69 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5780.831 305.815 -18.90 <2e-16 *** ## flipper_length_mm 49.686 1.518 32.72 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 394.3 on 340 degrees of freedom ## Multiple R-squared: 0.759, Adjusted R-squared: 0.7583 ## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16 ``` ] ] .pull-right[ The output indicates - **p-value: < 2.2e-16**: `\(p<.01\)` the model is significant - derived by comparing **F-statistic** (1070.74) to F distribution having specified degrees of freedom (1, 340) - Report as: F(1, 340) = 1070.74 - **Adjusted R-squared: 0.7583**: - flipper length can account for 75.83% variation in body mass - **Coefficients** - Intercept estimate -5780.8314 is significant - `flipper_length_mm` (slope) estimate 49.6856 is significant ] --- ## Outliers and influential cases ```r penguins_output <- penguins %>% filter(!is.na(body_mass_g) | !is.na(flipper_length_mm)) %>% mutate( model_stdres = bm_fl_model %>% rstandard(), model_cook_dist = bm_fl_model %>% cooks.distance() ) penguins_output %>% select(body_mass_g, model_stdres, model_cook_dist) %>% filter(abs(model_stdres) > 2.58 | model_cook_dist > 1) ``` ``` ## # A tibble: 4 × 3 ## body_mass_g model_stdres model_cook_dist ## <int> <dbl> <dbl> ## 1 4650 3.28 0.0388 ## 2 5850 2.66 0.0182 ## 3 6300 2.80 0.0353 ## 4 2700 -2.69 0.0149 ``` <br/> - No influential cases (Cook's distance `> 1`) - There are a handful of outliers (4 abs std res `> 2.58`) --- ## Checking assumptions: normality .pull-left[ Shapiro-Wilk test for normality of standard residuals, - robust models: should be **not** significant ```r penguins_output %$% shapiro.test( model_stdres ) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: model_stdres ## W = 0.99295, p-value = 0.1085 ``` {{content}} ] .pull-right[ .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-9-1.png)<!-- --> ] ] -- **Standard residuals are normally distributed!** --- ## Checking assumptions: homoscedasticity .pull-left[ Breusch-Pagan test for homoscedasticity of standard residuals - robust models: should be **not** significant ```r bm_fl_model %>% bptest() ``` ``` ## ## studentized Breusch-Pagan test ## ## data: . ## BP = 2.1579, df = 1, p-value = 0.1418 ``` {{content}} ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] -- **Standard residuals are homoscedastic!** --- ## Checking assumptions: independence <br/> Durbin-Watson test for the independence of residuals - robust models: statistic should be close to 2 (advised between 1 and 3) and **not** significant ```r bm_fl_model %>% dwtest() ``` ``` ## ## Durbin-Watson test ## ## data: . ## DW = 2.1896, p-value = 0.9572 ## alternative hypothesis: true autocorrelation is greater than 0 ``` -- **Standard residuals are independent!** Note: the result depends on the order of the data. --- ## Example Yes, we can predict a penguin's body mass from flipper length! `$$body\ mass_i=(-5780.83+49.69*flipper\ length_{i})+\epsilon_i$$` .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- ## A different problem: classification - *regression*: predicting a continuous output variable - *classification*: predicting a categorical (nominal, including binary) output variable For instance: can we automatically identify the two species based on the penguins' body mass? .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- ## Logistic regression .pull-left[ There are different approaches on how to define it... this is probably the most common <br/> `$$h(x) = \frac{1}{1 + e^{-(\color{purple}{w \cdot \phi(x)})}}$$` where `\(0.5\)` is the (soft) decision bundary - if `\(h(x) < 0.5\)` probably Adelie - if `\(h(x) > 0.5\)` probably Gentoo ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] <br/> **Interestingly**, that's equivalent to <center> `\(h_w(x) = \color{orchid}{\sigma} (w \cdot \phi(x))\)` with `\(\color{orchid}{\sigma}(z) = \frac{1}{1 + e^{-z}}\)` </center> --- ## Example <br/> ```r penguins_to_learn <- palmerpenguins::penguins %>% filter(species %in% c("Adelie", "Gentoo")) %>% mutate(species = forcats::fct_drop(species)) %>% filter(!is.na(body_mass_g) | !is.na(bill_depth_mm)) %>% mutate(across(bill_length_mm:body_mass_g, scale)) %>% mutate( species_01 = dplyr::recode(species, Adelie = 0, Gentoo = 1) ) ``` .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] --- ## stats::glm ```r sp_bm_model <- penguins_to_learn %$% glm(species_01 ~ body_mass_g, family = binomial()) sp_bm_model %>% summary() %>% print() ``` ``` ## ## Call: ## glm(formula = species_01 ~ body_mass_g, family = binomial()) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.5839 0.2666 -2.190 0.0285 * ## body_mass_g 5.2072 0.7271 7.161 7.98e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 376.98 on 273 degrees of freedom ## Residual deviance: 102.07 on 272 degrees of freedom ## AIC: 106.07 ## ## Number of Fisher Scoring iterations: 7 ``` --- ## Logistic regression Assumptions - **Linearity** of the logit - predictors have linear relationship with log of outcome - When more than one predictor: **no multicollinearity** - if two or more predictor variables are used in the model, each pair of variables not correlated Pseudo-R2 - Approaches to calculating model quality (power) Adding complexity - Multiple logistic regression: multiple predictors - Multinomial logistic regression: several categories as outcome --- ## Summary .pull-left[ <br/> **Today**: Regression analysis - Simple regression - Multiple regression - Comparing models **Next time**: Control structures - Conditional statements - Conditional and deterministic loops - *Group work?* <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ]