Lecture 204

class: center, middle, inverse, title-slide

.title[
# Lecture 204
]
.author[
### Dr Stefano De Sabbata School of Geography, Geology, and the Env., University of Leicester <a href="https://github.com/sdesabbata/r-for-geographic-data-science" style="color: white">github.com/sdesabbata/r-for-geographic-data-science</a> <a href="mailto:s.desabbata@le.ac.uk" style="color: white">s.desabbata@le.ac.uk</a> | <a href="https://twitter.com/maps4thought" style="color: white">@maps4thought</a> text licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/" style="color: white">CC BY-SA 4.0</a>, code licensed under <a href="https://www.gnu.org/licenses/gpl-3.0.html" style="color: white">GNU GPL v3.0</a>
]

---

class: inverse, center, middle

# Simple Regression

---
## Recap

.pull-left[

**Prev**: Comparing data

- Comparing distributions through mean
- Correlation analysis
- Variable transformation

**Today**: Regression analysis

- Simple regression
- Least square via gradient descent
- Logistic regression

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-2-1.png)

]

---
## Regression analysis

**Regression analysis** is a supervised machine learning approach

Special case of the general linear model

$$ outcome_i = (model) + error_i $$

Predict (estimate) value of one outcome (dependent) variable as

- one predictor (independent) variable: **simple / univariate**

`$$Y_i=(b_0+b_1*X_{i1})+\epsilon_i$$`
    
- more predictor (independent) variables: **multiple / multivar.**

`$$Y_i=(b_0+b_1*X_{i1}+b_2*X_{i2}+\dots+b_M*X_{iM})+\epsilon_i$$`

---
## Example

Can we predict a penguin's body mass from flipper length?

`$$body\ mass_i=(b_0+b_1*flipper\ length_{i})+\epsilon_i$$`

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-3-1.png)

]

---
## Example

Can we predict a penguin's body mass from flipper length?

`$$body\ mass_i=(b_0+b_1*flipper\ length_{i})+\epsilon_i$$`

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-4-1.png)

]

---
## Least squares

.pull-left[

**Least squares** is the most commonly used approach to generate a regression model

The model fits a line
 
- to **minimise** the squared values of the **residuals** (errors)
- that is squared difference between
 - **observed values**
 - **model**

`$$residual_i=observed_i-model_i$$`

`$$deviation=\sum_i(observed_i-model_i)^2$$`

]
.pull-right[

.center[
![:scale 90%](data:image/png;base64,#images/489px-Linear_least_squares_example2.svg.png)

.referencenote[
by Krishnavedala 
via Wikimedia Commons, CC-BY-SA-3.0
]
]

]

---
## Least squares via gradient descent

.center[

![:scale 90%](data:image/png;base64,#images/linear-regression-gd_shiny-app.png)

[sdesabbata.shinyapps.io/linear-regression-gd](https://sdesabbata.shinyapps.io/linear-regression-gd/)

]

---
## Least squares: Assumptions

.pull-left[

- **Linearity**
    - the relationship is actually linear
- **Normality** of residuals
    - standard residuals are normally distributed with mean `0`
- **Homoscedasticity** of residuals
    - at each level of the predictor variable(s) the variance of the standard residuals should be the same (*homo-scedasticity*) rather than different (*hetero-scedasticity*) 
- **Independence** of residuals
    - adjacent standard residuals are not correlated

]
.pull-right[

.center[
![:scale 90%](data:image/png;base64,#images/489px-Linear_least_squares_example2.svg.png)

.referencenote[
by Krishnavedala 
via Wikimedia Commons, CC-BY-SA-3.0
]
]

]

---
## stats::lm

.pull-left[

.small[

```r
bm_fl_model <-
 penguins %>%
 filter(
 !is.na(body_mass_g) &
 !is.na(flipper_length_mm)
 ) %$%
 lm(body_mass_g ~ flipper_length_mm)

bm_fl_model %>%  
  summary()
```

```
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -1058.80 -259.27 -26.88 247.33 1288.69 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -5780.831 305.815 -18.90 <2e-16 ***
## flipper_length_mm 49.686 1.518 32.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 394.3 on 340 degrees of freedom
## Multiple R-squared: 0.759,	Adjusted R-squared: 0.7583 
## F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
```

]

]
.pull-right[

The output indicates

- **p-value: < 2.2e-16**: `$p<.01$` the model is significant
 - derived by comparing **F-statistic** (1070.74) to F distribution having specified degrees of freedom (1, 340)
 - Report as: F(1, 340) = 1070.74
- **Adjusted R-squared: 0.7583**: 
 - flipper length can account for 75.83% variation in body mass
- **Coefficients**
 - Intercept estimate -5780.8314 is significant
 - `flipper_length_mm` (slope) estimate 49.6856 is significant

]

---
## Outliers and influential cases

```r
penguins_output <-
 penguins %>%
 filter(!is.na(body_mass_g) | !is.na(flipper_length_mm)) %>%
 mutate(
 model_stdres = bm_fl_model %>% rstandard(),
 model_cook_dist = bm_fl_model %>% cooks.distance()
 )

penguins_output %>%
  select(body_mass_g, model_stdres, model_cook_dist) %>%
  filter(abs(model_stdres) > 2.58 | model_cook_dist > 1)
```

```
## # A tibble: 4 × 3
## body_mass_g model_stdres model_cook_dist
## <int> <dbl> <dbl>
## 1 4650 3.28 0.0388
## 2 5850 2.66 0.0182
## 3 6300 2.80 0.0353
## 4 2700 -2.69 0.0149
```

- No influential cases (Cook's distance `> 1`) 
- There are a handful of outliers (4 abs std res `> 2.58`)

---
## Checking assumptions: normality

.pull-left[

Shapiro-Wilk test for normality of standard residuals,

- robust models: should be **not** significant

```r
penguins_output %$% 
  shapiro.test(
    model_stdres
  )
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  model_stdres
## W = 0.99295, p-value = 0.1085
```

]
.pull-right[

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-9-1.png)
]

]

**Standard residuals are normally distributed!**

---
## Checking assumptions: homoscedasticity

.pull-left[

Breusch-Pagan test for homoscedasticity of standard residuals

- robust models: should be **not** significant

```r
bm_fl_model %>% 
  bptest()
```

```
## 
## 	studentized Breusch-Pagan test
## 
## data:  .
## BP = 2.1579, df = 1, p-value = 0.1418
```

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-11-1.png)

]

**Standard residuals are homoscedastic!**

---
## Checking assumptions: independence

Durbin-Watson test for the independence of residuals

- robust models: statistic should be close to 2 (advised between 1 and 3) and **not** significant

```r
bm_fl_model %>%
  dwtest()
```

```
## 
## 	Durbin-Watson test
## 
## data:  .
## DW = 2.1896, p-value = 0.9572
## alternative hypothesis: true autocorrelation is greater than 0
```

**Standard residuals are independent!**

Note: the result depends on the order of the data.

---
## Example

Yes, we can predict a penguin's body mass from flipper length!

`$$body\ mass_i=(-5780.83+49.69*flipper\ length_{i})+\epsilon_i$$`

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-13-1.png)

]

---
## A different problem: classification

- *regression*: predicting a continuous output variable
- *classification*: predicting a categorical (nominal, including binary) output variable

For instance: can we automatically identify the two species based on the penguins' body mass?

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-15-1.png)

]

---
## Logistic regression

.pull-left[

There are different approaches on how to define it... this is probably the most common

`$$h(x) = \frac{1}{1 + e^{-(\color{purple}{w \cdot \phi(x)})}}$$`

where `$0.5$` is the (soft) decision bundary

- if `$h(x) < 0.5$` probably Adelie
- if `$h(x) > 0.5$` probably Gentoo

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-16-1.png)

]

**Interestingly**, that's equivalent to

`$h_w(x) = \color{orchid}{\sigma} (w \cdot \phi(x))$` &nbsp;&nbsp;&nbsp; with &nbsp;&nbsp;&nbsp; `$\color{orchid}{\sigma}(z) = \frac{1}{1 + e^{-z}}$`

</center>

---
## Example

```r
penguins_to_learn <-
 palmerpenguins::penguins %>%
 filter(species %in% c("Adelie", "Gentoo")) %>%
 mutate(species = forcats::fct_drop(species)) %>%
 filter(!is.na(body_mass_g) | !is.na(bill_depth_mm)) %>%
 mutate(across(bill_length_mm:body_mass_g, scale)) %>%
 mutate(
 species_01 = dplyr::recode(species, Adelie = 0, Gentoo = 1)
 )
```

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-18-1.png)

]

---
## stats::glm

```r
sp_bm_model <- penguins_to_learn %$%
 glm(species_01 ~ body_mass_g, family = binomial())

sp_bm_model %>%
  summary() %>%
  print()
```

```
## 
## Call:
## glm(formula = species_01 ~ body_mass_g, family = binomial())
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5839     0.2666  -2.190   0.0285 *  
## body_mass_g   5.2072     0.7271   7.161 7.98e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 376.98  on 273  degrees of freedom
## Residual deviance: 102.07  on 272  degrees of freedom
## AIC: 106.07
## 
## Number of Fisher Scoring iterations: 7
```

---
## Logistic regression

Assumptions

- **Linearity** of the logit
    - predictors have linear relationship with log of outcome
- When more than one predictor: **no multicollinearity**
    - if two or more predictor variables are used in the model, each pair of variables not correlated

Pseudo-R2

- Approaches to calculating model quality (power)

Adding complexity

- Multiple logistic regression: multiple predictors
- Multinomial logistic regression: several categories as outcome

---
## Summary

.pull-left[

**Today**: Regression analysis

- Simple regression
- Multiple regression
- Comparing models

**Next time**: Control structures

- Conditional statements
- Conditional and deterministic loops
- *Group work?*

.referencenote[
Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com).
]

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/204-slides-regression_files/figure-html/unnamed-chunk-20-1.png)

]