Lecture 205

class: center, middle, inverse, title-slide

.title[
# Lecture 205
]
.author[
### Dr Stefano De Sabbata School of Geography, Geology, and the Env., University of Leicester <a href="https://github.com/sdesabbata/r-for-geographic-data-science" style="color: white">github.com/sdesabbata/r-for-geographic-data-science</a> <a href="mailto:s.desabbata@le.ac.uk" style="color: white">s.desabbata@le.ac.uk</a> | <a href="https://twitter.com/maps4thought" style="color: white">@maps4thought</a> text licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/" style="color: white">CC BY-SA 4.0</a>, code licensed under <a href="https://www.gnu.org/licenses/gpl-3.0.html" style="color: white">GNU GPL v3.0</a>
]

---

class: inverse, center, middle

# Multiple regression

---
## Recap

.pull-left[

**Prev**: Comparing data

- Comparing distributions through mean
- Correlation analysis
- Variable transformation

**Today**: Regression analysis

- Simple regression
- Multiple regression
- Comparing models

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-2-1.png)

]

---
## Multiple regression

**Regression analysis** is a supervised machine learning approach

Special case of the general linear model

`$$outcome_i=(model)+error_i$$`

Predict (estimate) value of one outcome (dependent) variable as

- one predictor (independent) variable: **simple / univariate**

`$$Y_i=(b_0+b_1*X_{i1})+\epsilon_i$$`
    
- more predictor (independent) variables: **multiple / multivar.**

`$$Y_i=(b_0+b_1*X_{i1}+b_2*X_{i2}+\dots+b_M*X_{iM})+\epsilon_i$$`

---
## Assumptions

- **Linearity**
    - the relationship is actually linear
- **Normality** of residuals
    - standard residuals are normally distributed with mean `0`
- **Homoscedasticity** of residuals
    - at each level of the predictor variable(s) the variance of the standard residuals should be the same (*homo-scedasticity*) rather than different (*hetero-scedasticity*) 
- **Independence** of residuals
    - adjacent standard residuals are not correlated

- When more than one predictor: **no multicollinearity**
    - if two or more predictor variables are used in the model, each pair of variables not correlated

---
## boston_data housing

.pull-left[

A classic R dataset

- price of houses in boston_data
- in relation to: 
  - house characteristics
  - neighborhood 
  - air quality

```r
boston_data <-
 MASS::Boston
```

.referencenote[
Harrison, D., and D. L. Rubinfeld. 1978. [Hedonic Housing Prices and the Demand for Clean Air](https://doi.org/10.1016/0095-0696(78)90006-2). Journal of Environmental Economics and Management 5 (1): 81–102.
]

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-4-1.png)

]

---
## Example: boston_data housing

Can we predict price based on number of rooms and air quality?

`$$house\ value_i=(b_0+b_1*rooms_{i}+b_2*NO\ conc_{i})+\epsilon_i$$`

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-5-1.png)

]

---
## stats::lm

.pull-left[

.small[

```r
medv_model <-
 boston_data %$% 
 lm(medv ~ rm + nox)

medv_model %>%  
  summary()
```

```
## 
## Call:
## lm(formula = medv ~ rm + nox)
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -17.889 -3.287 -0.636 2.518 39.638 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) -18.2059 3.3393 -5.452 7.82e-08 ***
## rm 8.1567 0.4173 19.546 < 2e-16 ***
## nox -18.9706 2.5304 -7.497 2.97e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.281 on 503 degrees of freedom
## Multiple R-squared: 0.5354,	Adjusted R-squared: 0.5336 
## F-statistic: 289.9 on 2 and 503 DF, p-value: < 2.2e-16
```

]

]
.pull-right[

The output indicates

- **p-value: < 2.2e-16**: `$p<.01$` the model is significant
 - derived by comparing **F-statistic** to F distribution 289.87 having specified degrees of freedom (2, 503)
 - Report as: F(2, 503) = 289.87
- **Adjusted R-squared: 0.5336**: 
 - number of rooms and air quality can account for 53.36% variation in house prices
- **Coefficients**
 - Intercept est. -18.2 is significant
 - `rm` (slope) est. 8.2 is significant
 - `nox` (slope) est. -19 is significant

]

---
## Coefficients

.pull-left[

### Confindence intervals

- Coefficients' 95% confidence intervals
  - can be interpreted as interval containing true coefficient values
  - good models should result in small intervals

```r
medv_model %>%
  confint()
```

```
##                 2.5 %     97.5 %
## (Intercept) -24.76666 -11.645106
## rm            7.33676   8.976551
## nox         -23.94200 -13.999233
```

]
.pull-right[

### Standardised coefficients

- Indicate amount of change
  - in the outcome variable
  - per one standard deviation change in the predictor variable
- Can also be interpreted as importance of predictor

```r
library(lm.beta)

medv_model %>%
  lm.beta()
```

```
## 
## Call:
## lm(formula = medv ~ rm + nox)
## 
## Standardized Coefficients::
## (Intercept)          rm         nox 
##          NA   0.6231316  -0.2390178
```

]

---
## Outliers and influential cases

```r
boston_output <-
 boston_data %>%
 mutate(
 model_stdres = medv_model %>% rstandard(),
 model_cook_dist = medv_model %>% cooks.distance()
 )

boston_output %>%
  select(medv, model_stdres, model_cook_dist) %>%
  filter(abs(model_stdres) > 2.58 | model_cook_dist > 1)
```

```
##    medv model_stdres model_cook_dist
## 1  50.0     2.975511      0.02911770
## 2  21.9    -2.907328      0.11856922
## 3  27.5     4.899644      0.26329818
## 4  23.1     3.511137      0.10891047
## 5  50.0     6.339016      0.12065579
## 6  50.0     4.094574      0.02308311
## 7  50.0     3.665060      0.02786674
## 8  50.0     4.699311      0.02109333
## 9  50.0     5.257825      0.03746931
## 10  7.5    -2.672594      0.01576071
```

- No influential cases (Cook's distance `> 1`)
- There are many outliers (7 abs std res `> 3.29`, 2% `> 2.58`)

---
## Checking assumptions: normality

.pull-left[

Shapiro-Wilk test for normality of standard residuals,

- robust models: should be **not** significant

```r
boston_output %$% 
  shapiro.test(
    model_stdres
  )
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data: model_stdres
## W = 0.8979, p-value < 2.2e-16
```

]
.pull-right[

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-12-1.png)

]

**Standard residuals are NOT normally distributed**

---
## Checking assumptions: homoscedasticity

.pull-left[

Breusch-Pagan test for homoscedasticity of standard residuals

- robust models: should be **not** significant

```r
medv_model %>% 
  bptest()
```

```
## 
## 	studentized Breusch-Pagan test
## 
## data:  .
## BP = 22.342, df = 2, p-value = 1.407e-05
```

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-14-1.png)

]

**Standard residuals are NOT homoscedastic**

---
## Checking assumptions: independence

Durbin-Watson test for the independence of residuals

- robust models: statistic should be close to 2 (advised between 1 and 3) and **not** significant

```r
medv_model %>%
  dwtest()
```

```
## 
## 	Durbin-Watson test
## 
## data: .
## DW = 0.68451, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
```

**Standard residuals are NOT independent**

Note: the result depends on the order of the data.

---
## Checking assumptions: multicollinearity

Checking the variance inflation factor (VIF)

- robust models should have no multicollinearity: 
  - largest VIF should be lower than 10 
  - or the average VIF should not be greater than 1

```r
library(car)

medv_model %>%
  vif()
```

```
##       rm      nox 
## 1.100495 1.100495
```

**There is no multicollinearity**

---
## Conclusions: boston_data housing

.pull-left[

No, we can't predict house prices based only on number of rooms and air quality.

- predictors are statistically significant
- but model is not robust, as it doesn't satisfy most assumptions
  - Standard residuals are NOT normally distributed
  - Standard residuals are NOT homoscedastic
  - Standard residuals are NOT independent
  - (although there is no multicollinearity)

We seem to be on the right path, but something is missing...

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-17-1.png)

]

---
## Model: boston_data housing (2)

.pull-left[

Can we predict house prices based on number of rooms, air quality, **student/teacher ratio** and **crime level**?

.small[

```r
medv_model2 <-
 boston_data %>% 
 filter(medv < 50) %$% 
 lm(
 medv ~ rm + nox + ptratio + log(crim)
 )

medv_model2 %>%  
  summary()
```

```
## 
## Call:
## lm(formula = medv ~ rm + nox + ptratio + log(crim))
## 
## Residuals:
## Min 1Q Median 3Q Max 
## -12.1957 -2.7435 -0.1094 2.2879 26.9646 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 10.4676 4.0846 2.563 0.010687 * 
## rm 5.9404 0.3421 17.366 < 2e-16 ***
## nox -13.0203 2.9408 -4.427 1.18e-05 ***
## ptratio -1.0344 0.1107 -9.345 < 2e-16 ***
## log(crim) -0.5558 0.1676 -3.316 0.000983 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.526 on 485 degrees of freedom
## Multiple R-squared: 0.6715,	Adjusted R-squared: 0.6688 
## F-statistic: 247.9 on 4 and 485 DF, p-value: < 2.2e-16
```

]

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-19-1.png)

]

---
## Logarithmic transformations

- 10% change in criminality score leads to
  - `$log(110/100) * b_{crim}= 0.0953 * b_{crim}$` change
  - `0.0953 * -0.5558 = 0.0529`

.center[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-20-1.png)

]

---
## Checking assumptions

.pull-left[

```r
medv_model2 %>% rstandard() %>% 
  shapiro.test()
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  .
## W = 0.95777, p-value = 1.231e-10
```

```r
medv_model2 %>% bptest()
```

```
## 
## 	studentized Breusch-Pagan test
## 
## data:  .
## BP = 14.78, df = 4, p-value = 0.005179
```

]
.pull-right[

```r
medv_model2 %>% dwtest()
```

```
## 
## 	Durbin-Watson test
## 
## data: .
## DW = 0.99915, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
```

```r
medv_model2 %>% vif()
```

```
##        rm       nox   ptratio log(crim) 
##  1.191368  2.810505  1.302618  3.125288
```

]

- Standard residuals are NOT normally distributed
- Standard residuals are NOT homoscedastic (borderline)
- Standard residuals are NOT independent
- There is some sign of multicollinearity

---
## Conclusions: boston_data housing (2)

.pull-left[

No, we still can't robustly predict house prices based on number of rooms, air quality, student/teacher ratio and crime level.

- predictors are statistically significant
- but model is not robust
  - Standard residuals are NOT normally distributed
  - Standard residuals are NOT homoscedastic (borderline)
  - Standard residuals are NOT independent
  - There is some sign of multicollinearity

Still possibly on the right path, not quite there yet...

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-23-1.png)

]

---
## Comparing models

Is there a difference between Model1's `$R^2 =$` 0.5336 and Model2's `$R^2 =$` 0.6688?

.pull-left[

- `$R^2$`
  - measure of correlation between
    - values predicted by the model (fitted values)
    - observed values for outcome variable
- Adjusted `$R^2$`, depending on
    - number of cases
    - number of predictor (independent) variables
  - *"unnecessary"* variables lower the value

The model with the highest adjusted `$R^2$` has the best fit

]
.pull-right[

Can be used to test whether adjusted `$R^2$` are signif. different

- if models are hierarchical
  - one uses all variables of the other
  - plus some additional variables

.small[

```r
medv_model1 <-
 boston_data %>% 
 filter(medv < 50) %$% 
 stats::lm(medv ~ rm + nox)

anova(medv_model1, medv_model2)
```

```
## Analysis of Variance Table
## 
## Model 1: medv ~ rm + nox
## Model 2: medv ~ rm + nox + ptratio + log(crim)
## Res.Df RSS Df Sum of Sq F Pr(>F) 
## 1 487 12890.0 
## 2 485 9936.7 2 2953.3 72.073 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

]

Still, neither model is robust!

---
## Information criteria

.pull-left[

- Akaike Information Criterion (**AIC**)
  - measure of model fit 
    - penalising model with more variables
  - not interpretable per-se, used to compare similar models
    - lower value, better fit

]
.pull-right[

```r
AIC(medv_model1)
```

```
## [1] 3000.763
```

```r
AIC(medv_model2)
```

```
## [1] 2877.258
```

]

- Bayesian Information Criterion (**BIC**)
  - similar to AIC

---
## Summary

.pull-left[

**Today**: Regression analysis

- Simple regression
- Multiple regression
- Comparing models

**Next time**: Control structures

- Conditional statements
- Conditional and deterministic loops
- *Group work?*

.referencenote[
Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com).
]

]
.pull-right[

![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-26-1.png)

]