class: center, middle, inverse, title-slide .title[ # Lecture 205 ] .author[ ### Dr Stefano De Sabbata
School of Geography, Geology, and the Env., University of Leicester
github.com/sdesabbata/r-for-geographic-data-science
s.desabbata@le.ac.uk
|
@maps4thought
text licensed under
CC BY-SA 4.0
, code licensed under
GNU GPL v3.0
] --- class: inverse, center, middle # Multiple regression --- ## Recap .pull-left[ <br/> **Prev**: Comparing data - Comparing distributions through mean - Correlation analysis - Variable transformation **Today**: Regression analysis - Simple regression - Multiple regression - Comparing models <br/> ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- ## Multiple regression <br/> **Regression analysis** is a supervised machine learning approach Special case of the general linear model `$$outcome_i=(model)+error_i$$` <br/> Predict (estimate) value of one outcome (dependent) variable as - one predictor (independent) variable: **simple / univariate** `$$Y_i=(b_0+b_1*X_{i1})+\epsilon_i$$` - more predictor (independent) variables: **multiple / multivar.** `$$Y_i=(b_0+b_1*X_{i1}+b_2*X_{i2}+\dots+b_M*X_{iM})+\epsilon_i$$` --- ## Assumptions <br/> - **Linearity** - the relationship is actually linear - **Normality** of residuals - standard residuals are normally distributed with mean `0` - **Homoscedasticity** of residuals - at each level of the predictor variable(s) the variance of the standard residuals should be the same (*homo-scedasticity*) rather than different (*hetero-scedasticity*) - **Independence** of residuals - adjacent standard residuals are not correlated <br/> - When more than one predictor: **no multicollinearity** - if two or more predictor variables are used in the model, each pair of variables not correlated --- ## boston_data housing .pull-left[ A classic R dataset - price of houses in boston_data - in relation to: - house characteristics - neighborhood - air quality ```r boston_data <- MASS::Boston ``` <br/> .referencenote[ Harrison, D., and D. L. Rubinfeld. 1978. [Hedonic Housing Prices and the Demand for Clean Air](https://doi.org/10.1016/0095-0696(78)90006-2). Journal of Environmental Economics and Management 5 (1): 81–102. ] ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- ## Example: boston_data housing Can we predict price based on number of rooms and air quality? `$$house\ value_i=(b_0+b_1*rooms_{i}+b_2*NO\ conc_{i})+\epsilon_i$$` .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- ## stats::lm .pull-left[ <br/> .small[ ```r medv_model <- boston_data %$% lm(medv ~ rm + nox) medv_model %>% summary() ``` ``` ## ## Call: ## lm(formula = medv ~ rm + nox) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.889 -3.287 -0.636 2.518 39.638 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -18.2059 3.3393 -5.452 7.82e-08 *** ## rm 8.1567 0.4173 19.546 < 2e-16 *** ## nox -18.9706 2.5304 -7.497 2.97e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.281 on 503 degrees of freedom ## Multiple R-squared: 0.5354, Adjusted R-squared: 0.5336 ## F-statistic: 289.9 on 2 and 503 DF, p-value: < 2.2e-16 ``` ] ] .pull-right[ The output indicates - **p-value: < 2.2e-16**: `\(p<.01\)` the model is significant - derived by comparing **F-statistic** to F distribution 289.87 having specified degrees of freedom (2, 503) - Report as: F(2, 503) = 289.87 - **Adjusted R-squared: 0.5336**: - number of rooms and air quality can account for 53.36% variation in house prices - **Coefficients** - Intercept est. -18.2 is significant - `rm` (slope) est. 8.2 is significant - `nox` (slope) est. -19 is significant ] --- ## Coefficients .pull-left[ ### Confindence intervals - Coefficients' 95% confidence intervals - can be interpreted as interval containing true coefficient values - good models should result in small intervals ```r medv_model %>% confint() ``` ``` ## 2.5 % 97.5 % ## (Intercept) -24.76666 -11.645106 ## rm 7.33676 8.976551 ## nox -23.94200 -13.999233 ``` ] .pull-right[ ### Standardised coefficients - Indicate amount of change - in the outcome variable - per one standard deviation change in the predictor variable - Can also be interpreted as importance of predictor ```r library(lm.beta) medv_model %>% lm.beta() ``` ``` ## ## Call: ## lm(formula = medv ~ rm + nox) ## ## Standardized Coefficients:: ## (Intercept) rm nox ## NA 0.6231316 -0.2390178 ``` ] --- ## Outliers and influential cases ```r boston_output <- boston_data %>% mutate( model_stdres = medv_model %>% rstandard(), model_cook_dist = medv_model %>% cooks.distance() ) boston_output %>% select(medv, model_stdres, model_cook_dist) %>% filter(abs(model_stdres) > 2.58 | model_cook_dist > 1) ``` ``` ## medv model_stdres model_cook_dist ## 1 50.0 2.975511 0.02911770 ## 2 21.9 -2.907328 0.11856922 ## 3 27.5 4.899644 0.26329818 ## 4 23.1 3.511137 0.10891047 ## 5 50.0 6.339016 0.12065579 ## 6 50.0 4.094574 0.02308311 ## 7 50.0 3.665060 0.02786674 ## 8 50.0 4.699311 0.02109333 ## 9 50.0 5.257825 0.03746931 ## 10 7.5 -2.672594 0.01576071 ``` - No influential cases (Cook's distance `> 1`) - There are many outliers (7 abs std res `> 3.29`, 2% `> 2.58`) --- ## Checking assumptions: normality .pull-left[ Shapiro-Wilk test for normality of standard residuals, - robust models: should be **not** significant ```r boston_output %$% shapiro.test( model_stdres ) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: model_stdres ## W = 0.8979, p-value < 2.2e-16 ``` {{content}} ] .pull-right[ .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] ] -- **Standard residuals are NOT normally distributed** --- ## Checking assumptions: homoscedasticity .pull-left[ Breusch-Pagan test for homoscedasticity of standard residuals - robust models: should be **not** significant ```r medv_model %>% bptest() ``` ``` ## ## studentized Breusch-Pagan test ## ## data: . ## BP = 22.342, df = 2, p-value = 1.407e-05 ``` {{content}} ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] -- **Standard residuals are NOT homoscedastic** --- ## Checking assumptions: independence <br/> Durbin-Watson test for the independence of residuals - robust models: statistic should be close to 2 (advised between 1 and 3) and **not** significant ```r medv_model %>% dwtest() ``` ``` ## ## Durbin-Watson test ## ## data: . ## DW = 0.68451, p-value < 2.2e-16 ## alternative hypothesis: true autocorrelation is greater than 0 ``` -- **Standard residuals are NOT independent** Note: the result depends on the order of the data. --- ## Checking assumptions: multicollinearity <br/> Checking the variance inflation factor (VIF) - robust models should have no multicollinearity: - largest VIF should be lower than 10 - or the average VIF should not be greater than 1 ```r library(car) medv_model %>% vif() ``` ``` ## rm nox ## 1.100495 1.100495 ``` -- **There is no multicollinearity** --- ## Conclusions: boston_data housing .pull-left[ No, we can't predict house prices based only on number of rooms and air quality. - predictors are statistically significant - but model is not robust, as it doesn't satisfy most assumptions - Standard residuals are NOT normally distributed - Standard residuals are NOT homoscedastic - Standard residuals are NOT independent - (although there is no multicollinearity) We seem to be on the right path, but something is missing... ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- ## Model: boston_data housing (2) .pull-left[ Can we predict house prices based on number of rooms, air quality, **student/teacher ratio** and **crime level**? .small[ ```r medv_model2 <- boston_data %>% filter(medv < 50) %$% lm( medv ~ rm + nox + ptratio + log(crim) ) medv_model2 %>% summary() ``` ``` ## ## Call: ## lm(formula = medv ~ rm + nox + ptratio + log(crim)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12.1957 -2.7435 -0.1094 2.2879 26.9646 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.4676 4.0846 2.563 0.010687 * ## rm 5.9404 0.3421 17.366 < 2e-16 *** ## nox -13.0203 2.9408 -4.427 1.18e-05 *** ## ptratio -1.0344 0.1107 -9.345 < 2e-16 *** ## log(crim) -0.5558 0.1676 -3.316 0.000983 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.526 on 485 degrees of freedom ## Multiple R-squared: 0.6715, Adjusted R-squared: 0.6688 ## F-statistic: 247.9 on 4 and 485 DF, p-value: < 2.2e-16 ``` ] ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] --- ## Logarithmic transformations - 10% change in criminality score leads to - `\(log(110/100) * b_{crim}= 0.0953 * b_{crim}\)` change - `0.0953 * -0.5558 = 0.0529` .center[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] --- ## Checking assumptions .pull-left[ ```r medv_model2 %>% rstandard() %>% shapiro.test() ``` ``` ## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.95777, p-value = 1.231e-10 ``` ```r medv_model2 %>% bptest() ``` ``` ## ## studentized Breusch-Pagan test ## ## data: . ## BP = 14.78, df = 4, p-value = 0.005179 ``` ] .pull-right[ ```r medv_model2 %>% dwtest() ``` ``` ## ## Durbin-Watson test ## ## data: . ## DW = 0.99915, p-value < 2.2e-16 ## alternative hypothesis: true autocorrelation is greater than 0 ``` ```r medv_model2 %>% vif() ``` ``` ## rm nox ptratio log(crim) ## 1.191368 2.810505 1.302618 3.125288 ``` ] -- - Standard residuals are NOT normally distributed - Standard residuals are NOT homoscedastic (borderline) - Standard residuals are NOT independent - There is some sign of multicollinearity --- ## Conclusions: boston_data housing (2) .pull-left[ No, we still can't robustly predict house prices based on number of rooms, air quality, student/teacher ratio and crime level. - predictors are statistically significant - but model is not robust - Standard residuals are NOT normally distributed - Standard residuals are NOT homoscedastic (borderline) - Standard residuals are NOT independent - There is some sign of multicollinearity Still possibly on the right path, not quite there yet... ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-23-1.png)<!-- --> ] --- ## Comparing models Is there a difference between Model1's `\(R^2 =\)` 0.5336 and Model2's `\(R^2 =\)` 0.6688? .pull-left[ - `\(R^2\)` - measure of correlation between - values predicted by the model (fitted values) - observed values for outcome variable - Adjusted `\(R^2\)`, depending on - number of cases - number of predictor (independent) variables - *"unnecessary"* variables lower the value The model with the highest adjusted `\(R^2\)` has the best fit ] .pull-right[ Can be used to test whether adjusted `\(R^2\)` are signif. different - if models are hierarchical - one uses all variables of the other - plus some additional variables .small[ ```r medv_model1 <- boston_data %>% filter(medv < 50) %$% stats::lm(medv ~ rm + nox) anova(medv_model1, medv_model2) ``` ``` ## Analysis of Variance Table ## ## Model 1: medv ~ rm + nox ## Model 2: medv ~ rm + nox + ptratio + log(crim) ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 487 12890.0 ## 2 485 9936.7 2 2953.3 72.073 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] ] Still, neither model is robust! --- ## Information criteria <br/> .pull-left[ - Akaike Information Criterion (**AIC**) - measure of model fit - penalising model with more variables - not interpretable per-se, used to compare similar models - lower value, better fit ] .pull-right[ ```r AIC(medv_model1) ``` ``` ## [1] 3000.763 ``` ```r AIC(medv_model2) ``` ``` ## [1] 2877.258 ``` ] <br/> - Bayesian Information Criterion (**BIC**) - similar to AIC --- ## Summary .pull-left[ <br/> **Today**: Regression analysis - Simple regression - Multiple regression - Comparing models **Next time**: Control structures - Conditional statements - Conditional and deterministic loops - *Group work?* <br/> .referencenote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ] ] .pull-right[ ![](data:image/png;base64,#/home/rstudio/r-for-geographic-data-science/docs/slides/205-slides-regression-multiple_files/figure-html/unnamed-chunk-26-1.png)<!-- --> ]