2020-01-15
Today’s libraries
nycflights13
%$%
from the library magrittr
library(tidyverse) library(magrittr) library(nycflights13)
But let’s start from a simple example from datasets
iris %>% filter(Species == "setosa") %>% pull(Petal.Length) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.95498, p-value = 0.05481
iris %>% filter(Species == "versicolor") %>% pull(Petal.Length) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.966, p-value = 0.1585
iris %>% filter(Species == "virginica") %>% pull(Petal.Length) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.96219, p-value = 0.1098
Independent T-test tests whether two group means are different
\[outcome_i = (group\ mean) + error_i \]
Values are normally distributed, groups have same size, and they are independent (different flowers, check using leveneTest
)
iris %>% filter(Species %in% c("versicolor", "virginica")) %$% # Note %$% t.test(Petal.Length ~ Species)
## ## Welch Two Sample t-test ## ## data: Petal.Length by Species ## t = -12.604, df = 95.57, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.49549 -1.08851 ## sample estimates: ## mean in group versicolor mean in group virginica ## 4.260 5.552
The difference is significant t(95.57) = -12.6, p < .01
ANOVA (analysis of variance) tests whether more than two group means are different
\[outcome_i = (group\ mean) + error_i \]
Values are normally distributed, groups have same size, they are independent (different flowers, check using leveneTest
)
iris %$% aov(Petal.Length ~ Species) %>% summary()
## Df Sum Sq Mean Sq F value Pr(>F) ## Species 2 437.1 218.55 1180 <2e-16 *** ## Residuals 147 27.2 0.19 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference is significant t(2, 147) = 1180.16, p < .01
Two variables can be related in three different ways
Correlation is a standardised measure of covariance
flights_nov_20 <- nycflights13::flights %>% filter(!is.na(dep_delay), !is.na(arr_delay), month == 11, day ==20)
flights_nov_20 %>% pull(dep_delay) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.39881, p-value < 2.2e-16
flights_nov_20 %>% pull(arr_delay) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.67201, p-value < 2.2e-16
If two variables are normally distributed, use Pearson’s r
The square of the correlation value indicates the percentage of shared variance
If they were normally distributed, but they are not
# note the use of %$% #instead of %>% flights_nov_20 %$% cor.test(dep_delay, arr_delay)
## ## Pearson's product-moment correlation ## ## data: dep_delay and arr_delay ## t = 58.282, df = 972, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.8669702 0.8950078 ## sample estimates: ## cor ## 0.8817655
If two variables are not normally distributed, use Spearman’s rho
The square of the correlation value indicates the percentage of shared variance
If few ties, but there are
flights_nov_20 %$% cor.test( dep_delay, arr_delay, method = "spearman")
## Warning in cor.test.default(dep_delay, arr_delay, method = "spearman"): ## Cannot compute exact p-value with ties
## ## Spearman's rank correlation rho ## ## data: dep_delay and arr_delay ## S = 71437522, p-value < 2.2e-16 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.5361247
If not normally distributed and there is a large number of ties, use Kendall’s tau
The square of the correlation value indicates the percentage of shared variance
Departure and arrival delay seem actually to share
flights_nov_20 %$% cor.test( dep_delay, arr_delay, method = "kendall")
## ## Kendall's rank correlation tau ## ## data: dep_delay and arr_delay ## z = 17.859, p-value < 2.2e-16 ## alternative hypothesis: true tau is not equal to 0 ## sample estimates: ## tau ## 0.3956265
Combines in one visualisation: histograms, scatter plots, and correlation values for a set of variables
library(psych) flights_nov_20 %>% select( dep_delay, arr_delay, air_time ) %>% pairs.panels( method = "kendall" )
Regression analysis is a supervised machine learning approach
Predict the value of one outcome variable as
\[outcome_i = (model) + error_i \]
\[Y_i = (b_0 + b_1 * X_{i1}) + \epsilon_i \]
\[Y_i = (b_0 + b_1 * X_{i1} + b_2 * X_{i2} + \dots + b_M * X_{iM}) + \epsilon_i \]
Least squares is the most commonly used approach to generate a regression model
The model fits a line
\[deviation = \sum(observed - model)^2\]
\[arr\_delay_i = (b_0 + b_1 * dep\_delay_{i1}) + \epsilon_i \]
delay_model <- flights_nov_20 %$% # Note %$% lm(arr_delay ~ dep_delay) delay_model %>% summary()
## ## Call: ## lm(formula = arr_delay ~ dep_delay) ## ## Residuals: ## Min 1Q Median 3Q Max ## -43.906 -9.022 -1.758 8.678 57.052 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.96717 0.43748 -11.35 <2e-16 *** ## dep_delay 1.04229 0.01788 58.28 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 13.62 on 972 degrees of freedom ## Multiple R-squared: 0.7775, Adjusted R-squared: 0.7773 ## F-statistic: 3397 on 1 and 972 DF, p-value: < 2.2e-16
The output indicates
dep_delay
(slope) estimate 1.0423 is significant\[arr\_delay_i = (Intercept + Coefficient_{dep\_delay} * dep\_delay_{i1}) + \epsilon_i \]
flights_nov_20 %>% ggplot(aes(x = dep_delay, y = arr_delay)) + geom_point() + coord_fixed(ratio = 1) + geom_abline(intercept = 4.0943, slope = 1.04229, color="red")
0
Shapiro-Wilk test for normality of standard residuals,
delay_model %>% rstandard() %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.98231, p-value = 1.73e-09
Standard residuals are NOT normally distributed
Breusch-Pagan test for homoscedasticity of standard residuals
library(lmtest) delay_model %>% bptest()
## ## studentized Breusch-Pagan test ## ## data: . ## BP = 0.017316, df = 1, p-value = 0.8953
Standard residuals are homoscedastic
Durbin-Watson test for the independence of residuals
# Also part of the library lmtest delay_model %>% dwtest()
## ## Durbin-Watson test ## ## data: . ## DW = 1.8731, p-value = 0.02358 ## alternative hypothesis: true autocorrelation is greater than 0
Standard residuals might not be completely indipendent
Note: the result depends on the order of the data.
In the practical session, we will see: