37 Lecture 502
Correlation

37.1 Correlation

Two variables can be related in three different ways

  • related
    • positively: entities with high values in one tend to have high values in the other
    • negatively: entities with high values in one tend to have low values in the other
  • not related at all

Correlation is a standardised measure of covariance

37.3 Example

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.39881, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.67201, p-value < 2.2e-16

37.4 Pearson’s r

If two variables are normally distributed, use Pearson’s r

The square of the correlation value indicates the percentage of shared variance

If they were normally distributed, but they are not

  • 0.882 ^ 2 = 0.778
  • departure and arrival delay would share 77.8% of variance



## 
##  Pearson's product-moment correlation
## 
## data:  dep_delay and arr_delay
## t = 58.282, df = 972, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8669702 0.8950078
## sample estimates:
##       cor 
## 0.8817655

37.5 Spearman’s rho

If two variables are not normally distributed, use Spearman’s rho

  • non-parametric
  • based on rank difference

The square of the correlation value indicates the percentage of shared variance

If few ties, but there are

  • 0.536 ^ 2 = 0.287
  • departure and arrival delay would share 28.7% of variance
## Warning in cor.test.default(dep_delay, arr_delay, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  dep_delay and arr_delay
## S = 71437522, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5361247

37.6 Kendall’s tau

If not normally distributed and there is a large number of ties, use Kendall’s tau

  • non-parametric
  • based on rank difference

The square of the correlation value indicates the percentage of shared variance

Departure and arrival delay seem actually to share

  • 0.396 ^ 2 = 0.157
  • 15.7% of variance
## 
##  Kendall's rank correlation tau
## 
## data:  dep_delay and arr_delay
## z = 17.859, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.3956265

37.7 Pairs plot

Combines in one visualisation: histograms, scatter plots, and correlation values for a set of variables