2020-01-15
A visual variable is an aspect of a mark that can be controlled to change its appearance.
Visual variables include:
Grammars provide rules for languages
“The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements)” (Wilkinson, 2005)
Statistical graphic specifications are expressed in six statements:
The ggplot2
library offers a series of functions for creating graphics declaratively, based on the Grammar of Graphics.
To create a graph in ggplot2
:
aes
)geom_point
)library(tidyverse) library(nycflights13) library(knitr)
x
variable to plotgeom_histogram
nycflights13::flights %>% filter(month == 11) %>% ggplot( aes( x = dep_delay ) ) + geom_histogram( binwidth = 10 )
x
categorical variabley
variable to plotgeom_boxplot
nycflights13::flights %>% filter(month == 11) %>% ggplot( aes( x = carrier, y = arr_delay ) ) + geom_boxplot()
x
categorical variabley
variable to plotgeom_jitter
nycflights13::flights %>% filter(month == 11) %>% ggplot( aes( x = carrier, y = arr_delay ) ) + geom_jitter()
x
categorical variabley
variable to plotgeom_violin
nycflights13::flights %>% filter(month == 11) %>% ggplot( aes( x = carrier, y = arr_delay ) ) + geom_violin()
x
e.g., a temporal variabley
variable to plotgeom_line
nycflights13::flights %>% filter(!is.na(dep_delay)) %>% mutate(flight_date = ISOdate(year, month, day)) %>% group_by(flight_date) %>% summarize(avg_dep_delay = mean(dep_delay)) %>% ggplot(aes( x = flight_date, y = avg_dep_delay )) + geom_line()
x
and y
variable to plotgeom_point
nycflights13::flights %>% filter( month == 11, carrier == "US", !is.na(dep_delay), !is.na(arr_delay) ) %>% ggplot(aes( x = dep_delay, y = arr_delay )) + geom_point()
x
and y
variable to plotgeom_count
counts overlapping points and maps the count to sizenycflights13::flights %>% filter( month == 11, carrier == "US", !is.na(dep_delay), !is.na(arr_delay) ) %>% ggplot(aes( x = dep_delay, y = arr_delay )) + geom_count()
x
and y
variable to plotgeom_bin2d
nycflights13::flights %>% filter( month == 11, carrier == "US", !is.na(dep_delay), !is.na(arr_delay) ) %>% ggplot(aes( x = dep_delay, y = arr_delay )) + geom_bin2d()
Quantitatively describe or summarize variables
stat.desc
from pastecs
library
base
includes countsdesc
includes descriptive statsnorm
(default is FALSE
) includes distribution statslibrary(pastecs) nycflights13::flights %>% filter(month == 11, carrier == "US") %>% select(dep_delay, arr_delay, distance) %>% stat.desc() %>% kable()
dep_delay | arr_delay | distance | |
---|---|---|---|
nbr.val | 1668.0000000 | 1667.000000 | 1.699000e+03 |
nbr.null | 58.0000000 | 35.000000 | 0.000000e+00 |
nbr.na | 31.0000000 | 32.000000 | 0.000000e+00 |
min | -17.0000000 | -63.000000 | 9.600000e+01 |
max | 193.0000000 | 191.000000 | 2.153000e+03 |
range | 210.0000000 | 254.000000 | 2.057000e+03 |
sum | 961.0000000 | -4450.000000 | 9.715580e+05 |
median | -4.0000000 | -7.000000 | 5.290000e+02 |
mean | 0.5761391 | -2.669466 | 5.718411e+02 |
SE.mean | 0.4084206 | 0.518816 | 1.464965e+01 |
CI.mean.0.95 | 0.8010713 | 1.017600 | 2.873327e+01 |
var | 278.2347513 | 448.706408 | 3.646264e+05 |
std.dev | 16.6803702 | 21.182691 | 6.038430e+02 |
coef.var | 28.9519850 | -7.935179 | 1.055963e+00 |
nbr.val
: overall number of values in the datasetnbr.null
: number of NULL
values – NULL is often returned by expressions and functions whose values are undefinednbr.na
: number of NA
s – missing value indicatormin
(also min()
): minimum value in the datasetmax
(also max()
): minimum value in the datasetrange
: difference between min
and max
(different from range()
)sum
(also sum()
): sum of the values in the datasetmean
(also mean()
): arithmetic mean, that is sum
over the number of values not NA
median
(also median()
): median, that is the value separating the higher half from the lower half the valuesmode()
functio is available: mode, the value that appears most often in the valuesAssuming that the data in the dataset are a sample of a population
SE.mean
: standard error of the mean – estimation of the variability of the mean calculated on different samples of the data (see also central limit theorem)
CI.mean.0.95
: 95% confidence interval of the mean – indicates that there is a 95% probability that the actual mean is within that distance from the sample mean
var
: variance (\(\sigma^2\)), it quantifies the amount of variation as the average of squared distances from the mean\[\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2\]
std.dev
: standard deviation (\(\sigma\)), it quantifies the amount of variation as the square root of the variance\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2}\]
coef.var
: variation coefficient it quantifies the amount of variation as the standard deviation divided by the meannycflights13::flights %>% ggplot( aes( x = dep_delay ) ) + geom_histogram( aes( y =..density.. ), binwidth = 10 ) + stat_function( fun = dnorm, args = list( mean = dep_delay_mean, sd = dep_delay_sd), colour = "black", size = 1)
Cumulative values against the cumulative probability of a particular distribution
nycflights13::flights %>% filter( month == 11, carrier == "US" ) %>% qplot( sample = dep_delay, data = ., stat = "qq", xlab = "Theoretical", ylab = "Sample" )
nycflights13::flights %>% filter(month == 11, carrier == "US") %>% select(dep_delay, arr_delay, distance) %>% stat.desc(basic = FALSE, desc = FALSE, norm = TRUE) %>% kable()
dep_delay | arr_delay | distance | |
---|---|---|---|
skewness | 4.4187763 | 2.0716291 | 2.0030249 |
skew.2SE | 36.8709612 | 17.2808242 | 16.8678747 |
kurtosis | 28.8513206 | 9.5741004 | 2.6000743 |
kurt.2SE | 120.4418092 | 39.9557893 | 10.9542887 |
normtest.W | 0.5545326 | 0.8657894 | 0.6012442 |
normtest.p | 0.0000000 | 0.0000000 | 0.0000000 |
Shapiro–Wilk test compares the distribution of a variable with a normal distribution having same mean and standard deviation
normtest.W
(test statistics) and normtest.p
(significance)shapiro.test
function is availablenycflights13::flights %>% filter(month == 11, carrier == "US") %>% pull(dep_delay) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.55453, p-value < 2.2e-16
Most statistical tests are based on the idea of hypothesis testing
The threshold to accept or reject an hypotheis is arbitrary and based on conventions (e.g., p < .01 or p < .05)
Example: The null hypotheis of the Shapiro–Wilk test is that the sample is normally distributed and p < .01 indicates that the probability of that being true is very low.
In a normal distribution, the values of skewness and kurtosis should be zero
skewness
: skewness value indicates
kurtosis
: kurtosis value indicates
skew.2SE
and kurt.2SE
: skewness and kurtosis divided by 2 standard errors. If greater than 1, the respective statistics is significant (p < .05).Levene’s test for equality of variance in different levels
dep_delay_carrier <- nycflights13::flights %>% filter(month == 11) %>% select(dep_delay, carrier) library(car) leveneTest(dep_delay_carrier$dep_delay, dep_delay_carrier$carrier)
## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 15 20.203 < 2.2e-16 *** ## 27019 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the practical session, we will see: