2020-01-15
A visual variable is an aspect of a mark that can be controlled to change its appearance.
Visual variables include:
Grammars provide rules for languages
“The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements)” (Wilkinson, 2005)
Statistical graphic specifications are expressed in six statements:
The ggplot2 library offers a series of functions for creating graphics declaratively, based on the Grammar of Graphics.
To create a graph in ggplot2:
aes)geom_point)library(tidyverse) library(nycflights13) library(knitr)
x variable to plotgeom_histogramnycflights13::flights %>%
  filter(month == 11) %>%
  ggplot(
    aes(
      x = dep_delay
    )
  ) +
  geom_histogram(
    binwidth = 10
  )
x categorical variabley variable to plotgeom_boxplotnycflights13::flights %>%
  filter(month == 11) %>%
  ggplot(
    aes(
      x = carrier, 
      y = arr_delay
    )
  ) +
  geom_boxplot()
x categorical variabley variable to plotgeom_jitternycflights13::flights %>%
  filter(month == 11) %>%
  ggplot(
    aes(
      x = carrier, 
      y = arr_delay
    )
  ) +
  geom_jitter()
x categorical variabley variable to plotgeom_violinnycflights13::flights %>%
  filter(month == 11) %>%
  ggplot(
    aes(
      x = carrier, 
      y = arr_delay
    )
  ) +
  geom_violin()
x e.g., a temporal variabley variable to plotgeom_linenycflights13::flights %>%
  filter(!is.na(dep_delay)) %>%
  mutate(flight_date = ISOdate(year, month, day)) %>%
  group_by(flight_date) %>%
  summarize(avg_dep_delay = mean(dep_delay)) %>%
  ggplot(aes(
    x = flight_date,
    y = avg_dep_delay
  )) +
  geom_line()
x and y variable to plotgeom_pointnycflights13::flights %>%
  filter(
    month == 11, 
    carrier == "US",
    !is.na(dep_delay),
    !is.na(arr_delay)
  ) %>%
  ggplot(aes(
    x = dep_delay,
    y = arr_delay
  )) +
  geom_point()
x and y variable to plotgeom_count counts overlapping points and maps the count to sizenycflights13::flights %>%
  filter(
    month == 11, carrier == "US",
    !is.na(dep_delay), !is.na(arr_delay)
  ) %>%
  ggplot(aes(
    x = dep_delay,
    y = arr_delay
  )) +
  geom_count()
x and y variable to plotgeom_bin2dnycflights13::flights %>%
  filter(
    month == 11, 
    carrier == "US",
    !is.na(dep_delay),
    !is.na(arr_delay)
  ) %>%
  ggplot(aes(
    x = dep_delay,
    y = arr_delay
  )) +
  geom_bin2d()
Quantitatively describe or summarize variables
stat.desc from pastecs library
base includes countsdesc includes descriptive statsnorm (default is FALSE) includes distribution statslibrary(pastecs) nycflights13::flights %>% filter(month == 11, carrier == "US") %>% select(dep_delay, arr_delay, distance) %>% stat.desc() %>% kable()
| dep_delay | arr_delay | distance | |
|---|---|---|---|
| nbr.val | 1668.0000000 | 1667.000000 | 1.699000e+03 | 
| nbr.null | 58.0000000 | 35.000000 | 0.000000e+00 | 
| nbr.na | 31.0000000 | 32.000000 | 0.000000e+00 | 
| min | -17.0000000 | -63.000000 | 9.600000e+01 | 
| max | 193.0000000 | 191.000000 | 2.153000e+03 | 
| range | 210.0000000 | 254.000000 | 2.057000e+03 | 
| sum | 961.0000000 | -4450.000000 | 9.715580e+05 | 
| median | -4.0000000 | -7.000000 | 5.290000e+02 | 
| mean | 0.5761391 | -2.669466 | 5.718411e+02 | 
| SE.mean | 0.4084206 | 0.518816 | 1.464965e+01 | 
| CI.mean.0.95 | 0.8010713 | 1.017600 | 2.873327e+01 | 
| var | 278.2347513 | 448.706408 | 3.646264e+05 | 
| std.dev | 16.6803702 | 21.182691 | 6.038430e+02 | 
| coef.var | 28.9519850 | -7.935179 | 1.055963e+00 | 
nbr.val: overall number of values in the datasetnbr.null: number of NULL values – NULL is often returned by expressions and functions whose values are undefinednbr.na: number of NAs – missing value indicatormin (also min()): minimum value in the datasetmax (also max()): minimum value in the datasetrange: difference between min and max (different from range())sum (also sum()): sum of the values in the datasetmean (also mean()): arithmetic mean, that is sum over the number of values not NAmedian (also median()): median, that is the value separating the higher half from the lower half the valuesmode()functio is available: mode, the value that appears most often in the valuesAssuming that the data in the dataset are a sample of a population
SE.mean: standard error of the mean – estimation of the variability of the mean calculated on different samples of the data (see also central limit theorem)
CI.mean.0.95: 95% confidence interval of the mean – indicates that there is a 95% probability that the actual mean is within that distance from the sample mean
var: variance (\(\sigma^2\)), it quantifies the amount of variation as the average of squared distances from the mean\[\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2\]
std.dev: standard deviation (\(\sigma\)), it quantifies the amount of variation as the square root of the variance\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2}\]
coef.var: variation coefficient it quantifies the amount of variation as the standard deviation divided by the meannycflights13::flights %>% 
  ggplot(
    aes(
      x = dep_delay
    )
  ) +
  geom_histogram(
    aes(
      y =..density..
    ),
    binwidth = 10
  ) + 
  stat_function(
    fun = dnorm, 
    args = list(
      mean = dep_delay_mean,
      sd = dep_delay_sd),
    colour = "black", size = 1)Cumulative values against the cumulative probability of a particular distribution
nycflights13::flights %>%
  filter(
    month == 11, 
    carrier == "US"
  ) %>%
  qplot(
    sample = dep_delay, 
    data = .,
    stat = "qq", 
    xlab = "Theoretical", 
    ylab = "Sample"
  )nycflights13::flights %>% filter(month == 11, carrier == "US") %>% select(dep_delay, arr_delay, distance) %>% stat.desc(basic = FALSE, desc = FALSE, norm = TRUE) %>% kable()
| dep_delay | arr_delay | distance | |
|---|---|---|---|
| skewness | 4.4187763 | 2.0716291 | 2.0030249 | 
| skew.2SE | 36.8709612 | 17.2808242 | 16.8678747 | 
| kurtosis | 28.8513206 | 9.5741004 | 2.6000743 | 
| kurt.2SE | 120.4418092 | 39.9557893 | 10.9542887 | 
| normtest.W | 0.5545326 | 0.8657894 | 0.6012442 | 
| normtest.p | 0.0000000 | 0.0000000 | 0.0000000 | 
Shapiro–Wilk test compares the distribution of a variable with a normal distribution having same mean and standard deviation
normtest.W (test statistics) and normtest.p (significance)shapiro.test function is availablenycflights13::flights %>% filter(month == 11, carrier == "US") %>% pull(dep_delay) %>% shapiro.test()
## ## Shapiro-Wilk normality test ## ## data: . ## W = 0.55453, p-value < 2.2e-16
Most statistical tests are based on the idea of hypothesis testing
The threshold to accept or reject an hypotheis is arbitrary and based on conventions (e.g., p < .01 or p < .05)
Example: The null hypotheis of the Shapiro–Wilk test is that the sample is normally distributed and p < .01 indicates that the probability of that being true is very low.
In a normal distribution, the values of skewness and kurtosis should be zero
skewness: skewness value indicates
kurtosis: kurtosis value indicates
skew.2SE and kurt.2SE: skewness and kurtosis divided by 2 standard errors. If greater than 1, the respective statistics is significant (p < .05).Levene’s test for equality of variance in different levels
dep_delay_carrier <- nycflights13::flights %>% filter(month == 11) %>% select(dep_delay, carrier) library(car) leveneTest(dep_delay_carrier$dep_delay, dep_delay_carrier$carrier)
## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 15 20.203 < 2.2e-16 *** ## 27019 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the practical session, we will see: