32 Descriptive statistics

32.1 Descriptive statistics

Quantitatively describe or summarize variables

  • stat.desc from pastecs library
    • base includes counts
    • desc includes descriptive stats
    • norm (default is FALSE) includes distribution stats

32.2 stat.desc output

dep_delay arr_delay distance
nbr.val 1668.0000000 1667.000000 1.699000e+03
nbr.null 58.0000000 35.000000 0.000000e+00
nbr.na 31.0000000 32.000000 0.000000e+00
min -17.0000000 -63.000000 9.600000e+01
max 193.0000000 191.000000 2.153000e+03
range 210.0000000 254.000000 2.057000e+03
sum 961.0000000 -4450.000000 9.715580e+05
median -4.0000000 -7.000000 5.290000e+02
mean 0.5761391 -2.669466 5.718411e+02
SE.mean 0.4084206 0.518816 1.464965e+01
CI.mean.0.95 0.8010713 1.017600 2.873327e+01
var 278.2347513 448.706408 3.646264e+05
std.dev 16.6803702 21.182691 6.038430e+02
coef.var 28.9519850 -7.935179 1.055963e+00

32.3 stat.desc: basic

  • nbr.val: overall number of values in the dataset
  • nbr.null: number of NULL values – NULL is often returned by expressions and functions whose values are undefined
  • nbr.na: number of NAs – missing value indicator

32.4 stat.desc: desc

  • min (also min()): minimum value in the dataset
  • max (also max()): minimum value in the dataset
  • range: difference between min and max (different from range())
  • sum (also sum()): sum of the values in the dataset
  • mean (also mean()): arithmetic mean, that is sum over the number of values not NA
  • median (also median()): median, that is the value separating the higher half from the lower half the values
  • mode()functio is available: mode, the value that appears most often in the values

32.5 Sample statistics

Assuming that the data in the dataset are a sample of a population

  • SE.mean: standard error of the mean – estimation of the variability of the mean calculated on different samples of the data (see also central limit theorem)

  • CI.mean.0.95: 95% confidence interval of the mean – indicates that there is a 95% probability that the actual mean is within that distance from the sample mean

32.6 Estimating variation

  • var: variance (\(\sigma^2\)), it quantifies the amount of variation as the average of squared distances from the mean

\[\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2\]

  • std.dev: standard deviation (\(\sigma\)), it quantifies the amount of variation as the square root of the variance

\[\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (\mu-x_i)^2}\]

  • coef.var: variation coefficient it quantifies the amount of variation as the standard deviation divided by the mean