2020-01-15
An introduction to R
More complex data types
Vectors are ordered list of values.
A vector variable can be defined using
a_vector
)<-
c
a_vector <- c("Birmingham", "Derby", "Leicester", "Lincoln", "Nottingham", "Wolverhampton") a_vector
## [1] "Birmingham" "Derby" "Leicester" "Lincoln" ## [5] "Nottingham" "Wolverhampton"
:
seq
rep
4:7
## [1] 4 5 6 7
seq(1, 7, by = 0.5)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
seq(1, 10, length.out = 7)
## [1] 1.0 2.5 4.0 5.5 7.0 8.5 10.0
rep("Ciao", 4)
## [1] "Ciao" "Ciao" "Ciao" "Ciao"
Each element of a vector can be retrieved specifying the related index between square brackets, after the identifier of the vector. The first element of the vector has index 1.
a_vector[3]
## [1] "Leicester"
A vector of indexes can be used to retrieve more than one element.
a_vector[c(5, 3)]
## [1] "Nottingham" "Leicester"
Functions can be used on a vector variable directly
a_numeric_vector <- 1:5 a_numeric_vector + 10
## [1] 11 12 13 14 15
sqrt(a_numeric_vector)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
a_numeric_vector >= 3
## [1] FALSE FALSE TRUE TRUE TRUE
Overall expressions can be tested using the functions:
any(a_numeric_vector >= 3)
## [1] TRUE
all(a_numeric_vector >= 3)
## [1] FALSE
A factor is a data type similar to a vector. However, the values contained in a factor can only be selected from a set of levels.
houses_vector <- c("Bungalow", "Flat", "Flat", "Detached", "Flat", "Terrace", "Terrace") houses_vector
## [1] "Bungalow" "Flat" "Flat" "Detached" "Flat" "Terrace" ## [7] "Terrace"
houses_factor <- factor(c("Bungalow", "Flat", "Flat", "Detached", "Flat", "Terrace", "Terrace")) houses_factor
## [1] Bungalow Flat Flat Detached Flat Terrace Terrace ## Levels: Bungalow Detached Flat Terrace
The function table can be used to obtain a tabulated count for each level.
houses_factor <- factor(c("Bungalow", "Flat", "Flat", "Detached", "Flat", "Terrace", "Terrace")) houses_factor
## [1] Bungalow Flat Flat Detached Flat Terrace Terrace ## Levels: Bungalow Detached Flat Terrace
table(houses_factor)
## houses_factor ## Bungalow Detached Flat Terrace ## 1 1 3 2
A specific set of levels can be specified when creating a factor by providing a levels argument.
houses_factor_spec <- factor( c("People Carrier", "Flat", "Flat", "Hatchback", "Flat", "Terrace", "Terrace"), levels = c("Bungalow", "Flat", "Detached", "Semi", "Terrace")) table(houses_factor_spec)
## houses_factor_spec ## Bungalow Flat Detached Semi Terrace ## 0 3 0 0 2
In statistics terminology, (unordered) factors are categorical (i.e., binary or nominal) variables. Levels are not ordered.
income_nominal <- factor( c("High", "High", "Low", "Low", "Low", "Medium", "Low", "Medium"), levels = c("Low", "Medium", "High")) income_nominal > "Low"
## Warning in Ops.factor(income_nominal, "Low"): '>' not meaningful for ## factors
## [1] NA NA NA NA NA NA NA NA
In statistics terminology, ordered factors are ordinal variables. Levels are ordered.
income_ordered <- ordered( c("High", "High", "Low", "Low", "Low", "Medium", "Low", "Medium"), levels = c("Low", "Medium", "High")) income_ordered > "Low"
## [1] TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
sort(income_ordered)
## [1] Low Low Low Low Medium Medium High High ## Levels: Low < Medium < High
Matrices are collections of numerics arranged in a two-dimensional rectangular layout
a_matrix <- matrix(c(3, 5, 7, 4, 3, 1), c(3, 2)) a_matrix
## [,1] [,2] ## [1,] 3 4 ## [2,] 5 3 ## [3,] 7 1
Variables of the type array are higher-dimensional matrices.
a3dim_array <- array(1:24, dim=c(4, 3, 2))
a3dim_array
## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 5 9 ## [2,] 2 6 10 ## [3,] 3 7 11 ## [4,] 4 8 12 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 13 17 21 ## [2,] 14 18 22 ## [3,] 15 19 23 ## [4,] 16 20 24
Subsets of matrices (and arrays) can be selected as seen for vectors.
a_matrix[2, c(1, 2)]
## [1] 5 3
a3dim_array[c(1, 2), 2, 2]
## [1] 17 18
apply
applies another function to each level of a set dimension of an array
apply(a3dim_array, 3, min) # apply on third dimension
## [1] 1 13
apply(a3dim_array, 1, min) # apply on first dimension
## [1] 1 2 3 4
apply(a3dim_array, 2, min) # apply on second dimension
## [1] 1 5 9
Variables of the type list can contain elements of different types (including vectors and matrices), whereas elements of vectors are all of the same type.
employee <- list("Stefano", 2015) employee
## [[1]] ## [1] "Stefano" ## ## [[2]] ## [1] 2015
employee[[1]] # Note the double square brackets for selection
## [1] "Stefano"
In named lists each element has a name, and elements can be selected to using their name after the symbol $
.
employee <- list(name = "Stefano", start_year = 2015) employee
## $name ## [1] "Stefano" ## ## $start_year ## [1] 2015
employee$name
## [1] "Stefano"
With lapply
take care that the function makes sense for any element in the list
various <- list( "Some text", matrix(c(6, 3, 1, 2), c(2, 2)) ) lapply(various, is.numeric)
## [[1]] ## [1] FALSE ## ## [[2]] ## [1] TRUE
A data frame is equivalent to a named list where all elements are vectors of the same length.
employees <- data.frame( Name = c("Maria", "Pete", "Sarah"), Age = c(47, 34, 32), Role = c("Professor", "Researcher", "Researcher")) employees
## Name Age Role ## 1 Maria 47 Professor ## 2 Pete 34 Researcher ## 3 Sarah 32 Researcher
Data frames are the most common way to represent tabular data in R. Matrices and lists can be converted to data frames.
Selection is similar to vectors and lists.
employees[1, ] # row selection
## Name Age Role ## 1 Maria 47 Professor
employees[, 1] # column selection, as for matrices
## [1] Maria Pete Sarah ## Levels: Maria Pete Sarah
Selection is similar to vectors and lists.
employees$Name # column selection, as for named lists
## [1] Maria Pete Sarah ## Levels: Maria Pete Sarah
employees$Name[1]
## [1] Maria ## Levels: Maria Pete Sarah
Values can be assigned to cells through filtering and <-
employees$Age[3] <- 33 employees
## Name Age Role ## 1 Maria 47 Professor ## 2 Pete 34 Researcher ## 3 Sarah 33 Researcher
Operations can be performed on columns, and new columns created.
current_year <- as.integer(format(Sys.Date(), "%Y")) employees$Year_of_birth <- current_year - employees$Age employees
## Name Age Role Year_of_birth ## 1 Maria 47 Professor 1973 ## 2 Pete 34 Researcher 1986 ## 3 Sarah 33 Researcher 1987
A tibble is a modern reimagining of the data.frame within tidyverse
This forces you to confront problems earlier, typically leading to cleaner, more expressive code.
More complex data types
In the practical session, we will see (surprise, surprise)
Moving on towards data science