1 Introduction to R
We start this chapter with a brief introduction to R
, the programming language that will be the focus of the module, and the tool that we will use to do data science.
R
is one of the most widely used programming languages nowadays, along with Python, especially in geographic and satellite data science. I don’t personally have a strong preference for either, and I use both fairly regularly and in combination. Most of the time, using one or the other is a matter of habit or the availability of a particular functionality that makes it easier to complete the task you are set to do. For instance, Python has great libraries for programming deep neural networks. However, I find R
more effective and powerful in data manipulation, statistical analysis, visualisation and mapping. That is the key reason why this book focuses on R
. At the same time, beyond the mere details of syntax, the languages are not too different and are becoming easier to integrate. Most principles and approaches covered in this book can be applied when using Python, just using a different syntax.
1.1 The R programming language
R
4 was created in 1992 by R
oss Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R
is a free, open-source implementation of the S
statistical programming language initially created at Bell Labs. At its core, R
is a functional programming language (its main functionalities revolve around defining and executing functions). However, it now supports and is commonly used as an imperative (focused on instructions on variables and programming control structures) and object-oriented (involving complex object structures) programming language.
In simple terms, programming in R
mainly focuses on devising a series of instructions to execute a task – most commonly, loading and analysing a dataset.
As such, R can be used to program by creating sequences of instructions involving variables – which are named entities that can store values, more on that below. That will be the main topic of this practical session. Instructions can include control flow structures, such as decision points (if/else) and loops, which will be the topic of the next practical session. Instructions can also be grouped into functions, which we will see in more detail in the next chapter.
R
is interpreted, not compiled. That means that an R
interpreter receives an instruction you write and interprets and executes them. Other programming languages require their code to be compiled in an executable file to be executed on a computer.
1.1.1 RStudio
RStudio is probably the most popular Integrated Development Environment (IDE) for R
. When using RStudio, the R
interpreter is hidden in the backend, and RStudio is the frontend application that allows you to interact with the interpreter. As you open RStudio or RStudio Server, the interface is divided into two main sections. On the left side, you find the Console – and the R
script editor, when a script is being edited. The Console in an input/output window into the R
interpreter, where you can type instructions and see the resulting output.
For instance, if you type in the Console
1 + 1
the R
interpreter understands that as an instruction to sum 1
to 1
and returns the following result as output.
## [1] 2
As these materials are created in RMarkdown, the output of the computation is always preceded by ##
. Note how the output value 2
is preceded by [1]
, which indicates that the output is constituted by only one element. If the output is constituted by more than one element, as the list of numbers below, each row of the output is preceded by the index of the first element of the output.
## [1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361
## [20] 400
On the right side, you find two groups of panels. On the top-right, the main element is the Environment panel, which represents of the current state of the interpreter’s memory, showing all the information available for computation. For instance, that is where you will be able to see datasets loaded for analysis. On the bottom-right, you find the Files panel, which shows the file system (file and folders on your computer or the server), as well as the Help panel, which shows you the help pages when required. We will discuss the other panels later on in the practical sessions.
1.1.2 Coding style
A coding style is a set of rules and guidelines to write programming code designed to ensure that the code is easy to read, understand, and consistent over time. Following a good coding style is essential when writing code that others will read – for instance, if you work in a team, publish your code or submit your code as a piece of coursework – and it ensures you will understand your code in a few months. Following a good coding style is also an essential step towards reproducibility, as we will see in a later chapter.
In this book, I will follow the Tidyverse Style Guide (style.tidyverse.org). Study the Tidyverse Style Guide and use it consistently.
1.2 Core concepts
1.2.1 Values
When a value is typed in the Console, the interpreter returns the same value. In the examples below, 2
is a simple numeric value, while "String value"
is a textual value, which in R
is referred to as a character value and in programming is also commonly referred to as a string (short for a string of characters).
Numeric example:
2
## [1] 2
Character example:
"String value"
## [1] "String value"
Note how character values need to start and end with a single or double quote ('
or "
), which are not part of the information themselves. The Tidyverse Style Guide suggests always using the double quote ("
), so we will use those in this module.
Anything that follows a #
symbol is considered a comment, and the interpreter ignores it.
# hi, I am a comment, please ignore me
As mentioned above, the interpreter understands simple operations on numeric and logic values, as well as text objects (including single characters and strings of character, that is text objects longer than one character, commonly referred to simply as strings in computer science), as discussed in more detail in Appendix 1.
# Sum 1 and 2
1 + 2
## [1] 3
# Logical AND operation between
# the value TRUE and FALSE
TRUE & FALSE
## [1] FALSE
# Check whether the character a
# is equal to the character b
"a" == "b"
## [1] FALSE
1.2.2 Variables
In computer programming, a variable can be thought about as a storage location (a bit of memory) with an associated name (also referred to as identifier) and a value that can vary – hence the name variable. When programming, you can define a variable by naming it with an identifier and providing a value to be stored in the variable. After that, you can retrieve the value stored in the variable by specifying the chosen identifier. Variables are an essential tool in programming, as they allow saving the result of a piece of computation and to retrieve it later on for further analysis.
A variable can be defined in R
using an identifier (e.g., a_variable
) on the left of an assignment operator <-
, followed by the object to be linked to the identifier, such as a value (e.g., 1
) to be assigned on the right. The value of the variable can be invoked by simply specifying the identifier.
a_variable <- 1
a_variable
## [1] 1
If you type a_variable <- 1
in the Console in RStudio, a new element appears in the Environment panel, representing the new variable in the memory. The left part of the entry contains the identifier a_variable
, and the right part contains the value assigned to the variable a_variable
, that is 1
. In the example below, another variable named another_variable
is created and summed to a_variable
, saving the result in sum_of_two_variables
.
another_variable <- 4
another_variable
## [1] 4
sum_of_two_variables <- a_variable + another_variable
sum_of_two_variables
## [1] 5
1.2.3 Data frames
The examples above illustrate how R
allows working with simple data types. However, in data science, we rarely engage directly with such simple pieces of information. Rather, we commonly work with datasets composed of tables. Data frames are complex data types which encode the concept of a table in R
by combining and arranging together a series of simple objects. In the following chapters, we will explore the different complex types and how to handle tables in more detail.
R
includes many simple example data frames, such as the iris
dataset. When loading R
, a pre-defined variable named iris
will be available in the environment. By specifying the identifier iris
in the console (as shown below), you can invoke the variable and see its contents. Each one of the 150 rows of the table (the first five rows are shown below) reports information about an iris flower. Four numeric columns (Sepal.Length
, Sepal.Width
, Petal.Length
and Petal.Width
) are used to describe the size of sepal and petal, while a string column (Species
) is used to catalogue the species of iris.
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
1.2.4 Algorithms
Any operation that can be executed using a computer is called an algorithm. To be more precise, Nigel Cutland5 defined an algorithm as “a mechanical rule, or automatic method, or program for performing some mathematical operation” (Cutland, 1980, p. 76).
The instructions you get to mount your Ikea furniture can be thought of as an algorithm, an effective procedure to perform the operation of mounting your furniture. You are playing the part of the computer executing the algorithm.
A program is a set of instructions implementing an abstract algorithm into a specific language – let that be R, Python, or any other language. In their definition, algorithms (and thus the programs that implement them) can use variables and functions. As is the case for R
, programs that are interpreted rather than compiled are also referred to as scripts.
1.2.5 Functions
You can think of a function as a processing unit that, having received some values as input, performs a specific task and can return a value as output. Some simple algorithms can be coded as programs made of only one function that performs the whole task. More complex algorithms might require multiple functions, each designed to complete a sub-task, which combined perform the entire task.
You can invoke a function by specifying the function name along with the arguments (input values) between simple brackets. Each argument corresponds to a parameter (i.e., an internal variable used within the function to run the operation, as we will see in more detail later in this book). Programming languages provide pre-defined functions that implement common algorithms (e.g., finding the square root of a number or calculating a linear regression).
For instance, sqrt
is the pre-defined function in R
that computes the square root of a number. The instruction sqrt(2)
tells the R
interpreter to run the function that calculates the square root using 2
as the input value. The function will return 1.414214
, the square root of 2
, as the output.
sqrt(2)
## [1] 1.414214
Another example is the function round
, which returns a value rounded to a specified number of digits after the dot. For instance, round(1.414214, digits = 2)
returns 1.41
. In this case, we specify that the second argument refers to the number of digits to be kept, and this is because this function also has other arguments that can be specified.
round(1.414214, digits = 2)
## [1] 1.41
Functions can also be used on the right side of the assignment operator <-
in which case the output value of the function will be stored in the memory slot with that identifier. Variables can be used as arguments. For instance, after saving the result of the square root of two in the variable sqrt_of_two
, we can use the same variable as the first argument for the function round
.
sqrt_of_two <- sqrt(2)
sqrt_of_two
## [1] 1.414214
round(sqrt_of_two, digits = 2)
## [1] 1.41
As the rest of the book will illustrate, R
provides a wide range of functions that will allow conducting any step of a a data analysis: from data input/ouput to data manipulation, from statistics and machine learning to visualisation, and even to creating a web-based book like the one you are currently reading (which has indeed been created using R
). For instance, you can use the function hist
to plot the histogram of petal lengths of flowers in iris
. Note how, in the code below, the first parameter is not iris
but iris$Petal.Length
as the $
is used to extract the Petal.Length
from iris
(more on this in the coming chapters).
hist(iris$Petal.Length, main = "Petal lengths")
Functions can also be used as arguments of functions. Instead of first calculating the square root of two, saving the value in a variable, and then using the variable as the first argument of round
, we can directly add the function sqrt
and its argument as the first argument of the function round
.
## [1] 1.41
In a subsequent chapter, we will see how you can create functions yourself.
As we introduce variables and functions, and functions using variables and functions, the complexity of our code increases quite rapidly. In fact, using a function as the argument for another function is usually discouraged because it makes the code more difficult to read. Instead, it would be best to always aim for a code that is as easy to read and understand as possible. An essential step in ensuring that is to follow coding style guidelines closely.
1.2.6 Libraries
Functions can be collected and stored in libraries (sometimes referred to as packages), containing related functions and sometimes datasets. For instance, the base
library in R
includes the sqrt
function above, and the rgdal
library, which contains implementations of the GDAL (Geospatial Data Abstraction Library) functionalities for R
.
Libraries can be installed in R
using the function install.packages
or using Tool > Install Packages...
in RStudio.
1.3 Tidyverse
The meta-library Tidyverse7 contains the following libraries:
-
ggplot2
8 is a system for declaratively creating graphics based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. -
dplyr
provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. -
tidyr
provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. -
readr
provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. -
purrr
enhancesR
’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. Once you master the basic concepts, purrr allows you to replace many for loops with code that is easier to write and more expressive. -
tibble
is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not. Tibbles are data.frames that are lazy and surly: they do less and complain more forcing you to confront problems earlier, typically leading to cleaner, more expressive code. -
stringr
provides a cohesive set of functions designed to make working with strings as easy as possible. It is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. -
forcats
provides a suite of useful tools that solve common problems with factors.R
uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
A library can be loaded using the function library
, as shown below (note the name of the library is not quoted). Once a library is installed on a computer, you don’t need to install it again, but every script needs to load all the libraries that it uses. Once a library is loaded, all its functions can be used.
Important: it is always necessary to load the tidyverse
meta-library if you want to use the stringr
functions or the pipe operator %>%
.
1.3.1 The pipe operator
The pipe operator is useful to outline more complex operations step by step (see also R for Data Science, Chapter 18). The pipe operator %>%
- takes the result from one function
- and passes it to the next function
- as the first argument
- that doesn’t need to be included in the code anymore
The code below shows a simple example. The number 2
is taken as input for the first pipe that passes it on as the first argument to the function sqrt
. The output value 1.41
is then taken as input for the second pipe, which passes it on as the first argument to the function trunc
. The final output 1
is finally returned.
## [1] 1
The image below graphically illustrates how the pipe operator works, compared to the same procedure executed using two temporary variables that are used to store temporary values.
Similarly, you can use pipe to recreate the histogram of petal lengths from the iris
dataset shown above, using the function pull
to extract the column Petal.Length
from the table, instead of the operator $
. However, the Tidyverse includes the ggplot2
library, which provides support for more sophisticated plotting, as we will see in subsequent chapters.
The first step of a sequence of pipes can be a value, a variable, or a function, including arguments. The code below shows a series of examples of different ways of achieving the same result. The examples use the function round
, which also allows for a second argument: digits = 2
. Note that, when using the pipe operator, only the nominally second argument is provided to the function round
– that is round(digits = 2)
# No pipe, using variables
tmp_variable_A <- 2
tmp_variable_B <- sqrt(tmp_variable_A)
round(tmp_variable_B, digits = 2)
# No pipe, using functions only
round(sqrt(2), digits = 2)
# Pipe starting from a value
2 %>%
sqrt() %>%
round(digits = 2)
# Pipe starting from a variable
the_value_two <- 2
the_value_two %>%
sqrt() %>%
round(digits = 2)
# Pipe starting from a function
sqrt(2) %>%
round(digits = 2)
A complex operation created through the use of %>%
can be used on the right side of <-
, to assign the outcome of the operation to a variable.
1.4 Exercise 101.1
Question 101.1.1: Write a piece of code using the pipe operator that takes as input the number 1632
, calculates the logarithm to the base 10, takes the highest integer number lower than the calculated value (lower round), and verifies whether it is an integer.
Question 101.1.2: Write a piece of code using the pipe operator that takes as input the number 1632
, calculates the square root, takes the lowest integer number higher than the calculated value (higher round), and verifies whether it is an integer.
Question 101.1.3: Write a piece of code using the pipe operator that takes as input the string "1632"
, transforms it into a number, and checks whether the result is Not a Number.
Question 101.1.4: Write a piece of code using the pipe operator that takes as input the string "-16.32"
, transforms it into a number, takes the absolute value and truncates it, and finally checks whether the result is Not Available.
Question 101.1.5: Rewrite a piece of code below by substituting the last line with the function mean()
. What kind of result do you obtain? What does it represent?
Question 101.1.6: Further edit the code created for Question 101.1.6 by substituting Petal.Length
with Petal.Width
first and Species
then? What kind of results do you obtain? What do they mean?
by Stefano De Sabbata – text licensed under the CC BY-SA 4.0, contains public sector information licensed under the Open Government Licence v3.0, code licensed under the GNU GPL v3.0.