This document contains all the materials covered in the Introduction to RStudio workshop of the Maths Skills Centre. This workshop introduces the RStudio environment, writing code in a script, importing and viewing data, data wrangling and data visualisation.
You can follow the instructions here to install or update R and RStudio. This workshop was assembled using R version 4.1.2 and RStudio version 2021.9.0.351.
Questions:
1. What are the four panels in RStudio?
2. How is a project set up in RStudio?
3. Where do we save files for RStudio to use?Objectives:
1. Navigate the RStudio environment.
2. Set up an RStudio project for the workshop.
3. Save the workshop files from the vle into the project working directory.
When working with RStudio, for example for an analysis or even this workshop, it is good practice to have all relevant files in one folder. This makes it easier to locate files and to share our work with colleagues. This folder is called our working directory.
A convenient way to set and save a working directory is through the RStudio projects functionality. Here we create a .Rproj
file, which marks our RStudio project. The folder in which the .Rproj
file is saved will be our working directory. To set up an RStudio project:
intro_R_workshop
.Documents
folder.RStudio should now have refreshed, showing us our project environment. Open a new script, as we did in the previous section, such that the four-panel view is restored. Then save the script by pressing the Save icon (the floppy disk) to the right of the New File button. Give the file a convenient name, such as intro_R_code
. It will then appear in the bottom right of RStudio as intro_R_code.R
.
As mentioned above, one of the features of the RStudio project is that it allows us to have all relevant files in one folder. We will download the data files for this workshop and save them inside our project directory:
data
in your working directory by pressing on ‘New Folder’ in the Files panel (bottom right panel).results
in your working directory.NHANES.csv
) from here into the data
folder.iris.csv
) from here into the data
folder.Now the data is accessible from the data
folder in our working directory in RStudio.
Questions:
1. What are the common terms used to describe R code?
2. What is the structure of commands in R?
3. How can data be entered into RStudio manually?
4. How are notes written alongside code?
5. How and why do we install packages?Objectives:
1. Define the following R terminology: object, assign, call, function, arguments and options.
2. Use a built-in function and control its working with an argument and an option.
3. Assign values to a named vector using thec()
function.
4. Write comments to make a script easier to interpret.
5. Install and load thereadr
package.
In R, everything that we create or import is saved as an object. An object has a name, with which we can refer to it. When we save something as an object, this is referred to as assigning something to that object. Assigning is done using the <-
syntax, as shown below. If we wanted to save the number 4 under the name four
, we would type in our script:
four <- 4
There are two ways to run this code:
Ctrl
+ Enter
(Cmd
+ Return
on Mac).After running this code, we see four
has appeared in the environment window. This confirms that we have successfully assigned a value to an object.
Most commands in R involve functions. A function is an in-built piece of code that performs a specific task. When a function is used in R, this is referred to as calling the function. Usually, a function is called with one or more inputs - these are called arguments.
Let us take the round()
function as an example. This function rounds a supplied value.
For example, we call the round()
function on the value 3.14. The value 3.14 is an argument in our call:
round(3.14)
R then returns in the console:
## [1] 3
An argument can also be an object, as you will see in the challenge at the end of this section.
Often the behaviour of a function can be manipulated using arguments which take options. These options have defaults, which are assumed if we do not specify an alternative.
For example, the round()
function takes the digits
argument, for which the default is digits = 0
. That is why so far, round()
has returned values with 0 digits behind the decimal.
We can specify an alternative option for the digits
argument, for example for a single digit behind the decimal:
round(3.14, digits = 1)
## [1] 3.1
Challenge: assign the value 6.667 to an object named number
. Then, call the round()
function on the object number
. Ensure that round()
returns two digits behind the decimal. Find the solution in the drop-down.
number <- 6.667
round(number, digits = 2)
## [1] 6.67
c()
functionSo far, we have worked with objects that contained a single value. Often we are working with multiple data points. In R, multiple data points can be saved inside one object as a vector. A vector is a collection of data points of the same type (e.g. numbers or words). Vectors are formed by calling the c()
function, with data provided between the brackets, separated by commas. For example:
numbers <- c(1,2,3,4,5,6)
Vectors can also contain words, which we call strings in R. Ensure that strings are always wrapped in quotes:
colours <- c("red", "blue", "green")
Later in this workshop we will learn how to import data without needing to manually type data into c()
.
As we progress through this workshop, our script will fill up with many lines of code. To enhance readability for ourselves and for colleagues who may use our script at a later date, we include comments.
Any text in R that is preceded by a #
is ignored by R and treated as a comment. For example, if we were to comment when creating our vector numbers
:
# create a vector of the numbers 1 to 6
numbers <- c(1,2,3,4,5,6)
Challenge: assign the city names London, Manchester and Newcastle to an object named cities
. Include a comment to increase the clarity of your code. Find the solution in the drop-down.
# a vector of city names
cities <- c("London", "Manchester", "Newcastle")
So far, we have used the built-in functions round()
and c()
. R has many useful functions, some of which do not come built-in. Rather than releasing individual functions, collections of functions are released together in packages. Therefore, to use an external function, we need to install and load the appropriate package.
Working with packages is analogous to using a new light bulb:
install.packages()
function. This is analogous to screwing a new light bulb into the socket - we only need to do this once.library()
function. This is analogous to using a light bulb - every time we enter the room, we need to turn on the light again.Let us take the readr
package as an example. This package contains a set of functions for importing data into R. In the next section, we will use a function from this package. First, we install the package. You can run the following line from the console rather than the script, as we do not want the package to install again every time we run the script:
# install the readr package
install.packages("readr")
Then we load the package. It is good practice to load all packages at the top of our script. This way, it is easy for another user to identify whether they need to install any packages before running your script. Therefore, paste the following code at or near the top of your script:
# loading a package
library(readr)
As we progress through the workshop, we will encounter other packages. You will get a chance to further practice installing and loading packages then.
Questions:
1. How is data from a .csv file imported into R?
2. What is the difference between a tibble and a data frame?
3. How is a summary view of a tibble viewed in R?Objectives:
1. Useread_csv()
to import data from a .csv file as a tibble.
2. Be aware of differences between a tibble and a data frame.
3. Useview()
,head()
, andsummary()
to inspect a tibble.
.csv
data into RStudioIn the previous section we learned how to type data into RStudio using the c()
function. More commonly, we directly import data from a .csv
spreadsheet. We will use the read_csv()
function from the readr
package, which we loaded in the previous section.
We will use a subset from the NHANES data, which is a public health survey in the US. This data should be located inside the data
folder of your working directory. We import the data as follows:
health_data <- read_csv("data/NHANES.csv")
We can see that health_data
has appeared in the environment panel. This object is a tibble, which you can think of as the RStudio equivalent of a spreadsheet. The main difference is that in a tibble, each column is a vector. Recall that within a vector, data must be of the same type. Therefore, individual columns in a tibble are always of one data type.
In the previous section we created the tibble health_data
. As you come to using RStudio independently, you are likely to come across the data frame. The tibble and the data frame are very similar - in fact, tibble is a type of data frame. In this workshop we limit ourselves to tibble, as its an updated version of the data frame. However, in many online tutorials, data frame is used as the tutorials are a few years old. To find out more about the difference between tibble and data frame, see this blog.
We will cover four ways to quickly inspect the tibble that we have just created. Firstly, we can view health_data
as a whole, analogous to the view that Excel would provide us with. We can do this using the View()
function (note the capital V
). The code below will open a new tab in RStudio, allowing us to scroll through the data:
View(health_data)
Alternatively, we can ask RStudio to display the first six rows of health_data
using the head()
function:
head(health_data)
## # A tibble: 6 x 5
## ID Sex Age Height Weight
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 71892 male 0 NA 7.7
## 2 71460 female 47 148. 62.3
## 3 58929 male 78 178. 63.4
## 4 64041 male 9 151. 62.9
## 5 69722 male 7 116. 21.4
## 6 59289 male 30 165. 50.1
This output is useful for a few checks:
<dbl>
and <chr>
, which stand for double and character, respectively. A double column that has been imported as a character column, for example, could give us trouble in downstream analyses.We may also want to check that the tibble has the same number of rows and columns as we expect. We can do so using the functions nrow()
and ncol()
, respectively:
nrow(health_data)
## [1] 10000
ncol(health_data)
## [1] 5
Finally, we may want to have a look at the spread of our continuous data. We can obtain a view of this using the summary()
function:
summary(health_data)
## ID Sex Age Height
## Min. :51624 Length:10000 Min. : 0.00 Min. : 79.1
## 1st Qu.:56439 Class :character 1st Qu.:11.00 1st Qu.:152.6
## Median :61233 Mode :character Median :32.00 Median :163.9
## Mean :61443 Mean :34.68 Mean :157.9
## 3rd Qu.:66490 3rd Qu.:56.00 3rd Qu.:173.1
## Max. :71916 Max. :80.00 Max. :201.7
## NA's :1022
## Weight
## Min. : 2.70
## 1st Qu.: 43.90
## Median : 67.50
## Mean : 64.56
## 3rd Qu.: 85.10
## Max. :239.40
## NA's :431
This output shows us for each column with continuous data the smallest and largest values (Min.
and Max.
), the 1st and 3rd quantiles, the median and the mean.
In addition, the number of NA's
are displayed. An empty cell in RStudio is denoted by NA
. For example, our Weight
column has 431 NA's
, i.e. 431 missing values. If we were not expecting missing values in our data, we would need to investigate where these NA's
came from.
Challenge: Import the csv file iris.csv
as a tibble named flower_data
. This file contains data on the length and width of sepals and petals of three species of iris. Then, find out the following about the tibble:
What are the column names?
How many rows and columns does the data contain?
What is the mean of sepal length?
How many NAs does sepal width contain?
Find the solution in the drop-down.
flower_data <- read_csv("data/iris.csv") # import the data
head(flower_data) # find the column names in head() output
## # A tibble: 6 x 5
## sepal.length sepal.width petal.length petal.width variety
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 Setosa
## 2 4.9 3 1.4 0.2 Setosa
## 3 4.7 3.2 1.3 0.2 Setosa
## 4 4.6 3.1 1.5 0.2 Setosa
## 5 5 3.6 1.4 0.2 Setosa
## 6 5.4 3.9 1.7 0.4 Setosa
nrow(flower_data) # number of rows
## [1] 150
ncol(flower_data) # number of columns
## [1] 5
summary(flower_data) # mean of sepal length is 5.843 and sepal width has no NAs
## sepal.length sepal.width petal.length petal.width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## variety
## Length:150
## Class :character
## Mode :character
##
##
##
Questions:
1. How can rows from a tibble be selected?
2. How can columns from a tibble be selected?
3. How can multiple data wrangling steps be combined into one command?
4. How can new columns be created based on existing columns?
5. How can group-specific summary statistics be obtained?
6. How can a tibble be saved as a .csv file?Objectives:
1. Usefilter()
to select rows from a tibble.
2. Useselect()
to select columns from a tibble.
3. Use the pipe operator,%>%
, to link commands together.
4. Usemutate()
to create new columns based on existing columns.
5. Usedrop_na()
,group_by()
,summarise()
,n()
,mean()
andsd()
to obtain group-specific summary statistics.
6. Usewrite_csv()
to save a tibble as a .csv file.
Now that our data has been loaded in RStudio as a tibble, we proceed to the stage of “data wrangling”: manipulating the data such that it is ready for downstream analyses.
We may be interested in particular rows and/or columns of our data in downstream analyses. We can select rows using the filter()
function and columns using the select()
function. Both of these functions are part of the dplyr
package. We will first ensure that these functions are loaded in RStudio. Then we will learn how to use them to filter rows and columns.
Challenge: Install and load the dplyr
package, such that you can use filter()
and select()
in this workshop.
Find the solution in the drop-down.
Recall that in order to use read_csv()
from the readr
package, we needed to install readr
using install.packages()
, followed by loading the package using library()
.
install.packages("dplyr") # install dplyr, only needs to be done once
library(dplyr) # load dplyr, needs to be done every time RStudio is started up
To select particular rows, we use the filter()
function. This function takes our tibble of interest (health_data
) and a criterion for filtering. For example, to select rows with participants of the female sex from health_data
, then the criterion is Sex == "female"
(note the use of the double =
):
filter(health_data, Sex == "female")
## # A tibble: 4,963 x 5
## ID Sex Age Height Weight
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 71460 female 47 148. 62.3
## 2 65251 female 10 154. 42.7
## 3 59520 female 14 158. 79.1
## 4 51782 female 23 157. 85.6
## 5 71104 female 78 151. 84.5
## 6 68471 female 15 167. 131.
## 7 64653 female 1 NA 11.4
## 8 64055 female 59 155. 93.7
## 9 57092 female 6 NA NA
## 10 66541 female 70 170. 66.5
## # ... with 4,953 more rows
We could also filter based on a column with continuous data. For example, if we wanted to retain the data on participants below 170 cm:
filter(health_data, Height < 170)
## # A tibble: 6,023 x 5
## ID Sex Age Height Weight
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 71460 female 47 148. 62.3
## 2 64041 male 9 151. 62.9
## 3 69722 male 7 116. 21.4
## 4 59289 male 30 165. 50.1
## 5 71034 male 11 144. 42.6
## 6 65251 female 10 154. 42.7
## 7 59520 female 14 158. 79.1
## 8 56541 male 8 137. 36.6
## 9 63316 male 10 145. 56.2
## 10 51782 female 23 157. 85.6
## # ... with 6,013 more rows
The function for selecting columns is select()
. The names of the columns that we want to keep are included inside select()
as a vector. For example, to retain the ID
and Height
columns:
select(health_data, c(ID, Height))
## # A tibble: 10,000 x 2
## ID Height
## <dbl> <dbl>
## 1 71892 NA
## 2 71460 148.
## 3 58929 178.
## 4 64041 151.
## 5 69722 116.
## 6 59289 165.
## 7 66142 174.
## 8 71034 144.
## 9 65251 154.
## 10 59520 158.
## # ... with 9,990 more rows
We could also specify columns to exclude using -c()
. For example, to exclude the Weight
column:
select(health_data, -c(Weight))
## # A tibble: 10,000 x 4
## ID Sex Age Height
## <dbl> <chr> <dbl> <dbl>
## 1 71892 male 0 NA
## 2 71460 female 47 148.
## 3 58929 male 78 178.
## 4 64041 male 9 151.
## 5 69722 male 7 116.
## 6 59289 male 30 165.
## 7 66142 male 21 174.
## 8 71034 male 11 144.
## 9 65251 female 10 154.
## 10 59520 female 14 158.
## # ... with 9,990 more rows
Note that the output from each of these commands can be saved as an object. For example, to save health_data
with the Weight
column excluded:
health_data_no_weight <- select(health_data, -c(Weight))
%>%
Often when data is wrangled, multiple steps need to be combined. For example, we might want to exclude the Weight
column from health_data
and only retain data on participants of the female Sex
. The long way to do this would be to create a new object at each step of the data wrangling:
health_data_no_weight <- select(health_data, -c(Weight))
health_data_no_weight_only_female <- filter(health_data_no_weight, Sex == "female")
Note that in the filter()
step, we specify health_data_no_weight
as the data, rather than health_data
. So here we select()
for columns and then filter()
for rows on the reduced data from select()
.
This operation would be easier if we could link the select()
and filter()
steps together. This can be done using the pipe operator, %>%
. When reading the pipe operator in code, think of it as saying “then”. In the example below, we select all but the Weight
column, then we filter for participants of the female Sex
:
health_data_no_weight_only_female <- select(health_data, -c(Weight)) %>%
filter(Sex == "female")
Notice that in the filter()
step, we no longer specify the data. This is because the pipe operator passes on the data from the select()
step.
It is common to provide the tibble by itself in the first command, such that the chain becomes:
health_data_no_weight_only_female <- health_data %>%
select(-c(Weight)) %>%
filter(Sex == "female")
Challenge: Create an object named flower_data_setosa_sepal
, a subset of flower_data
, with:
Only flowers of the Setosa
variety.
Only the sepal length and sepal width columns.
Find the solution in the drop-down.
flower_data_setosa_sepal <- flower_data %>%
filter(variety == "Setosa") %>%
select(c(sepal.length, sepal.width))
mutate()
Often we want to add new columns to our tibble, based on existing columns. For example, we may want to add a BMI
column to our health_data
object, based on the Height
and Weight
columns. The BMI is calculated by dividing Weight
by Height
squared. Since Height
in health_data
is in cm, we divide Height
by 100. We create the BMI
column using mutate()
:
health_data_BMI <- health_data %>%
mutate(BMI = Weight/(Height/100)^2)
Using head()
we can see our new column:
head(health_data_BMI)
## # A tibble: 6 x 6
## ID Sex Age Height Weight BMI
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 71892 male 0 NA 7.7 NA
## 2 71460 female 47 148. 62.3 28.3
## 3 58929 male 78 178. 63.4 20.1
## 4 64041 male 9 151. 62.9 27.7
## 5 69722 male 7 116. 21.4 15.8
## 6 59289 male 30 165. 50.1 18.4
Earlier in this workshop we used summary()
to obtain summary statistics for each column in our tibble. Alternatively, we may want to obtain group-specific summary statistics. For example, we may want summary statistics for Height
, grouped by Sex
, from our health_data
.
In this example we will ask RStudio to return the mean, standard deviation and number of observations for Height
, grouped by Sex
. First, we group the data by Sex
using group_by()
. We then calculate summary statistics using summarise()
. Inside summarise()
we specify the values that we want. We request the number of observations per group using n()
, the mean height using mean()
and the standard deviation of height using sd()
. Each of these values is given a name, which we specify ahead of the =
signs.
health_data %>%
group_by(Sex) %>%
summarise(n = n(),
mean = mean(Height),
sd = sd(Height))
## # A tibble: 2 x 4
## Sex n mean sd
## <chr> <int> <dbl> <dbl>
## 1 female 4963 NA NA
## 2 male 5037 NA NA
While n()
resulted in an output, mean()
and sd()
resulted in NA
s. These functions return NA
when there is at least one NA
in the column of interest (i.e. an empty cell). We can circumvent this by dropping the NA
s from Height
using drop_na()
from the tidyr
package. Make sure that you install the tidyr
package before running the code below.
library(tidyr) #load package for drop_na()
health_data %>%
drop_na(Height) %>% #remove NAs in Height
group_by(Sex) %>%
summarise(n = n(),
mean = mean(Height),
sd = sd(Height))
## # A tibble: 2 x 4
## Sex n mean sd
## <chr> <int> <dbl> <dbl>
## 1 female 4453 153. 19.7
## 2 male 4525 163. 25.1
Note that we now have less observations (lower number under n
), since we have dropped the empty rows.
.csv
fileOnce data wrangling is completed, we may want to export our tibble as a .csv
file. This allows us to easily share the data with others. For example, we may want to export our summary table from the last subsection.
To do this, we create an object for the summary table. Then we export the tibble using the write_csv()
function from the readr
package. We provide a name for the file inside write_csv()
within quotes. By default, the file is saved in our `results`` directory.
height_summary <- health_data %>% #give the summary table a name
drop_na(Height) %>%
group_by(Sex) %>%
summarise(n = n(),
mean = mean(Height),
sd = sd(Height))
write_csv(height_summary, "results/height_summary.csv")
After running this code, check in your `results`` directory that the file has indeed been created.
Challenge: Create a tibble with the number of observations, mean and standard deviation of sepal width, grouped by variety. Name this object sepal_width_summary
and save it as a .csv
file in your working directory. Find the solution in the drop-down.
sepal_width_summary <- flower_data %>%
group_by(variety) %>%
summarise(n = n(),
mean = mean(sepal.width),
sd = sd(sepal.width))
write_csv(sepal_width_summary, "results/sepal_width_summary.csv")
Notice that no drop_na()
is not required, since sepal.width
is free of NA
s.
Questions:
1. What is the general format of aggplot()
command?
2. How can this format be adapted for scatterplots and boxplots?
3. How is Google used to find additionalggplot2
commands?
4. How are ggplot objects exported from R?Objectives:
1. Describe the core components of aggplot()
command.
2. Create scatterplots and boxplots using theggplot2
package.
3. Adjustggplot2
objects, for example by adding a title, with the help of Google.
4. Useggsave()
to export a ggplot plot.
Once our data is in an appropriate format, we can proceed to visualisation. The ggplot2
package provides functions with which data can be plotted. Here we will introduce this package.
ggplot()
commandEvery visualisation using ggplot2
includes the ggplot()
function. The general format of the command is:
ggplot(<DATA>, aes(<MAPPINGS>)) +
<GEOM_FUNCTION>()
The above command has four components:
ggplot()
function;<DATA>
);aes(<MAPPINGS>)
);<GEOM_FUNCTION>()
).Notice that ggplot()
and <GEOM_FUNCTION>()
are connected through a +
. In general, a +
is used to connect ggplot2
functions that together build one plot.
ggplot()
plotsHere we will adapt the general format presented above to create scatterplots and boxplots. Make sure you have the ggplot2
package installed and loaded before proceeding.
To create a scatterplot of Height
vs Weight
in our health_data
, we:
ggplot()
function;health_data
as our <DATA>
;y = Height
and x = Weight
inside aes()
;<GEOM_FUNCTION>
, which is geom_point()
.The code and output therefore become:
ggplot(health_data, aes(y = Height, x = Weight)) +
geom_point()
Note: you may receive a warning message about missing values being removed. You can ignore that message for the purpose of this workshop.
In the challenge below you can try to create a boxplot. Rather than geom_point()
, which is used for scatterplots, you will use geom_boxplot()
.
Challenge: Create a boxplot of sepal.length
across variety
from the flower_data
. Think closely about which variables should be denoted as y
and x
as the mappings inside aes()
. Find the solution in the drop-down.
ggplot(flower_data, aes(y = sepal.length, x = variety)) +
geom_boxplot()
ggplot2
offers a lot more beyond selecting variables in the <MAPPING>
and selecting a plot type through the <GEOM_FUNCTION()>
. Often the best way to learn how to perform a specific operation with ggplot2
is to consult Google. In this section we will show this through the addition of a title to our plot.
Let’s try to add a title to our plot of Height
vs. Weight
. Since we have not learned how to do this, we will consult Google. Searching efficiently is a skill which you will develop as you use RStudio. We may for example search “ggplot r add title”. The following pages are included in the result, which all look like tutorials:
Following the tutorial on “sthda.com”, which is a common source for R tutorials, brings us to an index which includes a link to information on titles in ggplot2
:
Following this link shows that the tutorial advises using + labs(title = "…")
to add a title:
Trying this with our plot gives:
ggplot(health_data, aes(y = Height, x = Weight)) +
geom_point() +
labs(title = "Height vs. Weight in the NHANES data")
ggsave()
If you are writing a report or manuscript outside of RStudio, you will llikely need to export figures. This can be done using ggsave()
. Three things to note about this function are:
By default, ggsave()
saves the last plot created.
ggsave()
requires a file name, such as "height_weight_plot.png"
. The plot is saved under this filename in the specified directory.
ggsave()
uses the file extension given in the file name. For example, "height_weight_plot.png"
will produce a .png
file, while "height_weight_plot.pdf"
will produce a .pdf
file.
For example, to save our Height
vs Weight
plot as a .png
file, we run:
ggplot(health_data, aes(y = Height, x = Weight)) +
geom_point() +
labs(title = "Height vs. Weight in the NHANES data")
ggsave("results/height_weight_plot.png")
We should then have the file "height_weight_plot.png"
in our results
directory.
Challenge: Boxplots are often being replaced by violin plots, as these show the distribution of the data more clearly. Use Google to find the command for a violin plot. Then, adapt your code for the boxplot of sepal.length
across variety
from the flower_data
to create a violin plot. Finally, save this plot as a .pdf
file. Find the solution in the drop-down.
ggplot(flower_data, aes(y = sepal.length, x = variety)) +
geom_violin()
# To export to pdf, we use the .pdf file extension
ggsave("results/sepal_length_by_variety_violin_plot.pdf")
# To export to pdf, we use the .pdf file extension
ggsave("../results/sepal_length_by_variety_violin_plot.pdf")
Questions:
1. How are string variables converted to factor variables?
2. How are groups along the x-axis reordered in a ggplot graph?
3. How are multiple ggplot graphs grouped into one plotting region?Objectives:
1. Useas_factor()
to convert a character variable to a factor variable.
2. Usefct_relevel()
to reorder groups along the x-axis of a ggplot graph.
3. Use thepatchwork
package to group multipleggplot()
graphs together.
as_factor()
for categorical variablesWhen we import categorical variables as part of a tibble, they appear as a character
column. See for example the Sex
column in health_data:
head(health_data)
## # A tibble: 6 x 5
## ID Sex Age Height Weight
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 71892 male 0 NA 7.7
## 2 71460 female 47 148. 62.3
## 3 58929 male 78 178. 63.4
## 4 64041 male 9 151. 62.9
## 5 69722 male 7 116. 21.4
## 6 59289 male 30 165. 50.1
A character
column in RStudio is a column with words (also known as “strings”). For downstream analyses, it is often useful to convert a categorical character
column into a factor
column. This has the following advantages:
We can convert a character column into a factor using the as_factor()
function from the forcats
package. We do this inside mutate()
. Ensure that you have the forcats
package installed before you run the code below.
library(forcats) # load the required package
health_data_factor <- health_data %>%
mutate(Sex = as_factor(Sex))
Using head()
, we can see that Sex
has become a factor. See the <fct>
under Sex
:
head(health_data_factor)
## # A tibble: 6 x 5
## ID Sex Age Height Weight
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 71892 male 0 NA 7.7
## 2 71460 female 47 148. 62.3
## 3 58929 male 78 178. 63.4
## 4 64041 male 9 151. 62.9
## 5 69722 male 7 116. 21.4
## 6 59289 male 30 165. 50.1
We can check that our variable is free of typos by checking the levels (i.e. categories) inside our variable using the pull()
and levels()
functions:
health_data_factor %>%
pull(Sex) %>%
levels()
## [1] "male" "female"
This shows us that our factor variable is free of typos, since we have only two levels with spelling as expected.
fct_relevel()
to reorder groups in a factor variableWe can change the order of the levels using fct_relevel()
inside mutate()
:
health_data_factor_relevel <- health_data_factor %>%
mutate(Sex = fct_relevel(Sex, "female", "male"))
We then see that the levels have been reordered:
health_data_factor_relevel %>%
pull(Sex) %>%
levels()
## [1] "female" "male"
Reordering factor levels is handy when you need to rearrange your graphs of data. For example, a boxplot of Height by Sex using health_data_factor
has “male” as the left-most boxplot:
ggplot(health_data_factor, aes(y = Height, x = Sex)) +
geom_boxplot()
In contrast, the same code on health_data_factor_relevel
returns a graph with “female” as the left-most boxplot:
ggplot(health_data_factor_relevel, aes(y = Height, x = Sex)) +
geom_boxplot()
Challenge: Create an object named flower_data_factor
, with the variety
column as a factor. Ensure that the levels are ordered as “Versicolor”, “Virginica”, “Setosa”. Find the solution in the drop-down.
flower_data_factor <- flower_data %>%
mutate(variety = as_factor(variety)) %>%
mutate(variety = fct_relevel(variety,
"Versicolor", "Virginica", "Setosa"))
#check that reordering the factor levels worked
flower_data_factor %>%
pull(variety) %>%
levels()
## [1] "Versicolor" "Virginica" "Setosa"
patchwork
packageOften we want to visualise data in more than one way. In this case, it is useful to be able to arrange multiple plots into one plotting region, such that RStudio returns one image which contains all our plots. We can do so using the patchwork
package. Before you proceed, ensure that this package is installed and loaded.
To use patchwork
, we need to save each of our plots as an object. For example, to save our Height
vs Weight
plot under the name p1
, we run:
p1 <- ggplot(health_data, aes(y = Height, x = Weight)) +
geom_point()
Let’s save a second plot under the name p2
, this time a scatterplot of Height
across Age
:
p2 <- ggplot(health_data, aes(y = Height, x = Age)) +
geom_point()
Both plots can be shown in one plotting region by running the object names, separated by a +
:
p1 + p2
Challenge: Save the boxplot and the violin plot of sepal.length
as objects, named p1
and p2
. Plot these graphs in one plotting region using patchwork
. Then, save this as a .pdf
file in your results folder. Find the solution in the drop-down.
p1 <- ggplot(flower_data, aes(y = sepal.length, x = variety)) +
geom_boxplot()
p2 <- ggplot(flower_data, aes(y = sepal.length, x = variety)) +
geom_violin()
p1 + p2
ggsave("results/sepal_length_plots.pdf")
ggsave("../results/sepal_length_plots.pdf")