Code
download.file("https://github.com/profrichharris/profrichharris.github.io/raw/main/MandM/data/covid_extract.csv", "covid.csv", mode = "wb", quiet = TRUE)
What Is the Tidyverse?
If base R is R Classic then tidyverse is a new flavour of R, designed for data science. It consists of a collection of R packages that “share an underlying design philosophy, grammar, and data structures.”
Tidyverse is easier to demonstrate than to pin-down to some basics so let’s work through an example using both base R and tidyverse to illustrate some differences.
If, as suggested in ‘Getting Started’, you have created an R Project to contain all the files you create and download for this course then open it now by using File –> Open Project… from the dropdown menus in R Studio. If you have not created one then now might be a good time!
We will begin by downloading a data file to use. It will be downloaded to your working directory, which is the folder associated with your R Project if you are using one. You can check the working directory by using getwd()
and change it using Session –> Set Working Directory or with the function setwd(dir)
where dir
is the chosen directory. If you have created a Project then the working directory is that of the Project.
The data are an extract of the Covid Data Dashboard for England in December 2021. Some prior manipulation and adjustments to those data have been undertaken for another project so treat them as indicative only. The actual reported numbers may have been changed slightly from their originals although only marginally so.
We also need to require(tidyverse)
ready for use.
If you get a warning message saying there is no package called tidyverse then you need to install it: install.packages("tidyverse", dependencies = TRUE)
. You will find that some people prefer to use library()
instead of require()
. The difference between them is subtle but you can find an argument in favour of using library()
here even though I usually don’t.
Let’s read-in and take a look at the data. First in base R.
MSOA11CD regionName X2021.12.04 X2021.12.11 X2021.12.18 X2021.12.25 All.Ages
1 E02000002 London 25 48 148 176 7726
2 E02000003 London 46 58 165 215 11246
3 E02000004 London 24 44 100 141 6646
4 E02000005 London 58 97 185 231 10540
5 E02000007 London 38 94 153 205 10076
6 E02000008 London 54 101 232 245 12777
Now using tidyverse,
# A tibble: 6 × 7
MSOA11CD regionName `2021-12-04` `2021-12-11` `2021-12-18` `2021-12-25`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 E02000002 London 25 48 148 176
2 E02000003 London 46 58 165 215
3 E02000004 London 24 44 100 141
4 E02000005 London 58 97 185 231
5 E02000007 London 38 94 153 205
6 E02000008 London 54 101 232 245
# ℹ 1 more variable: `All Ages` <dbl>
There are some similarities – for example the function read.csv
reads-in a file of comma separated variables, as does read_csv
. However, the output from these functions differ. First, tidyverse has, in this case, handled the names of the variables better. It has also created what is described as a tibble
which is “a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.” You can find out more about them and how they differ from traditional data frames here. Basically, they are a form of data frame that fit into tidyverse’s philosophy to try and keep ‘things’ tidy through a shared underlying design philosophy, grammar and data structures.
We will now: - select the regionName
, 2021-12-04
and All Ages
variables; - rename the second of these as cases
and the third as population
; - and look at the data again to check that it has worked.
In base R,
regionName cases population
1 London 25 7726
2 London 46 11246
3 London 24 6646
4 London 58 10540
5 London 38 10076
6 London 54 12777
In tidyverse,
# A tibble: 6 × 3
regionName cases population
<chr> <dbl> <dbl>
1 London 25 7726
2 London 46 11246
3 London 24 6646
4 London 58 10540
5 London 38 10076
6 London 54 12777
Comparing the two, the tidyverse code may be more intuitive to understand because of its use of verbs as functions: select()
, rename()
and so forth.
Now we shall bring the two previous stages together, using what is referred to as a pipe. Without worrying about the detail, which we will return to presently, here is an example of a pipe, |>
being used in base R:
regionName cases population
1 London 25 7726
2 London 46 11246
3 London 24 6646
4 London 58 10540
5 London 38 10076
6 London 54 12777
The above will only work if you are using R version 4.1.0 or above. You can check which version you are running by using R.Version()$version
.
Here is the same process using tidyverse and a different pipe, %>%
,
# A tibble: 6 × 3
regionName cases population
<chr> <dbl> <dbl>
1 London 25 7726
2 London 46 11246
3 London 24 6646
4 London 58 10540
5 London 38 10076
6 London 54 12777
The obvious difference here is that the tidyverse code is more elegant. But what is the pipe and what is the difference between |>
in the base R code and %>%
in the tidyverse example?
A pipe is really just a way of sending (’piping`) something from one line of code to the next, to create a chain of commands (forgive the mixed metaphors). For example,
Could be calculated as
As
or, if you want to save on a few characters of code,
However, this won’t work:
This is confusing but it is because of the different pipes, one (|>
) a more recent development than the other (%>%
).
A more complicated example of piping is below. It employs the function sapply()
, a variant of the function lapply(X, FUN)
that takes a list X
and applies the function FUN
to each part of it. In the example, it is the function mean
.
Here it is without any pipes:
[1] 20
With pipes, the above could instead be written as
or as
All three arrive at the same answer, which is 20.
So far, so good but what is the difference between |>
and %>%
? The answer is that %>%
was developed before |>
in the magrittr package, whereas |>
is R’s new native pipe. They are often interchangeable but not always.
At the moment, the |>
pipe is less flexible to use than %>%
. Consider the following example. The final two lines of code work fine using %>%
to pipe the data frame into the regression model, which is a line of best fit between the x
and y
values (the function lm()
fits a linear model which can be used to predict a y
value from a value of x
).
Call:
lm(formula = y ~ x, data = .)
Coefficients:
(Intercept) x
-0.2172 2.0057
(note: the output you get will likely differ from mine because the function rnorm()
adds some random variations to the data)
However, it does not work with the pipe, |>
because it does not recognise the place holder .
that we had previously used to represent what was flowing through the pipe.
To solve the problem, the above code can be modified by wrapping the regression part in another function but the end result is rather ‘clunky’.
Call:
lm(formula = y ~ x, data = z)
Coefficients:
(Intercept) x
-0.0239 2.0051
Over time, expect |>
to be developed and to supersede %>%
. For now, you are unlikely to encounter errors using %>%
as a substitute for |>
but you might using |>
instead of %>%
. In other words, %>%
is the safer choice if you are unsure, although the |>
is faster:
Unit: microseconds
expr
1:100 %>% data.frame(x = ., y = 2 * . + rnorm(100)) %>% lm(y ~ x, data = .)
(function(z) lm(y ~ x, data = z))((function(z) data.frame(x = z, y = 2 * z + rnorm(100)))(1:100))
min lq mean median uq max neval cld
243.294 254.897 296.7551 272.4245 281.0960 2877.872 100 a
241.818 257.644 276.1104 272.8755 283.4535 451.574 100 a
After that digression into piping, let’s return to our example that is comparing base R and tidyverse to read-in a table of data, select variables, rename one and, in the following, to calculate the number of COVID-19 cases per English region as a percentage of their estimated populations in the week ending 2021-12-04.
First, in base R:
East Midlands East of England London
25472 35785 43060
North East North West South East
10796 31185 62807
South West West Midlands Yorkshire and The Humber
33846 26554 21079
East Midlands East of England London
0.524 0.571 0.479
North East North West South East
0.403 0.423 0.681
South West West Midlands Yorkshire and The Humber
0.598 0.445 0.381
Now using tidyverse,
# A tibble: 9 × 4
regionName cases population rate
<chr> <dbl> <dbl> <dbl>
1 East Midlands 25472 4865583 0.524
2 East of England 35785 6269161 0.571
3 London 43060 8991550 0.479
4 North East 10796 2680763 0.403
5 North West 31185 7367456 0.423
6 South East 62807 9217265 0.681
7 South West 33846 5656917 0.598
8 West Midlands 26554 5961929 0.445
9 Yorkshire and The Humber 21079 5526350 0.381
Either way produces the same answers but, again, there is an elegance and consistency to the tidyverse way of doing it (which works just fine with the |>
pipe) that is, perhaps, missing from base R.
As a final step for the comparison, we will extend the code to visualise the regional COVID-19 rates in a histogram, with a rug plot included. A rug plot is a way of preserving the individual data values that would otherwise be ‘lost’ within the bins of a histogram.
As previously, we begin with base R,
df1 <- read.csv("covid.csv")
df1 <- df1[, c("regionName", "X2021.12.04", "All.Ages")]
names(df1)[c(2,3)] <- c("cases", "population")
cases <- tapply(df1$cases, df1$regionName, sum)
population <- tapply(df1$population, df1$regionName, sum)
rate <- round(cases / population * 100, 3)
hist(rate, xlab = "rate (cases as % of population)",
main = "Regional COVID-19 rates: week ending 2021-12-04")
rug(rate, lwd = 2)
…and continue with tidyverse, creating the output in such a way that it mimics the previous plot.
require(ggplot2)
read_csv("covid.csv") |>
select(regionName, `2021-12-04`, `All Ages`) |>
rename(cases = `2021-12-04`, population = `All Ages`) |>
group_by(regionName) |>
summarise(across(where(is.numeric), sum)) |>
mutate(rate = round(cases / population * 100, 3)) -> df2
df2 |>
ggplot(aes(x = rate)) +
geom_histogram(colour = "black", fill = "grey", binwidth = 0.05,
center = -0.025) +
geom_rug(linewidth = 2) +
labs(x = "rate (cases as % of population)", y = "Frequency",
title = "Regional COVID-19 rates: week ending 2021-12-04") +
theme_minimal() +
theme(panel.grid.major.y = element_blank())
In this instance, it is the tidyverse code that is the more elaborate. This is partly because there is more customisation of it to mimic the base R plot. However, it is also because it is using the package ggplot2 to produce the histogram. We return to ggplot2
more in later sessions. For now it is sufficient to scan the code and observe how it is ‘layering up’ the various components of the graphic, which those components separated by the +
in the lines of code.
The use of the +
notation in ggplot2
operates a little like a pipe in that the outcome of one operation is handed on to the next to modify the graphic being produced. It doesn’t use the pipe because the package’s origins are somewhat older but just think of the +
as layering-up – adding to – the graphic.
I prefer the ggplot2
to the hist()
graphics plot but that may be a matter of personal taste. However, ggplot2
can do ‘clever things’ with the visualisation, a hint of which is shown below.
df2 |>
ggplot(aes(x = rate)) +
geom_histogram(colour = "black", fill = "grey", binwidth = 0.05,
center = -0.025) +
geom_rug(aes(colour = regionName), size = 2) +
labs(x = "rate (cases as % of population)", y = "Frequency",
title = "Regional COVID-19 rates: week ending 2021-12-04") +
scale_colour_discrete(name = "Region") +
theme_minimal() +
theme(panel.grid.major.y = element_blank())
Please don’t form that impression that ggplot2
is hard-wired to tidverse and base R to the base graphics
. In practice, they are interchangeable.
Here is an example of using ggplot2
after a sequence of base R commands.
df1 <- read.csv("covid.csv")
df1 <- df1[, c("regionName", "X2021.12.04", "All.Ages")]
names(df1)[c(2,3)] <- c("cases", "population")
df1$rate <- round(df1$cases / df1$population * 100, 3)
ggplot(df1, aes(x = rate, y = regionName)) +
geom_boxplot() +
labs(x = "rate (cases as % of population)",
y = "region",
title = "Regional COVID-19 rates: week ending 2021-12-04") +
theme_minimal()
And here is an example of using the base R graphic boxplot()
after a chain of tidyverse commands.
read_csv("covid.csv") |>
select(regionName, `2021-12-04`, `All Ages`) |>
rename(cases = `2021-12-04`, population = `All Ages`) |>
mutate(rate = round(cases / population * 100, 3)) -> df2
par(mai=c(0.8,2,0.5,0.5), bty = "n", pch = 20) # See text below
boxplot(df2$rate ~ df2$regionName, horizontal = TRUE,
whisklty = "solid", staplelty = 0,
col = "white", las = 1, cex = 0.9, cex.axis = 0.75,
xlab = "rate (cases as % of population)", ylab="",
main = "Regional COVID-19 rates: week ending 2021-12-04")
title(ylab = "region", line = 6)
I would argue that, in this instance, the base R graphic is as nice as the ggplot2 one but it took more customisation to get it that way and I had to go digging around in the help files, ?boxplot
, ?bxp
and ?par
to find what I needed, which included changing the graphic’s margins (par(mai=...))
), moving and changing the size of the text on the vertical axis (the argument cex.axis
and the use of the title
function), changing the appearance of the ‘whiskers’ (whisklty = "solid"
and staplelty = 0
), and so forth. Still, it does demonstrate that you can have a lot of control over what is produced, if you have the patience and tenacity to do so.
Having provided a very small taste of tidyverse and how it differs from base R, we might ask, “which is better?” However, the question is misguided: it is a little like deciding to go to South America and asking whether Spanish or Portuguese is the better language to use. It depends, of course, on what you intend to do and where you intend to travel.
I use both base R and tidyverse packages in my work, sometimes drifting between the two in rather haphazard ways. If I can get what I want to work then I am happy. Outcomes worry me more than means so, although I use tidyverse a lot, I am not always as tidy as it would want me to be!
There is much more to tidyverse than has been covered here. See here for further information about it and its core packages.
A full introduction to using tidyverse for Data Science is provided by the book R for Data Science (2nd edition) by Hadley Wickham and Garrett Grolemund. There is a free online version.
---
title: "Tidyverse"
subtitle: "What Is the Tidyverse?"
execute:
warning: false
message: false
---
```{r}
#| include: false
installed <- installed.packages()[,1]
pkgs <- c("XML", "tidyverse", "ggplot2", "microbenchmark")
install <- pkgs[!(pkgs %in% installed)]
if(length(install)) install.packages(install, dependencies = TRUE, repos = "https://cloud.r-project.org")
```
![](tidyverse.png){width=300}
## Introduction
If base R is R Classic then tidyverse is a new flavour of R, designed for data science. It consists of [a collection of R packages](https://www.tidyverse.org/){target="_blank"} that "share an underlying design philosophy, grammar, and data structures."
Tidyverse is easier to demonstrate than to pin-down to some basics so let's work through an example using both base R and tidyverse to illustrate some differences.
## To Start
If, as suggested in 'Getting Started', you have created an R Project to contain all the files you create and download for this course then open it now by using File --> Open Project... from the dropdown menus in R Studio. If you have not created one then now might be a good time!
We will begin by downloading a data file to use. It will be downloaded to your working directory, which is the folder associated with your R Project if you are using one. You can check the working directory by using `getwd()` and change it using Session --> Set Working Directory or with the function `setwd(dir)` where `dir` is the chosen directory. If you have created a Project then the working directory is that of the Project.
The data are an extract of the [Covid Data Dashboard](https://coronavirus.data.gov.uk/){target="_blank"} for England in December 2021. Some prior manipulation and adjustments to those data have been undertaken for another project so treat them as indicative only. The actual reported numbers may have been changed slightly from their originals although only marginally so.
```{r}
download.file("https://github.com/profrichharris/profrichharris.github.io/raw/main/MandM/data/covid_extract.csv", "covid.csv", mode = "wb", quiet = TRUE)
```
We also need to `require(tidyverse)` ready for use.
```{r}
require(tidyverse)
```
![](hazard.gif){width=75}
<font size = 3>If you get a warning message saying there is no package called tidyverse then you need to install it: `install.packages("tidyverse", dependencies = TRUE)`. You will find that some people prefer to use `library()` instead of `require()`. The difference between them is subtle but you can find an argument in favour of using `library()` [here](https://www.r-bloggers.com/2016/12/difference-between-library-and-require-in-r/){target="_blank"} even though I usually don't.</font>
## Reading-in the data
Let's read-in and take a look at the data. First in base R.
```{r}
df1 <- read.csv("covid.csv")
head(df1)
```
Now using tidyverse,
```{r}
df2 <- read_csv("covid.csv")
slice_head(df2, n = 6)
```
There are some similarities -- for example the function `read.csv` reads-in a file of comma separated variables, as does `read_csv`. However, the output from these functions differ. First, tidyverse has, in this case, handled the names of the variables better. It has also created what is described as a `tibble` which is "a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not." You can find out more about them and how they differ from traditional data frames [here](https://tibble.tidyverse.org/){target="_blank"}. Basically, they are a form of data frame that fit into tidyverse's philosophy to try and keep 'things' tidy through a shared underlying design philosophy, grammar and data structures.
## Selecting and renaming variables
We will now:
- select the `regionName`, `2021-12-04` and `All Ages` variables;
- rename the second of these as `cases` and the third as `population`;
- and look at the data again to check that it has worked.
In base R,
```{r}
df1 <- df1[, c("regionName", "X2021.12.04", "All.Ages")]
names(df1)[2:3] <- c("cases", "population")
head(df1)
```
In tidyverse,
```{r}
df2 <- select(df2, regionName, `2021-12-04`, `All Ages`)
df2 <- rename(df2, cases = `2021-12-04`, population = `All Ages`)
slice_head(df2, n = 6)
```
Comparing the two, the tidyverse code may be more intuitive to understand because of its use of verbs as functions: `select()`, `rename()` and so forth.
## Piping
Now we shall bring the two previous stages together, using what is referred to as a pipe. Without worrying about the detail, which we will return to presently, here is an example of a pipe, `|>` being used in base R:
```{r}
read.csv("covid.csv") |>
(\(x) x[, c("regionName", "X2021.12.04", "All.Ages")])() -> df1
names(df1)[2:3] <- c("cases", "population")
df1 |>
head()
```
![](hazard.gif){width=75}
<font size = 3>The above will only work if you are using R version 4.1.0 or above. You can check which version you are running by using `R.Version()$version`.</font>
Here is the same process using tidyverse and a different pipe, `%>%`,
```{r, message=FALSE, warning=FALSE}
read_csv("covid.csv") %>%
select(regionName, `2021-12-04`, `All Ages`) %>%
rename(cases = `2021-12-04`, population = `All Ages`) %>%
slice_head(n = 6)
```
The obvious difference here is that the tidyverse code is more elegant. But what is the pipe and what is the difference between `|>` in the base R code and `%>%` in the tidyverse example?
A pipe is really just a way of sending ('piping`) something from one line of code to the next, to create a chain of commands (forgive the mixed metaphors). For example,
```{r}
x <- 0:10
mean(x)
```
Could be calculated as
```{r}
0:10 |>
mean()
```
As
```{r}
0:10 %>%
mean()
```
or, if you want to save on a few characters of code,
```{r}
0:10 %>%
mean
```
However, this won't work:
```{r}
#| eval: false
0:10 |>
mean
```
This is confusing but it is because of the different pipes, one (`|>`) a more recent development than the other (`%>%`).
A more complicated example of piping is below. It employs the function `sapply()`, a variant of the function `lapply(X, FUN)` that takes a list `X` and applies the function `FUN` to each part of it. In the example, it is the function `mean`.
Here it is without any pipes:
```{r}
x <- list(0:10, 10:20)
# Creates a list with two parts: the numbers 0 to 10, and 10 to 20
y <- sapply(x, mean)
# Calculates the mean for each part of the list, which are 5 and 15
sum(y)
# Sums together the two means, giving 20
```
With pipes, the above could instead be written as
```{r}
list(0:10, 10:20) |>
sapply(mean) |>
sum()
```
or as
```{r}
list(0:10, 10:20) %>%
sapply(mean) %>%
sum()
```
All three arrive at the same answer, which is 20.
So far, so good but what is the difference between `|>` and `%>%`? The answer is that `%>%` was developed before `|>` in the [magrittr package](https://cran.r-project.org/web/packages/magrittr/){target="_blank"}, whereas `|>` is [R's new native pipe](https://www.r-bloggers.com/2021/05/the-new-r-pipe/){target="_blank"}. They are often interchangeable **but not always**.
At the moment, the `|>` pipe is less flexible to use than `%>%`. Consider the following example. The final two lines of code work fine using `%>%` to pipe the data frame into the regression model, which is a line of best fit between the `x` and `y` values (the function `lm()` fits a linear model which can be used to predict a `y` value from a value of `x`).
```{r}
1:100 %>%
data.frame(x = ., y = 2*. + rnorm(100)) %>%
lm(y ~ x, data = .)
```
(note: the output you get will likely differ from mine because the function `rnorm()` adds some random variations to the data)
However, it does *not* work with the pipe, `|>` because it does not recognise the place holder `.` that we had previously used to represent what was flowing through the pipe.
```{r}
#| eval: false
# The following code does not work
1:100 |>
data.frame(x = ., y = 2*. + rnorm(100)) |>
lm(y ~ x, data = .)
```
To solve the problem, the above code can be modified by wrapping the regression part in another function but the end result is rather 'clunky'.
```{r}
1:100 |>
(\(z) data.frame(x = z, y = 2*z + rnorm(100)))() |>
(\(z) lm(y ~ x, data = z))()
```
Over time, expect `|>` to be developed and to supersede `%>%`. For now, you are unlikely to encounter errors using `%>%` as a substitute for `|>` but you might using `|>` instead of `%>%`. In other words, `%>%` is the safer choice if you are unsure, although the `|>` is faster:
```{r}
#| eval: false
install.packages("microbenchmark", dependencies = TRUE)
require("microbenchmark")
microbenchmark(
1:100 %>%
data.frame(x = ., y = 2*. + rnorm(100)) %>%
lm(y ~ x, data = .),
1:100 |>
(\(z) data.frame(x = z, y = 2*z + rnorm(100)))() |>
(\(z) lm(y ~ x, data = z))(),
times = 100
)
```
```{r}
#| echo: false
require("microbenchmark")
microbenchmark(
1:100 %>%
data.frame(x = ., y = 2*. + rnorm(100)) %>%
lm(y ~ x, data = .),
1:100 |>
(\(z) data.frame(x = z, y = 2*z + rnorm(100)))() |>
(\(z) lm(y ~ x, data = z))(),
times = 100
)
```
## Back to the example
After that digression into piping, let's return to our example that is comparing base R and tidyverse to read-in a table of data, select variables, rename one and, in the following, to calculate the number of COVID-19 cases per English region as a percentage of their estimated populations in the week ending 2021-12-04.
First, in base R:
```{r}
df1 <- read.csv("covid.csv")
df1 <- df1[, c("regionName", "X2021.12.04", "All.Ages")]
names(df1)[c(2,3)] <- c("cases", "population")
cases <- tapply(df1$cases, df1$regionName, sum) # Total cases per region
cases
# This step isn't necessary but is included
# to show the result of the line above
population <- tapply(df1$population, df1$regionName, sum)
# Total population per region
rate <- round(cases / population * 100, 3)
rate
```
Now using tidyverse,
```{r}
read_csv("covid.csv") |>
select(regionName, `2021-12-04`, `All Ages`) |>
rename(cases = `2021-12-04`, population = `All Ages`) |>
group_by(regionName) |>
summarise(across(where(is.numeric), sum)) |>
mutate(rate = round(cases / population * 100, 3)) |>
print(n = Inf)
```
Either way produces the same answers but, again, there is an elegance and consistency to the tidyverse way of doing it (which works just fine with the `|>` pipe) that is, perhaps, missing from base R.
## Plotting
As a final step for the comparison, we will extend the code to visualise the regional COVID-19 rates in a histogram, with a [rug plot](https://en.wikipedia.org/wiki/Rug_plot#:~:text=A%20rug%20plot%20is%20a,a%20one%2Ddimensional%20scatter%20plot){target="_blank"} included. A rug plot is a way of preserving the individual data values that would otherwise be 'lost' within the bins of a histogram.
As previously, we begin with base R,
```{r}
df1 <- read.csv("covid.csv")
df1 <- df1[, c("regionName", "X2021.12.04", "All.Ages")]
names(df1)[c(2,3)] <- c("cases", "population")
cases <- tapply(df1$cases, df1$regionName, sum)
population <- tapply(df1$population, df1$regionName, sum)
rate <- round(cases / population * 100, 3)
hist(rate, xlab = "rate (cases as % of population)",
main = "Regional COVID-19 rates: week ending 2021-12-04")
rug(rate, lwd = 2)
```
...and continue with tidyverse, creating the output in such a way that it mimics the previous plot.
```{r}
require(ggplot2)
read_csv("covid.csv") |>
select(regionName, `2021-12-04`, `All Ages`) |>
rename(cases = `2021-12-04`, population = `All Ages`) |>
group_by(regionName) |>
summarise(across(where(is.numeric), sum)) |>
mutate(rate = round(cases / population * 100, 3)) -> df2
df2 |>
ggplot(aes(x = rate)) +
geom_histogram(colour = "black", fill = "grey", binwidth = 0.05,
center = -0.025) +
geom_rug(linewidth = 2) +
labs(x = "rate (cases as % of population)", y = "Frequency",
title = "Regional COVID-19 rates: week ending 2021-12-04") +
theme_minimal() +
theme(panel.grid.major.y = element_blank())
```
In this instance, it is the tidyverse code that is the more elaborate. This is partly because there is more customisation of it to mimic the base R plot. However, it is also because it is using the package [ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} to produce the histogram. We return to `ggplot2` more in later sessions. For now it is sufficient to scan the code and observe how it is 'layering up' the various components of the graphic, which those components separated by the `+` in the lines of code.
![](hazard.gif){width=75}
<font size = 3>The use of the `+` notation in `ggplot2` operates a little like a pipe in that the outcome of one operation is handed on to the next to modify the graphic being produced. It doesn't use the pipe because the package's origins are somewhat older but just think of the `+` as layering-up -- adding to -- the graphic.</font>
I prefer the `ggplot2` to the `hist()` graphics plot but that may be a matter of personal taste. However, `ggplot2` can do 'clever things' with the visualisation, a hint of which is shown below.
```{r}
#| fig-height: 4
df2 |>
ggplot(aes(x = rate)) +
geom_histogram(colour = "black", fill = "grey", binwidth = 0.05,
center = -0.025) +
geom_rug(aes(colour = regionName), size = 2) +
labs(x = "rate (cases as % of population)", y = "Frequency",
title = "Regional COVID-19 rates: week ending 2021-12-04") +
scale_colour_discrete(name = "Region") +
theme_minimal() +
theme(panel.grid.major.y = element_blank())
```
</br>
Please don't form that impression that `ggplot2` is hard-wired to tidverse and base R to the base `graphics`. In practice, they are interchangeable.
Here is an example of using `ggplot2` after a sequence of base R commands.
```{r}
df1 <- read.csv("covid.csv")
df1 <- df1[, c("regionName", "X2021.12.04", "All.Ages")]
names(df1)[c(2,3)] <- c("cases", "population")
df1$rate <- round(df1$cases / df1$population * 100, 3)
ggplot(df1, aes(x = rate, y = regionName)) +
geom_boxplot() +
labs(x = "rate (cases as % of population)",
y = "region",
title = "Regional COVID-19 rates: week ending 2021-12-04") +
theme_minimal()
```
And here is an example of using the base R graphic `boxplot()` after a chain of tidyverse commands.
```{r}
#| fig-height: 6
read_csv("covid.csv") |>
select(regionName, `2021-12-04`, `All Ages`) |>
rename(cases = `2021-12-04`, population = `All Ages`) |>
mutate(rate = round(cases / population * 100, 3)) -> df2
par(mai=c(0.8,2,0.5,0.5), bty = "n", pch = 20) # See text below
boxplot(df2$rate ~ df2$regionName, horizontal = TRUE,
whisklty = "solid", staplelty = 0,
col = "white", las = 1, cex = 0.9, cex.axis = 0.75,
xlab = "rate (cases as % of population)", ylab="",
main = "Regional COVID-19 rates: week ending 2021-12-04")
title(ylab = "region", line = 6)
```
I would argue that, in this instance, the base R graphic is as nice as the ggplot2 one but it took more customisation to get it that way and I had to go digging around in the help files, `?boxplot`, `?bxp` and `?par` to find what I needed, which included changing the graphic's margins (`par(mai=...))`), moving and changing the size of the text on the vertical axis (the argument `cex.axis` and the use of the `title` function), changing the appearance of the 'whiskers' (`whisklty = "solid"` and `staplelty = 0`), and so forth. Still, it does demonstrate that you can have a lot of control over what is produced, if you have the patience and tenacity to do so.
## Which is better?
Having provided a **very** small taste of tidyverse and how it differs from base R, we might ask, "which is better?" However, the question is misguided: it is a little like deciding to go to South America and asking whether Spanish or Portuguese is the better language to use. It depends, of course, on what you intend to do and where you intend to travel.
I use both base R and tidyverse packages in my work, sometimes drifting between the two in rather haphazard ways. If I can get what I want to work then I am happy. Outcomes worry me more than means so, although I use tidyverse a lot, I am not always as tidy as it would want me to be!
## Futher reading
![](data_science.jpg){width=100}
There is **much** more to tidyverse than has been covered here. See [here](https://www.tidyverse.org/packages/){target="_blank} for further information about it and its core packages.
A full introduction to using tidyverse for Data Science is provided by the book [R for Data Science](https://www.oreilly.com/library/view/r-for-data/9781492097396/){target="_blank"} (2nd edition) by Hadley Wickham and Garrett Grolemund. There is a free [online version](http://r4ds.hadley.nz/){target="_blank"}.