Summary
Part 1 (Sgro (2023)) : was a general overview of the dplyr
package for data exploration and manipulation to summarize or extract information from tabular data.
Part 2: (This post) provides worked examples while learning Rmarkdown.
The 10 `dplyr` examples cover the topics in the list below. An 11th example was added to show that plots are automatically embedded within the output of an Rmarkdown document.
1. Summarize multiple columns
2. Select files with maximum values in each group
3. Apply a function to each tab file
4. Frequencies / relative proportions
5. Remove duplicate files
6. Change the number of variables
7. Select complete cases
8. Add multiple threads/columns
9. Count the number of cases or sons
10. Count the number of cases or single files (with NAs)
11. embed a plot
1 Motivation
This article had a triple motivation:
- This is “Part 2” of an overview of the
dplyr
Tidyverse R package which presents short worked out examples. - A documentation reflecting learning markdown and Rmarkdown in class (see below.)
- Updating the
%>%
pipe to the newer R-base pipe|>
. However, this provided new challenges.
The Biochemistry R Club is following the new, 2nd edition of R for Data Science, currently on Chapter 3 in particular with the dplyr
package.
This document was created in class during the R Club session to learn how to create markdown and/or Rmarkdown documents while at the same time pursuing learning about dplyr
which, for this session, is inspired by this document in Spanish which has 10 examples of various dplyr
commands (Máxima Formación S.L. (2022)).
Note: to translate the Spanish document to English automatically simply paste the link within a Google Translate page. Then click on the link on the English side:
Note: to learn more about markdown and Rmarkdown it is useful to read both the .Rmd
source file as well as one or more of its rendered version. The R Club version is available on the R Club repository at 2023-12-19-session-06.
2 Preparations:
First we need to activate Tidyverse!
For this we add an “R code chunk” to which we add the request to suppress the information messages typically echoed when activating Tidyverse
library(tidyverse)
These exercises uses the built-in, pre-installed dataset iris
.
The following line is an example of “in-line” computation by R, to provide exact values, here depicting the tabular form of the iris
dataset. This avoids the need for Copy/Paste, and is more reliable as the numbers would be updated should the data change. (See .Rmd
file to see how it’s done!)
The dataset contains 150 rows and 5 columns.
3 Ten ways to use dplyr
Below are the proposed examples from the blog article cited above. The main change is the substitution to the new pipe symbol |>
now included in base R installation starting with version 4.1
.
Minor additions or alterations are sometimes added to the original code. For example the added line relocate(Species)
moves the species column in first position in the final table(s).
Translated paragraph titles (sections) are preserved as well as original text shown in italics.
3.1 Summarize multiple columns
What is the average length and width of sepals and petals for each species?
The code computes the average (mean
) for each column. The group_by()
function provides that the mean is computed specifically for each species. The summarise_all()
function affects every variable, meaning that each column is summarized with the command provided in parenthesis i.e. mean
.
iris |>
group_by(Species) |>
summarise_all(mean)
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03
If you want to summarize some columns alone, use the functions summarise_at()
specifying which column(s), or summarise_if()
specifying column properties.
For example:
iris |> summarize_at(c("Sepal.Length","Petal.Width"),mean, na.rm = TRUE)
Sepal.Length Petal.Width
1 5.843333 1.199333
iris |> summarize_if(is.numeric,mean, na.rm = TRUE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.843333 3.057333 3.758 1.199333
3.2 Select rows with the maximum values in each group
Imagine that you only want to obtain that case with the highest value of sepal length, for each species.
Here we filter each row to select a maximum value with max
applied to the column Sepal.Length
.
In the following code the group_by()
function ensures that the following commands will be applied to each species within the chosen column. The result will be a “grouped data frame”. The next command filters out the row with maximum value from the chosen column. It is done for all species thanks to the prior use of the group_by()
function. The arrange()
function does not seem to make any difference in this specific output but arrange(desc(Species))
could be used to list by descending order. The command relocate(Species)
will move that column in the first place on the left.
knitr::kable(
iris |>
group_by(Species) |>
filter(Sepal.Length == max(Sepal.Length)) |>
arrange(desc(Species)) |>
relocate(Species)
)
Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|---|
virginica | 7.9 | 3.8 | 6.4 | 2.0 |
versicolor | 7.0 | 3.2 | 4.7 | 1.4 |
setosa | 5.8 | 4.0 | 1.2 | 0.2 |
The code could be written in a different way for the same result:
knitr::kable(
iris |>
group_by(Species) |>
slice(which.max(Sepal.Length)) |>
relocate(Species)
)
Species | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|---|
setosa | 5.8 | 4.0 | 1.2 | 0.2 |
versicolor | 7.0 | 3.2 | 4.7 | 1.4 |
virginica | 7.9 | 3.8 | 6.4 | 2.0 |
3.3 Apply a function to each row of a table
And if you want to calculate a new variable (column) with the maximum value of the length of sepals or petals, we do the following:
rowwise()
allows the computation a row-at-a-time.
mutate()
will create a new column with the requested computation which is to select the max
value from the supplied values from 2 columns. Since we have rowwise()
on the previous command this will be done one row at a time. We have to add head()
to limit the outpout, otherwise kable
will print the complete table.
knitr::kable(
iris |>
rowwise() |>
mutate(Max.Len= max(Sepal.Length,Petal.Length)) |> head()
)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Max.Len |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 5.1 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 4.9 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 4.7 |
4.6 | 3.1 | 1.5 | 0.2 | setosa | 4.6 |
5.0 | 3.6 | 1.4 | 0.2 | setosa | 5.0 |
5.4 | 3.9 | 1.7 | 0.4 | setosa | 5.4 |
3.4 Obtain the relative frequencies or proportions
What % of cases do we have by species?
I changed the original n
variable used in the original formula to N
to clarify the difference between the variable and the counting function n()
.
knitr::kable(
iris |>
group_by(Species) |>
summarise(N = n()) |>
mutate(freq = N / sum(N))
)
Species | N | freq |
---|---|---|
setosa | 50 | 0.3333333 |
versicolor | 50 | 0.3333333 |
virginica | 50 | 0.3333333 |
In this case we have 50 flowers (cases or rows) for each species, therefore there is 33% of the sample that corresponds to the setosa lily species, 33% to veriscolor and 33% to virginica.
3.5 Remove duplicate rows
This is a common task in data manipulation, but it is easy to solve with the distinct()
function:
(The kable
table is not used on this very simple output.)
iris |>
distinct(Species)
Species
1 setosa
2 versicolor
3 virginica
Result shows that the “Species” variable only has 3 different labels: setosa, virginica and vericolor.
3.6 Change the name of the variables
We often need to shorten variable (column) names, to write code more easily and faster. For example, here we will call the variable “Sepal.Length” “SL” to avoid typing errors with capital letters, periods and English spelling. To rename the columns we can write the following:
The code is modified to change the name of 4 columns:
knitr::kable(
iris |>
rename(SL = Sepal.Length,
SW = Sepal.Width,
PL = Petal.Length,
PW = Petal.Width) |>
head()
)
SL | SW | PL | PW | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
We see that only the name of the selected columns has changed.
3.7 Select complete cases
Note: The iris
dataset is complete without missing values.
How to keep the complete observations? We construct a dummy data set containing missing (or missing, NA) values for the example:
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 <NA>
3 NA b
Note the syntax difference for a number: NA
and for a character <NA>
. This difference would not be visible in a kable
formatted table.
We can use the drop_na()
function to remove all cases that contain any form of NA
:
df |> drop_na()
# A tibble: 1 × 2
x y
<dbl> <chr>
1 1 a
We can also specifically remove NA
only associated with a specific variable, while preserving others:
df |> drop_na(x)
# A tibble: 2 × 2
x y
<dbl> <chr>
1 1 a
2 2 <NA>
You can also use the functions na.omit()
, filter(complete.cases(.))
or filter(!is.na(x1))
.
Notes:
na.omit()
is not adplyr
function.- the other 2 suggestions only work with the
%>%
pipe from themagrittr
package. For a technical, detailed explanation see Velásquez (2022) “Understanding the native R pipe |>”, or a saved archive.
3.8 Sum across multiple rows/columns
Let’s create an example data set, imagine 1
is presence and 0
absence (or Yes
/NO
).
(Note the use of =
to “assign” values. While this is correct “R purists” prefer the assignment code <-
.)
df=data.frame(
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1))
df
x1 x2 x3
1 1 1 0
2 0 1 1
3 0 NA 0
4 NA 1 1
5 0 1 1
6 1 0 0
7 1 NA NA
8 NA NA NA
9 0 0 0
10 1 1 1
To perform the sum, imagine that we can replace the NA
s with zero (absence of information and absence of the event). To summarize all columns we use summarise_all()
:
NOTE: as above, the provided code only works with the magrittr
pipe %>%
. However, based on the above blog link, we can modify the code using a “lambda function” shortcut:
df |>
(\(x) replace(x, is.na(x), 0))() |>
summarise(across(everything(), sum))
x1 x2 x3
1 4 5 4
# summarise_all(funs(sum)) # funs() has been deprecated
Note: while this exercise was useful to learn how to replace NA
with 0
, the same result could be obtained in this case with:
# funs() has been deprecated: replace with list()
df |> summarise_all(list(sum), na.rm = TRUE)
x1 x2 x3
1 4 5 4
To summarize each row:
The original code was:
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(.[1:3]))
To make it work with the new R-base pipe, it is necessary to create a function on the fly (a lambda function with no specific name.) This works but clearly the use of the %>%
is much more elegant!
df |>
(\(x) replace(x, is.na(x), 0))() |>
(function(x) { mutate(x, sum = rowSums(x[1:3]))})()
x1 x2 x3 sum
1 1 1 0 2
2 0 1 1 2
3 0 0 0 0
4 0 1 1 2
5 0 1 1 2
6 1 0 0 1
7 1 0 0 1
8 0 0 0 0
9 0 0 0 0
10 1 1 1 3
3.9 Count the number of cases or rows
This is very important, because when we summarize the data in descriptive statistics we should take into account the number of observations that have been used for this:
I change n
for N
in the original code to clearly distinguish the variable from the counting function n()
. I also change the code to compute the mean for all the relevant columns.
knitr::kable(
iris |>
group_by(Species) |>
summarise(N = n(),
meanSL = mean(Sepal.Length),
meanSW = mean(Sepal.Width),
meanPL = mean(Petal.Length),
meanPW= mean(Petal.Width))
)
Species | N | meanSL | meanSW | meanPL | meanPW |
---|---|---|---|---|---|
setosa | 50 | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 50 | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 50 | 6.588 | 2.974 | 5.552 | 2.026 |
You can also use the count function:
iris |>
group_by(Species) |>
count()
# A tibble: 3 × 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
3.10 Count the number of cases or unique rows (with NAs)
When we have a data set with NAs we can count the number of cases or unique rows with the n_distinct()
function.
We create a dummy data set for the example.
data = data.frame(aa=c(1,2,3,4,NA), bb=c('a', 'b', 'a', 'c', 'c'))
data
aa bb
1 1 a
2 2 b
3 3 a
4 4 c
5 NA c
data |>
filter(!is.na(aa)) |>
group_by(bb) |>
summarise(Unique_Elements = n_distinct(aa))
# A tibble: 3 × 2
bb Unique_Elements
<chr> <int>
1 a 2
2 b 1
3 c 1
For example, here we have selected the cases that do not have NA in the variable “aa” and counted the different cases for the variable “bb”.
Note that the value c
in column bb
associated with NA
in column aa
is not counted and not reported.
3.11 The 11th exercise: add a histogram
To show that plots are automatically embedded within the final export of an Rmarkdown document, we improvised a histogram with ggplot2
:
ggplot(iris) +
aes(x = Petal.Length, y = Sepal.Width, colour = Species) +
geom_point(shape = "circle", size = 1.5) +
scale_color_hue(direction = 1) +
theme_minimal()
4 Image Credits
Iris depiction from Rosebrock (2019).