Data exploration in R with Tidyverse dplyr – part 2 – 10 examples

Iris flower measurements of sepals and petals
The iris dataset has measures of width and length for petal and sepal of 3 iris species.

Summary

Part 1 (Sgro (2023)) : was a general overview of the dplyr package for data exploration and manipulation to summarize or extract information from tabular data.
Part 2: (This post) provides worked examples while learning Rmarkdown.

The 10 `dplyr` examples cover the topics in the list below. An 11th example was added to show that plots are automatically embedded within the output of an Rmarkdown document.

1. Summarize multiple columns
2. Select files with maximum values in each group
3. Apply a function to each tab file
4. Frequencies / relative proportions
5. Remove duplicate files
6. Change the number of variables
7. Select complete cases
8. Add multiple threads/columns
9. Count the number of cases or sons
10. Count the number of cases or single files (with NAs)
11. embed a plot

1 Motivation

This article had a triple motivation:

  1. This is “Part 2” of an overview of the dplyr Tidyverse R package which presents short worked out examples.
  2. A documentation reflecting learning markdown and Rmarkdown in class (see below.)
  3. Updating the %>% pipe to the newer R-base pipe |>. However, this provided new challenges.

The Biochemistry R Club is following the new, 2nd edition of R for Data Science, currently on Chapter 3 in particular with the dplyr package.

This document was created in class during the R Club session to learn how to create markdown and/or Rmarkdown documents while at the same time pursuing learning about dplyr which, for this session, is inspired by this document in Spanish which has 10 examples of various dplyr commands (Máxima Formación S.L. (2022)).

Note: to translate the Spanish document to English automatically simply paste the link within a Google Translate page. Then click on the link on the English side:

""
Translate web page from Spanish to English with Google Translate.

Note: to learn more about markdown and Rmarkdown it is useful to read both the .Rmd source file as well as one or more of its rendered version. The R Club version is available on the R Club repository at 2023-12-19-session-06.

2 Preparations:

First we need to activate Tidyverse!

For this we add an “R code chunk” to which we add the request to suppress the information messages typically echoed when activating Tidyverse

library(tidyverse)

These exercises uses the built-in, pre-installed dataset iris.

The following line is an example of “in-line” computation by R, to provide exact values, here depicting the tabular form of the iris dataset. This avoids the need for Copy/Paste, and is more reliable as the numbers would be updated should the data change. (See .Rmd file to see how it’s done!)

The dataset contains 150 rows and 5 columns.

3 Ten ways to use dplyr

Below are the proposed examples from the blog article cited above. The main change is the substitution to the new pipe symbol |> now included in base R installation starting with version 4.1.

Minor additions or alterations are sometimes added to the original code. For example the added line relocate(Species) moves the species column in first position in the final table(s).

Translated paragraph titles (sections) are preserved as well as original text shown in italics.

3.1 Summarize multiple columns

What is the average length and width of sepals and petals for each species?

The code computes the average (mean) for each column. The group_by() function provides that the mean is computed specifically for each species. The summarise_all() function affects every variable, meaning that each column is summarized with the command provided in parenthesis i.e. mean.

iris |>
group_by(Species) |>
summarise_all(mean)
# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 

If you want to summarize some columns alone, use the functions summarise_at() specifying which column(s), or summarise_if() specifying column properties.

For example:

iris |> summarize_at(c("Sepal.Length","Petal.Width"),mean, na.rm = TRUE)
  Sepal.Length Petal.Width
1     5.843333    1.199333
iris |> summarize_if(is.numeric,mean, na.rm = TRUE)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.843333    3.057333        3.758    1.199333

3.2 Select rows with the maximum values in each group

Imagine that you only want to obtain that case with the highest value of sepal length, for each species.

Here we filter each row to select a maximum value with max applied to the column Sepal.Length.

In the following code the group_by() function ensures that the following commands will be applied to each species within the chosen column. The result will be a “grouped data frame”. The next command filters out the row with maximum value from the chosen column. It is done for all species thanks to the prior use of the group_by() function. The arrange() function does not seem to make any difference in this specific output but arrange(desc(Species)) could be used to list by descending order. The command relocate(Species) will move that column in the first place on the left.

knitr::kable(
iris |>
  group_by(Species) |>
  filter(Sepal.Length == max(Sepal.Length)) |>
  arrange(desc(Species)) |>
  relocate(Species)
)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
virginica 7.9 3.8 6.4 2.0
versicolor 7.0 3.2 4.7 1.4
setosa 5.8 4.0 1.2 0.2

The code could be written in a different way for the same result:

knitr::kable(
  iris |> 
    group_by(Species) |>
      slice(which.max(Sepal.Length)) |>
    relocate(Species)
)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.8 4.0 1.2 0.2
versicolor 7.0 3.2 4.7 1.4
virginica 7.9 3.8 6.4 2.0

3.3 Apply a function to each row of a table

And if you want to calculate a new variable (column) with the maximum value of the length of sepals or petals, we do the following:

rowwise() allows the computation a row-at-a-time.
mutate() will create a new column with the requested computation which is to select the max value from the supplied values from 2 columns. Since we have rowwise() on the previous command this will be done one row at a time. We have to add head() to limit the outpout, otherwise kable will print the complete table.

knitr::kable(
 iris |>
 rowwise() |>
 mutate(Max.Len= max(Sepal.Length,Petal.Length)) |> head()
)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
5.1 3.5 1.4 0.2 setosa 5.1
4.9 3.0 1.4 0.2 setosa 4.9
4.7 3.2 1.3 0.2 setosa 4.7
4.6 3.1 1.5 0.2 setosa 4.6
5.0 3.6 1.4 0.2 setosa 5.0
5.4 3.9 1.7 0.4 setosa 5.4

3.4 Obtain the relative frequencies or proportions

What % of cases do we have by species?

I changed the original n variable used in the original formula to N to clarify the difference between the variable and the counting function n().

knitr::kable(
  iris |>
  group_by(Species) |>
  summarise(N = n()) |>
  mutate(freq = N / sum(N))
)
Species N freq
setosa 50 0.3333333
versicolor 50 0.3333333
virginica 50 0.3333333

In this case we have 50 flowers (cases or rows) for each species, therefore there is 33% of the sample that corresponds to the setosa lily species, 33% to veriscolor and 33% to virginica.

3.5 Remove duplicate rows

This is a common task in data manipulation, but it is easy to solve with the distinct() function:

(The kable table is not used on this very simple output.)

iris |>
distinct(Species)
     Species
1     setosa
2 versicolor
3  virginica

Result shows that the “Species” variable only has 3 different labels: setosa, virginica and vericolor.

3.6 Change the name of the variables

We often need to shorten variable (column) names, to write code more easily and faster. For example, here we will call the variable “Sepal.Length” “SL” to avoid typing errors with capital letters, periods and English spelling. To rename the columns we can write the following:

The code is modified to change the name of 4 columns:

knitr::kable(
iris |>
rename(SL = Sepal.Length,
       SW = Sepal.Width,
       PL = Petal.Length,
       PW = Petal.Width) |>
head()
)
SL SW PL PW Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

We see that only the name of the selected columns has changed.

3.7 Select complete cases

Note: The iris dataset is complete without missing values.

How to keep the complete observations? We construct a dummy data set containing missing (or missing, NA) values for the example:

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 <NA> 
3    NA b    

Note the syntax difference for a number: NA and for a character <NA>. This difference would not be visible in a kable formatted table.

We can use the drop_na() function to remove all cases that contain any form of NA:

df |> drop_na()
# A tibble: 1 × 2
      x y    
  <dbl> <chr>
1     1 a    

We can also specifically remove NA only associated with a specific variable, while preserving others:

df |> drop_na(x)
# A tibble: 2 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 <NA> 

You can also use the functions na.omit()filter(complete.cases(.)) or filter(!is.na(x1)).

Notes:

3.8 Sum across multiple rows/columns

Let’s create an example data set, imagine 1 is presence and 0 absence (or Yes/NO).

(Note the use of = to “assign” values. While this is correct “R purists” prefer the assignment code <-.)

df=data.frame(
   x1=c(1,0,0,NA,0,1,1,NA,0,1),
   x2=c(1,1,NA,1,1,0,NA,NA,0,1),
   x3=c(0,1,0,1,1,0,NA,NA,0,1))
df
   x1 x2 x3
1   1  1  0
2   0  1  1
3   0 NA  0
4  NA  1  1
5   0  1  1
6   1  0  0
7   1 NA NA
8  NA NA NA
9   0  0  0
10  1  1  1

To perform the sum, imagine that we can replace the NAs with zero (absence of information and absence of the event). To summarize all columns we use summarise_all():

NOTE: as above, the provided code only works with the magrittr pipe %>%. However, based on the above blog link, we can modify the code using a “lambda function” shortcut:

df |>
    (\(x) replace(x, is.na(x), 0))() |>
  summarise(across(everything(), sum))
  x1 x2 x3
1  4  5  4
#    summarise_all(funs(sum)) # funs() has been deprecated

Note: while this exercise was useful to learn how to replace NA with 0, the same result could be obtained in this case with:

# funs() has been deprecated: replace with list()
df |> summarise_all(list(sum), na.rm = TRUE)
  x1 x2 x3
1  4  5  4

To summarize each row:

The original code was:

df %>%
    replace(is.na(.), 0) %>%
    mutate(sum = rowSums(.[1:3]))

To make it work with the new R-base pipe, it is necessary to create a function on the fly (a lambda function with no specific name.) This works but clearly the use of the %>% is much more elegant!

df |> 
   (\(x) replace(x, is.na(x), 0))() |>
  (function(x) { mutate(x, sum = rowSums(x[1:3]))})()
   x1 x2 x3 sum
1   1  1  0   2
2   0  1  1   2
3   0  0  0   0
4   0  1  1   2
5   0  1  1   2
6   1  0  0   1
7   1  0  0   1
8   0  0  0   0
9   0  0  0   0
10  1  1  1   3

3.9 Count the number of cases or rows

This is very important, because when we summarize the data in descriptive statistics we should take into account the number of observations that have been used for this:

I change n for N in the original code to clearly distinguish the variable from the counting function n(). I also change the code to compute the mean for all the relevant columns.

knitr::kable(
iris |>
group_by(Species) |>
summarise(N = n(), 
          meanSL = mean(Sepal.Length),
          meanSW = mean(Sepal.Width), 
          meanPL = mean(Petal.Length),
          meanPW= mean(Petal.Width))
)
Species N meanSL meanSW meanPL meanPW
setosa 50 5.006 3.428 1.462 0.246
versicolor 50 5.936 2.770 4.260 1.326
virginica 50 6.588 2.974 5.552 2.026

You can also use the count function:

iris |>
group_by(Species) |>
count()
# A tibble: 3 × 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

3.10 Count the number of cases or unique rows (with NAs)

When we have a data set with NAs we can count the number of cases or unique rows with the n_distinct() function.

We create a dummy data set for the example.

data = data.frame(aa=c(1,2,3,4,NA), bb=c('a', 'b', 'a', 'c', 'c'))
data
  aa bb
1  1  a
2  2  b
3  3  a
4  4  c
5 NA  c
data |>                    
  filter(!is.na(aa)) |>    
  group_by(bb) |>          
  summarise(Unique_Elements = n_distinct(aa)) 
# A tibble: 3 × 2
  bb    Unique_Elements
  <chr>           <int>
1 a                   2
2 b                   1
3 c                   1

For example, here we have selected the cases that do not have NA in the variable “aa” and counted the different cases for the variable “bb”.

Note that the value c in column bb associated with NA in column aa is not counted and not reported.

3.11 The 11th exercise: add a histogram

To show that plots are automatically embedded within the final export of an Rmarkdown document, we improvised a histogram with ggplot2:

ggplot(iris) +
  aes(x = Petal.Length, y = Sepal.Width, colour = Species) +
  geom_point(shape = "circle", size = 1.5) +
  scale_color_hue(direction = 1) +
  theme_minimal() Example plot pf Sepal vs Petal for 3 iris species.

4 Image Credits

Iris depiction from Rosebrock (2019).

References

Máxima Formación S.L. 2022. “Top 10: Manipulación de Datos Con Dplyr.” https://www.maximaformacion.es/blog-ciencia-datos/top-10-manipulacion-de-datos-con-dplyr/.
Rosebrock, Adrian. 2019. “Machine Learning in Python – PyImageSearch.” https://pyimagesearch.com/2019/01/14/machine-learning-in-python/.
Sgro, Jean-Yves. 2023. “Data Exploration in R with Tidyverse Dplyr – Part 1 – Overview.” https://bcrf.biochem.wisc.edu/2023/12/26/data-exploration-in-r-with-tidyverse-dplyr-part-1-overview/.
Velásquez, Isabella. 2022. “Understanding the Native R Pipe |>.” https://towardsdatascience.com/understanding-the-native-r-pipe-98dea6d8b61b.