Data exploration in R with Tidyverse dplyr – part 1 – overview

Summary

Part 1: A general overview of the dplyr package for data exploration and manipulation to summarize or extract information. Part 2: Next post will provide worked examples.

 

""
Important verbs of dplyr, a Tidyverse R package.

 

1 Motivation

The Biochemistry R Club is following the new, 2^nd edition of R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund (2023).) The Club sessions are recorded in a local UW GitLab repository R studygroup and accessible to all. At the moment of this writing we are studying in depth analysis of tabular data with the dplyr R package (Wickham et al. (2023)) which is part of the Tidyverse series of packages (Wickham (2023)) that can augment any R-base (R Core Team (2023)) installation.

This post and the one following are inspired by two Spanish language posts about dplyr discovered on the Linked In network. During the translation of the first post (Máxima Formación S.L. (2023)) I have kept the spirit of defining dplyr verbs and grammar, adapted some of the text, and clarified and added the bibliography.

For the second post (Máxima Formación S.L. (2022b)) I updated the pipe to the new, R-base pipe |> thus replacing the %>% pipe used in the Spanish post.

2 Quick and efficient data manipulation with dplyr in R

A casual survey published by Lohr (2014) in the New-York Times in 2014 suggested that:

“Data scientists […] spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” – NYTimes (2014).

Learning how to clean, organize (i.e. manipulate,) and summarize data with the R package dplyr will save time and effort during the analysis phase for meaningful results.

To use dplyr install tidyverse and activate it. See also dplyr.tidyverse.org.

# Install from CRAN
install.packages("tidyverse")
library(tidyverse)

The first command installs the Tidyverse packages (ggplot2dplyr, etc.) which are activated by the second command.

2.1 dplyr in action: using verbs

The dplyr package, is part of a collection of R packages for data science, the “Tidyverse”, in which all packages share an underlying design philosophy, grammar, and data structures. The dplyr package uses a data manipulation grammar, which provides a consistent set of verbs that help solve the most common data manipulation challenges:

  • mutate() add/change/calculate new variables (columns).
  • select() select, choose variables (columns) based on their names
  • filter() filters, chooses observations (rows) based on their values
  • summarise() resume, reduce multiple values to a single summary value
  • arrange() order, changes the order of rows based on their values
  • group_by() groups, performs data operations on groups defined by variables

Most data are assumed to be contained in a tabular format with rows and columns, called a data frame. All verbs function similarly (Wickham and Grolemund (2017)):

  1. The first argument is a data frame.
  2. Subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
  3. The result is a new data frame.

Together, these properties make it easy to chain together multiple simple steps to achieve a complex result.

If multiple manipulations are required to achieve the desired result, the use of a “piping” method allows to pass the results of one command to the next command. The pipe symbol |> is now part of a base-R installation (version 4.1 and up.) The pipe provides the R code similarity with a spoken language if one reads the pipe symbol as “then” or “and then”.

“Spoken languages consist of simple words that you combine into sentences to create sophisticated thoughts.” (Wickham and Grolemund (2017)).

The “piping” method consists of “passing along” the results of one command to the next command, creating a fluid flow of data that does not need intermediate files. In that sense the pipe helps make R expressive, like a spoken language:

The example R code below can then be read as:

  • I start with the mtcars built-in dataset,
  • and then” I filter out those cars with 4 cylinders (cyl variable/column).
  • and then” I only show the first few lines with the head() function
library(tidyverse)
mtcars |>
  filter(cyl ==4) |>
  head()
                mpg cyl  disp hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1

And if we also only want to observe its gasoline consumption contained within the mpg column, we can combine the following operation to select only that column:

mtcars |>
  filter(cyl == 4) |>
  select(mpg)
                mpg
Datsun 710     22.8
Merc 240D      24.4
Merc 230       22.8
Fiat 128       32.4
Honda Civic    30.4
Toyota Corolla 33.9
Toyota Corona  21.5
Fiat X1-9      27.3
Porsche 914-2  26.0
Lotus Europa   30.4
Volvo 142E     21.4

Thus we have “piped” together a stream of data starting with the mtcars dataset which passed to the filter() function to keep only the rows matching the number of cylinders, and that result was finally passed to the select() function that only kept that column.

Operations are thus added one after another with the |> pipe.

2.2 What’s so special about dplyr?

  1. dplyr offers many functions that perform common data manipulation operations, that filter rows based on criteria, select columns, sort data, add or delete columns, and can summarize data in concise tables.
  2. dplyr is very easy to learn and use. Functions names are easy to remember and clearly describe what they do: For example, filter() is used to filter rows, as the name suggests.
  3. dplyr functions were written in a computationally efficient way and thuse process faster than R base functions that may be used for the same tasks. dplyr functions have a clearer syntax and better support data sets.
  4. dplyr uses the same unifying language as other tidyverse packages, such as ggplot2, which will make it much easier for us to use.

2.3 Tips for data manipulation

  1. Missing values. Fortunately, all aggregation functions can use the na.rm argument to help remove missing values before calculation.
  2. Counts. Whenever performing an aggregation it is a good idea to include a count (n()) or a count of non-missing values (sum(!is.na( x))) to help checking that conclusions are not based on very small amounts of data.
  3. Unbundling. The ungroup() function can remove the grouping previously set by group_by() and return to operations to use ungrouped data.

2.4 Star Wars dataset demonstrations

“Cool” demonstrations are available on many web pages demonstrating dplyr with the Star Wars characters from the starwars dataset pre-installed with dplyr in the form of tabular data (“tibble”/“data frame”) with 87 rows and 14 variables.

Here are 2 of many that can be found online:

3 Cheat Sheets

Many cheat sheets are available from RStudio at https://posit.co/resources/cheatsheets/

4 Image Credits

The earliest back-traced source image was that published on September 5, 2019 by Griesemer (2019), but might have itself been derived from elsewhere. The hexagonal logo was replaced with its current version.

5 References

Griesemer, Jeff. 2019. “Data Manipulation in R with Dplyr.” https://towardsdatascience.com/data-manipulation-in-r-with-dplyr-3095e0867f75.
Lohr, Steve4. 2014. “For Big-Data Scientists, ’Janitor Work’ Is Key Hurdle to Insights.” The New York Times, August. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html.
Máxima Formación S.L. 2022a. “Manipulación de Datos Rápida y Efectiva Con Dplyr.” https://www.maximaformacion.es/blog-ciencia-datos/manipulacion-de-datos-rapida-y-efectiva-con-dplyr/.
———. 2022b. “Top 10: Manipulación de Datos Con Dplyr.” https://www.maximaformacion.es/blog-ciencia-datos/top-10-manipulacion-de-datos-con-dplyr/.
———. 2023. “Manipulación de Datos Rápida y Efectiva Con Dplyr.” https://www.linkedin.com/pulse/manipulaci%C3%B3n-de-datos-r%C3%A1pida-y-efectiva-con-dplyr-maximaformacion.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyversehttps://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science, 2nd Edition. 2nd ed. O’Reilly Media. https://r4ds.hadley.nz/.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2022. “Introduction to Dplyr.” https://dplyr.tidyverse.org/articles/dplyr.html.
———. 2023. Dplyr: A Grammar of Data Manipulationhttps://dplyr.tidyverse.org.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. Paperback; O’Reilly Media. http://r4ds.had.co.nz/.