Summary
Part 1: A general overview of the dplyr
package for data exploration and manipulation to summarize or extract information. Part 2: Next post will provide worked examples.
1 Motivation
The Biochemistry R Club is following the new, 2^nd edition of R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund (2023).) The Club sessions are recorded in a local UW GitLab repository R studygroup and accessible to all. At the moment of this writing we are studying in depth analysis of tabular data with the dplyr
R package (Wickham et al. (2023)) which is part of the Tidyverse series of packages (Wickham (2023)) that can augment any R-base (R Core Team (2023)) installation.
This post and the one following are inspired by two Spanish language posts about dplyr
discovered on the Linked In network. During the translation of the first post (Máxima Formación S.L. (2023)) I have kept the spirit of defining dplyr
verbs and grammar, adapted some of the text, and clarified and added the bibliography.
For the second post (Máxima Formación S.L. (2022b)) I updated the pipe to the new, R-base pipe |>
thus replacing the %>%
pipe used in the Spanish post.
2 Quick and efficient data manipulation with dplyr
in R
A casual survey published by Lohr (2014) in the New-York Times in 2014 suggested that:
“Data scientists […] spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” – NYTimes (2014).
Learning how to clean, organize (i.e. manipulate,) and summarize data with the R package dplyr
will save time and effort during the analysis phase for meaningful results.
To use dplyr
install tidyverse
and activate it. See also dplyr.tidyverse.org.
# Install from CRAN
install.packages("tidyverse")
library(tidyverse)
The first command installs the Tidyverse packages (ggplot2
, dplyr
, etc.) which are activated by the second command.
2.1 dplyr
in action: using verbs
The dplyr
package, is part of a collection of R packages for data science, the “Tidyverse”, in which all packages share an underlying design philosophy, grammar, and data structures. The dplyr
package uses a data manipulation grammar, which provides a consistent set of verbs that help solve the most common data manipulation challenges:
mutate()
add/change/calculate new variables (columns).select()
select, choose variables (columns) based on their namesfilter()
filters, chooses observations (rows) based on their valuessummarise()
resume, reduce multiple values to a single summary valuearrange()
order, changes the order of rows based on their valuesgroup_by()
groups, performs data operations on groups defined by variables
Most data are assumed to be contained in a tabular format with rows and columns, called a data frame. All verbs function similarly (Wickham and Grolemund (2017)):
- The first argument is a data frame.
- Subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
- The result is a new data frame.
Together, these properties make it easy to chain together multiple simple steps to achieve a complex result.
If multiple manipulations are required to achieve the desired result, the use of a “piping” method allows to pass the results of one command to the next command. The pipe symbol |>
is now part of a base-R installation (version 4.1
and up.) The pipe provides the R code similarity with a spoken language if one reads the pipe symbol as “then” or “and then”.
“Spoken languages consist of simple words that you combine into sentences to create sophisticated thoughts.” (Wickham and Grolemund (2017)).
The “piping” method consists of “passing along” the results of one command to the next command, creating a fluid flow of data that does not need intermediate files. In that sense the pipe helps make R expressive, like a spoken language:
The example R code below can then be read as:
- I start with the
mtcars
built-in dataset, - “and then” I filter out those cars with 4 cylinders (
cyl
variable/column). - “and then” I only show the first few lines with the
head()
function
library(tidyverse)
mtcars |>
filter(cyl ==4) |>
head()
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
And if we also only want to observe its gasoline consumption contained within the mpg
column, we can combine the following operation to select only that column:
mtcars |>
filter(cyl == 4) |>
select(mpg)
mpg
Datsun 710 22.8
Merc 240D 24.4
Merc 230 22.8
Fiat 128 32.4
Honda Civic 30.4
Toyota Corolla 33.9
Toyota Corona 21.5
Fiat X1-9 27.3
Porsche 914-2 26.0
Lotus Europa 30.4
Volvo 142E 21.4
Thus we have “piped” together a stream of data starting with the mtcars
dataset which passed to the filter()
function to keep only the rows matching the number of cylinders, and that result was finally passed to the select()
function that only kept that column.
Operations are thus added one after another with the |>
pipe.
2.2 What’s so special about dplyr?
dplyr
offers many functions that perform common data manipulation operations, that filter rows based on criteria, select columns, sort data, add or delete columns, and can summarize data in concise tables.dplyr
is very easy to learn and use. Functions names are easy to remember and clearly describe what they do: For example,filter()
is used to filter rows, as the name suggests.dplyr
functions were written in a computationally efficient way and thuse process faster than R base functions that may be used for the same tasks.dplyr
functions have a clearer syntax and better support data sets.dplyr
uses the same unifying language as othertidyverse
packages, such asggplot2
, which will make it much easier for us to use.
2.3 Tips for data manipulation
- Missing values. Fortunately, all aggregation functions can use the
na.rm
argument to help remove missing values before calculation. - Counts. Whenever performing an aggregation it is a good idea to include a count (
n()
) or a count of non-missing values (sum(!is.na( x))
) to help checking that conclusions are not based on very small amounts of data. - Unbundling. The
ungroup()
function can remove the grouping previously set bygroup_by()
and return to operations to use ungrouped data.
2.4 Star Wars dataset demonstrations
“Cool” demonstrations are available on many web pages demonstrating dplyr
with the Star Wars characters from the starwars
dataset pre-installed with dplyr
in the form of tabular data (“tibble”/“data frame”) with 87 rows and 14 variables.
Here are 2 of many that can be found online:
- Introduction to
dplyr
“vignette” (Wickham et al. (2022)): from the authors of thedplyr
package - Rapid data manipulation (in Spanish, Máxima Formación S.L. (2022a)): from the authors of the Spanish posts. Includes photographs, Lego, and computer generated character illustrations.
3 Cheat Sheets
Many cheat sheets are available from RStudio at https://posit.co/resources/cheatsheets/
4 Image Credits
The earliest back-traced source image was that published on September 5, 2019 by Griesemer (2019), but might have itself been derived from elsewhere. The hexagonal logo was replaced with its current version.