8 dplyr
In the introductory vignette we learned that creating tidy eval functions boils down to a single pattern: quote and unquote. In this vignette we’ll apply this pattern in a series of recipes for dplyr.
This vignette is organised so that you can quickly find your way to a copy-paste solution when you face an immediate problem.
8.1 Patterns for single arguments
8.1.1 enquo()
and !!
- Quote and unquote arguments
We start with a quick recap of the introductory vignette. Creating a function around dplyr pipelines involves three steps: abstraction, quoting, and unquoting.
Abstraction step
First identify the varying parts:
df1 %>% group_by(x1) %>% summarise(mean = mean(y1)) df2 %>% group_by(x2) %>% summarise(mean = mean(y2)) df3 %>% group_by(x3) %>% summarise(mean = mean(y3)) df4 %>% group_by(x4) %>% summarise(mean = mean(y4))
And abstract those away with a informative argument names:
data %>% group_by(group_var) %>% summarise(mean = mean(summary_var))
And wrap in a function:
grouped_mean <- function(data, group_var, summary_var) { data %>% group_by(group_var) %>% summarise(mean = mean(summary_var)) }
Quoting step
Identify all the arguments where the user is allowed to refer to data frame columns directly. The function can’t evaluate these arguments right away. Instead they should be automatically quoted. Apply
enquo()
to these argumentsgroup_var <- enquo(group_var) summary_var <- enquo(summary_var)
Unquoting step
Identify where these variables are passed to other quoting functions and unquote with
!!
. In this case we passgroup_var
togroup_by()
andsummary_var
tosummarise()
:data %>% group_by(!!group_var) %>% summarise(mean = mean(!!summary_var))
We end up with a function that automatically quotes its arguments group_var
and summary_var
and unquotes them when they are passed to other quoting functions:
grouped_mean <- function(data, group_var, summary_var) {
group_var <- enquo(group_var)
summary_var <- enquo(summary_var)
data %>%
group_by(!!group_var) %>%
summarise(mean = mean(!!summary_var))
}
grouped_mean(mtcars, cyl, mpg)
#> # A tibble: 3 x 2
#> cyl mean
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
8.1.2 quo_name()
- Create default column names
Use quo_name()
to transform a quoted expression to a column name:
simple_var <- quote(height)
quo_name(simple_var)
#> [1] "height"
These names are only a default stopgap. For more complex uses, you’ll probably want to let the user override the default. Here is a case where the default name is clearly suboptimal:
complex_var <- quote(mean(height, na.rm = TRUE))
quo_name(complex_var)
#> [1] "mean(height, na.rm = TRUE)"
8.1.3 :=
and !!
- Unquote column names
In expressions like c(name = NA)
, the argument name is quoted. Because of the quoting it’s not possible to make an indirect reference to a variable that contains a name:
name <- "the real name"
c(name = NA)
#> name
#> NA
In tidy eval function it is possible to unquote argument names with !!
. However you need the special :=
operator:
rlang::qq_show(c(!!name := NA))
#> c("the real name" := NA)
This unusual operator is needed because using !
on the left-hand side of =
is not valid R code:
rlang::qq_show(c(!!name = NA))
#> Error: <text>:1:25: 意外的'='
#> 1: rlang::qq_show(c(!!name =
#> ^
Let’s use this !!
technique to pass custom column names to group_by()
and summarise()
:
grouped_mean <- function(data, group_var, summary_var) {
group_var <- enquo(group_var)
summary_var <- enquo(summary_var)
# Create default column names
group_nm <- quo_name(group_var)
summary_nm <- quo_name(summary_var)
# Prepend with an informative prefix
group_nm <- paste0("group_", group_nm)
summary_nm <- paste0("mean_", summary_nm)
data %>%
group_by(!!group_nm := !!group_var) %>%
summarise(!!summary_nm := mean(!!summary_var))
}
grouped_mean(mtcars, cyl, mpg)
#> # A tibble: 3 x 2
#> group_cyl mean_mpg
#> <dbl> <dbl>
#> 1 4 26.7
#> 2 6 19.7
#> 3 8 15.1
8.2 Patterns for multiple arguments
8.2.1 ...
- Forward multiple arguments
We have created a function that takes one grouping variable and one summary variable. It would make sense to take multiple grouping variables instead of just one. Let’s adjust our function with a ...
argument.
Replace
group_var
by...
:function(data, ..., summary_var)
Swap
...
andsummary_var
because arguments on the right-hand side of...
are harder to pass. They can only be passed with their full name explictly specified while arguments on the left-hand side can be passed without name:function(data, summary_var, ...)
It’s good practice to prefix named arguments with a
.
to reduce the risk of conflicts between your arguments and the arguments passed to...
:function(.data, .summary_var, ...)
Because of the magic of dots forwarding we don’t have to use the quote-and-unquote pattern. We can just pass ...
to other quoting functions like group_by()
:
grouped_mean <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
.data %>%
group_by(...) %>% # Forward `...`
summarise(mean = mean(!!summary_var))
}
grouped_mean(mtcars, disp, cyl, am)
#> # A tibble: 6 x 3
#> # Groups: cyl [3]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> # … with 1 more row
Forwarding ...
is straightforward but has the downside that you can’t modify the arguments or their names.
8.2.2 enquos()
and !!!
- Quote and unquote multiple arguments
Quoting and unquoting multiple variables with ...
is pretty much the same process as for single arguments:
Quoting multiple arguments can be done in two ways: internal quoting with the plural variant
enquos()
and external quoting withvars()
. Use internal quoting when your function takes expressions with...
and external quoting when your function takes a list of expressions.Unquoting multiple arguments requires a variant of
!!
, the unquote-splice operator!!!
which unquotes each element of a list as an independent argument in the surrounding function call.
Quote the dots with enquos()
and unquote-splice them with !!!
:
grouped_mean2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_vars <- enquos(...) # Get a list of quoted dots
.data %>%
group_by(!!!group_vars) %>% # Unquote-splice the list
summarise(mean = mean(!!summary_var))
}
grouped_mean2(mtcars, disp, cyl, am)
#> # A tibble: 6 x 3
#> # Groups: cyl [3]
#> cyl am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> # … with 1 more row
The quote-and-unquote pattern does more work than simple forwarding of ...
and is functionally identical. Don’t do this extra work unless you need to modify the arguments or their names.
8.2.3 expr()
- Modify quoted arguments
Modifying quoted expressions is often necessary when dealing with multiple arguments. Say we’d like a grouped_mean()
variant that takes multiple summary variables rather than multiple grouping variables. We need to somehow take the mean()
of each summary variable.
One easy way is to use the quote-and-unquote pattern with expr()
. This function is just like quote()
from base R. It plainly returns your argument, quoted:
quote(height)
#> height
expr(height)
#> height
quote(mean(height))
#> mean(height)
expr(mean(height))
#> mean(height)
But expr()
has a twist, it has full unquoting support:
vars <- list(quote(height), quote(mass))
expr(mean(!!vars[[1]]))
#> mean(height)
expr(group_by(!!!vars))
#> group_by(height, mass)
You can loop over a list of arguments and modify each of them:
purrr::map(vars, function(var) expr(mean(!!var, na.rm = TRUE)))
#> [[1]]
#> mean(height, na.rm = TRUE)
#>
#> [[2]]
#> mean(mass, na.rm = TRUE)
This makes it easy to take multiple summary variables, wrap them in a call to mean()
, before unquote-splicing within summarise()
:
grouped_mean3 <- function(.data, .group_var, ...) {
group_var <- enquo(.group_var)
summary_vars <- enquos(...) # Get a list of quoted summary variables
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
.data %>%
group_by(!!group_var) %>%
summarise(!!!summary_vars) # Unquote-splice the list
}
8.2.4 vars()
- Quote multiple arguments externally
How could we take multiple summary variables in addition to multiple grouping variables? Internal quoting with ...
has a major disadvantage: the arguments in ...
can only have one purpose. If you need to quote multiple sets of variables you have to delegate the quoting to another function. That’s the purpose of vars()
which quotes its arguments and returns a list:
vars(species, gender)
#> <list_of<quosure>>
#>
#> [[1]]
#> <quosure>
#> expr: ^species
#> env: global
#>
#> [[2]]
#> <quosure>
#> expr: ^gender
#> env: global
The arguments can be complex expressions and have names:
vars(h = height, m = mass / 100)
#> <list_of<quosure>>
#>
#> $h
#> <quosure>
#> expr: ^height
#> env: global
#>
#> $m
#> <quosure>
#> expr: ^mass / 100
#> env: global
When the quoting is external you don’t use enquos()
. Simply take lists of expressions in your function and forward the lists to other quoting functions with !!!
:
grouped_mean3 <- function(data, group_vars, summary_vars) {
stopifnot(
is.list(group_vars),
is.list(summary_vars)
)
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
data %>%
group_by(!!!group_vars) %>%
summarise(n = n(), !!!summary_vars)
}
grouped_mean3(starwars, vars(species, gender), vars(height))
#> # A tibble: 43 x 4
#> # Groups: species [38]
#> species gender n `mean(height, na.rm = TRUE)`
#> <chr> <chr> <int> <dbl>
#> 1 <NA> female 3 137
#> 2 <NA> male 2 183
#> 3 Aleena male 1 79
#> 4 Besalisk male 1 198
#> 5 Cerean male 1 198
#> # … with 38 more rows
grouped_mean3(starwars, vars(gender), vars(height, mass))
#> # A tibble: 5 x 4
#> gender n `mean(height, na.rm = TRUE… `mean(mass, na.rm = TRUE…
#> <chr> <int> <dbl> <dbl>
#> 1 <NA> 3 120 46.3
#> 2 female 19 165. 54.0
#> 3 hermaphrodite 1 175 1358
#> 4 male 62 179. 81.0
#> 5 none 2 200 140
One advantage of vars()
is that it lets users specify their own names:
grouped_mean3(starwars, vars(gender), vars(h = height, m = mass))
#> # A tibble: 5 x 4
#> gender n h m
#> <chr> <int> <dbl> <dbl>
#> 1 <NA> 3 120 46.3
#> 2 female 19 165. 54.0
#> 3 hermaphrodite 1 175 1358
#> 4 male 62 179. 81.0
#> 5 none 2 200 140
8.2.5 enquos(.named = TRUE)
- Automatically add default names
If you pass .named = TRUE
to enquos()
the unnamed expressions are automatically given default names:
f <- function(...) names(enquos(..., .named = TRUE))
f(height, mean(mass))
#> [1] "height" "mean(mass)"
User-supplied names are never overridden:
f(height, m = mean(mass))
#> [1] "height" "m"
This is handy when you need to modify the names of quoted expressions. In this example we’ll ensure the list is named before adding a prefix:
grouped_mean2 <- function(.data, .summary_var, ...) {
summary_var <- enquo(.summary_var)
group_vars <- enquos(..., .named = TRUE) # Ensure quoted dots are named
# Prefix the names of the list of quoted dots
names(group_vars) <- paste0("group_", names(group_vars))
.data %>%
group_by(!!!group_vars) %>% # Unquote-splice the list
summarise(mean = mean(!!summary_var))
}
grouped_mean2(mtcars, disp, cyl, am)
#> # A tibble: 6 x 3
#> # Groups: group_cyl [3]
#> group_cyl group_am mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> # … with 1 more row
One big downside of this technique is that all arguments get a prefix, including the arguments that were given specific names by the user:
grouped_mean2(mtcars, disp, c = cyl, a = am)
#> # A tibble: 6 x 3
#> # Groups: group_c [3]
#> group_c group_a mean
#> <dbl> <dbl> <dbl>
#> 1 4 0 136.
#> 2 4 1 93.6
#> 3 6 0 205.
#> 4 6 1 155
#> 5 8 0 358.
#> # … with 1 more row
In general it’s better to preserve the names explicitly passed by the user. To do that we can’t automatically add default names with enquos()
because once the list is fully named we don’t have any way of detecting which arguments were passed with an explicit names. We’ll have to add default names manually with quos_auto_name()
.
8.2.6 quos_auto_name()
- Manually add default names
It can be helpful add default names to the list of quoted dots manually:
- We can detect which arguments were explicitly named by the user.
- The default names can be applied to lists returned by
vars()
.
Let’s add default names manually with quos_auto_name()
to lists of externally quoted variables. We’ll detect unnamed arguments and only add a prefix to this subset of arguments. This way we preserve user-supplied names:
grouped_mean3 <- function(data, group_vars, summary_vars) {
stopifnot(
is.list(group_vars),
is.list(summary_vars)
)
# Detect and prefix unnamed arguments:
unnamed <- names(summary_vars) == ""
# Add the default names:
summary_vars <- rlang::quos_auto_name(summary_vars)
prefixed_nms <- paste0("mean_", names(summary_vars)[unnamed])
names(summary_vars)[unnamed] <- prefixed_nms
# Expand the argument _after_ giving the list its default names
summary_vars <- purrr::map(summary_vars, function(var) {
expr(mean(!!var, na.rm = TRUE))
})
data %>%
group_by(!!!group_vars) %>%
summarise(n = n(), !!!summary_vars) # Unquote-splice the renamed list
}
Note how we add the default names before wrapping the arguments in a mean()
call. This way we avoid including mean()
in the name:
quo_name(quote(mass))
#> [1] "mass"
quo_name(quote(mean(mass, na.rm = TRUE)))
#> [1] "mean(mass, na.rm = TRUE)"
We get nicely prefixed default names:
grouped_mean3(starwars, vars(gender), vars(height, mass))
#> # A tibble: 5 x 4
#> gender n mean_height mean_mass
#> <chr> <int> <dbl> <dbl>
#> 1 <NA> 3 120 46.3
#> 2 female 19 165. 54.0
#> 3 hermaphrodite 1 175 1358
#> 4 male 62 179. 81.0
#> 5 none 2 200 140
And the user is able to fully override the names:
grouped_mean3(starwars, vars(gender), vars(h = height, m = mass))
#> # A tibble: 5 x 4
#> gender n h m
#> <chr> <int> <dbl> <dbl>
#> 1 <NA> 3 120 46.3
#> 2 female 19 165. 54.0
#> 3 hermaphrodite 1 175 1358
#> 4 male 62 179. 81.0
#> 5 none 2 200 140
8.3 select()
TODO
8.4 filter()
TODO
8.5 case_when()
TODO
8.6 Gotchas
8.6.1 Nested quoting functions
https://stackoverflow.com/questions/51902438/rlangsym-in-anonymous-functions