3 你需要 tidy eval 吗?
In computer science, frameworks like tidy evaluation are known as metaprogramming. Modifying the blueprints of computations amounts to programming the program, i.e. metaprogramming. In other languages, this type of approach is often seen as a last resort because it requires new skills and might make your code harder to read. Things are different in R because of the importance of data masking functions, but it is still good advice to consider other options before turning to tidy evaluation. In this section, we review several strategies for solving programming problems with tidyverse packages.
Before diving into tidy eval, make sure to know about the fundamentals of programming with the tidyverse. These are likely to have a better return on investment of time and will also be useful to solve problems outside the tidyverse.
Fixed column names. A solid function taking data frames with fixed column names is better than a brittle function that uses tidy eval.
Automating loops. dplyr excels at automating loops. Acquiring a good command of rowwise vectorisation and columnwise mapping may prove very useful.
Tidy evaluation is not all-or-nothing, it encompasses a wide range of features and techniques. Here are a few techniques that are easy to pick up in your workflow:
- Passing expressions through
{{
and...
. - Passing column names to
.data[[
andone_of()
.
All these techniques make it possible to reuse existing components of tidyverse grammars and compose them into new functions.
3.1 固定列名
A simple solution is to write functions that expect data frames containing specific column names. If the computation always operates on the same columns and nothing varies, you don’t need any tidy eval. On the other hand, your users must ensure the existence of these columns as part of their data cleaning process. This is why this technique primarily makes sense when you’re writing functions tailored to your own data analysis uses, or perhaps in functions that interface with a specific web API for retrieving data. In general, fixed column names are task specific.
Say we have a simple pipeline that computes the body mass index for each observation in a tibble:
starwars %>% transmute(bmi = mass / (height / 100)^2)
#> # A tibble: 87 x 1
#> bmi
#> <dbl>
#> 1 26.0
#> 2 26.9
#> 3 34.7
#> 4 33.3
#> 5 21.8
#> # … with 82 more rows
We could extract this code in a function that takes data frames with columns mass
and height
:
compute_bmi <- function(data) {
data %>% transmute(bmi = mass / height^2)
}
It’s always a good idea to check the inputs of your functions and fail early with an informative error message when their assumptions are not met. In this case, we should validate the data frame and throw an error when it does not contain the expected columns:
compute_bmi <- function(data) {
if (!all(c("mass", "height") %in% names(data))) {
stop("`data` must contain `mass` and `height` columns")
}
data %>% transmute(bmi = mass / height^2)
}
iris %>% compute_bmi()
#> Error in compute_bmi(.): `data` must contain `mass` and `height` columns
In fact, we could go even further and validate the contents of the columns in addition to their names:
compute_bmi <- function(data) {
if (!all(c("mass", "height") %in% names(data))) {
stop("`data` must contain `mass` and `height` columns")
}
mean_height <- round(mean(data$height, na.rm = TRUE), 1)
if (mean_height > 3) {
warning(glue::glue(
"Average height is { mean_height }, is it scaled in meters?"
))
}
data %>% transmute(bmi = mass / height^2)
}
starwars %>% compute_bmi()
#> Warning in compute_bmi(.): Average height is 174.4, is it scaled in meters?
#> # A tibble: 87 x 1
#> bmi
#> <dbl>
#> 1 0.00260
#> 2 0.00269
#> 3 0.00347
#> 4 0.00333
#> 5 0.00218
#> # … with 82 more rows
starwars %>% mutate(height = height / 100) %>% compute_bmi()
#> # A tibble: 87 x 1
#> bmi
#> <dbl>
#> 1 26.0
#> 2 26.9
#> 3 34.7
#> 4 33.3
#> 5 21.8
#> # … with 82 more rows
Spending your programming time on the domain logic of your function, such as input and scale validation, may have a greater payoff than learning tidy eval just to improve its syntax. It makes your function more robust to faulty data and reduces the risks of erroneous analyses.
3.2 自动循环
Most programming problems involve iteration because data transformations are typically achieved element by element, by applying the same recipe over and over again. There are two main ways of automating iteration in R, vectorisation and mapping. Learning how to juggle with the different ways of expressing loops is not only an important step towards acquiring a good command of R and the tidyverse, it will also make you more proficient at solving programming problems.
3.2.1 dplyr 中的向量化
dplyr is designed to optimise iteration by taking advantage of the vectorisation of many R functions. Rowwise vectorisation is achieved through normal R rules, which dplyr augments with groupwise vectorisation.
3.2.1.1 按行计算
Rowwise vectorisation in dplyr is a consequence of normal R rules for vectorisation. A vectorised function is a function that works the same way with vectors of 1 element as with vectors of n elements. The operation is applied elementwise (often at the machine code level, which makes them very efficient). We have already mentioned the vectorisation of toupper()
, and many other functions in R are vectorised. One important class of vectorised functions is the arithmetic operators:
# Dividing 1 element
1 / 10
#> [1] 0.1
# Dividing 5 elements
1:5 / 10
#> [1] 0.1 0.2 0.3 0.4 0.5
Technically, a function is vectorised when:
- It returns a vector as long as the input.
- Applying the function on a single element yields the same result than applying it on the whole vector and then subsetting the element.
In other words, a vectorised function fn
fulfills the following identity:
fn(x[[i]]) == fn(x)[[i]]
When you mix vectorised and non-vectorised operations, the combined operation is itself vectorised when the last operation to run is vectorised. Here we’ll combine the vectorised /
function with the summary function mean()
. The result of this operation is a vector that has the same length as the LHS of /
:
x <- 1:5
x / mean(x)
#> [1] 0.33 0.67 1.00 1.33 1.67
Note that the other combination of operations is not vectorised because in that case the summary operation has the last word:
mean(x / 10)
#> [1] 0.3
The dplyr verb mutate()
expects vector semantics. The operations defining new columns typically return vectors as long as their inputs:
data <- tibble(x = rnorm(5, sd = 10))
data %>%
mutate(rescaled = x / sd(x))
#> # A tibble: 5 x 2
#> x rescaled
#> <dbl> <dbl>
#> 1 -14.0 -1.09
#> 2 2.55 0.199
#> 3 -24.4 -1.90
#> 4 -0.0557 -0.00434
#> 5 6.22 0.484
In fact, mutate()
enforces vectorisation. Returning a smaller vector is an error unless it has size 1. If the result of a mutate expression has size 1, it is automatically recycled to the tibble or group size. This ensures that all columns have the same length and fit within the tibble constraints of rectangular data:
data %>%
mutate(sigma = sd(x))
#> # A tibble: 5 x 2
#> x sigma
#> <dbl> <dbl>
#> 1 -14.0 12.8
#> 2 2.55 12.8
#> 3 -24.4 12.8
#> 4 -0.0557 12.8
#> 5 6.22 12.8
In contrast to mutate()
, the dplyr verb summarise()
expects summary operations that return a single value:
data %>%
summarise(sd(x))
#> # A tibble: 1 x 1
#> `sd(x)`
#> <dbl>
#> 1 12.8
3.2.1.2 分组计算
Things get interesting with grouped tibbles. dplyr augments the vectorisation of normal R functions with groupwise vectorisation. If your tibble has ngroup
groups, the operations are repeated ngroup
times.
my_division <- function(x, y) {
message("I was just called")
x / y
}
# Called 1 time
data %>%
mutate(new = my_division(x, 10))
#> I was just called
#> # A tibble: 5 x 2
#> x new
#> <dbl> <dbl>
#> 1 -14.0 -1.40
#> 2 2.55 0.255
#> 3 -24.4 -2.44
#> 4 -0.0557 -0.00557
#> 5 6.22 0.622
gdata <- data %>% group_by(g = c("a", "a", "b", "b", "c"))
# Called 3 times
gdata %>%
mutate(new = my_division(x, 10))
#> I was just called
#> I was just called
#> I was just called
#> # A tibble: 5 x 3
#> # Groups: g [3]
#> x g new
#> <dbl> <chr> <dbl>
#> 1 -14.0 a -1.40
#> 2 2.55 a 0.255
#> 3 -24.4 b -2.44
#> 4 -0.0557 b -0.00557
#> 5 6.22 c 0.622
If the operation is entirely vectorised, the result will be the same whether the tibble is grouped or not, since elementwise computations are not affected by the values of other elements. But as soon as summary operations are involved, the result depends on the grouping structure because the summaries are computed from group sections instead of whole columns.
# Marginal rescaling
data %>%
mutate(new = x / sd(x))
#> # A tibble: 5 x 2
#> x new
#> <dbl> <dbl>
#> 1 -14.0 -1.09
#> 2 2.55 0.199
#> 3 -24.4 -1.90
#> 4 -0.0557 -0.00434
#> 5 6.22 0.484
# Conditional rescaling
gdata %>%
mutate(new = x / sd(x))
#> # A tibble: 5 x 3
#> # Groups: g [3]
#> x g new
#> <dbl> <chr> <dbl>
#> 1 -14.0 a -1.20
#> 2 2.55 a 0.218
#> 3 -24.4 b -1.42
#> 4 -0.0557 b -0.00324
#> 5 6.22 c NA
Whereas rowwise vectorisation automates loops over the elements of a column, groupwise vectorisation automates loops over the levels of a grouping specification. The combination of these is very powerful.
3.2.2 按列循环
Rowwise and groupwise vectorisations are means of looping in the direction of rows, applying the same operation to each group and each element. What if you’d like to apply an operation in the direction of columns? This is possible in dplyr by mapping functions over columns.
Mapping functions is part of the functional programming approach. If you’re going to spend some time learning new programming concepts, acquiring functional programming skills is likely to have a higher payoff than learning about the metaprogramming concepts of tidy evaluation. Functional programming is inherent to R as it underlies the apply()
family of functions in base R and the map()
family from the purrr package. It is a powerful tool to add to your quiver.
3.2.2.1 映射函数
Everything that exists in R is an object, including functions. If you type the name of a function without parentheses, you get the function object instead of the result of calling the function:
toupper
#> function (x)
#> {
#> if (!is.character(x))
#> x <- as.character(x)
#> .Internal(toupper(x))
#> }
#> <bytecode: 0x7fa9ffdbec30>
#> <environment: namespace:base>
In its simplest form, functional programming is about passing a function object as argument to another function called a mapper function, that iterates over a vector to apply the function on each element, and returns all results in a new vector. In other words, a mapper functions writes loops so you don’t have to. Here is a manual loop that applies toupper()
over all elements of a character vector and returns a new vector:
new <- character(length(letters))
for (i in seq_along(letters)) {
new[[i]] <- toupper(letters[[i]])
}
new
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
#> [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
Using a mapper function results in much leaner code. Here we apply toupper()
over all elements of letters
and return the results as a character vector, as indicated by the suffix _chr
:
new <- purrr::map_chr(letters, toupper)
new
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
#> [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
In practice, functional programming is all about hiding for
loops, which are abstracted away by the mapper functions that automate the iteration.
Mapping is an elegant way of transforming data element by element, but it’s not the only one. For instance, toupper()
is actually a vectorised function that already operates on whole vectors element by element. The fastest and leanest code is just:
toupper(letters)
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
#> [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
Mapping functions are more useful with functions that are not vectorised or for computations over lists and data frame columns where the vectorisation occurs within the elements or columns themselves. In the following example, we apply a summarising function over all columns of a data frame:
purrr::map_int(mtcars, n_distinct)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 25 3 27 22 22 29 30 2 2 3 6
3.2.2.2 dplyr 作用域变体
dplyr provides variants of the main data manipulation verbs that map functions over a selection of columns. These verbs are known as the scoped variants and are recognizable from their _at
, _if
and _all
suffixes.
Scoped verbs support three sorts of selection:
_all
verbs operate on all columns of the data frame. You can summarise all columns of a data frame within groups withsummarise_all()
:iris %>% group_by(Species) %>% summarise_all(mean) #> # A tibble: 3 x 5 #> Species Sepal.Length Sepal.Width Petal.Length Petal.Width #> <fct> <dbl> <dbl> <dbl> <dbl> #> 1 setosa 5.01 3.43 1.46 0.246 #> 2 versicolor 5.94 2.77 4.26 1.33 #> 3 virginica 6.59 2.97 5.55 2.03
_if
verbs operate conditionally, on all columns for which a predicate returnsTRUE
. If you are familiar with purrr, the idea is similar to the conditional mapperpurrr::map_if()
. Promoting all character columns of a data frame as grouping variables is as simple as:starwars %>% group_by_if(is.character) #> # A tibble: 87 x 13 #> # Groups: name, hair_color, skin_color, eye_color, gender, homeworld, #> # species [87] #> name height mass hair_color skin_color eye_color birth_year gender #> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> #> 1 Luke… 172 77 blond fair blue 19 male #> 2 C-3PO 167 75 <NA> gold yellow 112 <NA> #> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA> #> 4 Dart… 202 136 none white yellow 41.9 male #> 5 Leia… 150 49 brown light brown 19 female #> # … with 82 more rows, and 5 more variables: homeworld <chr>, #> # species <chr>, films <list>, vehicles <list>, starships <list>
_at
verbs operate on a selection of columns. You can supply integer vectors of column positions or character vectors of colunm names.mtcars %>% summarise_at(1:2, mean) #> mpg cyl #> 1 20 6.2 mtcars %>% summarise_at(c("disp", "drat"), median) #> disp drat #> 1 196 3.7
More interestingly, you can use
vars()
3 to supply the same sort of expressions you would pass toselect()
! The selection helpers make it very convenient to craft a selection of columns to map over.starwars %>% summarise_at(vars(height:mass), mean) #> # A tibble: 1 x 2 #> height mass #> <dbl> <dbl> #> 1 NA NA starwars %>% summarise_at(vars(ends_with("_color")), n_distinct) #> # A tibble: 1 x 3 #> hair_color skin_color eye_color #> <int> <int> <int> #> 1 13 31 15
The scoped variants of mutate()
and summarise()
are the closest analogue to base::lapply()
and purrr::map()
. Unlike pure list mappers, the scoped verbs fully implement the dplyr semantics, such as groupwise vectorisation or the summary constraints:
# map() returns a simple list with the results
mtcars[1:5] %>% purrr::map(mean)
#> $mpg
#> [1] 20
#>
#> $cyl
#> [1] 6.2
#>
#> $disp
#> [1] 231
#>
#> $hp
#> [1] 147
#>
#> $drat
#> [1] 3.6
# `mutate_` variants recycle to group size
mtcars[1:5] %>% mutate_all(mean)
#> # A tibble: 32 x 5
#> mpg cyl disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20.1 6.19 231. 147. 3.60
#> 2 20.1 6.19 231. 147. 3.60
#> 3 20.1 6.19 231. 147. 3.60
#> 4 20.1 6.19 231. 147. 3.60
#> 5 20.1 6.19 231. 147. 3.60
#> # … with 27 more rows
# `summarise_` variants enforce a size 1 constraint
mtcars[1:5] %>% summarise_all(mean)
#> # A tibble: 1 x 5
#> mpg cyl disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20.1 6.19 231. 147. 3.60
# All scoped verbs know about groups
mtcars[1:5] %>% group_by(cyl) %>% summarise_all(mean)
#> # A tibble: 3 x 5
#> cyl mpg disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 26.7 105. 82.6 4.07
#> 2 6 19.7 183. 122. 3.59
#> 3 8 15.1 353. 209. 3.23
The other scoped variants also accept optional functions to map over the selection of columns. For instance, you could group by a selection of variables and transform them on the fly:
iris %>% group_by_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> # … with 145 more rows
or transform the column names of selected variables:
storms %>% select_at(vars(name:hour), toupper)
#> # A tibble: 10,010 x 5
#> NAME YEAR MONTH DAY HOUR
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 Amy 1975 6 27 0
#> 2 Amy 1975 6 27 6
#> 3 Amy 1975 6 27 12
#> 4 Amy 1975 6 27 18
#> 5 Amy 1975 6 28 0
#> # … with 10,005 more rows
The scoped variants lie at the intersection of purrr and dplyr and combine the rowwise looping mechanisms of dplyr with the columnwise mapping of purrr. This is a powerful combination.
vars()
is the function that does the quoting of your expressions, and returns blueprints to its caller. This pattern of letting an external helper quote the arguments is called external quoting.↩