tidyverse / tidyeval Goto Github PK
View Code? Open in Web Editor NEWA guide to tidy evaluation
Home Page: https://tidyeval.tidyverse.org
A guide to tidy evaluation
Home Page: https://tidyeval.tidyverse.org
@jimhester suggested:
tidy_capture()
and tidy_capture_dots()
for enquo()
and enquos()
tidy_build()
for expr()
tidy_eval()
for eval_tidy()
But we should consider all functions used in the book before providing any new aliases
I did an initial sketch of this idea in this PR: #9
The premise is based purely on my own experience: I found quoting hard to understand until I encountered it explained as a 'code/data transformation'.
I think this idea could be used to add clarity to quoting in this new tidyeval documentation.
Take this first para introducing quoting functions:
On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself. We say the argument has been automatically quoted. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is quote(). It automatically quotes its argument and returns the quoted expression without any evaluation. Because only the expression passed as argument matters, none of these statements are equivalent:
It does not define quoting. It also says about quote
:
It automatically quotes its argument and returns the quoted expression without any evaluation.
Which I found very confusing until I learned of the 'expert' definition of 'evaluate' it employs - which requires some appreciation of parser + evaluator model.
Using the code/data transform idea:
On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself as data. We call the process of converting an expression to data quoting. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is
quote()
. It automatically quotes its argument and returns the quoted expression as data. Because it not the value of the expression, but its representation as data that matters, none of these statements are equivalent:
It defines quoting explicitly 'the process of converting an expression to data'. It avoids saying 'unevaluated' and in turn avoiding the need for explanation of interpreter internals.
There are a couple of further ideas in the PR. Please accept this as a genuine suggestion based on a frame of explanation I found useful as a conceptual newcomer. A blog post I wrote exploring this idea was received well with many independent positive expressions of feedback. This gives me some confidence in its broader applicability and appeal.
Would it be possible to add some very brief examples for using rlang with dplyr::select
and dplyr::filter
. I appreciate they are still "to do". From an educational perspective it would be really useful since these are such basic building blocks of dplyr.
Many thanks in advance for building these fantastic resources! And apologies if this isn't the best forum in which to make this request.
Common denominator of semantics is bare symbols, so use ensyms()
.
Capture the essence of this twitter discussion somewhere:
https://twitter.com/JennyBryan/status/1088859123658018816
Summary: enquo()
is a better all-purpose default quotation mechanism to recommend than ensym()
. That is, if you're only going to learn 1 of these, make it enquo()
.
ensyms()
not appropriate as the symbols will still be looked up inside the execution env of the user's wrapper, all the way up to the search path.
Do we need envars()
? If user supplies symbols foo
and bar
, they get quoted as .data$foo
and .data$bar
. If a call is supplied, a helpful error is issued.
Other approach: enstrings()
as a shortcut for map_chr(ensyms(...), as_string)
and providing a tidyselect function to check existence of names in data frame. r-lib/tidyselect#85. The advantage of this approach is that the user can continue working with names as strings, which is sometimes needed (programmatically creating new columns etc).
From @jennybc:
library(tidyverse)
pick_me_strings <- c("mpg", "gear")
mtcars %>%
select(one_of(pick_me_strings)) %>%
glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
mtcars %>%
select_at(pick_me_strings) %>%
glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
pick_me_exprs <- rlang::exprs(mpg, gear)
mtcars %>%
select(!!!pick_me_exprs) %>%
glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
Things can get confusing when trying to apply select()
knowledge to other verbs:
library(tidyverse)
df <- tibble(
name = c("abby", "bea", "curt", "doug"),
happy = c(TRUE, FALSE, FALSE, TRUE),
awake = c(FALSE, FALSE, TRUE, TRUE),
)
filter_me_strings <- c("happy", "awake")
df %>%
filter(one_of(filter_me_strings))
#> Error in filter_impl(.data, quo): Evaluation error: No tidyselect variables were registered.
df %>%
filter_at(filter_me_strings)
#> Error in apply_filter_syms(.vars_predicate, syms, .tbl): argument ".vars_predicate" is missing, with no default
df %>%
filter(!!!filter_me_strings)
#> Error in filter_impl(.data, quo): Evaluation error: operations are possible only for numeric, logical or complex types.
filter_me_exprs <- rlang::exprs(happy, awake)
df %>%
filter(!!!filter_me_exprs)
#> # A tibble: 1 x 3
#> name happy awake
#> <chr> <lgl> <lgl>
#> 1 doug TRUE TRUE
mutate_front <- function(.data, ...) {
exprs <- enquos(..., .named = TRUE)
.data <- mutate(.data, ...)
new_vars <- syms(names(exprs))
select(.data, !!!new_vars, everything())
}
starwars %>%
mutate_front(
height / 100,
birth_year_fct = cut_number(birth_year, 4)
)
#> # A tibble: 87 x 15
#> `height/100` birth_year_fct name height mass hair_color skin_color
#> <dbl> <fct> <chr> <int> <dbl> <chr> <chr>
#> 1 1.72 [8,35] Luke… 172 77 blond fair
#> 2 1.67 (72,896] C-3PO 167 75 NA gold
#> 3 0.96 [8,35] R2-D2 96 32 NA white, bl…
#> 4 2.02 (35,52] Dart… 202 136 none white
#> 5 1.5 [8,35] Leia… 150 49 brown light
#> 6 1.78 (35,52] Owen… 178 120 brown, gr… light
#> 7 1.65 (35,52] Beru… 165 75 brown light
#> 8 0.97 NA R5-D4 97 32 NA white, red
#> 9 1.83 [8,35] Bigg… 183 84 black light
#> 10 1.82 (52,72] Obi-… 182 77 auburn, w… fair
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> # birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
As discussed in meeting, anything you can do to help people do these lookups and reverse lookups would be awesome. Some information is already sprinkled in rlang help files, but it's still hard to find and it's not always there.
For the most part, tidy eval terms are named with more discipline re: doing/being what they say. But because base R's names are often idiosyncratic/confusing, this complicates mapping from one to the other (think about substitute()
, "expressions" or "frames"), so an explicit mapping by an expert is helpful. Basically, demarcate those hazardous bits with bright yellow caution tape.
This is related to issue #1 re glossary and maybe they can somehow be addressed at the same time.
I think the content is growing enough that it would be easier to work with this if each chapter got its own Rmd file.
And introduce notion of expression earlier?
rlang::is_expression(expr(mycolumn))
#> [1] TRUE
rlang::is_expression("mycolumn")
#> [1] FALSE
rlang::is_expression(sym("mycolumn"))
#> [1] TRUE
Do we need syms()
for an early for
loop example?
I wrote this for Advanced R, but it doesn't feel quite right there. I think it might be better in the programming with dplyr vignette (whatever that ends up being)
In our grouped_mean()
example above, we allow the user to select one grouping variable, and one summary variable. What if we wanted to allow the user to select more than one? One option would be to use ...
. There are three possible ways we could use ...
it:
Pass ...
onto the mean()
function. That would make it easy to set
na.rm = TRUE
. This is easiest to implement.
Allow the user to select multiple groups
Allow the user to select multiple variables to summarise.
Implementing each one of these is relatively straightforward, but what if we want to be able to group by multiple variables, summarise multiple variables, and pass extra args on to mean()
. Generally, I think it is better to avoid this sort of API (instead relying on multiple function that each do one thing) but sometimes it is the lesser of the two evils, so it is useful to have a technique in your backpocket to handle it.
grouped_mean <- function(df, groups, vars, args) {
var_means <- map(vars, function(var) expr(mean(!!var, !!!args)))
names(var_means) <- map_chr(vars, expr_name)
df %>%
dplyr::group_by(!!!groups) %>%
dplyr::summarise(!!!var_means)
}
grouped_mean(mtcars, exprs(vs, am), exprs(hp, drat, wt), list(na.rm = TRUE))
If you use this design a lot, you may also want to provide an alias to exprs()
with a better name. For example, dplyr provides the vars()
wrapper to support the scoped verbs (e.g. summarise_if()
, mutate_at()
). aes()
in ggplot2 is similar, although it does a little more: requires all arguments be named, naming the the first arguments (x
and y
) by default, and automatically renames so you can use the base names for aesthetics (e.g. pch
vs shape
).
grouped_mean(mtcars, vars(vs, am), vars(hp, drat, wt), list(na.rm = TRUE))
Implement the three variants of grouped_mean()
described above:
# ... passed on to mean
grouped_mean <- function(df, group_by, summarise, ...) {}
# ... selects variables to summarise
grouped_mean <- function(df, group_by, ...) {}
# ... selects variables to group by
grouped_mean <- function(df, ..., summarise) {}
Interesting because you embed quosures in an expression
arrange2 <- function(.df, ..., .na.last = TRUE) {
# Capture all dots
args <- enquos(...)
# Uses `!!!` to splice in the individual arguments
# Use `!!` to inline `.na.last` to avoid it being matched in the data mask
order_call <- expr(order(!!!args, na.last = !!.na.last))
# Evaluate the call to order using data mask
ord <- eval_tidy(order_call, .df)
stopifnot(length(ord) == nrow(.df))
.df[ord, , drop = FALSE]
}
df <- data.frame(x = c(2, 3, 1), y = runif(3))
arrange2(df, x)
arrange2(df, -y)
In this line:
Line 275 in 7844627
you say:
We have already mentioned the vectorisation of toupper(), and many other functions in R are vectorised.
But I have not found this mention. At least, not in this book :)
Data is often the most relevant scope for a data analyst. That's why users are sometimes tempted to use attach()
. But attach()
is bad for the same reason that global options are bad, it hinders reproducibility. Data-masking is a way to promote a data scope in a limited, controlled way.
If you have deliberately left out both of these, then just close this.
But if not, it would be nice to add one or both.
As a bookdown reader I often make heavy use of:
As suggested in today's meeting, I think a glossary of terms would be very helpful.
I recommend that you gather all the "visibility", "printing", and "debugging" strategies in one place and make it very discoverable. One of my tidy eval learning challenges is that my usual tricks for play and inspection don't work and I can't easily remember the official strategies, like qq_show()
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.