tidyverse / tidyeval Goto Github PK

A guide to tidy evaluation

Home Page: https://tidyeval.tidyverse.org

CSS 80.45% Rebol 6.09% HTML 13.46%

tidyeval's Introduction

This guide is now superseded by more recent efforts at documenting tidy evaluation in a user-friendly way. We now recommend reading:

The new Programming with dplyr vignette.
The Using ggplot2 in packages vignette.

We are keeping this bookdown guide online for posterity, but please know that it is missing a lot of advances that make tidy eval more palatable, such as the embracing operator {{ arg }} and glue support for custom names.

tidyeval's People

Contributors

Stargazers

Watchers

tidyeval's Issues

Why data masking

Data is often the most relevant scope for a data analyst. That's why users are sometimes tempted to use attach(). But attach() is bad for the same reason that global options are bad, it hinders reproducibility. Data-masking is a way to promote a data scope in a limited, controlled way.

Consider evocative aliases

@jimhester suggested:

tidy_capture() and tidy_capture_dots() for enquo() and enquos()
tidy_build() for expr()
tidy_eval() for eval_tidy()

But we should consider all functions used in the book before providing any new aliases

tidy eval terms and functions <--> base R terms and functions

As discussed in meeting, anything you can do to help people do these lookups and reverse lookups would be awesome. Some information is already sprinkled in rlang help files, but it's still hard to find and it's not always there.

For the most part, tidy eval terms are named with more discipline re: doing/being what they say. But because base R's names are often idiosyncratic/confusing, this complicates mapping from one to the other (think about substitute(), "expressions" or "frames"), so an explicit mapping by an expert is helpful. Basically, demarcate those hazardous bits with bright yellow caution tape.

This is related to issue #1 re glossary and maybe they can somehow be addressed at the same time.

Explore arrange implementation

Interesting because you embed quosures in an expression

arrange2 <- function(.df, ..., .na.last = TRUE) {
  # Capture all dots
  args <- enquos(...)
  
  # Uses `!!!` to splice in the individual arguments
  # Use `!!` to inline `.na.last` to avoid it being matched in the data mask
  order_call <- expr(order(!!!args, na.last = !!.na.last))
  
  # Evaluate the call to order using data mask
  ord <- eval_tidy(order_call, .df)
  stopifnot(length(ord) == nrow(.df))
  
  .df[ord, , drop = FALSE]
}

df <- data.frame(x = c(2, 3, 1), y = runif(3))

arrange2(df, x)
arrange2(df, -y)

Tangling with dots

I wrote this for Advanced R, but it doesn't feel quite right there. I think it might be better in the programming with dplyr vignette (whatever that ends up being)

Tangling with dots

In our grouped_mean() example above, we allow the user to select one grouping variable, and one summary variable. What if we wanted to allow the user to select more than one? One option would be to use .... There are three possible ways we could use ... it:

Pass ... onto the mean() function. That would make it easy to set
na.rm = TRUE. This is easiest to implement.
Allow the user to select multiple groups
Allow the user to select multiple variables to summarise.

Implementing each one of these is relatively straightforward, but what if we want to be able to group by multiple variables, summarise multiple variables, and pass extra args on to mean(). Generally, I think it is better to avoid this sort of API (instead relying on multiple function that each do one thing) but sometimes it is the lesser of the two evils, so it is useful to have a technique in your backpocket to handle it.

grouped_mean <- function(df, groups, vars, args) {

  var_means <- map(vars, function(var) expr(mean(!!var, !!!args)))
  names(var_means) <- map_chr(vars, expr_name)
  
  df %>%
    dplyr::group_by(!!!groups) %>%
    dplyr::summarise(!!!var_means)
}

grouped_mean(mtcars, exprs(vs, am), exprs(hp, drat, wt), list(na.rm = TRUE))

If you use this design a lot, you may also want to provide an alias to exprs() with a better name. For example, dplyr provides the vars() wrapper to support the scoped verbs (e.g. summarise_if(), mutate_at()). aes() in ggplot2 is similar, although it does a little more: requires all arguments be named, naming the the first arguments (x and y) by default, and automatically renames so you can use the base names for aesthetics (e.g. pch vs shape).

grouped_mean(mtcars, vars(vs, am), vars(hp, drat, wt), list(na.rm = TRUE))

Exercises

Implement the three variants of grouped_mean() described above:

# ... passed on to mean
grouped_mean <- function(df, group_by, summarise, ...) {}
# ... selects variables to summarise
grouped_mean <- function(df, group_by, ...) {}
# ... selects variables to group by
grouped_mean <- function(df, ..., summarise) {}

dplyr: working with character vectors of column names

From @jennybc:

library(tidyverse)

pick_me_strings <- c("mpg", "gear")

mtcars %>% 
  select(one_of(pick_me_strings)) %>% 
  glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...

mtcars %>% 
  select_at(pick_me_strings) %>% 
  glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...

pick_me_exprs <- rlang::exprs(mpg, gear)

mtcars %>% 
  select(!!!pick_me_exprs) %>% 
  glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...

Things can get confusing when trying to apply select() knowledge to other verbs:

library(tidyverse)

df <- tibble(
  name = c("abby", "bea", "curt", "doug"),
  happy = c(TRUE, FALSE, FALSE, TRUE),
  awake = c(FALSE, FALSE, TRUE, TRUE),
)

filter_me_strings <- c("happy", "awake")

df %>% 
  filter(one_of(filter_me_strings))
#> Error in filter_impl(.data, quo): Evaluation error: No tidyselect variables were registered.

df %>% 
  filter_at(filter_me_strings)
#> Error in apply_filter_syms(.vars_predicate, syms, .tbl): argument ".vars_predicate" is missing, with no default

df %>% 
  filter(!!!filter_me_strings)
#> Error in filter_impl(.data, quo): Evaluation error: operations are possible only for numeric, logical or complex types.

filter_me_exprs <- rlang::exprs(happy, awake)

df %>% 
  filter(!!!filter_me_exprs)
#> # A tibble: 1 x 3
#>   name  happy awake
#>   <chr> <lgl> <lgl>
#> 1 doug  TRUE  TRUE

Polite request to add examples for select and filter

Would it be possible to add some very brief examples for using rlang with dplyr::select and dplyr::filter. I appreciate they are still "to do". From an educational perspective it would be really useful since these are such basic building blocks of dplyr.

Many thanks in advance for building these fantastic resources! And apologies if this isn't the best forum in which to make this request.

Functions forwarding inputs to select() and a mutate- or filter-like function

Common denominator of semantics is bare symbols, so use ensyms().

"How to see what's going on"

I recommend that you gather all the "visibility", "printing", and "debugging" strategies in one place and make it very discoverable. One of my tidy eval learning challenges is that my usual tricks for play and inspection don't work and I can't easily remember the official strategies, like qq_show().

Replace quote() by expr()

And introduce notion of expression earlier?

rlang::is_expression(expr(mycolumn))
#> [1] TRUE

rlang::is_expression("mycolumn")
#> [1] FALSE

rlang::is_expression(sym("mycolumn"))
#> [1] TRUE

Do we need syms() for an early for loop example?

How can we take inputs that should strictly be column names?

ensyms() not appropriate as the symbols will still be looked up inside the execution env of the user's wrapper, all the way up to the search path.
Do we need envars()? If user supplies symbols foo and bar, they get quoted as .data$foo and .data$bar. If a call is supplied, a helpful error is issued.
Other approach: enstrings() as a shortcut for map_chr(ensyms(...), as_string) and providing a tidyselect function to check existence of names in data frame. r-lib/tidyselect#85. The advantage of this approach is that the user can continue working with names as strings, which is sometimes needed (programmatically creating new columns etc).

(en)quo(s) vs (en)sym(s)

Capture the essence of this twitter discussion somewhere:

https://twitter.com/JennyBryan/status/1088859123658018816

Summary: enquo() is a better all-purpose default quotation mechanism to recommend than ensym(). That is, if you're only going to learn 1 of these, make it enquo().

Add GitHub link and/or edit link

If you have deliberately left out both of these, then just close this.

But if not, it would be nice to add one or both.

As a bookdown reader I often make heavy use of:

The little GItHub icon that take people to the underlying repo. Add with with this line in _output.yml.
The edit button to facilitate pull requests, especially small typo corrections. Add with this line in _output.yml.

Glossary

As suggested in today's meeting, I think a glossary of terms would be very helpful.

Use one Rmd file per chapter

I think the content is growing enough that it would be easier to work with this if each chapter got its own Rmd file.

mutate_front()

mutate_front <- function(.data, ...) {
  exprs <- enquos(..., .named = TRUE)
  .data <- mutate(.data, ...)

  new_vars <- syms(names(exprs))
  select(.data, !!!new_vars, everything())
}

starwars %>%
  mutate_front(
    height / 100,
    birth_year_fct = cut_number(birth_year, 4)
  )
#> # A tibble: 87 x 15
#>    `height/100` birth_year_fct name  height  mass hair_color skin_color
#>           <dbl> <fct>          <chr>  <int> <dbl> <chr>      <chr>
#>  1         1.72 [8,35]         Luke…    172    77 blond      fair
#>  2         1.67 (72,896]       C-3PO    167    75 NA         gold
#>  3         0.96 [8,35]         R2-D2     96    32 NA         white, bl…
#>  4         2.02 (35,52]        Dart…    202   136 none       white
#>  5         1.5  [8,35]         Leia…    150    49 brown      light
#>  6         1.78 (35,52]        Owen…    178   120 brown, gr… light
#>  7         1.65 (35,52]        Beru…    165    75 brown      light
#>  8         0.97 NA             R5-D4     97    32 NA         white, red
#>  9         1.83 [8,35]         Bigg…    183    84 black      light
#> 10         1.82 (52,72]        Obi-…    182    77 auburn, w… fair
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Rephrasing of sentence

In this line:

tidyeval/introduction.Rmd

Line 275 in 7844627

 Rowwise vectorisation in dplyr is a consequence of normal R rules for vectorisation. A vectorised function is a function that works the same way with vectors of 1 element as with vectors of _n_ elements. The operation is applied elementwise (often at the machine code level, which makes them very efficient). We have already mentioned the vectorisation of `toupper()`, and many other functions in R are vectorised. One important class of vectorised functions is the arithmetic operators: 

you say:

We have already mentioned the vectorisation of toupper(), and many other functions in R are vectorised.

But I have not found this mention. At least, not in this book :)

An idea: Introduce quoting as a expression/data transformation from the beginning

I did an initial sketch of this idea in this PR: #9

The premise is based purely on my own experience: I found quoting hard to understand until I encountered it explained as a 'code/data transformation'.

I think this idea could be used to add clarity to quoting in this new tidyeval documentation.

Take this first para introducing quoting functions:

On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself. We say the argument has been automatically quoted. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is quote(). It automatically quotes its argument and returns the quoted expression without any evaluation. Because only the expression passed as argument matters, none of these statements are equivalent:

It does not define quoting. It also says about quote:

It automatically quotes its argument and returns the quoted expression without any evaluation.

Which I found very confusing until I learned of the 'expert' definition of 'evaluate' it employs - which requires some appreciation of parser + evaluator model.

Using the code/data transform idea:

On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself as data. We call the process of converting an expression to data quoting. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is quote(). It automatically quotes its argument and returns the quoted expression as data. Because it not the value of the expression, but its representation as data that matters, none of these statements are equivalent:

It defines quoting explicitly 'the process of converting an expression to data'. It avoids saying 'unevaluated' and in turn avoiding the need for explanation of interpreter internals.

There are a couple of further ideas in the PR. Please accept this as a genuine suggestion based on a frame of explanation I found useful as a conceptual newcomer. A blog post I wrote exploring this idea was received well with many independent positive expressions of feedback. This gives me some confidence in its broader applicability and appeal.

tidyverse / tidyeval Goto Github PK

tidyeval's Introduction

tidyeval's People

Contributors

Stargazers

Watchers

Forkers

tidyeval's Issues

Tangling with dots

Exercises

Recommend Projects

Recommend Topics

Recommend Org