Git Product home page Git Product logo

tidyeval's Introduction

Lifecycle Status

This guide is now superseded by more recent efforts at documenting tidy evaluation in a user-friendly way. We now recommend reading:

We are keeping this bookdown guide online for posterity, but please know that it is missing a lot of advances that make tidy eval more palatable, such as the embracing operator {{ arg }} and glue support for custom names.

tidyeval's People

Contributors

batpigandme avatar cgrilson7 avatar eribul avatar geanders avatar jennybc avatar jmbuhr avatar lindbrook avatar lionel- avatar markdly avatar shntnu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tidyeval's Issues

Why data masking

Data is often the most relevant scope for a data analyst. That's why users are sometimes tempted to use attach(). But attach() is bad for the same reason that global options are bad, it hinders reproducibility. Data-masking is a way to promote a data scope in a limited, controlled way.

Consider evocative aliases

@jimhester suggested:

  • tidy_capture() and tidy_capture_dots() for enquo() and enquos()
  • tidy_build() for expr()
  • tidy_eval() for eval_tidy()

But we should consider all functions used in the book before providing any new aliases

tidy eval terms and functions <--> base R terms and functions

As discussed in meeting, anything you can do to help people do these lookups and reverse lookups would be awesome. Some information is already sprinkled in rlang help files, but it's still hard to find and it's not always there.

For the most part, tidy eval terms are named with more discipline re: doing/being what they say. But because base R's names are often idiosyncratic/confusing, this complicates mapping from one to the other (think about substitute(), "expressions" or "frames"), so an explicit mapping by an expert is helpful. Basically, demarcate those hazardous bits with bright yellow caution tape.

This is related to issue #1 re glossary and maybe they can somehow be addressed at the same time.

Explore arrange implementation

Interesting because you embed quosures in an expression

arrange2 <- function(.df, ..., .na.last = TRUE) {
  # Capture all dots
  args <- enquos(...)
  
  # Uses `!!!` to splice in the individual arguments
  # Use `!!` to inline `.na.last` to avoid it being matched in the data mask
  order_call <- expr(order(!!!args, na.last = !!.na.last))
  
  # Evaluate the call to order using data mask
  ord <- eval_tidy(order_call, .df)
  stopifnot(length(ord) == nrow(.df))
  
  .df[ord, , drop = FALSE]
}

df <- data.frame(x = c(2, 3, 1), y = runif(3))

arrange2(df, x)
arrange2(df, -y)

Tangling with dots

I wrote this for Advanced R, but it doesn't feel quite right there. I think it might be better in the programming with dplyr vignette (whatever that ends up being)

Tangling with dots

In our grouped_mean() example above, we allow the user to select one grouping variable, and one summary variable. What if we wanted to allow the user to select more than one? One option would be to use .... There are three possible ways we could use ... it:

  • Pass ... onto the mean() function. That would make it easy to set
    na.rm = TRUE. This is easiest to implement.

  • Allow the user to select multiple groups

  • Allow the user to select multiple variables to summarise.

Implementing each one of these is relatively straightforward, but what if we want to be able to group by multiple variables, summarise multiple variables, and pass extra args on to mean(). Generally, I think it is better to avoid this sort of API (instead relying on multiple function that each do one thing) but sometimes it is the lesser of the two evils, so it is useful to have a technique in your backpocket to handle it.

grouped_mean <- function(df, groups, vars, args) {

  var_means <- map(vars, function(var) expr(mean(!!var, !!!args)))
  names(var_means) <- map_chr(vars, expr_name)
  
  df %>%
    dplyr::group_by(!!!groups) %>%
    dplyr::summarise(!!!var_means)
}

grouped_mean(mtcars, exprs(vs, am), exprs(hp, drat, wt), list(na.rm = TRUE))

If you use this design a lot, you may also want to provide an alias to exprs() with a better name. For example, dplyr provides the vars() wrapper to support the scoped verbs (e.g. summarise_if(), mutate_at()). aes() in ggplot2 is similar, although it does a little more: requires all arguments be named, naming the the first arguments (x and y) by default, and automatically renames so you can use the base names for aesthetics (e.g. pch vs shape).

grouped_mean(mtcars, vars(vs, am), vars(hp, drat, wt), list(na.rm = TRUE))

Exercises

  1. Implement the three variants of grouped_mean() described above:

    # ... passed on to mean
    grouped_mean <- function(df, group_by, summarise, ...) {}
    # ... selects variables to summarise
    grouped_mean <- function(df, group_by, ...) {}
    # ... selects variables to group by
    grouped_mean <- function(df, ..., summarise) {}
    

dplyr: working with character vectors of column names

From @jennybc:

library(tidyverse)

pick_me_strings <- c("mpg", "gear")

mtcars %>% 
  select(one_of(pick_me_strings)) %>% 
  glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...

mtcars %>% 
  select_at(pick_me_strings) %>% 
  glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...

pick_me_exprs <- rlang::exprs(mpg, gear)

mtcars %>% 
  select(!!!pick_me_exprs) %>% 
  glimpse()
#> Observations: 32
#> Variables: 2
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...

Things can get confusing when trying to apply select() knowledge to other verbs:

library(tidyverse)

df <- tibble(
  name = c("abby", "bea", "curt", "doug"),
  happy = c(TRUE, FALSE, FALSE, TRUE),
  awake = c(FALSE, FALSE, TRUE, TRUE),
)

filter_me_strings <- c("happy", "awake")

df %>% 
  filter(one_of(filter_me_strings))
#> Error in filter_impl(.data, quo): Evaluation error: No tidyselect variables were registered.

df %>% 
  filter_at(filter_me_strings)
#> Error in apply_filter_syms(.vars_predicate, syms, .tbl): argument ".vars_predicate" is missing, with no default

df %>% 
  filter(!!!filter_me_strings)
#> Error in filter_impl(.data, quo): Evaluation error: operations are possible only for numeric, logical or complex types.

filter_me_exprs <- rlang::exprs(happy, awake)

df %>% 
  filter(!!!filter_me_exprs)
#> # A tibble: 1 x 3
#>   name  happy awake
#>   <chr> <lgl> <lgl>
#> 1 doug  TRUE  TRUE

Polite request to add examples for select and filter

Would it be possible to add some very brief examples for using rlang with dplyr::select and dplyr::filter. I appreciate they are still "to do". From an educational perspective it would be really useful since these are such basic building blocks of dplyr.

Many thanks in advance for building these fantastic resources! And apologies if this isn't the best forum in which to make this request.

"How to see what's going on"

I recommend that you gather all the "visibility", "printing", and "debugging" strategies in one place and make it very discoverable. One of my tidy eval learning challenges is that my usual tricks for play and inspection don't work and I can't easily remember the official strategies, like qq_show().

Replace quote() by expr()

And introduce notion of expression earlier?

rlang::is_expression(expr(mycolumn))
#> [1] TRUE

rlang::is_expression("mycolumn")
#> [1] FALSE

rlang::is_expression(sym("mycolumn"))
#> [1] TRUE

Do we need syms() for an early for loop example?

How can we take inputs that should strictly be column names?

  • ensyms() not appropriate as the symbols will still be looked up inside the execution env of the user's wrapper, all the way up to the search path.

  • Do we need envars()? If user supplies symbols foo and bar, they get quoted as .data$foo and .data$bar. If a call is supplied, a helpful error is issued.

  • Other approach: enstrings() as a shortcut for map_chr(ensyms(...), as_string) and providing a tidyselect function to check existence of names in data frame. r-lib/tidyselect#85. The advantage of this approach is that the user can continue working with names as strings, which is sometimes needed (programmatically creating new columns etc).

Add GitHub link and/or edit link

If you have deliberately left out both of these, then just close this.

But if not, it would be nice to add one or both.

As a bookdown reader I often make heavy use of:

  • The little GItHub icon that take people to the underlying repo. Add with with this line in _output.yml.
  • The edit button to facilitate pull requests, especially small typo corrections. Add with this line in _output.yml.

Glossary

As suggested in today's meeting, I think a glossary of terms would be very helpful.

Use one Rmd file per chapter

I think the content is growing enough that it would be easier to work with this if each chapter got its own Rmd file.

mutate_front()

mutate_front <- function(.data, ...) {
  exprs <- enquos(..., .named = TRUE)
  .data <- mutate(.data, ...)

  new_vars <- syms(names(exprs))
  select(.data, !!!new_vars, everything())
}

starwars %>%
  mutate_front(
    height / 100,
    birth_year_fct = cut_number(birth_year, 4)
  )
#> # A tibble: 87 x 15
#>    `height/100` birth_year_fct name  height  mass hair_color skin_color
#>           <dbl> <fct>          <chr>  <int> <dbl> <chr>      <chr>
#>  1         1.72 [8,35]         Luke…    172    77 blond      fair
#>  2         1.67 (72,896]       C-3PO    167    75 NA         gold
#>  3         0.96 [8,35]         R2-D2     96    32 NA         white, bl…
#>  4         2.02 (35,52]        Dart…    202   136 none       white
#>  5         1.5  [8,35]         Leia…    150    49 brown      light
#>  6         1.78 (35,52]        Owen…    178   120 brown, gr… light
#>  7         1.65 (35,52]        Beru…    165    75 brown      light
#>  8         0.97 NA             R5-D4     97    32 NA         white, red
#>  9         1.83 [8,35]         Bigg…    183    84 black      light
#> 10         1.82 (52,72]        Obi-…    182    77 auburn, w… fair
#> # ... with 77 more rows, and 8 more variables: eye_color <chr>,
#> #   birth_year <dbl>, gender <chr>, homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Rephrasing of sentence

In this line:

Rowwise vectorisation in dplyr is a consequence of normal R rules for vectorisation. A vectorised function is a function that works the same way with vectors of 1 element as with vectors of _n_ elements. The operation is applied elementwise (often at the machine code level, which makes them very efficient). We have already mentioned the vectorisation of `toupper()`, and many other functions in R are vectorised. One important class of vectorised functions is the arithmetic operators:

you say:

We have already mentioned the vectorisation of toupper(), and many other functions in R are vectorised.

But I have not found this mention. At least, not in this book :)

An idea: Introduce quoting as a expression/data transformation from the beginning

I did an initial sketch of this idea in this PR: #9

The premise is based purely on my own experience: I found quoting hard to understand until I encountered it explained as a 'code/data transformation'.

I think this idea could be used to add clarity to quoting in this new tidyeval documentation.

Take this first para introducing quoting functions:

On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself. We say the argument has been automatically quoted. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is quote(). It automatically quotes its argument and returns the quoted expression without any evaluation. Because only the expression passed as argument matters, none of these statements are equivalent:

It does not define quoting. It also says about quote:

It automatically quotes its argument and returns the quoted expression without any evaluation.

Which I found very confusing until I learned of the 'expert' definition of 'evaluate' it employs - which requires some appreciation of parser + evaluator model.

Using the code/data transform idea:

On the other hand, a quoting function is not passed the value of an expression, it is passed the expression itself as data. We call the process of converting an expression to data quoting. The quoted expression might be evaluated a bit later or might not be evaluated at all. The simplest quoting function is quote(). It automatically quotes its argument and returns the quoted expression as data. Because it not the value of the expression, but its representation as data that matters, none of these statements are equivalent:

It defines quoting explicitly 'the process of converting an expression to data'. It avoids saying 'unevaluated' and in turn avoiding the need for explanation of interpreter internals.

There are a couple of further ideas in the PR. Please accept this as a genuine suggestion based on a frame of explanation I found useful as a conceptual newcomer. A blog post I wrote exploring this idea was received well with many independent positive expressions of feedback. This gives me some confidence in its broader applicability and appeal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.