Git Product home page Git Product logo

Comments (29)

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024 1

I think I'm unto something apparently data.table doesn't prepare the .SD unless it sees it in the j subscript that's why these work and not simply calling dot_sd(), I guess this is done to reduce the overhead of calculating the full .SD data.table we can sneak the names into the data_mask when rowwise is used this might require rewriting mutate.R to implement a low level function that would be called by both mutate_rowwise. and mutate. :

copy(test_df_t)[, `:=`(c("df_names"),{.SD;.(dot_sd())})][]
#>    a b df_names
#> 1: 1 1        a
#> 2: 2 2        b

copy(test_df_t)[, df_names := (function(){eval_tidy(expr(names(.SD)), env = caller_env())})()][]
#>    a b df_names
#> 1: 1 1        a
#> 2: 2 2        b

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024 1

using !!! inside expr which pushes it to evaluate before quoting

Hmm yep I think you're right on that. I might have a workaround. I should be able to give it a try later today.

Side note - you can use select_vec_sym(), it's shorthand for syms(select_vec_chr())

  • select_vec_idx() returns the column index/positions as a (named) integer vector
  • select_vec_chr() returns the column names as a character vector
  • select_vec_sym() returns the column names as a list of symbols
pacman::p_load(tidytable)

test_df <- tidytable(x = 1:3, y = 1:3, z = 1:3)

# _idx returns column positions as an integer vector
tidytable:::select_vec_idx(test_df, c(x, z))
#> x z 
#> 1 3

# _chr returns column names as a character vector
tidytable:::select_vec_chr(test_df, c(x, z))
#> [1] "x" "z"

# _sym returns column names as a list of symbols
tidytable:::select_vec_sym(test_df, c(x, z))
#> [[1]]
#> x
#> 
#> [[2]]
#> z

Created on 2020-12-16 by the reprex package (v0.3.0)

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024 1

there is still hope that we could achieve this, and it's totally safe to use unlist as we have only one row per group, this will require modifying the quo_text to add the unlist(!!) call

select_vec <- tidytable:::select_vec_chr
dot_sd_ac <- function(cols=everything()){
  expr(mget(select_vec(.SD, {{cols}})))
} 
test_df %>%
  mutate_rowwise.(dot_sd_nrow = mean(unlist(!!dot_sd_ac())),
                  row_mean = mean(c(x, y, z)))
#> # tidytable [3 × 5]
#>       x     y     z dot_sd_nrow row_mean
#>   <int> <int> <int>       <dbl>    <dbl>
#> 1     1     1     1           1        1
#> 2     2     2     2           2        2
#> 3     3     3     3           3        3

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024 1

Actually my old code accounts for all the cases mentioned above, I'll refactor it, make it more readable and add any string manipulation functions in utils-stringr.R. The way it accounts for all the cases is let's say we have mean(c_across()-c_across(V1:V4)) - sum(c_across(V1:V4)):

  • \\((?>[^()]|(?R))*\\) at first I extract the matching parentheses from the quosure's text. i.e I'll have two matches (c_across()-c_across(V1:V4)) and (c_across(V1:V4)), this is due to the fact that c_across call will always be inside another call. this is a special recursive regex supported on PCRE can be enabled in baseR and Not on ICU the regex engine used by stringr
  • c_across(?:.(?!c_across))+ then I get the part of these parentheses that has c_across and (?:.(?!c_across))+ match everything until you reach another (?!c_across) c_across call. we now have two substrings for the first c_across()- and c_across(V1:V4) and one for the second c_across(V1:V4).
  • then I re-extract the matching parentheses (), (V1:V4) for the 1st and (V1:V4) for the second.
    This is the safest approach (I think) to handle such an issue.

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024 1
utils-stringr.R

these are non breaking changes :

  • Exposed some arguments fixed, perl to str_detect., str_extract., str_replace_all.
  • Changed behavior of str_replace_all. it now iterates over all the patterns and replaces them with the equivalent replacement using gsub, this is a backwards compatible change.
str_detect. <- function(string, pattern, fixed=F, perl=F) {
  grepl(pattern, string, fixed=fixed, perl=perl)
}

# Allows for multiple pattern replacements but unlike stringr it applies then on the same string
# stringr's behavior is to return a list each element represents the modified x vector with each (pattern, replacement) couple
str_replace_all. <- function(string, pattern, replacement, fixed=F, perl=F) {
  if(missing(replacement)){
    replacement <- unname(pattern)
    pattern <- names(pattern)
  }else if(length(pattern)!=length(replacement)){
    if(length(pattern) > 1 & length(replacement) > 1) stop("replacement has to have the same length as pattern")
    else {
      if(length(replacement)==1) replacement <- rep(replacement, length(pattern))
      else pattern <- rep(pattern, length(replacement))
    }
  }
  for(i in 1:length(pattern))
    string <- gsub(pattern[[i]], replacement[[i]], string,  perl = perl, fixed = fixed)
  string
}

str_extract. <- function(x, pattern, fixed=F, perl=F) {
  regmatches(x, regexpr(pattern, x, fixed=fixed, perl=perl))
}

str_extract_all. <- function(x, pattern, fixed=F, perl=F) {
  regmatches(x, gregexpr(pattern, x, fixed=fixed, perl=perl))
}
mutate_rowwise. function

this implementation is the fastest one I could make yet, the trick is that I evaluate the select_vec_chr only one time at the top sparing us the overhead of them recalculating each time i.e nrow(.df) time, way faster then dyplr (5x), lower memory allocation a third of what dyplr needs (if bench::mark is to be trusted), and this maybe due to the fact they use lazy evaluation, the question remains is is lazy evaluation worth it? Although I think I have an idea to implement it same as the one I did for #166 but optimized for c_across, do you think it'll be interesting or even useful?
Also one note it's possible that for some reason a user has a .rowwise_id column ( improbable but there is a chance) in such case the user will lose the data.

pacman::p_load(tidytable, glue, vctrs)
shallow <- tidytable:::shallow
eval_quo <- tidytable:::eval_quo
select_vec_chr <- tidytable:::select_vec_chr

mutate_rowwise. <- function(.df, ...) {

  .df <- as_tidytable(.df)
  
  dots <- enquos(...)
  
  if (length(dots) == 0) return(.df)
  
  dots_names <- names(dots)
  all_names <- names(.df)
  if (any(dots_names %in% all_names )) .df <- copy(.df)
  else .df <- shallow(.df)
  
  .df[, .rowwise_id := .I] # 1:.N
  
  data_env <- env(quo_get_env(dots[[1]]), .df = .df)
  
  dots_text <- map_chr.(dots, quo_text)
  cs <- str_detect.(dots_text, "c_across.(", fixed=T)
  if(any(cs)){
    vars <- grab_vars(dots_text[cs])
    all_names <- set_names(all_names)
    selected <- map_chr.(vars, function(.x) toString(select_vec_chr(all_names, !!parse_expr(if(.x=="()") 'everything()' else .x))))

    dots_text[cs] <- str_replace_all.(dots_text[cs], glue("c_across.{vars}"), glue("vec_c(!!!c({selected}))"), fixed=T)

    dots[cs] <- map.(which(cs), function(.x) quo_set_expr(dots[[.x]], parse_expr(dots_text[[.x]])))
  }


  eval_quo(
    .df[, ':='(!!!dots), by = .rowwise_id],
    new_data_mask(data_env), env = caller_env()
  )
  
  .df[, .rowwise_id := NULL][]
}

# gets the insides of the c_across.es
grab_vars <- function(x){
    for(pattern in c("\\((?>[^()]|(?R))*\\)", "c_across\\.(?:.(?!c_across\\.))+", "\\((?>[^()]|(?R))*\\)")) 
        x <- unlist(str_extract_all.(x, pattern, perl=T))
    unique(x)
}

test_df <- as_tidytable(matrix(sample(1:1000, 200000, replace=T), ncol= 20))
x <- paste0("V", 15:23)
y <- paste0("V", 15:20)

bench::mark(
    tidytable = mutate_rowwise.(test_df, row_mean = mean(c_across.()), neg_sum=sum(c_across.(!V9:V13)), xy_sums=sum(c_across.(any_of(x))), y_sum=sum(c_across.(all_of(y)))),
    tidyverse= mutate(rowwise(test_df), row_mean = mean(c_across()), neg_sum=sum(c_across(!V9:V13)), xy_sums=sum(c_across(any_of(x))), y_sum=sum(c_across(all_of(y)))),
  iterations=100, check=F
)
#> # A tibble: 2 x 13
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 tidytable  403.94ms  441.9ms     2.22     4.53MB     1.53   100    69        45s
#> 2 dyplr         1.85s    2.01s     0.490   14.66MB     1.44   100   294       3.4m

microbenchmark::microbenchmark(
    tidytable = mutate_rowwise.(test_df, row_mean = mean(c_across.()), neg_sum=sum(c_across.(!V9:V13)), xy_sums=sum(c_across.(any_of(x))), y_sum=sum(c_across.(all_of(y)))),
    dyplr = mutate(rowwise(test_df), row_mean = mean(c_across()), neg_sum=sum(c_across(!V9:V13)), xy_sums=sum(c_across(any_of(x))), y_sum=sum(c_across(all_of(y)))), times=100)
#> Unit: milliseconds
#>       expr       min        lq      mean    median        uq      max neval cld
#>  tidytable  392.3277  432.9002  509.7025  465.0776  516.9729 1237.068   100  a 
#>      dyplr 1896.7422 2092.7374 2328.7882 2173.4218 2339.8404 4693.214   100   b

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024 1

Awesome - the string implementation looks good at first glance. I'll take a more in depth look at everything in the next week or so.

I'm pretty sure that your implementation will be the one we go with, but I also want to take another run at the "normal workflow" implementation before we officially get this added.

I was able to get c_across.() to work more normally, but it's extremely slow at the moment:

mutate_rowwise. <- function(.df, ...) {

  .df <- as_tidytable(.df)
  
  dots <- enquos(...)
  
  if (length(dots) == 0) return(.df)
  
  dots_names <- names(dots)
  
  if (any(dots_names %in% names(.df))) .df <- copy(.df)
  else .df <- shallow(.df)
  
  .df[, .rowwise_id := .I]
  
  data_env <- env(quo_get_env(dots[[1]]), .df = .df, dots_names = dots_names)
  
  eval_quo(
    .df[, (dots_names) := {.SD; list(!!!dots)}, by = .rowwise_id],
    new_data_mask(data_env), env = caller_env()
  )
  
  .df[, .rowwise_id := NULL][]
}

c_across. <- function(cols = everything()) {
  
  .sd <- get('.SD', envir = caller_env())
  
  select_syms <- select_vec_sym(.sd, {{ cols }})
  
  eval_quo(vctrs::vec_c(!!!select_syms), .sd)
}

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024 1

@moutikabdessabour Thanks again for all of the work on this! I don't think I would have been able to get this as nicely implemented as what you put together. Let's just go with the string implementation.

FYI - I already updated utils-stringr.R with the new versions so we can use them in mutate_rowwise.().

Can you make a pull request with the following changes:

  • A mutate_rowwise.R file (see implementation below)
  • A c_across.R file (see implementation below)
  • Run devtools::document() inside the package Rproject so documentation is added for both functions
  • Make sure on one of your commit messages you add that it closes this issue. That way it is closed automatically when I merge the PR.
  • Update the NEWS.md file and put mutate_rowwise.() under a "New Functions" header (you'll be able to see in the NEWS.md file how I did this in older versions of the package). Also if you want you can put your Github username there. (See v0.5.7 for how I noted your contribution to crossing.())
  • (Optional) Add unit tests. If you're unsure how let me know and I can walk you through it. If you do decide to add them make sure they are pretty simple (data frames should have 3 rows or so and 2 or 3 columns). This one is definitely optional - I can add them if you want.

A few notes:

  • I renamed grab_vars() to extract_cols() (to be slightly more descriptive of what it's doing)
  • I switched to using vec_unique() in extract_cols(). It's slightly faster than base::unique().
  • When creating the selected variable:
    • I passed the function the data frame instead of all_names, because c_across.(where(is.integer)) was failing otherwise.
    • I also created a helper function called get_selected() since it was a pretty long function call inside of map_chr.()
  • I switched to using ~ instead of function(.x) in map.() functions
  • I adjusted some things to fit the tidyverse style guide. Some of these aren't commonly followed as much in other programming languages, but are general rules of thumb when using R. Here are the ones I did:
    • Some general changes around spacing (for example putting a space before and after an = sign)
    • Made adjustments to keep "line length" from being too long
    • Made variable names descriptive - renamed cs to use_across
    • Use TRUE/FALSE instead of T/F
    • With if/else statements - the style guide recommends always using curly brackets { }. This is one rule that I do slightly differently. My rule is this - either both statements use curly brackets, or neither statement does. Edit: Whoops, looks like I actually did this in str_replace_all.(). I'll leave this here though since it's worth mentioning.

mutate_rowwise.R

#' Add/modify columns by row
#' 
#' @description 
#' Allows you to mutate "by row". this is most useful when a vectorized function doesn't exist.
#'
#' @param .df A data.table or data.frame
#' @param ... Columns to add/modify
#'
#' @export
#'
#' @examples
#' test_df <- data.table(x = runif(6), y = runif(6), z = runif(6))
#' 
#' # Compute the mean of x, y, z in each row
#' test_df %>%
#'   mutate_rowwise.(row_mean = mean(c(x, y, z)))
#'   
#' # Use c_across.() to more easily select many variables
#' test_df %>%
#'   mutate_rowwise.(row_mean = mean(c_across.(x:z)))
mutate_rowwise. <- function(.df, ...) {
  UseMethod("mutate_rowwise.")
}

mutate_rowwise..data.frame <- function(.df, ...) {
  
  .df <- as_tidytable(.df)
  
  dots <- enquos(...)
  
  if (length(dots) == 0) return(.df)
  
  if (any(names(dots) %in% names(.df))) .df <- copy(.df)
  else .df <- shallow(.df)
  
  .df[, .rowwise_id := .I]
  
  data_env <- env(quo_get_env(dots[[1]]), .df = .df)
  
  dots_text <- map_chr.(dots, quo_text)
  use_across <- str_detect.(dots_text, "c_across.(", fixed = TRUE)
  
  if (any(use_across)) {
    
    cols <- extract_cols(dots_text[use_across])
    
    selected <- map_chr.(cols, ~ get_selected(.df, .x))
    
    dots_text[use_across] <- str_replace_all.(
      dots_text[use_across],
      glue("c_across.{cols}"),
      glue("vec_c(!!!c({selected}))"),
      fixed = TRUE
    )
    
    dots[use_across] <- map.(which(use_across), ~ quo_set_expr(dots[[.x]], parse_expr(dots_text[[.x]])))
  }
  
  eval_quo(
    .df[, ':='(!!!dots), by = .rowwise_id],
    new_data_mask(data_env), env = caller_env()
  )
  
  .df[, .rowwise_id := NULL][]
}

# extract the inside of each c_across call
extract_cols <- function(x) {
  
  patterns <- c(
    "\\((?>[^()]|(?R))*\\)",
    "c_across\\.(?:.(?!c_across\\.))+",
    "\\((?>[^()]|(?R))*\\)"
  )
  
  for (pattern in patterns) x <- unlist(str_extract_all.(x, pattern, perl = TRUE))
  
  vec_unique(x)
}

get_selected <- function(.df, cols) {
  cols <- if (cols == "()") "everything()" else cols
  
  toString(select_vec_chr(.df, !!parse_expr(cols)))
}

c_across.R

#' Combine values from multiple columns
#' 
#' @description
#' `c_across.()` works inside of `mutate_rowwise.()`. It uses tidyselect so
#' you can easily select multiple variables.
#'
#' @param cols Columns to transform.
#'
#' @export
#'
#' @examples
#' test_df <- data.table(x = runif(6), y = runif(6), z = runif(6))
#'
#' test_df %>%
#'   mutate_rowwise.(row_mean = mean(c_across.(x:z)))
c_across. <- function(cols = everything()) {
  abort("c_across.() can only be used inside of mutate_rowwise.()")
}

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024 1

@moutikabdessabour Thanks again for all of the work on this!

My pleasure I actually enjoy coding in R that's why I keep checking the issues section. Also thank you for the notes I always read the tidyverse style guide but almost always fail to implement it 😞 .

Actually I just noticed that our code will break when the data.table has space named columns so I changed the get_selected to (I just glue backticks to the column name), also .rowwise_id was slipping into get selected so I moved it's creation under the if statement:

get_selected <- function(.df, cols) {
  cols <- if (cols == "()") "everything()" else cols
  
  toString(glue("`{select_vec_chr(.df, !!parse_expr(cols))}`"))
}

TODO

  • A mutate_rowwise.R file (with the minor changes stated above)
  • A c_across.R file.
  • Run devtools::document().
  • Mention in commits that it solves this issue
  • Update NEWS.md
  • unit tests: I'm pasting them below. If you could expand on this point I'd be happy to learn. should I add them to the mutate-variants test file?
test_that("mutate_rowwise.() adds column", {
  test_df <- data.table(x = 1:3, y = 4:6, z = c("a", "a", "b"))

  results_df <- test_df %>%
    mutate_rowwise.(sum = x + y)

  expect_equal(results_df$sum, c(5, 7, 9))
})

test_that("mutate_rowwise.() doesn't modify by reference", {
  test_df <- data.table(x = 1:3, y = 4:6, z = c("a", "a", "b"))

  results_df <- test_df %>%
    mutate_rowwise.(x = x + y)

  expect_equal(test_df$x, c(1, 2, 3))
  expect_equal(results_df$x, c(5, 7, 9))
})

test_that("c_across.() provides all columns", {
  test_df <- data.table(x = 1:3, y = 4:6, z = c("a", "a", "b"))

  results_df <- test_df %>%
    mutate_rowwise.(pasted = paste0(c_across.(), collapse=""))
  results_df_every <- test_df %>%
    mutate_rowwise.(pasted = paste0(c_across.(cols = everything()) , collapse=""))


  expect_equal(results_df$pasted, c("14a", "25a", "36b"))
  expect_equal(results_df$pasted, results_df_every$pasted)
})


test_that("c_across.(cols = where(is.numeric)) provides numeric columns", {
  test_df <- data.table(x = 1:3, y = 4:6, z = c("a", "a", "b"))

  results_df <- test_df %>%
    mutate_rowwise.(pasted = mean(c_across.(cols = where(is.numeric))))


  expect_equal(results_df$pasted, c(2.5, 3.5, 4.5))
})

test_that("c_across.(!) negates column", {
  test_df <- data.table(x = 1:3, y = 4:6, z = c("a", "a", "b"))

  results_df <- test_df %>%
    mutate_rowwise.(pasted = mean(c_across.(!z)))


  expect_equal(results_df$pasted,  c(2.5, 3.5, 4.5))
})

test_that("c_across.() can only be used inside mutate_rowwise.()", {
  test_df <- data.table(x = 1:3, y = 4:6, z = c("a", "a", "b"))


  expect_error(test_df %>%
    mutate.(pasted = paste0(c_across.(!z))))
})

test_that("c_across.() works with space named columns", {
  test_df <- data.table(`x y`=1:3, `x z`=1, y=3)
  resultdf <- test_df %>%
    mutate_rowwise.(sum = sum(c_across.(contains(" "))))

  expect_equal(resultdf$sum, c(2, 3, 4))
})

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024 1

That all looks great. The tests look good to me.

As far as location, let's put them in a new file. If you run usethis::use_test("mutate_rowwise") in the Rproject for the package it will create/open a new file for mutate_rowwise testing.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024 1

Got this all merged! Thanks again.

I should have mentioned this in my comment above - in your commit message to close an issue you have to use "Closes". For example - "Did this and that. Closes #xx"

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

The above implementation fails if there are any list columns. This implementation will work instead. Seems to be the safest option:

mutate_rowwise. <- function(.df, ...) {

  .df[, .rowwise_id := 1:.N]
  
  .df <- mutate.(.df, ..., .by = .rowwise_id)

  .df[, .rowwise_id := NULL]

  .df
}

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

here is a nasty implementation of c_across. I basically get the function call and run through it to find c_across calls replace them with the actual contents evaluated with tidytable:::select_vec_chr and wrapped in c() call then call mutate with the modified args:

pacman::p_load(rlang, tidytable, tidyverse)
mutate.rowwise <- function(.df, ...){
  # get the call without the actual function name
  l <- as.list(sys.call())[-1]
  # remove .df arg and unnamed args (the latter is to account for the use of pipes)
  l <- l[!grepl("^(.df|)$", names(l))]
  # convert the arguments to character
  l_chr <- lapply(l, as.character)
  # str2lang to reverse the transformation from character to `expression` like
  # as.call to turn the whole lists into a calls
  lapply(l_chr, function(x) as.call(lapply(grep_vec_chr(x, .df=.df), str2lang))) -> l_chr
  
  .df <- tidytable:::shallow(.df)
  .df[, .rowwise_id := .I]
   # call mutate with the new expressions
  .df <- do.call(mutate., c(list(.df=.df, .by=".rowwise_id"), l_chr))
  
  .df[, .rowwise_id := NULL][]
}
# replaces c_across calls with the actual columns
grep_vec_chr <- function(.df, select_vars){
  # checks for c_across
  grep("c_across", select_vars) -> across
  if(length(across)>0){
    # if found replace with the actual column
    select_vars[across] <- sapply(select_vars[across], function(x) {
      vars <-  sub("^c\\((.+)\\)$", "\\1", sub("^c_across\\((.+)\\)$", "\\1", x))
      sprintf("c(%s)", paste0(unlist(lapply(strsplit(vars, ", ?")[[1]], function(var) tidytable:::select_vec_chr(.df, !!parse_expr(var))), use.names=F) , collapse=","))
    })
  }
  select_vars
}
set.seed(1)
test_df <- as_tidytable(matrix(sample(1:10, 200, replace=T), ncol= 20))

grep_vec_chr(test_df, "c_across(V5:V3)")
[1] "c(V5,V4,V3)"
 grep_vec_chr(test_df, "c_across(!V5:V3)")
[1] "c(V1,V2,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20)"

test_df %>% mutate.rowwise(z=sum(c_across(V1:V5)), l=mean(c_across(c(V5, V20:V10)))) %>% select.(z:l) -> tdtb
test_df %>% rowwise() %>% mutate(z=sum(c_across(V1:V5)), l=mean(c_across(c(V5, V20:V10)))) %>% select(z:l) -> dplr
identical(tidytable(dplr), tdtb)

If we could use stringr the code will be improved and will work for cases when we have for example l=mean(c_across(c(V5, V20:V10))) - mean(c_across(!V5:20)), however I don't think this is the right approach but just a step towards finding the right solution. I'd be happy to hear your feedback.

After some toying around I made a nastier implementation that'll account for multiple c_acrosses :

grep_vec_chr <- function(.df, select_vars){
  grep("c_across", select_vars) -> across
  if(length(across)>0){
    select_vars[across] <- sapply(select_vars[across], function(x) {
      cs <- regmatches(x, gregexpr("c_across(?:.(?!c_across|(?!\\bc\\()[a-z]+\\())+", x, perl=T))[[1]]
      regmatches(cs, l<-gregexpr("\\((?>[^()]|(?R))*\\)", cs, perl=T)) -> vars

      rep <- lapply(vars, function(var) paste0(tidytable:::select_vec_chr(.df, !!parse_expr(var)), collapse=","))
      for(i in 1:length(vars)){
        x <- gsub(sprintf("c_across%s", vars[[i]]), sprintf("c(%s)",rep[[i]]), x, fixed=T)
      }
      x
    })
  }
  select_vars
}

mutate.rowwise <- function(.df, ...){
  l <- as.list(sys.call())[-1]
  l <- l[!grepl("^(.df|)$", names(l))]
  lapply(l, function(x) {
    grep_vec_chr(as.character(x), .df=.df)->x
    parse_expr(sprintf("`%s`(%s)", x[[1]], paste0(x[-1], collapse=",")))
  }) -> l_chr
  
  .df <- tidytable:::shallow(.df)
  .df[, .rowwise_id := .I]

  .df <- do.call(mutate., c(list(.df=.df, .by=".rowwise_id"), l_chr))
  
  .df[, .rowwise_id := NULL][]
}

> (test_df %>% mutate.rowwise(z=sum(c_across(V1:V5)), l=sum(c_across(c(V20:V10)))- sum(c_across(c(!V1:V9)))) %>% select.(z:l) -> tdtb)
#> # tidytable [10 × 2]
#>        z     l
#>    <int> <int>
#>  1    38     0
#>  2    32     0
#>  3    27     0
#>  4    31     0
#>  5    38     0
#>  6    33     0
#>  7    21     0
#>  8    27     0
#>  9    31     0
#> 10    40     0
> (test_df %>% rowwise() %>% mutate(z=sum(c_across(V1:V5)), l=sum(c_across(c( V20:V10)))- sum(c_across(c(!V1:V9)))) %>% select(z:l) -> dplr)
#> # A tibble: 10 x 2
#> # Rowwise: 
#>        z     l
#>    <int> <int>
#>  1    38   -38
#>  2    32   -32
#>  3    27   -27
#>  4    31   -31
#>  5    38   -38
#>  6    33   -33
#>  7    21   -21
#>  8    27   -27
#>  9    31   -31
#> 10    40   -40

I think there is an issue with the dplyr implementation as l should be 0. Although the second implementation looks promising it only accounts for c() and not the other select helpers this pushed me into a more robust implementation :

grep_c_across <- function(x) {
  # get balanced parentheses
  regmatches(x, gregexpr("\\((?>[^()]|(?R))*\\)", x, perl=T))[[1]] -> x
  # get the part of the string that follows the c_across
  x <- unlist(regmatches(x, gregexpr("c_across(?:.(?!c_across))+", x, perl=T)))
  # grab the parentheses that follows the c_across call
  regmatches(x, gregexpr("\\((?>[^()]|(?R))*\\)", x, perl=T))
}

grep_vec_chr <- function(.df, select_vars){
  grep("c_across", select_vars) -> across
  if(length(across)>0){
    select_vars[across] <- sapply(select_vars[across], function(x) {
      rep <- lapply(vars<-grep_c_across(sprintf("(%s)", x)), function(var) paste0(tidytable:::select_vec_chr(.df, !!parse_expr(var)), collapse=","))
      for(i in 1:length(vars)){
        x <- gsub(sprintf("c_across%s", vars[[i]]), sprintf("c(%s)",rep[[i]]), x, fixed=T)
      }
      x
    })
  }
  select_vars
}

this works well with any helper:

z <- c("V1", "V4")
grep_vec_chr(test_df, "mean(c_across(any_of(z)))- mean(c_across(c(!V20:V10)))")
#> [1] "mean(c(V1,V4))- mean(c(V1,V2,V3,V4,V5,V6,V7,V8,V9))"
z <- c("V1", "V01")
grep_vec_chr(test_df, "mean(c_across(any_of(z)))- mean(c_across(c(!V20:V10)))")
#> [1] "mean(c(V1))- mean(c(V1,V2,V3,V4,V5,V6,V7,V8,V9))"
#> # Using all_of errors as expected
grep_vec_chr(test_df, "mean(c_across(all_of(z)))- mean(c_across(c(!V20:V10)))")
#> Error: Can't subset columns that don't exist.
#> ✖ Column `V01` doesn't exist.
#> Run `rlang::last_error()` to see where the error occurred.
grep_vec_chr(test_df, "mean(c_across(last_col()))- mean(c_across(ends_with('2')))")
#> [1] "mean(c(V20))- mean(c(V2,V12))"

the code can be optimized, as I only wrote it to find a working solution to the problem at hand.

After opening an issue in the dplyr repo apparently the result is due to lazy evaluation => z is included in !V9:V1. Also the dplyr implementation is using a keeping track of the data used inside a DataMask and also the calls that's why they have pretty straight forward implementation of across and c_across, they basically keep peeking at the mask in order to get the call and select the vars from the dataMask

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

@moutikabdessabour Thanks for taking a shot at this, it's definitely a tough problem.

In general I'm trying to avoid using a "string manipulation to expression" workflow. I've had to bend that rule once so far (to get n.() to work in summarize.()), but I'd like to avoid it if possible. (There might be another example I can't think of at the moment...) Hopefully we can find something that has more of a straightforward workflow.

As far as peeking in the mask - unfortunately we can't do that because our data mask isn't the data frame itself, it's an environment that we evaluate the data.table expression in. That trick is core to making tidytable work in other areas, so rewriting it to use the data frame as a mask would cause a bunch of other use cases to fail. We also don't need to peek .df, we would need to peek .SD, which is only available inside a data.table call. Pretty tough problem overall.

This one is definitely at the top of the list for v0.5.8 (I think I'll be releasing v0.5.7 in the next couple weeks). As you can tell by the June creation of this issue, this function has been on the backburner for a while 😂

If you have any other attempts at this feel free to log them here. I think I'll be able to visit this in the next few weeks and we'll see what we can come up with

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

I don't think there is another way to achieve this, however I did rewrite the code to use rlang functions, basically check if quo_text has a c_across then if so use quo_set_expr to replace the quoted expression. this new version works like the dplyr implementation and allows for lazy eval once #166 is solved. the helper functions could be added to utils.R, also in the regular expressions used the first one grabs only balanced parentheses, the next grabs the across calls, and the last grabs the balanced parentheses that are found inside the across call, and the fact that it's tidyselect helper agnostic makes it flexible i.e across(myfunc(1:2)) will work just like any tidyselect helper

mutate.rowwise <- function(.df, ...){
  dots <- enquos(...)
  .df <- tidytable:::shallow(.df)

  .df[, .rowwise_id := .I]
  str_extract_helper_quo(dots, "c_across") -> cs
  !sapply(cs, is.null) -> any_cs
  
  if(any(any_cs)){
    dots[any_cs] <- lapply(which(any_cs), function(i) quo_set_expr(dots[[i]], parse_expr(str_gsub(sprintf("c_across%s", cs[[i]]), sprintf("!!!syms(tidytable:::select_vec_chr(.df[, -'.rowwise_id'], %s))", cs[[i]]), quo_text(dots[[i]])))))
    .df <- do.call(mutate., c(list(.df=.df, .by = ".rowwise_id"), dots))
  }else{
    .df <- mutate.(.df, ..., .by = .rowwise_id)
  }

  .df[, .rowwise_id := NULL]

  .df
}

########################################################################################################################
########################################                                         #######################################
########################################       String manipulation helpers       #######################################
########################################                                         #######################################
########################################################################################################################
# multiple pattern substitution using base R
str_gsub <- function(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = TRUE, useBytes = FALSE){
  if(length(pattern)!=length(replacement)){
    if(length(pattern) > 1 & length(replacement) > 1) stop("replacement has to have the same length as pattern")
    else {
      if(length(replacement)==1) replacement <- rep(replacement, length(pattern))
      else pattern <- rep(pattern, length(replacement))
    }
  }
  for(i in 1:length(pattern)){
    x <- gsub(pattern[[i]], replacement[[i]], x, ignore.case = ignore.case, perl = perl, fixed = fixed, useBytes = useBytes)
  }
  x
}
### Pattern extractors 
# extracts a pattern from a string 
str_extract <- function(text, pattern, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, invert=FALSE){
  regmatches(text, gregexpr(pattern, text, ignore.case = ignore.case, perl = perl, fixed = fixed, useBytes = useBytes), invert=invert)
}
# extracts a series of patterns in the given order
str_extract_ordered <- function(text, patterns, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, invert=FALSE){
  text <- str_extract(text, patterns[[1]], ignore.case = ignore.case, perl = perl, fixed = fixed, useBytes = useBytes, invert=invert)
  for (pattern in patterns[-1]) text <- lapply(text, function(x) unlist(str_extract(x, pattern, ignore.case = ignore.case, perl = perl, fixed = fixed, useBytes = useBytes, invert=invert)))
  text
}
# extracts helper functions from text
str_extract_helper <- function(text, func_name){
  str_extract_ordered(text, c("\\((?>[^()]|(?R))*\\)", sprintf("%s(?:.(?!%1$s))+", func_name), "\\((?>[^()]|(?R))*\\)"), perl=T )
}
str_extract_helper_quo <- function(quos, func_name){
  str_extract_helper(sapply(quos, quo_text), func_name)
}

@markfairbanks maybe we could pull it off without string manipulation if there is a way to turn this list of symbols to a vector of symbols(I tried !!! it failed cuz it can't be used in top level wrapping it in a c call doesn't work) I'll give it a shot tomorrow as it 2AM rn, Also somehow I can't access .SD in c_across is there any reason for this behavior:

# isn't runnable
c_across <- function(cols=everything()){
  eval_tidy(expr(syms(tidytable:::select_vec_chr(.SD, {{cols}}))), env = caller_env())
}
# names .SD is null
c_across_mget <- function(cols=everything()){
  eval_tidy(expr( mget(select_vec(names(.SD), {{cols}}))), env = caller_env())
}
select_vec <- function(x, cols){
  tidytable:::select_vec_chr(setNames(x, nm=x), {{cols}})
}

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

maybe we could pull it off without string manipulation if there is a way to turn this list of symbols to a vector of symbols

In general symbols or unevaluated expressions must be contained in a list. They can't be stored in a vector, because vectors want to evaluate them right away. It's an interesting R rule.

One thing worth noting - if you're trying to unpack a list into a vector you can use vec_c(). All vctrs functions are rlang compatible.

pacman::p_load(vctrs, rlang)

test_list <- list(1,2,3)

# You can unpack a list into a vector with vec_c()
vec_c(!!!test_list)
#> [1] 1 2 3

# However unevaluated symbols must be in a list
test_exprs <- exprs(x, y, z)

vec_c(!!!test_exprs)
#> Error: `..1` must be a vector, not a symbol.

Also somehow I can't access .SD in c_across is there any reason for this behavior

I'm not sure what the cause of this is to be honest. The other data.table symbols seem to be more easily accessible. If we want a "straightforward" c_across.() implementation, this is the biggest thing we'll have to solve (I think).

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

I think this is solved in Rdatatable/data.table#4163 but somehow it still isn't merged to the data.table repo, we can use the previous string manipulation method and then transition to this when data.table merge it or we just wait for them to implement it.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

This is for names(.SD) to work on the left hand side of :=. It currently works on the right hand side in data.table, but not if you try to create a function that calls it:

pacman::p_load(data.table, rlang)

test_df <- data.table(a = 1:2, b = 1:2)

### dot_n
dot_n <- function() eval_tidy(expr(.N), env = caller_env())

copy(test_df)[, count := .N][]
#>    a b count
#> 1: 1 1     2
#> 2: 2 2     2
copy(test_df)[, count := dot_n()][]
#>    a b count
#> 1: 1 1     2
#> 2: 2 2     2

### dot_sd
dot_sd <- function() eval_tidy(expr(names(.SD)), env = caller_env())

copy(test_df)[, df_names := names(.SD)][]
#>    a b df_names
#> 1: 1 1        a
#> 2: 2 2        b
copy(test_df)[, df_names := dot_sd()][] # Fails for some reason
#>    a b df_names
#> 1: 1 1     <NA>
#> 2: 2 2     <NA>

We just need to figure out the trick to get it to work lol

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

copy(test_df_t)[, ':='(c("df_names"),{.SD;.(dot_sd())})][]

That's a great find. We should be able to figure everything out from there. Here's the shell of a mutate_rowwise.() that works using your trick:

pacman::p_load(tidytable, rlang)

test_df <- data.table(x = 1:3, y = 1:3, z = 1:3)

shallow <- tidytable:::shallow
eval_quo <- tidytable:::eval_quo

mutate_rowwise. <- function(.df, ...) {

  .df <- as_tidytable(.df)
  
  dots <- enquos(...)
  
  if (length(dots) == 0) return(.df)
  
  dots_names <- names(dots)
  
  if (any(dots_names %in% names(.df))) .df <- copy(.df)
  else .df <- shallow(.df)
  
  .df <- mutate.(.df, .rowwise_id = 1:.N)
  
  data_env <- env(quo_get_env(dots[[1]]), .df = .df, dots_names = dots_names)
  
  eval_quo(
    .df[, (dots_names) := {.SD; list(!!!dots)}, by = .rowwise_id],
    new_data_mask(data_env), env = caller_env()
  )
  
  .df[, .rowwise_id := NULL][]
}

dot_sd_nrow <- function() eval_tidy(expr(nrow(.SD)), env = caller_env())

test_df %>%
  mutate_rowwise.(dot_sd_nrow = dot_sd_nrow(),
                  row_mean = mean(c(x, y, z)))
#> # tidytable [3 × 5]
#>       x     y     z dot_sd_nrow row_mean
#>   <int> <int> <int>       <int>    <dbl>
#> 1     1     1     1           1        1
#> 2     2     2     2           1        2
#> 3     3     3     3           1        3

I didn't get c_across.() to work at first pass, but we're getting there.

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

I can get it to get the names as symbols but once I add vctrs::vec_c it fails because it can't find .SD in the call stack. Correct me if I'm wrong but if an object is in a parent environment then it's accessible by all the children envs, right?

I guess I know what causes the code to fail is that we're basically using !!! inside expr which pushes it to evaluate before quoting, right? I'm 97% certain that this is the root of the problem because it pushes syms(select_vec_chr.....) to be evaluated inside the current context pushing .SD to be NULL.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

No luck on the workaround (at least I couldn't get it to work). The newest version of rlang has a function called enquo0() that doesn't immediately evaluate !!!. I created a quo0(), but it doesn't seem to help with our situation. Here's how it works though:

pacman::p_load(rlang)

test_list <- list(1, 2, 3)

quo0 <- function(expr) {
  enquo0(expr)
}

test_quo0 <- quo0(!!!test_list)
test_quo0
#> <quosure>
#> expr: ^!!!test_list
#> env:  global


# The expression includes the `!!!`
test_expr <- quo_get_expr(test_quo0)
test_expr
#> !!!test_list

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

Same here It prints the names(.SD) vector but can't get it to work maybe the .SD gets lost down the call stack (which shouldn't be happening) :

dot_sd_ac <- function(cols=everything()){
  quo0(vec_c(!!!print( tidyselect::eval_select({{cols}}, print(set_names(names(.SD)))))))
} 
test_df %>%
  mutate_rowwise.(dot_sd_nrow = mean(!!dot_sd_ac()),
                  row_mean = mean(c(x, y, z)))
#>   x   y   z 
#> "x" "y" "z" 
#> Error: Can't select within an unnamed vector.
#> Run `rlang::last_error()` to see where the error occurred.

In order to be able to debug the issue I overloaded tidyselect::eval_select to print the x that gets passed into it, however I think we're on the right track:

dot_sd_ac <- function(cols=everything()){
  quo0(vec_c(!!!names(eval_select({{cols}}, set_names(names(.SD))))))
} 
test_df %>%
  mutate_rowwise.(dot_sd_nrow = mean(!!dot_sd_ac()),
                  row_mean = mean(c(x, y, z)))
#>   x   y   z 
#> "x" "y" "z" 
#> Error: Unknown shortcut: x
#> Run `rlang::last_error()` to see where the error occurred.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

You might be right that .SD is getting lost in the call stack. I created an alias called .sd, and it fails when using select_vec_sym():

pacman::p_load(tidytable, data.table)

test_df <- data.table(x = 1:3, y = 1:3, z = 1:3)

eval_quo <- tidytable:::eval_quo
select_vec_sym <- tidytable:::select_vec_sym

eval_quo(
  copy(test_df)[, new_col := {.sd = .SD; list(.sd$x)}][]
)
#>    x y z new_col
#> 1: 1 1 1       1
#> 2: 2 2 2       2
#> 3: 3 3 3       3

eval_quo(
  copy(test_df)[, new_col := {.sd = .SD; list(vec_c(!!!select_vec_sym(.sd, everything())))}][]
)
#> Error in eval_select_impl(data, names(data), as_quosure(expr, env), include = include, : object '.sd' not found

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

@markfairbanks shouldn't it be accessible? anyway I'm digging deeper and for some weird reason the tidyselect::vars_select_eval is calling readr::cols maybe to check the col types, I can't make sense of it. Ok just forget the part about readr::cols it's due to using enquo0 and I'm passing it cols which got interpreted as readr::cols my bad.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

shouldn't it be accessible?

It should be. This is where mixing rlang with data.table can do some weird things. They're both approaching non-standard evaluation in different ways. There's a chance that this is an unsolvable problem.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

Another random side note - the utils-stringr.R file found here contains some stringr translations for internal use. So feel free to develop string solutions with stringr. Then when it gets added to tidytable we can use the internal versions. I'd probably recommend using them actually, I'm pretty bad with the base string functions 😂

If there are any translations it doesn't have that you need, this site has all of the base to stringr translations, so more can easily be added (if necessary).

Edit: You can also use the glue package, it's one of the dependencies of tidytable. It's a bit easier to use than sprintf (in my opinion)

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

I use sprintf cuz it allows for multiple reference however I don't see any issue with changing the code to use glue I just never had the chance to use it 😁. Also I didn't use stringr just to avoid another dependency.

from tidytable.

markfairbanks avatar markfairbanks commented on July 1, 2024

So I'm beginning to think we're going to have to use string manipulation. The way I did it in summarize.() was to iterate over the dots using map.(dots, string_function) (in that case to replace n.() with .N).

There are three cases this needs to work:

  1. User supplies empty c_across.() call
  2. User selects columns using c_across.(cols = c(x, y)) or c_across.(c(x, y))
  3. User uses c_across.() multiple times, sometimes mixing cases 1 & 2

I made a quick example that only covers use case 1, but here's how I see the workflow working:

pacman::p_load(tidytable, rlang, stringr)

shallow <- tidytable:::shallow
eval_quo <- tidytable:::eval_quo
select_vec_chr <- tidytable:::select_vec_chr

mutate_rowwise. <- function(.df, ...) {

  .df <- as_tidytable(.df)
  
  dots <- enquos(...)
  
  if (length(dots) == 0) return(.df)
  
  dots_names <- names(dots)
  
  if (any(dots_names %in% names(.df))) .df <- copy(.df)
  else .df <- shallow(.df)
  
  .df[, .rowwise_id := 1:.N]
  
  data_env <- env(quo_get_env(dots[[1]]), .df = .df)
  
  dots <- map.(dots, replace_c_across)
  
  eval_quo(
    .df[, ':='(!!!dots), by = .rowwise_id],
    new_data_mask(data_env), env = caller_env()
  )
  
  .df[, .rowwise_id := NULL][]
}

replace_c_across <- function(quosure) {
  quo_string <- quo_text(quosure)

  if (str_detect(quo_string, "c_across.[(]")) {
    # Note we don't need to return a quosure, just an expression
    parse_expr(
      str_replace_all(
        quo_string,
        "c_across.[(][)]",
        "unlist(mget(select_vec_chr(.SD, everything())))"
      )
    )
  } else {
    quosure
  }
}

test_df <- data.table(x = 1:3, y = 1:3, z = 1:3)

test_df %>%
  mutate_rowwise.(row_mean = mean(c_across.()))
#> # tidytable [3 × 4]
#>       x     y     z row_mean
#>   <int> <int> <int>    <dbl>
#> 1     1     1     1        1
#> 2     2     2     2        2
#> 3     3     3     3        3

And if that does sound good to you, would you be able to take a shot at the replace_c_across() helper? I'm probably 5% as good as you are at regex 😂. If this workflow seems like it will cause problems let me know.

We can still attempt to do the "normal workflow" way later, but this "string manipulation workflow" should be a good fallback.

Thoughts?

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

Glad to finally have it work, however it's too slow because of select_vec_sym for each row => complexity wise 0(nrows*select_vec_sym), I think eval_quo has some overhead but select_vec_sym takes the cake. notice that in the former implementation :

the trick is that I evaluate the select_vec_chr only one time at the top sparing us the overhead of them recalculating each time i.e nrow(.df) time

Just to be certain I made a new mutate_rowwise_new. that evaluates the cols for each row and it's results are 3 seconds faster than the newer implementation:

bench::mark(
     `pre-evaluates the names` = mutate_rowwise.(test_df, s=sum(c_across.())),
     `evaluates the names` = mutate_rowwise_new.(test_df, s=sum(c_across.())),
     `with actual call` = mutate_rowwise_last.(test_df, s=sum(c_across.())),
      check=F, iterations=5)
#> # A tibble: 3 x 13
#>   expression                   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 pre-evaluates the names  143.8ms 157.11ms    6.45      4.21MB     1.29     5     1   775.14ms
#> 2 evaluates the names           7s    7.36s    0.136    88.73MB     1.72     5    63     36.66s
#> 3 with actual call           10.8s   11.78s    0.0789  108.69MB     1.29     5    82      1.06m

the other 3 seconds are probably caused by eval_quo I remember that we ran into some speed problems with it in another issue. The only way, I think, that'll make this closer to the string method is to use some caching mechanism that'll have the selected columns saved.

from tidytable.

moutikabdessabour avatar moutikabdessabour commented on July 1, 2024

I managed to implement the caching mechanism, which is just an environement that's gonna have the call mapped to the selected columns:

cache <- env()
mutate_rowwisef. <- function(.df, ...) {

  .df <- as_tidytable(.df)
  
  dots <- enquos(...)
  
  if (length(dots) == 0) return(.df)
  
  dots_names <- names(dots)
  
  if (any(dots_names %in% names(.df))) .df <- copy(.df)
  else .df <- shallow(.df)
  
  cache <<- env()
  .df[, .rowwise_id := .I]
  
  data_env <- env(quo_get_env(dots[[1]]), .df = .df, dots_names = dots_names)
  
  eval_quo(
    .df[, (dots_names) := {.SD; list(!!!dots)}, by = .rowwise_id],
    new_data_mask(data_env), env = caller_env()
  )
  
  .df[, .rowwise_id := NULL][]
}

c_across_vec_c. <- function(cols = everything()) {

  call <- toString(match.call())
  .sd <- get('.SD', envir = caller_env())
  
  if(!exists(call, cache) || is.null(cache[[call]])){
    cache[[call]] <<- select_vec_sym(.sd, {{ cols }})
  }
  
  eval_quo(vctrs::vec_c(!!!cache[[call]]), .sd)
}

#> # A tibble: 4 x 13
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 pre-evaluates the names    141.47ms 147.06ms     6.52     4.21MB     1.30     5     1   766.89ms
#> 2 with actual call              6.59s    6.79s     0.145  354.38MB     1.33     5    46      34.6s
#> 3 with actual call vec_c        1.65s    1.94s     0.530   19.36MB     1.80     5    17      9.43s
#> 4 with actual call vec_c quo    1.93s    2.07s     0.482   21.26MB     1.54     5    16     10.38s

I tried some different variants but I settled on the one here:

  • with actual call vec_c is this implementation.
  • with actual call vec_c quo same but with call <- quo_text(quo(cols)) instead of match.call
  • with actual call has .sd[, unlist(.SD, recursive=F, use.names=F), .SDcols=cache[[call]]] as a return value
bench::mark(
    `string manipulation` = mutate_rowwise.(test_df, row_mean = mean(c_across.()), neg_sum=sum(c_across.(!V9:V13)), xy_sums=sum(c_across.(any_of(x))), y_sum=sum(c_across.(all_of(y)))),
    `with call` = mutate_rowwisef.(test_df, row_mean = mean(c_across_vec_c.()), neg_sum=sum(c_across_vec_c.(!V9:V13)), xy_sums=sum(c_across_vec_c.(any_of(x))), y_sum=sum(c_across_vec_c.(all_of(y)))),
    tidyverse= mutate(rowwise(test_df), row_mean = mean(c_across()), neg_sum=sum(c_across(!V9:V13)), xy_sums=sum(c_across(any_of(x))), y_sum=sum(c_across(all_of(y)))),
  iterations=100, check=F
)
#> # A tibble: 3 x 13
#>   expression               min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 string manipulation 370.62ms 397.94ms     2.50     4.47MB     1.50   100    60     40.05s
#> 2 with call              4.71s    4.91s     0.203    70.8MB     1.68   100   830      8.21m
#> 3 tidyverse               1.8s    1.92s     0.516   14.12MB     1.12   100   218      3.23m

I that there still room for improvements, also edge cases where the user will have the same c_across. calls multiple times will only be evaluated once thus giving it some 'edge' compared to the tidyverse implementation
@markfairbanks thoughts?

from tidytable.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.