Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Here's the first improvement. It's still slightly slower than <code class="notranslate

Thanks for looking into this <a class="user-mention notranslate" data-hovercard-type="

Great! Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hove

summarize_across.() is slower than dtplyr about tidytable HOT 11 CLOSED

markfairbanks commented on August 20, 2024

summarize_across.() is slower than dtplyr

from tidytable.

Comments (11)

markfairbanks commented on August 20, 2024 1

I can't seem to reproduce your times, but it definitely takes longer than data.table & dtplyr.

Part of the issue was that summarize_across.() was converting diamonds to a data.table in the background, but that only costs a couple ms.

Another thing worth noting is summarize() in dplyr 1.0.0 is now pretty darn fast. It might even be faster than data.table in some cases (as surprising as that sounds). You can see in the speed comparisons here the new timings. Pretty awesome job by them in this new release.

This dplyr issue highlights that across() has speed issues compared to the _if/_at/_all functions.

But back to the main issue - tidytable should (hypothetically) be comparable to dtplyr, so I'll have to do some more exploring on this one. I had a mutate_across.(.df, where(is.numeric), as.double) inside of summarize_across.() that I removed, but that only saved another few ms. It was an extra safety feature on data passed to data.table, but that was probably overkill.

Below are the times that I had. As mentioned above I should be able to knock a little bit off this time pretty easily, but we'll see where else I can find time. Nothing really jumps out at me - for a single summary function the translation is pretty light, but there's probably something I'm missing.

library(bench)
library(data.table)
library(dtplyr)
library(dplyr)
library(tidytable)
library(ggplot2)

diamonds2 <- lazy_dt(diamonds)
dt <- as.data.table(diamonds)

tidytable_func <- function() {
  dt %>%
    summarize_across.(c(depth, table, price, carat), 
                      mean,
                      by = c(cut, color, clarity))
}

dplyr_func <- function() {
  diamonds %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean)
  
}

dplyr_acr_func <- function() {
  diamonds %>% 
    group_by(cut, color, clarity) %>% 
    summarise(across(c(depth, table, price, carat), mean))
  
}

dtplyr_func <- function() {
  diamonds2 %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean) %>% 
    as_tibble()
}

cols <- c("depth", "table", "price", "carat")

dt_func <- function() {
  dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

bench::mark(
  dplyr_func(),
  dplyr_acr_func(),
  dtplyr_func(),
  tidytable_func(),
  dt_func(),
  check = FALSE, iterations = 30, time_unit = 'ms'
)

#> # A tibble: 5 x 6
#>   expression          min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_func()      17.0   19.2      51.0     9.49MB    18.7 
#> 2 dplyr_acr_func() 187.   207.        4.59    2.92MB    17.3 
#> 3 dtplyr_func()      5.98   7.54    131.      3.52MB     8.73
#> 4 tidytable_func()  17.5   20.9      46.0     7.94MB    19.9 
#> 5 dt_func()          2.57   3.03    308.      1.54MB    20.5

from tidytable.

markfairbanks commented on August 20, 2024 1

Well it turns out the issue is the use of the map.() function, which uses rlang::as_function() in the background. rlang::as_function() is what allows you to use ~ when using purrr.

For example:

map.(c(1,2,3), ~ .x + 1)

data.table has an internal optimization of mean() to their own version of the function. It's called data.table:::gmean(), which is their "GForce optimized function". They have a few of them.

So here's what this does to the times:

library(data.table)
library(tidytable)
library(ggplot2)
library(rlang)

dt <- as.data.table(diamonds)

cols <- c("depth", "table", "price", "carat")
by <- c("cut", "color", "clarity")

bench::mark(lapply = dt[, lapply(.SD, mean), .SDcols = cols, by = by],
            map = dt[, map.(.SD, mean), .SDcols = cols, by = by],
            as_fn = dt[, lapply(.SD, as_function(mean)), .SDcols = cols, by = by],
            lapply_base = dt[, lapply(.SD, base::mean), .SDcols = cols, by = by],
            check = FALSE, iterations = 30)
#> # A tibble: 4 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lapply         2.6ms   3.31ms     291.     3.11MB    10.0 
#> 2 map          10.52ms   12.1ms      82.0    4.87MB    20.5 
#> 3 as_fn         6.55ms   7.19ms     138.   319.36KB     9.88
#> 4 lapply_base  10.91ms  11.47ms      85.7  319.36KB    21.4

So I'm not positive what the workaround is at the moment. Using ~ is pretty key to tidyverse users, but it's also slowing things down. I'm guessing I'll have to work on this one for a bit

from tidytable.

markfairbanks commented on August 20, 2024 1

Here's the first improvement. It's still slightly slower than dtplyr, but I'll keep working on that.

pacman::p_load(bench, data.table, tidytable, dtplyr, dplyr, ggplot2)

diamonds_tbl <- as_tibble(ggplot2::diamonds)
diamonds_dt <- as_tidytable(diamonds_tbl)
diamonds_lazy <- lazy_dt(diamonds_dt)

tidytable_func <- function() {
  diamonds_dt %>%
    summarize_across.(c(depth, table, price, carat), 
                      mean,
                      by = c(cut, color, clarity))
}

dplyr_func <- function() {
  diamonds_tbl %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean)
  
}

dtplyr_func <- function() {
  diamonds_lazy %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean) %>% 
    as_tibble()
}

cols <- c("depth", "table", "price", "carat")

dt_func <- function() {
  diamonds_dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

bench::mark(
  dplyr_func(),
  dtplyr_func(),
  tidytable_func(),
  dt_func(),
  check = FALSE, iterations = 30, time_unit = 'ms'
)
#> # A tibble: 4 x 6
#>   expression         min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_func()     16.9   17.7       55.1     9.3MB     20.0
#> 2 dtplyr_func()     7.49   7.78     123.     3.52MB     13.6
#> 3 tidytable_func()  9.97  10.7       91.9  418.75KB     23.0
#> 4 dt_func()         2.41   2.86     336.     1.54MB     11.6

from tidytable.

markfairbanks commented on August 20, 2024 1

Well it turns out this is a data.table issue. If you alias a function it doesn't run using data.table's GForce optimized version.

pacman::p_load(bench, data.table, tidytable, ggplot2)

diamonds_dt <- ggplot2::diamonds %>%
  mutate_across.(where(is.factor), as.character) %>%
  replicate(10, ., simplify = FALSE) %>%
  bind_rows.()

cols <- c("depth", "table", "price", "carat")

normal_fn <- function() {
  diamonds_dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

# Create a simple alias for mean
fn <- mean

alias_fn <- function() {
  diamonds_dt[ 
    , lapply(.SD, fn)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

bench::mark(
  normal_fn(),
  alias_fn(),
  check = FALSE, iterations = 30, time_unit = 'ms'
)
#> # A tibble: 2 x 6
#>   expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 normal_fn()  27.0   30.9      31.1   14.85MB    18.0 
#> 2 alias_fn()   35.6   37.9      25.8    2.54MB     1.84

Unfortunately there's no way around that on my end. I'll open an issue on their page and see if they can fix that

from tidytable.

markfairbanks commented on August 20, 2024

@tungmilan Thanks for catching this. I'll take a look and let you know what I find

from tidytable.

tungttnguyen commented on August 20, 2024

Thanks for looking into this @markfairbanks ! Just curious, have you tested something similar to summarize_at() or summrize_if() for tidytable? Here is another issue that is similar to the one that you posted earlier.

from tidytable.

markfairbanks commented on August 20, 2024

Just curious, have you tested something similar to summarize_at() or summrize_if() for tidytable?

By test do you mean looked into adding the functions? If so I'm currently leaning towards not adding them since dplyr is moving away from them and summarize_across.() can cover them all.

I had mutate_if/_at/_all in tidytable before I figured out how to build mutate_across.(), but they now have deprecation warnings pop up when used. They all actually use mutate_across.() in the background. For example here's how mutate_if.() & mutate_all.() work:

mutate_if. <- function(.data, .predicate, .funs, ..., by = NULL) {

  mutate_across.(.data, where(.predicate), .funs, ..., by = {{by}})

}

mutate_all. <- function(.data, .funs, ..., by = NULL) {

  mutate_across.(.data, everything(), .funs, ..., by = {{by}})

}

So it'd be relatively easy to add summarize_if/_at/_all, but since they're all being deprecated in dplyr 1.0.0 I'm not sure if they should be added to tidytable

from tidytable.

tungttnguyen commented on August 20, 2024

No I meant if you tested yourself locally to see if the speed is faster. If the implementation of summarize_if/_at/_all is easy as you said then I think it's worth it because of the big potential gain in performance.

I don't think summarize_if/_at/_all were deprecated. According to dplyr 1.0.0 release notes, those functions were superseded which means they won't be removed in the future but won't receive further development (except critical fixes).

from tidytable.

markfairbanks commented on August 20, 2024

Ah gotcha, I see what you mean. So the reason summarize_if/_at/_all are faster than across() in dplyr is because the functions are built out differently internally.

In tidytable, however, those functions will call the same data.table code either way, so there wouldn't be any speed improvements between them.

from tidytable.

tungttnguyen commented on August 20, 2024

Great! Thanks @markfairbanks !

from tidytable.

tungttnguyen commented on August 20, 2024

Thank you very much again!

from tidytable.

summarize_across.() is slower than dtplyr about tidytable HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent