Git Product home page Git Product logo

Comments (11)

markfairbanks avatar markfairbanks commented on August 20, 2024 1

I can't seem to reproduce your times, but it definitely takes longer than data.table & dtplyr.

Part of the issue was that summarize_across.() was converting diamonds to a data.table in the background, but that only costs a couple ms.

Another thing worth noting is summarize() in dplyr 1.0.0 is now pretty darn fast. It might even be faster than data.table in some cases (as surprising as that sounds). You can see in the speed comparisons here the new timings. Pretty awesome job by them in this new release.

This dplyr issue highlights that across() has speed issues compared to the _if/_at/_all functions.

But back to the main issue - tidytable should (hypothetically) be comparable to dtplyr, so I'll have to do some more exploring on this one. I had a mutate_across.(.df, where(is.numeric), as.double) inside of summarize_across.() that I removed, but that only saved another few ms. It was an extra safety feature on data passed to data.table, but that was probably overkill.

Below are the times that I had. As mentioned above I should be able to knock a little bit off this time pretty easily, but we'll see where else I can find time. Nothing really jumps out at me - for a single summary function the translation is pretty light, but there's probably something I'm missing.

library(bench)
library(data.table)
library(dtplyr)
library(dplyr)
library(tidytable)
library(ggplot2)

diamonds2 <- lazy_dt(diamonds)
dt <- as.data.table(diamonds)

tidytable_func <- function() {
  dt %>%
    summarize_across.(c(depth, table, price, carat), 
                      mean,
                      by = c(cut, color, clarity))
}

dplyr_func <- function() {
  diamonds %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean)
  
}

dplyr_acr_func <- function() {
  diamonds %>% 
    group_by(cut, color, clarity) %>% 
    summarise(across(c(depth, table, price, carat), mean))
  
}

dtplyr_func <- function() {
  diamonds2 %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean) %>% 
    as_tibble()
}

cols <- c("depth", "table", "price", "carat")

dt_func <- function() {
  dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

bench::mark(
  dplyr_func(),
  dplyr_acr_func(),
  dtplyr_func(),
  tidytable_func(),
  dt_func(),
  check = FALSE, iterations = 30, time_unit = 'ms'
)

#> # A tibble: 5 x 6
#>   expression          min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>        <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_func()      17.0   19.2      51.0     9.49MB    18.7 
#> 2 dplyr_acr_func() 187.   207.        4.59    2.92MB    17.3 
#> 3 dtplyr_func()      5.98   7.54    131.      3.52MB     8.73
#> 4 tidytable_func()  17.5   20.9      46.0     7.94MB    19.9 
#> 5 dt_func()          2.57   3.03    308.      1.54MB    20.5

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024 1

Well it turns out the issue is the use of the map.() function, which uses rlang::as_function() in the background. rlang::as_function() is what allows you to use ~ when using purrr.

For example:

map.(c(1,2,3), ~ .x + 1)

data.table has an internal optimization of mean() to their own version of the function. It's called data.table:::gmean(), which is their "GForce optimized function". They have a few of them.

So here's what this does to the times:

library(data.table)
library(tidytable)
library(ggplot2)
library(rlang)

dt <- as.data.table(diamonds)

cols <- c("depth", "table", "price", "carat")
by <- c("cut", "color", "clarity")

bench::mark(lapply = dt[, lapply(.SD, mean), .SDcols = cols, by = by],
            map = dt[, map.(.SD, mean), .SDcols = cols, by = by],
            as_fn = dt[, lapply(.SD, as_function(mean)), .SDcols = cols, by = by],
            lapply_base = dt[, lapply(.SD, base::mean), .SDcols = cols, by = by],
            check = FALSE, iterations = 30)
#> # A tibble: 4 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lapply         2.6ms   3.31ms     291.     3.11MB    10.0 
#> 2 map          10.52ms   12.1ms      82.0    4.87MB    20.5 
#> 3 as_fn         6.55ms   7.19ms     138.   319.36KB     9.88
#> 4 lapply_base  10.91ms  11.47ms      85.7  319.36KB    21.4

So I'm not positive what the workaround is at the moment. Using ~ is pretty key to tidyverse users, but it's also slowing things down. I'm guessing I'll have to work on this one for a bit

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024 1

Here's the first improvement. It's still slightly slower than dtplyr, but I'll keep working on that.

pacman::p_load(bench, data.table, tidytable, dtplyr, dplyr, ggplot2)

diamonds_tbl <- as_tibble(ggplot2::diamonds)
diamonds_dt <- as_tidytable(diamonds_tbl)
diamonds_lazy <- lazy_dt(diamonds_dt)

tidytable_func <- function() {
  diamonds_dt %>%
    summarize_across.(c(depth, table, price, carat), 
                      mean,
                      by = c(cut, color, clarity))
}

dplyr_func <- function() {
  diamonds_tbl %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean)
  
}

dtplyr_func <- function() {
  diamonds_lazy %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean) %>% 
    as_tibble()
}

cols <- c("depth", "table", "price", "carat")

dt_func <- function() {
  diamonds_dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

bench::mark(
  dplyr_func(),
  dtplyr_func(),
  tidytable_func(),
  dt_func(),
  check = FALSE, iterations = 30, time_unit = 'ms'
)
#> # A tibble: 4 x 6
#>   expression         min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr_func()     16.9   17.7       55.1     9.3MB     20.0
#> 2 dtplyr_func()     7.49   7.78     123.     3.52MB     13.6
#> 3 tidytable_func()  9.97  10.7       91.9  418.75KB     23.0
#> 4 dt_func()         2.41   2.86     336.     1.54MB     11.6

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024 1

Well it turns out this is a data.table issue. If you alias a function it doesn't run using data.table's GForce optimized version.

pacman::p_load(bench, data.table, tidytable, ggplot2)

diamonds_dt <- ggplot2::diamonds %>%
  mutate_across.(where(is.factor), as.character) %>%
  replicate(10, ., simplify = FALSE) %>%
  bind_rows.()

cols <- c("depth", "table", "price", "carat")

normal_fn <- function() {
  diamonds_dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

# Create a simple alias for mean
fn <- mean

alias_fn <- function() {
  diamonds_dt[ 
    , lapply(.SD, fn)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

bench::mark(
  normal_fn(),
  alias_fn(),
  check = FALSE, iterations = 30, time_unit = 'ms'
)
#> # A tibble: 2 x 6
#>   expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 normal_fn()  27.0   30.9      31.1   14.85MB    18.0 
#> 2 alias_fn()   35.6   37.9      25.8    2.54MB     1.84

Unfortunately there's no way around that on my end. I'll open an issue on their page and see if they can fix that

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

@tungmilan Thanks for catching this. I'll take a look and let you know what I find

from tidytable.

tungttnguyen avatar tungttnguyen commented on August 20, 2024

Thanks for looking into this @markfairbanks ! Just curious, have you tested something similar to summarize_at() or summrize_if() for tidytable? Here is another issue that is similar to the one that you posted earlier.

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

Just curious, have you tested something similar to summarize_at() or summrize_if() for tidytable?

By test do you mean looked into adding the functions? If so I'm currently leaning towards not adding them since dplyr is moving away from them and summarize_across.() can cover them all.

I had mutate_if/_at/_all in tidytable before I figured out how to build mutate_across.(), but they now have deprecation warnings pop up when used. They all actually use mutate_across.() in the background. For example here's how mutate_if.() & mutate_all.() work:

mutate_if. <- function(.data, .predicate, .funs, ..., by = NULL) {

  mutate_across.(.data, where(.predicate), .funs, ..., by = {{by}})

}

mutate_all. <- function(.data, .funs, ..., by = NULL) {

  mutate_across.(.data, everything(), .funs, ..., by = {{by}})

}

So it'd be relatively easy to add summarize_if/_at/_all, but since they're all being deprecated in dplyr 1.0.0 I'm not sure if they should be added to tidytable

from tidytable.

tungttnguyen avatar tungttnguyen commented on August 20, 2024

No I meant if you tested yourself locally to see if the speed is faster. If the implementation of summarize_if/_at/_all is easy as you said then I think it's worth it because of the big potential gain in performance.

I don't think summarize_if/_at/_all were deprecated. According to dplyr 1.0.0 release notes, those functions were superseded which means they won't be removed in the future but won't receive further development (except critical fixes).

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

Ah gotcha, I see what you mean. So the reason summarize_if/_at/_all are faster than across() in dplyr is because the functions are built out differently internally.

In tidytable, however, those functions will call the same data.table code either way, so there wouldn't be any speed improvements between them.

from tidytable.

tungttnguyen avatar tungttnguyen commented on August 20, 2024

Great! Thanks @markfairbanks !

from tidytable.

tungttnguyen avatar tungttnguyen commented on August 20, 2024

Thank you very much again!

from tidytable.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.