Git Product home page Git Product logo

Comments (7)

marianschmidt avatar marianschmidt commented on August 20, 2024 1

With the current dev version I can reproduce the much improved results. Added some more scenarios for performance tests. Do you consider releasing this improvement soon?

#performance test with various scenarios

library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(2048)

results_new <- bench::press(
    rows = c(100000, 1000000, 1e7),
    ids = c(1000, 10000, 100000),
    {df <- tibble(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
                  bike = sample(c("mountain", "allround", "road", "bmx"), size = rows, replace = TRUE),
                  year = sample(1980:2020, size = rows, replace = TRUE))
    dt <- as.data.table(df)
      bench::mark(
          #first run with tidytable
          tidytable = dt %>%
            #sort by case id, time and item
            tidytable::arrange.(id, year, bike)%>%
            #calculate new item number variable #group by case id
            tidytable::mutate.(bike_number = as.integer(tidytable::row_number.()), by = id),
          #second run with dplyr
          dplyr = df %>%
            #sort by case id, time and item
            dplyr::arrange(id, year, bike)%>%
            #calculate new item number variable #group by case id
            dplyr::group_by(id) %>%
            dplyr::mutate(bike_number = as.integer(dplyr::row_number())) %>%
            dplyr::ungroup(),
          #third run with data.table
          data.table = data.table::copy(dt) %>%
            #sort by case id, time and item
            .[base::order(nchar(.[, id]), .[, id], .[, year], .[, bike], method = "radix")] %>%
            #calculate new item number variable #group by case id
            .[, bike_number := as.integer(seq_len(.N)), by=.[, id]] %>%
            .[],
          iterations = 3, filter_gc = FALSE, check = FALSE
      )
    }
  )
#> Running with:
#>       rows    ids
#> 1   100000   1000
#> 2  1000000   1000
#> 3 10000000   1000
#> 4   100000  10000
#> 5  1000000  10000
#> 6 10000000  10000
#> 7   100000 100000
#> 8  1000000 100000
#> 9 10000000 100000

  ggplot2::autoplot(results_new)

Created on 2020-06-09 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.3 (2020-02-29)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-06-09                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source                                  
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.1)                          
#>  backports     1.1.7   2020-05-13 [1] CRAN (R 3.6.3)                          
#>  beeswarm      0.2.3   2016-04-25 [1] CRAN (R 3.6.0)                          
#>  bench         1.1.1   2020-01-13 [1] CRAN (R 3.6.2)                          
#>  blob          1.2.1   2020-01-20 [1] CRAN (R 3.6.3)                          
#>  broom         0.5.6   2020-04-20 [1] CRAN (R 3.6.3)                          
#>  callr         3.4.3   2020-03-28 [1] CRAN (R 3.6.3)                          
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 3.6.1)                          
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.3)                          
#>  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.1)                          
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.1)                          
#>  curl          4.3     2019-12-02 [1] CRAN (R 3.6.1)                          
#>  data.table  * 1.12.9  2020-03-04 [1] Github (Rdatatable/data.table@b1b1832)  
#>  DBI           1.1.0   2019-12-15 [1] CRAN (R 3.6.1)                          
#>  dbplyr        1.4.4   2020-05-27 [1] CRAN (R 3.6.3)                          
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.1)                          
#>  devtools      2.3.0   2020-04-10 [1] CRAN (R 3.6.3)                          
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.2)                          
#>  dplyr       * 1.0.0   2020-05-29 [1] CRAN (R 3.6.3)                          
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 3.6.3)                          
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.1)                          
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.2)                          
#>  farver        2.0.3   2020-01-16 [1] CRAN (R 3.6.2)                          
#>  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 3.6.3)                          
#>  fs            1.4.1   2020-04-04 [1] CRAN (R 3.6.3)                          
#>  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.1)                          
#>  ggbeeswarm    0.6.0   2017-08-07 [1] CRAN (R 3.6.3)                          
#>  ggplot2     * 3.3.1   2020-05-28 [1] CRAN (R 3.6.3)                          
#>  glue          1.4.1   2020-05-13 [1] CRAN (R 3.6.3)                          
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.1)                          
#>  haven         2.3.1   2020-06-01 [1] CRAN (R 3.6.3)                          
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.1)                          
#>  hms           0.5.3   2020-01-08 [1] CRAN (R 3.6.2)                          
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)                          
#>  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.1)                          
#>  jsonlite      1.6.1   2020-02-02 [1] CRAN (R 3.6.2)                          
#>  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.2)                          
#>  lattice       0.20-38 2018-11-04 [2] CRAN (R 3.6.3)                          
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.3)                          
#>  lubridate     1.7.8   2020-04-06 [1] CRAN (R 3.6.3)                          
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.1)                          
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.1)                          
#>  mime          0.9     2020-02-04 [1] CRAN (R 3.6.2)                          
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 3.6.3)                          
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.1)                          
#>  nlme          3.1-144 2020-02-06 [2] CRAN (R 3.6.3)                          
#>  pillar        1.4.4   2020-05-05 [1] CRAN (R 3.6.3)                          
#>  pkgbuild      1.0.8   2020-05-07 [1] CRAN (R 3.6.3)                          
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.1)                          
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 3.6.3)                          
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.2)                          
#>  processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.2)                          
#>  profmem       0.5.0   2018-01-30 [1] CRAN (R 3.6.2)                          
#>  ps            1.3.3   2020-05-08 [1] CRAN (R 3.6.3)                          
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 3.6.3)                          
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)                          
#>  Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)                          
#>  readr       * 1.3.1   2018-12-21 [1] CRAN (R 3.6.1)                          
#>  readxl        1.3.1   2019-03-13 [1] CRAN (R 3.6.1)                          
#>  remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.2)                          
#>  reprex        0.3.0   2019-05-16 [1] CRAN (R 3.6.1)                          
#>  rlang         0.4.6   2020-05-02 [1] CRAN (R 3.6.3)                          
#>  rmarkdown     2.2     2020-05-31 [1] CRAN (R 3.6.3)                          
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.1)                          
#>  rvest         0.3.5   2019-11-08 [1] CRAN (R 3.6.1)                          
#>  scales        1.1.1   2020-05-11 [1] CRAN (R 3.6.3)                          
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.1)                          
#>  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.2)                          
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 3.6.1)                          
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.3)                          
#>  tibble      * 3.0.1   2020-04-20 [1] CRAN (R 3.6.3)                          
#>  tidyr       * 1.1.0   2020-05-20 [1] CRAN (R 3.6.3)                          
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 3.6.3)                          
#>  tidytable   * 0.5.1.9 2020-06-09 [1] Github (markfairbanks/tidytable@c133581)
#>  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 3.6.1)                          
#>  usethis       1.6.1   2020-04-29 [1] CRAN (R 3.6.3)                          
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.1)                          
#>  vctrs         0.3.1   2020-06-05 [1] CRAN (R 3.6.3)                          
#>  vipor         0.4.5   2017-03-22 [1] CRAN (R 3.6.3)                          
#>  withr         2.2.0   2020-04-20 [1] CRAN (R 3.6.3)                          
#>  xfun          0.14    2020-05-20 [1] CRAN (R 3.6.3)                          
#>  xml2          1.3.2   2020-04-23 [1] CRAN (R 3.6.3)                          
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.2)                          
#> 
#> [1] C:/Users/usr/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.3/library

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024 1

Awesome, good to hear that the performance issue is fixed.

Do you consider releasing this improvement soon?

Yep - my goal is to submit to CRAN this weekend. I'll keep you updated and let you know when it's accepted.

from tidytable.

marianschmidt avatar marianschmidt commented on August 20, 2024 1

@markfairbanks Thanks a lot for your efforts. Feel free to close this issue whenever convenient for you; I consider it closed.
btw, I did a lot more testing and reimplemented my tidyverse functions in tidytable. Performance is really incredible and since I had troubles with correctly translating my own functions to data.table, your package is such a great help.
During the process, it was quite hard to find adequate replacements for the .data pronoun from rlang because simply replacing dplyr::mutate() with tidytable::mutate.() didn't do it. I now find myself using rlang::tidy_eval() quite regularly and noticed that tidytable requires the strict passing of symbols in order to retrieve dataframe columns. Now everything I used to do in tidyverse, also works in tidytable. Thanks.

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

Can you install the dev version and let me know if you still have these issues? I actually found an issue that was slowing down pretty much every function in tidytable since v0.5.0. It somehow snuck past my normal speed tests. I just finished fixing it a few days ago.

devtools::install_github("markfairbanks/tidytable")

Here were the times I got when I ran your example.

One note - I made the dataset a data.table for the tidytable and data.table timings:

library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)

rows <- 1000000
ids  <- 50000

#simple data set with many different IDs and 1M rows, 3 cols
df <- tibble(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
             bike = sample(c("mountain", "allround", "road", "bmx"), size = rows, replace = TRUE),
             year = sample(1980:2020, size = rows, replace = TRUE),
             stringsAsFactors = FALSE)

dt <- as.data.table(df)

results <- bench::mark(
  #first run with tidytable
  tidytable = dt %>%
    #sort by case id, time and item
    tidytable::arrange.(id, year, bike)%>%
    #calculate new item number variable #group by case id
    tidytable::mutate.(bike_number = as.integer(tidytable::row_number.()), by = id),
  #second run with dplyr
  dplyr = df %>%
    #sort by case id, time and item
    dplyr::arrange(id, year, bike)%>%
    #calculate new item number variable #group by case id
    dplyr::group_by(id) %>%
    dplyr::mutate(bike_number = as.integer(dplyr::row_number())) %>%
    dplyr::ungroup(),
  #third run with data.table
  data.table = data.table::copy(dt) %>%
    # data.table::as.data.table(.) %>%
    #sort by case id, time and item
    .[base::order(nchar(.[, id]), .[, id], .[, year], .[, bike], method = "radix")] %>%
    #calculate new item number variable #group by case id
    .[, bike_number := as.integer(seq_len(.N)), by=.[, id]] %>%
    .[],
  iterations = 3, filter_gc = FALSE, check = FALSE
)

ggplot2::autoplot(results)

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

@marianschmidt FYI the CRAN submission of v0.5.2 has been put on hold bc CRAN changed their documentation requirements sometime in the past week or two. My initial submission a couple days ago was rejected bc of this change.

Once r-lib/roxygen2#1108 is fixed I’ll submit to CRAN again. There are quite a few packages that are having this same problem, but it looks like it will be fixed soon! As far as I can tell it will be fixed in the next day or two

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

@marianschmidt Glad the package is working out well!

As far as the .data pronoun, you can use data.table's version .SD.

And if you want to specify that you are using a variable from the global environment, you can just unquote it using !!.

library(tidytable, warn.conflicts = FALSE)
library(tidyverse, warn.conflicts = FALSE)

test_df <- data.table(x = c(1, 1, 1))

x <- 5

# Using tidytable
test_df %>%
  mutate.(data_x_plus_global_x = .SD$x + !!x)
#>    x data_x_plus_global_x
#> 1: 1                    6
#> 2: 1                    6
#> 3: 1                    6

# Using the tidyverse (version 1)
test_df %>%
  mutate(data_x_plus_global_x = .data$x + !!x)
#>    x data_x_plus_global_x
#> 1: 1                    6
#> 2: 1                    6
#> 3: 1                    6

# Using the tidyverse (version 2)
test_df %>%
  mutate(data_x_plus_global_x = .data$x + .env$x)
#>    x data_x_plus_global_x
#> 1: 1                    6
#> 2: 1                    6
#> 3: 1                    6

from tidytable.

markfairbanks avatar markfairbanks commented on August 20, 2024

@marianschmidt tidytable v0.5.2 is now up on CRAN!

FYI there is a small API change - the by argument has been renamed to .by. Using by causes a warning, but will still work for a couple months or so.

library(tidytable, warn.conflicts = FALSE)

test_df <- data.table(x = 1:3, y = c("a", "a", "b"))

# Using `by` causes a warning
test_df %>%
  summarize.(avg_x = mean(x), by = y)
#> Warning: The `by` argument of `summarize.()` is deprecated as of tidytable 0.5.2.
#> Please use the `.by` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_warnings()` to see where this warning was generated.
#>    y avg_x
#> 1: a   1.5
#> 2: b   3.0

# Using `.by` works normally
test_df %>%
  summarize.(avg_x = mean(x), .by = y)
#>    y avg_x
#> 1: a   1.5
#> 2: b   3.0

from tidytable.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.