markfairbanks / tidytable Goto Github PK

View Code? Open in Web Editor NEW

439.0 439.0 32.0 66.4 MB

Tidy interface to 'data.table'

Home Page: https://markfairbanks.github.io/tidytable/

License: Other

R 99.74% C++ 0.26%

tidytable's People

Contributors

Stargazers

Watchers

tidytable's Issues

Distinct.() not retaining unique rows

Results below show that distinct.() not returning the unique letters from col_1. Initially, I thought that it was related to the number of rows but that does not seem to be the case. I'm using tidytable version 0.5.0.9.

pacman::p_load(tidytable, 
               tidyverse)

n <- 500000

test_df <- data.table(col_1 = sample(LETTERS, n, rep = TRUE), 
                      col_2 = sample(1:5, n, rep = TRUE))

#####################
# Tidyverse
#####################

tidyverse_test <- test_df %>%
  select(col_1) %>%
  distinct() %>%
  count() %>%
  pull()


#####################
# Tidytable
#####################

tidytable_test <-
  test_df %>%
  select.(col_1) %>%
  distinct.() %>%
  count.() %>%
  pull.()



data.table(tidyverse_cnt = tidyverse_test, 
           tidytable_cnt = tidytable_test, 
           match_test = (tidyverse_test == tidytable_test))
#>    tidyverse_cnt tidytable_cnt match_test
#> 1:            26        500000      FALSE

^{Created on 2020-05-29 by the reprex package (v0.3.0)}

Implement summarize_across.()

@markfairbanks: not sure if I'm missing something even after checking the man; dplyr::summarise_at/all/if() are not yet implemented, ya?

Something like:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>% 
    summarise_if(is.numeric, mean, na.rm = TRUE)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.843333    3.057333        3.758    1.199333

iris %>% 
    summarise_at(vars(contains("Sepal")), sum, na.rm = TRUE)
#>   Sepal.Length Sepal.Width
#> 1        876.5       458.6

Originally posted by @leungi in #62 (comment)

Can't access the .y value from map2. when using mutate. inside map2.

Complicated title, but fairly simple explanation.
When I try to access the .y value I get the following error.

    library(dplyr, warn.conflicts = FALSE)
    library(tidytable, warn.conflicts = FALSE)
    
    data <- tibble(
    id = LETTERS[seq(1, 3)],
    val_1 = seq(1, 3, 1),
    val_2 = seq(4, 6, 1)
)
data %>% 
    nest_by.(id) %>% 
    mutate.(example_1 = map2.(data, id, ~.x %>% 
                                  mutate.(id = .y))) %>% 
    unnest.(example_1)
#> Error in eval(jsub, SDenv, parent.frame()): object '.y' not found

If I instead use the regular mutate function from dplyr, everything works fine.

    library(dplyr, warn.conflicts = FALSE)
    library(tidytable, warn.conflicts = FALSE)
    
    data <- tibble(
    id = LETTERS[seq(1, 3)],
    val_1 = seq(1, 3, 1),
    val_2 = seq(4, 6, 1)
)
data %>% 
    nest_by.(id) %>% 
    mutate.(example_1 = map2.(data, id, ~.x %>% 
                                  mutate(id = .y))) %>% 
    unnest.(example_1)
#>    id val_1 val_2 id1
#> 1:  A     1     4   A
#> 2:  B     2     5   B
#> 3:  C     3     6   C

Add ~ functionality to dt_map functions

dt_unnest_legacy() API change

@leungi

This is an FYI of an API change to dt_unnest_legacy(). (I think you use this feature more than anyone else). The "keep" columns are no longer passed in dots, but in a vector of bare column names. See below:

nested_df <- data.table(a = 1:10,
                        b = 11:20,
                        c = c(rep("a", 6), rep("b", 4)),
                        d = c(rep("a", 4), rep("b", 6))) %>%
  dt_group_nest(c, d)

nested_df %>%
  dt_unnest_legacy(data, keep = c(c, d))

nested_df %>%
  dt_unnest_legacy(data, keep = is.character)

This is part of the v0.3.0 release, and is in prep for the future API when multiple columns can be unnested simultaneously

Add ability to unnest.() without dropping unused list columns

dt_bind_rows(use.names=TRUE): Error in rbindlist

reprex:

x = iris %>% as.data.table %>% dt_select(Sepal.Length, Sepal.Width) 
y = iris %>% as.data.table %>% dt_select(Sepal.Width, Sepal.Length) 
dt_bind_rows(x, y, use.names=TRUE)

Full error:

Error in rbindlist(dots, idcol = .id): Item 3 has 1 columns, inconsistent with item 1 which has 2 columns. To fill missing columns use fill=TRUE.
Traceback:

1. dt_bind_rows(x, y, use.names = TRUE, fill = TRUE)
2. dt_bind_rows.default(x, y, use.names = TRUE, fill = TRUE)
3. rbindlist(dots, idcol = .id)

Alternative that works:

x = iris %>% select(Sepal.Length, Sepal.Width) 
y = iris %>% select(Sepal.Width, Sepal.Length) 
rbind(x,y)

sessionInfo:

R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/tidyverse/lib/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.8 LeyLabRMisc_0.1.3 tidytable_0.3.2   ggplot2_3.2.1    
[5] tidyr_1.0.0       dplyr_0.8.3      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       pillar_1.4.3     compiler_3.6.2   base64enc_0.1-3 
 [5] tools_3.6.2      bit_1.1-15.2     zeallot_0.1.0    digest_0.6.23   
 [9] uuid_0.1-2       jsonlite_1.6     evaluate_0.14    tibble_2.1.3    
[13] lifecycle_0.1.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.2     
[17] IRdisplay_0.7.0  IRkernel_1.1     repr_1.0.2       withr_2.1.2     
[21] vctrs_0.2.1      bit64_0.9-7      grid_3.6.2       tidyselect_0.2.5
[25] glue_1.3.1       R6_2.4.1         pbdZMQ_0.3-3     purrr_0.3.3     
[29] magrittr_1.5     backports_1.1.5  scales_1.1.0     htmltools_0.4.0 
[33] assertthat_0.2.1 colorspace_1.4-1 lazyeval_0.2.2   munsell_0.5.0   
[37] crayon_1.3.4

case. default not handling NA as expected

case. default does not use the defined default value if the result is NA. I expected to get similar results comparing case. to case_when.

pacman::p_load(data.table, tidytable)

gender_func_1 <- function(col) {
  col_value <-  case.(col == 1, "M",
                      col == 2, "F",
                      default = "U")
  return(col_value)
}



gender_func_2 <- function(col) {
  col_value <-  case_when(col == 1 ~ "M",
                          col == 2 ~ "F",
                          TRUE ~ "U")
  return(col_value)
}


data.frame(gender = c(0:3, NA)) %>% 
  mutate.(gender_grp_1 = gender_func_1(gender), 
          gender_grp_2 = gender_func_2(gender))

   gender gender_grp_1 gender_grp_2
    <int>        <chr>        <chr>
1:      0            U            U
2:      1            M            M
3:      2            F            F
4:      3            U            U
5:     NA         <NA>            U

Feature request: retain dplyr::distinct() ability to get unique by columns in dt_distinct()

Reprex below; dt_distinct() currently doesn't allow columns specifications.

test_df <- data.table(a = 1:10,
                      b = 11:20,
                      c = c(rep("a", 6), rep("b", 4)),
                      d = c(rep("a", 4), rep("b", 6)))

dplyr::distinct(test_df, c, d)

# expected output
   c d
1: a a
2: a b
3: b b

Implement dt_ifelse()

Convert fifelse() to automatically use the proper NA

`eval_tidy` failures: induced by differences in `dplyr::mutate` and `tidytable::dt_mutate`

I think there is a difference between how dplyr treats quoted code and how tidytable treats them.

This is a blocking issue for integrating tidytable with disk.frame, see DiskFrame/disk.frame#271

fn = function(a,b) {
  a + b
}

# dplyr seems to be able to handle this correctly
mwe_mutate = function(...) {
  quo_dotdotdot = rlang::enquos(...)
  data = data.frame(num = 1:100)
  code = rlang::quo(mutate(data, !!!quo_dotdotdot))  
  rlang::eval_tidy(code)
}

mwe_mutate(b = fn(num, num))

# tidytable seems to fail 
mwe_dt_mutate = function(...) {
  quo_dotdotdot = rlang::enquos(...)
  data = data.frame(num = 1:100)
  code = rlang::quo(dt_mutate(data, !!!quo_dotdotdot))  
  rlang::eval_tidy(code)
}

mwe_dt_mutate(b = fn(num, num)) # this gives an error

Date-type column causing select.() to fail

Reprex below.

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df1 <-
  structure(list(
    id = "cal", sequence_id = "0357C554_DA8B_4D2A_945B_8381F2CE0C48",
    date = structure(1548028800, class = c("POSIXct", "POSIXt"), tzone = "UTC")
  ), row.names = c(NA, -1L), class = "data.frame")

df1
#>            id                          sequence_id       date
#> 1 cal 0357C554_DA8B_4D2A_945B_8381F2CE0C48 2019-01-21

df1 %>% 
  tidytable::select.(id)
#> Error in vapply(.x, .f, character(1), ...): values must be length 1,
#>  but FUN(X[[3]]) result is length 2

df1 %>% 
  select(-date) %>% 
  tidytable::select.(id)
#>             id
#>          <chr>
#> 1: cal

df1 %>% 
  mutate(date = as.Date(date)) %>%  
  tidytable::select.(id)
#>             id
#>          <chr>
#> 1: cal

^{Created on 2020-05-09 by the reprex package (v0.3.0)}

Suspected Cause

I believe the issue lies with get_data_vars(), given that it doesn't account for Date or DateTime classes (e.g., POSIXct, POSIXt).

Below is a suggested change.

get_data_vars <- function(.data) {
data_names <- names(.data)
data_index <- seq_along(data_names)
data_vars <- setNames(as.list(data_index), data_names)

# current
# data_class <- tidytable::map_chr.(.data, class)

# proposed
# DateTime class may contain both POSIXct and POSIXt type, so pick one
data_class <- unlist(tidytable::map.(sapply(data, class), ~.[[1]])) 

integer_cols <- list(is.integer = data_index[data_class == 
                                               "integer"])
double_cols <- list(is.double = data_index[data_class == 
                                             "numeric"])
numeric_cols <- list(is.numeric = data_index[data_class %in% 
                                               c("integer", "numeric")])
character_cols <- list(is.character = data_index[data_class == 
                                                   "character"])
factor_cols <- list(is.factor = data_index[data_class == 
                                             "factor"])
dttm_cols <- list(is.ddtm = data_index[data_class %in%
                                            c("POSIXct", "POSIXt", "Date")])
logical_cols <- list(is.logical = data_index[data_class == 
                                               "logical"])
list_cols <- list(is.list = data_index[data_class == "list"])
data_vars <- data_vars %>% append(integer_cols) %>% append(double_cols) %>% 
  append(numeric_cols) %>% append(character_cols) %>% append(factor_cols) %>% 
  append(dttm_cols) %>% append(logical_cols) %>% append(list_cols)
data_vars
}

Proposed enabling column rename in select.()

Reprex and proposal below.

library(dplyr)
library(tidytable)

my_data <- iris

# |- current ----
head(my_data) %>% 
  select.(blah = Species,
          blah2 = Sepal.Length)
#>    Species Sepal.Length
#>      <fct>        <dbl>
#> 1:  setosa          5.1
#> 2:  setosa          4.9
#> 3:  setosa          4.7
#> 4:  setosa          4.6
#> 5:  setosa          5.0
#> 6:  setosa          5.4

# |- proposed ----
dots_selector_i_2 <- function (.data, ...) 
{
  data_vars <- tidytable:::get_data_vars(.data)
  select_vars <- rlang::enexprs(...)
  # print(select_vars)
  select_index <- unlist(eval(rlang::expr(c(!!!select_vars)), data_vars))
  keep_index <- unique(select_index[select_index > 0])
  if (length(keep_index) == 0) 
    keep_index <- seq_along(.data)
  drop_index <- unique(abs(select_index[select_index < 0]))
  select_index <- keep_index[!keep_index %in% drop_index]
  
  return(list(name = names(select_vars),
              idx = select_index))
}

dots_selector_2 <- function (.data, ...) 
{
  match_names <- dots_selector_i_2(.data,...)
  select_index <- match_names$idx
  data_names <- names(.data)
  select_vars <- rlang::syms(data_names[select_index])
  
  return(list(select_vars = select_vars,
              var_name = match_names$name))
}

select..tidytable2 <- function(.data, ...) {
  
  match_cols <- dots_selector_2(.data, ...)
  select_cols <- as.character(match_cols$select_vars)

  # Using a character vector is faster for select
  tidytable:::eval_expr(
    .data[, !!select_cols]
  )
  
  if (any(match_cols$var_name != "")) {
    tidytable:::eval_expr(
      data.table::setnames(.data, select_cols, match_cols$var_name)
    )
  } 
  .data
}

head(my_data) %>% 
  select..tidytable2(blah = Species, 
                     blah2 = Sepal.Length)
#>   blah2 Sepal.Width Petal.Length Petal.Width   blah
#> 1   5.1         3.5          1.4         0.2 setosa
#> 2   4.9         3.0          1.4         0.2 setosa
#> 3   4.7         3.2          1.3         0.2 setosa
#> 4   4.6         3.1          1.5         0.2 setosa
#> 5   5.0         3.6          1.4         0.2 setosa
#> 6   5.4         3.9          1.7         0.4 setosa

^{Created on 2020-04-05 by the reprex package (v0.3.0)}

Docs enhancement: example of providing a quoted arg

The tidytable docs include some examples of providing a non-quoted function arg, such as:

library(rlang)

df <- data.table(x = c(1,1,1), y = c(1,1,1), z = c("a","a","b"))

add_one <- function(.data, add_col) {
  add_col <- enexpr(add_col)

  .data %>%
    mutate.(new_col = !!add_col + 1)
}

df %>%
  add_one(x)

... but the docs don't have an example of if the user provides a quoted arg, such as:

make_subsets = function(to_filter, dt){
    dt %>% dt_filter(Species == to_filter)
}
iris$Species %>% unique %>% as.list %>% lapply(make_subsets, dt=as.data.table(iris))

Add na option to ifelse.()

Joins with differently named columns

Hi Mark,

Looks like joins using different column names across data.tables is not supported. Although, join_mold has all the required functionality, functions like dt_left_join do not seem to use it.

Here is a snippet comparing the functionality with dplyr:

library("tidytable")
#> 
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#> 
#>     dt
library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris_dt = as_dt(iris) %>% 
  dt_select(Species, Sepal.Length)

iris_2_dt = as_dt(iris) %>% 
  dt_select(Species, Petal.Length) %>% 
  dt_rename(Species_2 = Species)

# tidytable
dt_left_join(iris_dt, iris_2_dt, by = c("Species" = "Species_2"))
#> Error in `[.data.table`(y[x, on = on_vec, allow.cartesian = TRUE], , ..all_names): column(s) not found: Species

# dplyr
left_join(iris_dt, iris_2_dt, by = c("Species" = "Species_2")) %>% 
  tibble::as_tibble()
#> # A tibble: 7,500 x 3
#>    Species Sepal.Length Petal.Length
#>    <fct>          <dbl>        <dbl>
#>  1 setosa           5.1          1.4
#>  2 setosa           5.1          1.4
#>  3 setosa           5.1          1.3
#>  4 setosa           5.1          1.5
#>  5 setosa           5.1          1.4
#>  6 setosa           5.1          1.7
#>  7 setosa           5.1          1.4
#>  8 setosa           5.1          1.5
#>  9 setosa           5.1          1.4
#> 10 setosa           5.1          1.5
#> # … with 7,490 more rows

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.2 (2019-12-12)
#>  os       macOS Mojave 10.14.5        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Asia/Kolkata                
#>  date     2020-04-20                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
#>  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.0)
#>  callr         3.4.2   2020-02-12 [1] CRAN (R 3.6.0)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  data.table    1.12.8  2019-12-09 [1] CRAN (R 3.6.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
#>  devtools      2.2.2   2020-02-17 [1] CRAN (R 3.6.0)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.0)
#>  dplyr       * 0.8.4   2020-01-31 [1] CRAN (R 3.6.0)
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.0)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)
#>  jsonlite      1.6.1   2020-02-02 [1] CRAN (R 3.6.0)
#>  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
#>  pillar        1.4.3   2019-12-20 [1] CRAN (R 3.6.0)
#>  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.0)
#>  processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.0)
#>  ps            1.3.2   2020-02-13 [1] CRAN (R 3.6.0)
#>  purrr         0.3.3   2019-10-18 [1] CRAN (R 3.6.0)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.0)
#>  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.0)
#>  remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.0)
#>  reticulate    1.14    2019-12-17 [1] CRAN (R 3.6.2)
#>  rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.2)
#>  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.0)
#>  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.6.0)
#>  tidyselect    1.0.0   2020-01-27 [1] CRAN (R 3.6.0)
#>  tidytable   * 0.3.1   2020-02-19 [1] CRAN (R 3.6.0)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.0)
#>  vctrs         0.2.3   2020-02-20 [1] CRAN (R 3.6.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
#>  xfun          0.12    2020-01-13 [1] CRAN (R 3.6.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

dt_unnest() not keeping index column following dt_group_nest()

Reprex below.

I made reference to tidyfast since it has something similar.

# original - expected ouput
iris %>% 
  dplyr::group_nest(Species) %>% 
  dplyr::mutate(test = purrr::map(data, ~dplyr::filter(.x, Sepal.Width > 3.5))) %>% 
  tidyr::unnest(test)

# A tibble: 19 x 6
   Species   data              Sepal.Length Sepal.Width Petal.Length Petal.Width
   <fct>     <list>                   <dbl>       <dbl>        <dbl>       <dbl>
 1 setosa    <tibble [50 x 4]>          5           3.6          1.4         0.2
 2 setosa    <tibble [50 x 4]>          5.4         3.9          1.7         0.4
 3 setosa    <tibble [50 x 4]>          5.4         3.7          1.5         0.2
 4 setosa    <tibble [50 x 4]>          5.8         4            1.2         0.2
 5 setosa    <tibble [50 x 4]>          5.7         4.4          1.5         0.4
 6 setosa    <tibble [50 x 4]>          5.4         3.9          1.3         0.4
 7 setosa    <tibble [50 x 4]>          5.7         3.8          1.7         0.3
 8 setosa    <tibble [50 x 4]>          5.1         3.8          1.5         0.3
 9 setosa    <tibble [50 x 4]>          5.1         3.7          1.5         0.4
10 setosa    <tibble [50 x 4]>          4.6         3.6          1           0.2
11 setosa    <tibble [50 x 4]>          5.2         4.1          1.5         0.1
12 setosa    <tibble [50 x 4]>          5.5         4.2          1.4         0.2
13 setosa    <tibble [50 x 4]>          4.9         3.6          1.4         0.1
14 setosa    <tibble [50 x 4]>          5.1         3.8          1.9         0.4
15 setosa    <tibble [50 x 4]>          5.1         3.8          1.6         0.2
16 setosa    <tibble [50 x 4]>          5.3         3.7          1.5         0.2
17 virginica <tibble [50 x 4]>          7.2         3.6          6.1         2.5
18 virginica <tibble [50 x 4]>          7.7         3.8          6.7         2.2
19 virginica <tibble [50 x 4]>          7.9         3.8          6.4         2  

iris %>% 
  dt_group_nest(Species) %>% 
  dt_mutate(test = dt_map(data, function(x) dplyr::filter(x, Sepal.Width > 3.5))) %>% 
  dt_unnest(test)

    Sepal.Length Sepal.Width Petal.Length Petal.Width
 1:          5.0         3.6          1.4         0.2
 2:          5.4         3.9          1.7         0.4
 3:          5.4         3.7          1.5         0.2
 4:          5.8         4.0          1.2         0.2
 5:          5.7         4.4          1.5         0.4
 6:          5.4         3.9          1.3         0.4
 7:          5.7         3.8          1.7         0.3
 8:          5.1         3.8          1.5         0.3
 9:          5.1         3.7          1.5         0.4
10:          4.6         3.6          1.0         0.2
11:          5.2         4.1          1.5         0.1
12:          5.5         4.2          1.4         0.2
13:          4.9         3.6          1.4         0.1
14:          5.1         3.8          1.9         0.4
15:          5.1         3.8          1.6         0.2
16:          5.3         3.7          1.5         0.2
17:          7.2         3.6          6.1         2.5
18:          7.7         3.8          6.7         2.2
19:          7.9         3.8          6.4         2.0

# adding extra arg to `...` still doesn't work; same output as above
iris %>% 
  dt_group_nest(Species) %>% 
  dt_mutate(test = dt_map(data, function(x) dplyr::filter(x, Sepal.Width > 3.5))) %>% 
  dt_unnest(test, Species)

# tidyfast gives expected output
iris %>% 
  dt_group_nest(Species) %>% 
  dt_mutate(test = dt_map(data, function(x) dplyr::filter(x, Sepal.Width > 3.5))) %>% 
  tidyfast::dt_unnest(test)

      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
 1:    setosa          5.0         3.6          1.4         0.2
 2:    setosa          5.4         3.9          1.7         0.4
 3:    setosa          5.4         3.7          1.5         0.2
 4:    setosa          5.8         4.0          1.2         0.2
 5:    setosa          5.7         4.4          1.5         0.4
 6:    setosa          5.4         3.9          1.3         0.4
 7:    setosa          5.7         3.8          1.7         0.3
 8:    setosa          5.1         3.8          1.5         0.3
 9:    setosa          5.1         3.7          1.5         0.4
10:    setosa          4.6         3.6          1.0         0.2
11:    setosa          5.2         4.1          1.5         0.1
12:    setosa          5.5         4.2          1.4         0.2
13:    setosa          4.9         3.6          1.4         0.1
14:    setosa          5.1         3.8          1.9         0.4
15:    setosa          5.1         3.8          1.6         0.2
16:    setosa          5.3         3.7          1.5         0.2
17: virginica          7.2         3.6          6.1         2.5
18: virginica          7.7         3.8          6.7         2.2
19: virginica          7.9         3.8          6.4         2.0

Allow paged printing in Rmarkdown

Currently tidytable functions convert data.frame/data.table/tibble objects to a "tidytable" in the background. tidytables print more like tibbles in console, but otherwise have no features separate from a data.table.

When using Rmarkdown using knitr::opts_chunk$set(paged.print=TRUE) (the default) doesn't work.

@TysonStanley I've taken a couple shots at this and couldn't figure out a solution. Do you have any ideas?

unnest.() fails if list col of data.tables has different column order

@leungi

library(tidytable)
library(data.table)

df1 <- data.table(a = "a", b = 1)
df2 <- data.table(b = 1, a = "a")

nested_df <- data.table(id = 1:2,
                        list_col = list(df1, df2))

nested_df %>%
  unnest.(list_col)
#> Error in `[.data.table`(.data, , unlist(list_col, recursive = FALSE), : Column 1 of result for group 2 is type 'double' but expecting type 'character'. Column types must be consistent for each group.

Is it support .data in package？

If I use this tool in pkg development，How can I avoid variable name not defined？

Support passing `.` in dt_ functions

@markfairbanks Firstly, thanks for this project!

This reproducible example makes it clear:

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris_model = rpart::rpart(Species ~ ., data = iris)

iris %>% 
    tibble::as_tibble() %>% 
    mutate(., pred = predict(iris_model, ., type = "class"))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species pred  
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <fct> 
#>  1          5.1         3.5          1.4         0.2 setosa  setosa
#>  2          4.9         3            1.4         0.2 setosa  setosa
#>  3          4.7         3.2          1.3         0.2 setosa  setosa
#>  4          4.6         3.1          1.5         0.2 setosa  setosa
#>  5          5           3.6          1.4         0.2 setosa  setosa
#>  6          5.4         3.9          1.7         0.4 setosa  setosa
#>  7          4.6         3.4          1.4         0.3 setosa  setosa
#>  8          5           3.4          1.5         0.2 setosa  setosa
#>  9          4.4         2.9          1.4         0.2 setosa  setosa
#> 10          4.9         3.1          1.5         0.1 setosa  setosa
#> # … with 140 more rows


library("tidytable")
#> 
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#> 
#>     dt
iris_2 = as_dt(iris)

iris_2_model = rpart::rpart(Species ~ ., data = iris_2)
iris_2 %>% 
    dt_mutate(., pred = predict(iris_2_model, ., type = "class"))
#> Error in predict.rpart(iris_2_model, ., type = "class"): object '.' not found

^{Created on 2020-03-20 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.2 (2019-12-12)
#>  os       macOS Mojave 10.14.5        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Asia/Kolkata                
#>  date     2020-03-20                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
#>  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.0)
#>  callr         3.4.2   2020-02-12 [1] CRAN (R 3.6.0)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  data.table    1.12.8  2019-12-09 [1] CRAN (R 3.6.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
#>  devtools      2.2.2   2020-02-17 [1] CRAN (R 3.6.0)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.0)
#>  dplyr       * 0.8.4   2020-01-31 [1] CRAN (R 3.6.0)
#>  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.0)
#>  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.0)
#>  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
#>  pillar        1.4.3   2019-12-20 [1] CRAN (R 3.6.0)
#>  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)
#>  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.0)
#>  processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.0)
#>  ps            1.3.2   2020-02-13 [1] CRAN (R 3.6.0)
#>  purrr         0.3.3   2019-10-18 [1] CRAN (R 3.6.0)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.0)
#>  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.0)
#>  remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.0)
#>  rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.2)
#>  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.0)
#>  rpart         4.1-15  2019-04-12 [1] CRAN (R 3.6.2)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.0)
#>  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.6.0)
#>  tidyselect    1.0.0   2020-01-27 [1] CRAN (R 3.6.0)
#>  tidytable   * 0.3.1   2020-02-19 [1] CRAN (R 3.6.0)
#>  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
#>  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.6.0)
#>  vctrs         0.2.3   2020-02-20 [1] CRAN (R 3.6.0)
#>  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
#>  xfun          0.12    2020-01-13 [1] CRAN (R 3.6.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Add .names arg to summarize_across.()

Deprecate `dt_verb()` functions

Add enhanced selection support for group by calls

Rename functions?

@leungi @RossKen

I wanted to get some opinions on something. I think I want to rename all of the functions. The dt_ prefix starts to get clunky after a while. Example:

test_df %>%
  dt_select(dt_starts_with("x"), dt_ends_with("y")) %>%
  dt_arrange(x)

That's a bit tough to read.

So this is what I'm thinking about doing:

test_df %>%
  select.(starts_with.("x"), ends_with.("y")) %>%
  arrange.(x)

Edit: As I do more research, functions that start with "." are supposed to be for internal use only. So it would have to be select.().

# Edit: ignore, this isn't allowed
test_df %>%
  .select(.starts_with("x"), .ends_with("y")) %>%
  .arrange(x)

I would keep the old ones in the package for a while, but all documentation would use the "dot" version.

If there was a time to do this, it would be now when the user base was still a bit smaller.

Good idea? Bad idea? Is there another way to rename them that I’m not thinking of?

Implement group_split()

Edit to dt_map_dfc() to output data.table type

Reprex and suggested edit below.

packageVersion("tidydt")
#> [1] '0.2.0'

library(tidydt)
#> 
#> Attaching package: 'tidydt'
#> The following object is masked from 'package:stats':
#> 
#>     dt

not_nested <- list(col1 = c("Apple", "Orange"),
                   col2 = c("Baseball", "Football"))

purrr::map_dfc(not_nested, ~ .) %>% class()
#> [1] "tbl_df"     "tbl"        "data.frame"

# expect a data.table but got matrix
dt_map_dfc(not_nested, function(x) x) %>% class()
#> [1] "matrix"

dt_map_dfc_edit <- function (.x, .f, ...)
{
  result_list <- dt_map(.x, .f, ...)
  as_dt(as.data.frame(do.call(cbind, result_list)))
}

dt_map_dfc_edit(not_nested, function(x) x) %>% class()
#> [1] "data.table" "data.frame"

nested <- list(col1 = list(c("Apple", "Banana"),
                           c("Orange")),
               col2 = list(c("Baseball", "Soccer"),
                           c("Football")))

dt_map_dfc_edit(nested, function(x) x)
#>            col1            col2
#> 1: Apple,Banana Baseball,Soccer
#> 2:       Orange        Football

^{Created on 2020-01-19 by the reprex package (v0.3.0)}

enexpr: object not found

I'm trying to create a couple of convenience functions for processing data.table objects with tidytable (see below), and the first works, but the second doesn't, and I don't understand why. Any clarification would be greatly appreciated!

This function works with data.table objects (eg., unique_n(example_dt, sel_col=z))

#' Pretty print number of unique elements in a vector
#'
#' The result will be cat'ed to the screen.
#' tidytable compatable. Maje
#'
#' @param x a vector or data.table. If data.table, sel_col must not be NULL
#' @param label what to call the items in the vector (eg., "samples")
#' @param sel_col If x=data.table, which column to assess?
#' @returns NULL
unique_n = function(x, label='items', sel_col=NULL){
  if(any((class(x)) == 'data.table')){
    if(is.null(sel_col)){
      stop('sel_col cannot be NULL for data.table objects')
    }
    sel_col = ggplot2::enexpr(sel_col)
    x = tidytable::dt_distinct(x, !!sel_col)
    x = tidytable::dt_pull(x, !!sel_col)
  }
  cat(sprintf('No. of unique %s:', label),
      length(unique(x)), '\n')
}

This function doesn't work with data.table objects (eg., overlap(example_dt, example_dt, sel_col_x=z, sel_col_y=z)). I get an "object not found error".

#' Determine counts of setdiff, intersect, & union of 2 vectors (or data.tables)
#'
#' The output is printed text of intersect, each-way setdiff, and union.
#' Data.table compatible! Just make sure to provide sel_col_x and/or sel_col_y
#'
#' @param x vector1 or data.table. If data.table, sel_col_x must not be NULL
#' @param y vector2 or data.table. If data.table, sel_col_y must not be NULL
#' @param sel_col_x If x = data.table, which column to assess?
#' @param sel_col_y If y = data.table, which column to assess?
#' @return NULL
#'
overlap = function(x, y, sel_col_x=NULL, sel_col_y=NULL){
  if(any((class(x)) == 'data.table')){
    if(is.null(sel_col_x)){
      stop('sel_col_x cannot be NULL for data.table objects')
    }
    sel_col_x = ggplot2::enexpr(sel_col_x)
    x = tidytable::dt_distinct(x, !!sel_col_x)
    x = tidytable::dt_pull(x, !!sel_col_x)
  }
  if(any((class(y)) == 'data.table')){
    if(is.null(sel_col_y)){
      stop('sel_col_y cannot be NULL for data.table objects')
    }
    sel_col_y = ggplot2::enexpr(sel_col_y)
    y = tidytable::dt_distinct(y, !!sel_col_y)
    y = tidytable::dt_pull(y, !!sel_col_y)
  }
  cat('intersect(x,y):', length(intersect(x,y)), '\n')
  cat('setdiff(x,y):', length(setdiff(x,y)), '\n')
  cat('setdiff(y,x):', length(setdiff(y,x)), '\n')
  cat('union(x,y):', length(union(x,y)), '\n')
}

Implement new slice variants

See https://github.com/tidyverse/dplyr/blob/master/R/slice.R

slice_head()
slice_tail()
slice_min()
slice_max()

Implement mutate_rowwise.()

This one seems pretty straightforward, but should probably be implemented with c_across.() as well, which is trickier.

Edit: This fails if the data.table contains list columns. See new implementation below.

mutate_rowwise. <- function(.df, ...) {
  mutate.(.df, ..., .by = everything())
}

Make all methods S3 methods

Add .names arg to mutate_across.()

mutate_across cannot handle spaces in column names

reprex:

example_dt <- data.table::data.table(
  "x col" = c(1,1,1),
  "y col" = c(2,2,2),
  "z col" = c("a", "a", "b"))

 example_dt %>%
  mutate_across.(is.numeric, as.character)
# Error in `[.data.table`(.data, , `:=`((.cols), map.(.SD, .fns, ...)), : Some items of .SDcols are not column names: [`x col`, `y col`]

sessionInfo:

R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/phyloseq-phy/lib/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] LeyLabRMisc_0.1.5 tidytable_0.4.1   data.table_1.12.8 ape_5.3          
[5] phyloseq_1.30.0   ggplot2_3.2.1     tidyr_1.0.0       dplyr_0.8.3      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3          lattice_0.20-38     Biostrings_2.54.0  
 [4] assertthat_0.2.1    digest_0.6.23       foreach_1.4.7      
 [7] IRdisplay_0.7.0     R6_2.4.1            plyr_1.8.5         
[10] repr_1.0.2          stats4_3.6.2        evaluate_0.14      
[13] pillar_1.4.3        zlibbioc_1.32.0     rlang_0.4.6        
[16] lazyeval_0.2.2      uuid_0.1-2          vegan_2.5-6        
[19] S4Vectors_0.24.0    Matrix_1.2-18       splines_3.6.2      
[22] stringr_1.4.0       igraph_1.2.4.2      munsell_0.5.0      
[25] xfun_0.12           compiler_3.6.2      pkgconfig_2.0.3    
[28] BiocGenerics_0.32.0 base64enc_0.1-3     multtest_2.42.0    
[31] mgcv_1.8-31         htmltools_0.4.0     biomformat_1.14.0  
[34] tidyselect_1.1.0    tibble_2.1.3        IRanges_2.20.0     
[37] codetools_0.2-16    fansi_0.4.1         permute_0.9-5      
[40] crayon_1.3.4        withr_2.1.2         MASS_7.3-51.5      
[43] grid_3.6.2          nlme_3.1-143        jsonlite_1.6       
[46] gtable_0.3.0        lifecycle_0.1.0     magrittr_1.5       
[49] scales_1.1.0        cli_2.0.1           stringi_1.4.5      
[52] XVector_0.26.0      reshape2_1.4.3      ellipsis_0.3.0     
[55] vctrs_0.3.0         IRkernel_1.1        Rhdf5lib_1.8.0     
[58] iterators_1.0.12    tools_3.6.2         ade4_1.7-13        
[61] Biobase_2.46.0      glue_1.3.1          purrr_0.3.3        
[64] parallel_3.6.2      survival_3.1-8      colorspace_1.4-1   
[67] rhdf5_2.30.0        cluster_2.1.0       pbdZMQ_0.3-3       
[70] knitr_1.26

Implement tidyselect?

tidytable currently implements its own version of tidyselect features. It covers the core features a user needs, but is still missing some handy functionality.

Pros:

Allows the user to use the same tidyselect functions/features as they use in dplyr.

Cons:

Would add tidyselect dependencies on ellipsis/glue/purrr/vctrs
Adds 1-2ms to each function that uses it

Edit to dt_bind_rows() to mimic dplyr::bind_rows()

Reprex and suggested below below.

packageVersion("tidydt")
#> [1] '0.2.0'

library(tidydt)
#> 
#> Attaching package: 'tidydt'
#> The following object is masked from 'package:stats':
#> 
#>     dt

iris_ls <- iris %>%
  dplyr::group_split(Species)

dplyr::bind_rows(iris_ls)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

dplyr::bind_rows(dt_map(iris_ls, as_dt))
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:          5.1         3.5          1.4         0.2    setosa
#>   2:          4.9         3.0          1.4         0.2    setosa
#>   3:          4.7         3.2          1.3         0.2    setosa
#>   4:          4.6         3.1          1.5         0.2    setosa
#>   5:          5.0         3.6          1.4         0.2    setosa
#>  ---                                                            
#> 146:          6.7         3.0          5.2         2.3 virginica
#> 147:          6.3         2.5          5.0         1.9 virginica
#> 148:          6.5         3.0          5.2         2.0 virginica
#> 149:          6.2         3.4          5.4         2.3 virginica
#> 150:          5.9         3.0          5.1         1.8 virginica

# unsuccessful test
dt_bind_rows(iris_ls)
#> Error in dt_bind_rows(iris_ls): All inputs must be a data.frame or data.table

iris_ls_dt <- dt_map(iris_ls, as_dt)

dt_bind_rows(iris_ls_dt)
#> Error in dt_bind_rows(iris_ls_dt): All inputs must be a data.frame or data.table

dt_bind_rows(iris_ls, iris)
#> Error in dt_bind_rows(iris_ls, iris): All inputs must be a data.frame or data.table

# revised fxn
dt_bind_rows_edit <- function(.data, ...) {
  # check if input .data is already a list; if not, transform to list
  if (class(.data) != "list") {
    .data <- list(.data)
  }

  dots <- enexprs(...)
  dots <- dt_map(dots, eval)
  # remove list()
  dots <- append(.data, dots)
  if (!all(dt_map_lgl(dots, is.data.frame)))
    stop("All inputs must be a data.frame or data.table")
  if (!all(dt_map_lgl(dots, is.data.table)))
    dots <- dt_map(dots, as.data.table)
  rbindlist(dots)
}

# successful test
dt_bind_rows_edit(iris_ls)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:          5.1         3.5          1.4         0.2    setosa
#>   2:          4.9         3.0          1.4         0.2    setosa
#>   3:          4.7         3.2          1.3         0.2    setosa
#>   4:          4.6         3.1          1.5         0.2    setosa
#>   5:          5.0         3.6          1.4         0.2    setosa
#>  ---                                                            
#> 146:          6.7         3.0          5.2         2.3 virginica
#> 147:          6.3         2.5          5.0         1.9 virginica
#> 148:          6.5         3.0          5.2         2.0 virginica
#> 149:          6.2         3.4          5.4         2.3 virginica
#> 150:          5.9         3.0          5.1         1.8 virginica

dt_bind_rows_edit(iris_ls_dt)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:          5.1         3.5          1.4         0.2    setosa
#>   2:          4.9         3.0          1.4         0.2    setosa
#>   3:          4.7         3.2          1.3         0.2    setosa
#>   4:          4.6         3.1          1.5         0.2    setosa
#>   5:          5.0         3.6          1.4         0.2    setosa
#>  ---                                                            
#> 146:          6.7         3.0          5.2         2.3 virginica
#> 147:          6.3         2.5          5.0         1.9 virginica
#> 148:          6.5         3.0          5.2         2.0 virginica
#> 149:          6.2         3.4          5.4         2.3 virginica
#> 150:          5.9         3.0          5.1         1.8 virginica

dt_bind_rows_edit(iris_ls, iris)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:          5.1         3.5          1.4         0.2    setosa
#>   2:          4.9         3.0          1.4         0.2    setosa
#>   3:          4.7         3.2          1.3         0.2    setosa
#>   4:          4.6         3.1          1.5         0.2    setosa
#>   5:          5.0         3.6          1.4         0.2    setosa
#>  ---                                                            
#> 296:          6.7         3.0          5.2         2.3 virginica
#> 297:          6.3         2.5          5.0         1.9 virginica
#> 298:          6.5         3.0          5.2         2.0 virginica
#> 299:          6.2         3.4          5.4         2.3 virginica
#> 300:          5.9         3.0          5.1         1.8 virginica

dt_bind_rows_edit(iris, iris)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:          5.1         3.5          1.4         0.2    setosa
#>   2:          4.9         3.0          1.4         0.2    setosa
#>   3:          4.7         3.2          1.3         0.2    setosa
#>   4:          4.6         3.1          1.5         0.2    setosa
#>   5:          5.0         3.6          1.4         0.2    setosa
#>  ---                                                            
#> 296:          6.7         3.0          5.2         2.3 virginica
#> 297:          6.3         2.5          5.0         1.9 virginica
#> 298:          6.5         3.0          5.2         2.0 virginica
#> 299:          6.2         3.4          5.4         2.3 virginica
#> 300:          5.9         3.0          5.1         1.8 virginica

^{Created on 2020-01-19 by the reprex package (v0.3.0)}

.N doesn't work in dt_slice() or in slice variants

library(data.table)
library(tidytable)

test_df <- data.table::data.table(
  x = c(1,2,3,4),
  y = c(4,5,6,7),
  z = c("a","a","a","b"))

test_df %>%
  dt_slice(1:.N)
#> Error in 1:.N: argument of length 0

^{Created on 2020-02-28 by the reprex package (v0.3.0)}

summarize_across.() is slower than dtplyr

Thanks a lot @markfairbanks ! summarize_across.() is much slower than dtplyr and data.table solutions in the following example. What would be a faster way using tidytable?

R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
CPU: Core i7-8750H @ 2.20GHz, 2208 Mhz, 6 Core(s), 12 Logical Processor(s)
RAM: 16 GB

package        * version     date       lib source                                  
data.table     * 1.12.9      2020-04-20 [1] local                                   
dplyr          * 0.8.99.9003 2020-05-29 [1] Github (tidyverse/dplyr@2855355)        
dtplyr         * 1.0.1.9000  2020-05-29 [1] Github (tidyverse/dtplyr@dfb22f6)
tidytable      * 0.5.0.9     2020-05-29 [1] Github (markfairbanks/tidytable@bc4b1d0)

#> Unit: milliseconds
#>              expr      min       lq     mean   median       uq      max neval
#>      dplyr_func()  16.3866  17.2444  18.8779  17.8115  19.5853  34.8279   100
#>  dplyr_acr_func() 190.3610 199.9047 207.6058 204.3143 211.3635 278.1041   100
#>     dtplyr_func()   8.1858   8.7738   9.7016   9.3106  10.1511  19.2393   100
#>  tidytable_func()  71.8009  74.0848  78.2104  76.1958  80.2190 127.3493   100
#>         dt_func()   2.7862   2.9893   3.6715   3.1622   4.4132   6.1665   100

library(microbenchmark)
library(data.table)
library(dtplyr)
library(dplyr)
library(tidytable)
library(ggplot2)

diamonds2 <- lazy_dt(diamonds)
dt <- setDT(copy(diamonds))

tidytable_func <- function() {
  diamonds %>% 
    summarize_across.(c(depth, table, price, carat), 
                      mean,
                      by = c(cut, color, clarity))
}

dplyr_func <- function() {
  diamonds %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean)
  
}

dplyr_acr_func <- function() {
  diamonds %>% 
    group_by(cut, color, clarity) %>% 
    summarise(across(c(depth, table, price, carat), mean))
  
}

dtplyr_func <- function() {
  diamonds2 %>% 
    group_by(cut, color, clarity) %>% 
    summarise_at(vars(depth, table, price, carat), mean) %>% 
    as_tibble()
}

cols <- c("depth", "table", "price", "carat")

dt_func <- function() {
  dt[ 
    , lapply(.SD, mean)
    , by = .(cut, color, clarity)
    , .SDcols = cols]
}

microbenchmark(
  dplyr_func(),
  dplyr_acr_func(),
  dtplyr_func(),
  tidytable_func(),
  dt_func(),
  unit = 'ms',
  times = 100L
)

Originally posted by @tungmilan in #64 (comment)

Implement dt_relocate()

See https://github.com/tidyverse/dplyr/blob/master/R/relocate.R

Speed and Efficiency Comparisons

@markfairbanks it may be beneficial to show how these functions compare with the Tidyverse functions in terms of speed and efficiency (can use the updated bench package to do so). This is probably the main draw to this package--the idea that someone has a more efficient and a quicker experience than if they were using the Tidyverse.

Implement "names" argument in dt_mutate_across

separate_rows

Hi Mark,

Consider adding separate_rows. as an equivalent for tidyr::separate_rows. I would be happy to submit a PR.

pacman::p_load("tidytable", "data.table")

separate_rows. = function(dt, column, sep = ",", trim = FALSE){
  
  tidytable::is_tidytable(dt)
  stopifnot(is.character(sep) && length(sep) == 1L)
  stopifnot(is.logical(trim) && length(trim) == 1L)
  
  dt = tidytable:::shallow(dt)
  dt[ , id_ := .I]
  
  column             = rlang::enexpr(column)
  column_name        = rlang::expr_deparse(column)
  other_column_names = setdiff(colnames(dt), column_name)
  
  
  res = tidytable:::eval_expr(
    dt[ 
      , strsplit(as.character(!!column), split = sep)
      , by = other_column_names
      ]
  )
  data.table::setnames(res, "V1", column_name)
  res[ , id_ := NULL]
  
  if (trim) {
    tidytable:::eval_expr(res[ , !!column := trimws(!!column)])
  }
  
  return(res[])          
}

x = sample(c("a,b", "c,d, e", "f ,"), 1e4, replace = TRUE)
y = sample(c(1,2,3), 1e4, replace = TRUE)
temp = data.table(x, y)
dplyr::glimpse(temp)
#> Rows: 10,000
#> Columns: 2
#> $ x <chr> "f ,", "a,b", "f ,", "c,d, e", "c,d, e", "a,b", "a,b", "f ,", "f ,"…
#> $ y <dbl> 2, 3, 1, 2, 3, 3, 1, 2, 3, 2, 1, 1, 1, 2, 3, 1, 3, 2, 1, 2, 1, 1, 3…

system.time({ res_tt = separate_rows.(temp, x, trim = TRUE) })
#>    user  system elapsed 
#>   0.055   0.003   0.057
system.time({ res_tidyr = tidyr::separate_rows(temp, x) })
#>    user  system elapsed 
#>   0.882   0.682   1.571

Add "by" functionality to dt_mutate_across()

Check validity of verb.() function names

@moodymudskipper

I sent you this in reddit as well, I wasn't sure how often you checked your account.

I have an advanced R question and hoped you might be able to provide some guidance.

I created this package (tidytable) to replicate tidyverse syntax with a data.table/rlang backend.

Originally the functions followed the syntax of dt_verb(). I was about to submit a big update to CRAN that would replace this syntax with verb.() (while adding function deprecation warnings for the dt_ syntax).

My question is this - is it a bad idea to use verb.()? Is this against best practices since the "." is used for S3 methods? The package passes tests, passes devtools::check(), and all S3 methods work (ex: mutate..data.frame()). But is this a bad idea?

Thanks for any help - I'll be honest I'm not sure who to reach out to with this issue

Add ~ functionality to dt_mutate_() variants & dt_rename_() variants

unnest.() produces long outcome when wide outcome is expected

Reprex below.

library(tidytable)
#> 
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#> 
#>     dt
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# |- data ----
dat <- structure(
  list(
    id = c("11", "22"),
    phase = c("a", "b"),
    values = list(
      structure(
        list(
          a = 0.0584563566053344,
          b = 192,
          c = "50%",
          d = 1,
          e = 0,
          f = 0,
          g = 0
        ),
        row.names = c(NA, -1L),
        class = c("tbl_df",
                  "tbl", "data.frame")
      ),
      structure(
        list(
          c = "50%",
          d = 465L,
          e = 0,
          g = 290514.430137519,
          b = 10961.9288476965,
          a = 0.359973896295374,
          h = 1.46588348984196,
          f = 119.108387941727
        ),
        row.names = c(NA,
                      -1L),
        class = c("tbl_df", "tbl", "data.frame")
      )
    )
  ),
  row.names = c(NA,
                -2L),
  class = c("tbl_df", "tbl", "data.frame")
)

# |- tidyr ----
# wide results
dat %>% 
  tidyr::unnest(values)
#> # A tibble: 2 x 10
#>   id    phase      a      b c         d     e     f       g     h
#>   <chr> <chr>  <dbl>  <dbl> <chr> <dbl> <dbl> <dbl>   <dbl> <dbl>
#> 1 11    a     0.0585   192  50%       1     0    0       0  NA   
#> 2 22    b     0.360  10962. 50%     465     0  119. 290514.  1.47

# |- tidytable ----
# long output
dat %>% 
  unnest.(values)
#>        id phase     values
#>     <chr> <chr>     <list>
#>  1:    11     a 0.05845636
#>  2:    11     a        192
#>  3:    11     a        50%
#>  4:    11     a          1
#>  5:    11     a          0
#>  6:    11     a          0
#>  7:    11     a          0
#>  8:    22     b        50%
#>  9:    22     b        465
#> 10:    22     b          0
#> 11:    22     b   290514.4
#> 12:    22     b   10961.93
#> 13:    22     b  0.3599739
#> 14:    22     b   1.465883
#> 15:    22     b   119.1084

^{Created on 2020-04-05 by the reprex package (v0.3.0)}

unite?

How should one implement tidyr::unite() with tidytable? I don't see any equivalent function in the docs

Replace `by` with `.by`

Using by as an argument prevents the user from having columns named "by" when using mutate.() or summarize.().

See ?lifecycle::deprecated for redirection example

Add dtplyr to speed comparisons

It would be helpful to add these for the main dplyr verbs that already currently have tests - arrange(), mutate(), summarize(), & filter().

In case someone wants to help with this one, all tests are found in the source code of the README, and here's how filter() would look:

filter_marks <- bench::mark(
  tidyverse = filter(test_tbl, a <= 7, c == "a"),
  # Only new line, all other parts are already written
  dtplyr = as.data.table(filter(lazy_dt(test_dt), a <= 7, c == "a")),
  tidytable = filter.(test_dt, a <= 7, c == "a"),
  data.table = test_dt[a <= 7 & c == "a"],
  check = FALSE, iterations = iters, memory = FALSE, filter_gc = FALSE, time_unit = 'ms') %>%
  mutate(expression = as.character(expression),
         function_tested = "filter")

For anyone new to pull requests that wants to take a shot at this - feel free to comment on this issue and I can walk you through the process.

slow performance of mutate.() with by grouping for many groups

As somebody who likes the tidyverse syntax and requires the data.table performance while struggling with its modify-by-reference, I was very happy finding tidytable. Thanks for this great package!
I am working with large datasets (1-10M rows, 50-500 cols) that often require mutating of grouped data.
In this scenario however, I found tidytable::mutate.() to be much slower than the data.table equivalent, and still considerably slower than the dplyr alternative.

library(magrittr)
library(data.table)

rows <- 1000000
ids  <- 50000

#simple data set with many different IDs and 1M rows, 3 cols
df <- data.frame(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
                 bike = sample(c("mountain", "allround", "road", "bmx"), size = rows, replace = TRUE),
                 year = sample(1980:2020, size = rows, replace = TRUE),
                 stringsAsFactors = FALSE)

results <- bench::mark(
  #first run with tidytable
  tidytable = df %>%
    #sort by case id, time and item
    tidytable::arrange.(id, year, bike)%>%
    #calculate new item number variable #group by case id
    tidytable::mutate.(bike_number = as.integer(tidytable::row_number.()), by = id),
  #second run with dplyr
  dplyr = df %>%
    #sort by case id, time and item
    dplyr::arrange(id, year, bike)%>%
    #calculate new item number variable #group by case id
    dplyr::group_by(id) %>%
    dplyr::mutate(bike_number = as.integer(dplyr::row_number())) %>%
    dplyr::ungroup(),
  #third run with data.table
  data.table = data.table::copy(df) %>%
    data.table::as.data.table(.) %>%
    #sort by case id, time and item
    .[base::order(nchar(.[, id]), .[, id], .[, year], .[, bike], method = "radix")] %>%
    #calculate new item number variable #group by case id
    .[, bike_number := as.integer(seq_len(.N)), by=.[, id]] %>%
    .[],
  iterations = 3, filter_gc = FALSE, check = FALSE
)

ggplot2::autoplot(results)
#> Lade nötigen Namensraum: tidyr

^{Created on 2020-06-08 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.3 (2020-02-29)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-06-08                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source                                
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.1)                        
#>  backports     1.1.7   2020-05-13 [1] CRAN (R 3.6.3)                        
#>  beeswarm      0.2.3   2016-04-25 [1] CRAN (R 3.6.0)                        
#>  bench         1.1.1   2020-01-13 [1] CRAN (R 3.6.2)                        
#>  callr         3.4.3   2020-03-28 [1] CRAN (R 3.6.3)                        
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 3.6.3)                        
#>  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.6.1)                        
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.1)                        
#>  curl          4.3     2019-12-02 [1] CRAN (R 3.6.1)                        
#>  data.table  * 1.12.9  2020-03-04 [1] Github (Rdatatable/data.table@b1b1832)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.1)                        
#>  devtools      2.3.0   2020-04-10 [1] CRAN (R 3.6.3)                        
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 3.6.2)                        
#>  dplyr         1.0.0   2020-05-29 [1] CRAN (R 3.6.3)                        
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 3.6.3)                        
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.1)                        
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.2)                        
#>  farver        2.0.3   2020-01-16 [1] CRAN (R 3.6.2)                        
#>  fs            1.4.1   2020-04-04 [1] CRAN (R 3.6.3)                        
#>  generics      0.0.2   2018-11-29 [1] CRAN (R 3.6.1)                        
#>  ggbeeswarm    0.6.0   2017-08-07 [1] CRAN (R 3.6.3)                        
#>  ggplot2       3.3.1   2020-05-28 [1] CRAN (R 3.6.3)                        
#>  glue          1.4.1   2020-05-13 [1] CRAN (R 3.6.3)                        
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.1)                        
#>  highr         0.8     2019-03-20 [1] CRAN (R 3.6.1)                        
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.1)                        
#>  httr          1.4.1   2019-08-05 [1] CRAN (R 3.6.1)                        
#>  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.2)                        
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.3)                        
#>  magrittr    * 1.5     2014-11-22 [1] CRAN (R 3.6.1)                        
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.1)                        
#>  mime          0.9     2020-02-04 [1] CRAN (R 3.6.2)                        
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.1)                        
#>  pillar        1.4.4   2020-05-05 [1] CRAN (R 3.6.3)                        
#>  pkgbuild      1.0.8   2020-05-07 [1] CRAN (R 3.6.3)                        
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.1)                        
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 3.6.3)                        
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.2)                        
#>  processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.2)                        
#>  profmem       0.5.0   2018-01-30 [1] CRAN (R 3.6.2)                        
#>  ps            1.3.3   2020-05-08 [1] CRAN (R 3.6.3)                        
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 3.6.3)                        
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.1)                        
#>  Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)                        
#>  remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.2)                        
#>  rlang         0.4.6   2020-05-02 [1] CRAN (R 3.6.3)                        
#>  rmarkdown     2.2     2020-05-31 [1] CRAN (R 3.6.3)                        
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.1)                        
#>  scales        1.1.1   2020-05-11 [1] CRAN (R 3.6.3)                        
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.1)                        
#>  stringi       1.4.6   2020-02-17 [1] CRAN (R 3.6.2)                        
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.1)                        
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 3.6.3)                        
#>  tibble        3.0.1   2020-04-20 [1] CRAN (R 3.6.3)                        
#>  tidyr         1.1.0   2020-05-20 [1] CRAN (R 3.6.3)                        
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 3.6.3)                        
#>  tidytable     0.5.1   2020-05-29 [1] CRAN (R 3.6.3)                        
#>  usethis       1.6.1   2020-04-29 [1] CRAN (R 3.6.3)                        
#>  vctrs         0.3.0   2020-05-11 [1] CRAN (R 3.6.3)                        
#>  vipor         0.4.5   2017-03-22 [1] CRAN (R 3.6.3)                        
#>  withr         2.2.0   2020-04-20 [1] CRAN (R 3.6.3)                        
#>  xfun          0.14    2020-05-20 [1] CRAN (R 3.6.3)                        
#>  xml2          1.3.2   2020-04-23 [1] CRAN (R 3.6.3)                        
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.2)                        
#> 
#> [1] C:/Users/usr/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.3/library

unnest.() fails if nested data.tables have different ncol()

@leungi

library(tidytable)

df1 <- tidytable(a = "x", b = 1)
df2 <- tidytable(a = "y", b = 2, c = 3)

nested_df <- tidytable(id = 1:2,
                       list_col = list(df1, df2))

nested_df %>%
  unnest.(list_col)
#> Error in `[.data.table`(.data, , unlist(list_col, recursive = FALSE), : j doesn't evaluate to the same number of columns for each group

markfairbanks / tidytable Goto Github PK

tidytable's People

Contributors

Stargazers

Watchers

Forkers

tidytable's Issues

Suspected Cause

Recommend Projects

Recommend Topics

Recommend Org