markfairbanks / tidytable Goto Github PK
View Code? Open in Web Editor NEWTidy interface to 'data.table'
Home Page: https://markfairbanks.github.io/tidytable/
License: Other
Tidy interface to 'data.table'
Home Page: https://markfairbanks.github.io/tidytable/
License: Other
Results below show that distinct.() not returning the unique letters from col_1. Initially, I thought that it was related to the number of rows but that does not seem to be the case. I'm using tidytable version 0.5.0.9.
pacman::p_load(tidytable,
tidyverse)
n <- 500000
test_df <- data.table(col_1 = sample(LETTERS, n, rep = TRUE),
col_2 = sample(1:5, n, rep = TRUE))
#####################
# Tidyverse
#####################
tidyverse_test <- test_df %>%
select(col_1) %>%
distinct() %>%
count() %>%
pull()
#####################
# Tidytable
#####################
tidytable_test <-
test_df %>%
select.(col_1) %>%
distinct.() %>%
count.() %>%
pull.()
data.table(tidyverse_cnt = tidyverse_test,
tidytable_cnt = tidytable_test,
match_test = (tidyverse_test == tidytable_test))
#> tidyverse_cnt tidytable_cnt match_test
#> 1: 26 500000 FALSE
Created on 2020-05-29 by the reprex package (v0.3.0)
@markfairbanks: not sure if I'm missing something even after checking the man
; dplyr::summarise_at/all/if()
are not yet implemented, ya?
Something like:
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 5.843333 3.057333 3.758 1.199333
iris %>%
summarise_at(vars(contains("Sepal")), sum, na.rm = TRUE)
#> Sepal.Length Sepal.Width
#> 1 876.5 458.6
Originally posted by @leungi in #62 (comment)
Complicated title, but fairly simple explanation.
When I try to access the .y value I get the following error.
library(dplyr, warn.conflicts = FALSE)
library(tidytable, warn.conflicts = FALSE)
data <- tibble(
id = LETTERS[seq(1, 3)],
val_1 = seq(1, 3, 1),
val_2 = seq(4, 6, 1)
)
data %>%
nest_by.(id) %>%
mutate.(example_1 = map2.(data, id, ~.x %>%
mutate.(id = .y))) %>%
unnest.(example_1)
#> Error in eval(jsub, SDenv, parent.frame()): object '.y' not found
If I instead use the regular mutate function from dplyr, everything works fine.
library(dplyr, warn.conflicts = FALSE)
library(tidytable, warn.conflicts = FALSE)
data <- tibble(
id = LETTERS[seq(1, 3)],
val_1 = seq(1, 3, 1),
val_2 = seq(4, 6, 1)
)
data %>%
nest_by.(id) %>%
mutate.(example_1 = map2.(data, id, ~.x %>%
mutate(id = .y))) %>%
unnest.(example_1)
#> id val_1 val_2 id1
#> 1: A 1 4 A
#> 2: B 2 5 B
#> 3: C 3 6 C
This is an FYI of an API change to dt_unnest_legacy()
. (I think you use this feature more than anyone else). The "keep" columns are no longer passed in dots, but in a vector of bare column names. See below:
nested_df <- data.table(a = 1:10,
b = 11:20,
c = c(rep("a", 6), rep("b", 4)),
d = c(rep("a", 4), rep("b", 6))) %>%
dt_group_nest(c, d)
nested_df %>%
dt_unnest_legacy(data, keep = c(c, d))
nested_df %>%
dt_unnest_legacy(data, keep = is.character)
This is part of the v0.3.0 release, and is in prep for the future API when multiple columns can be unnested simultaneously
reprex:
x = iris %>% as.data.table %>% dt_select(Sepal.Length, Sepal.Width)
y = iris %>% as.data.table %>% dt_select(Sepal.Width, Sepal.Length)
dt_bind_rows(x, y, use.names=TRUE)
Full error:
Error in rbindlist(dots, idcol = .id): Item 3 has 1 columns, inconsistent with item 1 which has 2 columns. To fill missing columns use fill=TRUE.
Traceback:
1. dt_bind_rows(x, y, use.names = TRUE, fill = TRUE)
2. dt_bind_rows.default(x, y, use.names = TRUE, fill = TRUE)
3. rbindlist(dots, idcol = .id)
Alternative that works:
x = iris %>% select(Sepal.Length, Sepal.Width)
y = iris %>% select(Sepal.Width, Sepal.Length)
rbind(x,y)
sessionInfo:
R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/tidyverse/lib/libopenblasp-r0.3.7.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.8 LeyLabRMisc_0.1.3 tidytable_0.3.2 ggplot2_3.2.1
[5] tidyr_1.0.0 dplyr_0.8.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 pillar_1.4.3 compiler_3.6.2 base64enc_0.1-3
[5] tools_3.6.2 bit_1.1-15.2 zeallot_0.1.0 digest_0.6.23
[9] uuid_0.1-2 jsonlite_1.6 evaluate_0.14 tibble_2.1.3
[13] lifecycle_0.1.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.2
[17] IRdisplay_0.7.0 IRkernel_1.1 repr_1.0.2 withr_2.1.2
[21] vctrs_0.2.1 bit64_0.9-7 grid_3.6.2 tidyselect_0.2.5
[25] glue_1.3.1 R6_2.4.1 pbdZMQ_0.3-3 purrr_0.3.3
[29] magrittr_1.5 backports_1.1.5 scales_1.1.0 htmltools_0.4.0
[33] assertthat_0.2.1 colorspace_1.4-1 lazyeval_0.2.2 munsell_0.5.0
[37] crayon_1.3.4
case.
default
does not use the defined default
value if the result is NA. I expected to get similar results comparing case.
to case_when
.
pacman::p_load(data.table, tidytable)
gender_func_1 <- function(col) {
col_value <- case.(col == 1, "M",
col == 2, "F",
default = "U")
return(col_value)
}
gender_func_2 <- function(col) {
col_value <- case_when(col == 1 ~ "M",
col == 2 ~ "F",
TRUE ~ "U")
return(col_value)
}
data.frame(gender = c(0:3, NA)) %>%
mutate.(gender_grp_1 = gender_func_1(gender),
gender_grp_2 = gender_func_2(gender))
gender gender_grp_1 gender_grp_2
<int> <chr> <chr>
1: 0 U U
2: 1 M M
3: 2 F F
4: 3 U U
5: NA <NA> U
Reprex below; dt_distinct()
currently doesn't allow columns specifications.
test_df <- data.table(a = 1:10,
b = 11:20,
c = c(rep("a", 6), rep("b", 4)),
d = c(rep("a", 4), rep("b", 6)))
dplyr::distinct(test_df, c, d)
# expected output
c d
1: a a
2: a b
3: b b
Convert fifelse() to automatically use the proper NA
I think there is a difference between how dplyr treats quoted code and how tidytable
treats them.
This is a blocking issue for integrating tidytable
with disk.frame
, see DiskFrame/disk.frame#271
fn = function(a,b) {
a + b
}
# dplyr seems to be able to handle this correctly
mwe_mutate = function(...) {
quo_dotdotdot = rlang::enquos(...)
data = data.frame(num = 1:100)
code = rlang::quo(mutate(data, !!!quo_dotdotdot))
rlang::eval_tidy(code)
}
mwe_mutate(b = fn(num, num))
# tidytable seems to fail
mwe_dt_mutate = function(...) {
quo_dotdotdot = rlang::enquos(...)
data = data.frame(num = 1:100)
code = rlang::quo(dt_mutate(data, !!!quo_dotdotdot))
rlang::eval_tidy(code)
}
mwe_dt_mutate(b = fn(num, num)) # this gives an error
Reprex below.
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df1 <-
structure(list(
id = "cal", sequence_id = "0357C554_DA8B_4D2A_945B_8381F2CE0C48",
date = structure(1548028800, class = c("POSIXct", "POSIXt"), tzone = "UTC")
), row.names = c(NA, -1L), class = "data.frame")
df1
#> id sequence_id date
#> 1 cal 0357C554_DA8B_4D2A_945B_8381F2CE0C48 2019-01-21
df1 %>%
tidytable::select.(id)
#> Error in vapply(.x, .f, character(1), ...): values must be length 1,
#> but FUN(X[[3]]) result is length 2
df1 %>%
select(-date) %>%
tidytable::select.(id)
#> id
#> <chr>
#> 1: cal
df1 %>%
mutate(date = as.Date(date)) %>%
tidytable::select.(id)
#> id
#> <chr>
#> 1: cal
Created on 2020-05-09 by the reprex package (v0.3.0)
I believe the issue lies with get_data_vars()
, given that it doesn't account for Date or DateTime classes (e.g., POSIXct, POSIXt).
Below is a suggested change.
get_data_vars <- function(.data) {
data_names <- names(.data)
data_index <- seq_along(data_names)
data_vars <- setNames(as.list(data_index), data_names)
# current
# data_class <- tidytable::map_chr.(.data, class)
# proposed
# DateTime class may contain both POSIXct and POSIXt type, so pick one
data_class <- unlist(tidytable::map.(sapply(data, class), ~.[[1]]))
integer_cols <- list(is.integer = data_index[data_class ==
"integer"])
double_cols <- list(is.double = data_index[data_class ==
"numeric"])
numeric_cols <- list(is.numeric = data_index[data_class %in%
c("integer", "numeric")])
character_cols <- list(is.character = data_index[data_class ==
"character"])
factor_cols <- list(is.factor = data_index[data_class ==
"factor"])
dttm_cols <- list(is.ddtm = data_index[data_class %in%
c("POSIXct", "POSIXt", "Date")])
logical_cols <- list(is.logical = data_index[data_class ==
"logical"])
list_cols <- list(is.list = data_index[data_class == "list"])
data_vars <- data_vars %>% append(integer_cols) %>% append(double_cols) %>%
append(numeric_cols) %>% append(character_cols) %>% append(factor_cols) %>%
append(dttm_cols) %>% append(logical_cols) %>% append(list_cols)
data_vars
}
Reprex and proposal below.
library(dplyr)
library(tidytable)
my_data <- iris
# |- current ----
head(my_data) %>%
select.(blah = Species,
blah2 = Sepal.Length)
#> Species Sepal.Length
#> <fct> <dbl>
#> 1: setosa 5.1
#> 2: setosa 4.9
#> 3: setosa 4.7
#> 4: setosa 4.6
#> 5: setosa 5.0
#> 6: setosa 5.4
# |- proposed ----
dots_selector_i_2 <- function (.data, ...)
{
data_vars <- tidytable:::get_data_vars(.data)
select_vars <- rlang::enexprs(...)
# print(select_vars)
select_index <- unlist(eval(rlang::expr(c(!!!select_vars)), data_vars))
keep_index <- unique(select_index[select_index > 0])
if (length(keep_index) == 0)
keep_index <- seq_along(.data)
drop_index <- unique(abs(select_index[select_index < 0]))
select_index <- keep_index[!keep_index %in% drop_index]
return(list(name = names(select_vars),
idx = select_index))
}
dots_selector_2 <- function (.data, ...)
{
match_names <- dots_selector_i_2(.data,...)
select_index <- match_names$idx
data_names <- names(.data)
select_vars <- rlang::syms(data_names[select_index])
return(list(select_vars = select_vars,
var_name = match_names$name))
}
select..tidytable2 <- function(.data, ...) {
match_cols <- dots_selector_2(.data, ...)
select_cols <- as.character(match_cols$select_vars)
# Using a character vector is faster for select
tidytable:::eval_expr(
.data[, !!select_cols]
)
if (any(match_cols$var_name != "")) {
tidytable:::eval_expr(
data.table::setnames(.data, select_cols, match_cols$var_name)
)
}
.data
}
head(my_data) %>%
select..tidytable2(blah = Species,
blah2 = Sepal.Length)
#> blah2 Sepal.Width Petal.Length Petal.Width blah
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
Created on 2020-04-05 by the reprex package (v0.3.0)
The tidytable docs include some examples of providing a non-quoted function arg, such as:
library(rlang)
df <- data.table(x = c(1,1,1), y = c(1,1,1), z = c("a","a","b"))
add_one <- function(.data, add_col) {
add_col <- enexpr(add_col)
.data %>%
mutate.(new_col = !!add_col + 1)
}
df %>%
add_one(x)
... but the docs don't have an example of if the user provides a quoted arg, such as:
make_subsets = function(to_filter, dt){
dt %>% dt_filter(Species == to_filter)
}
iris$Species %>% unique %>% as.list %>% lapply(make_subsets, dt=as.data.table(iris))
Hi Mark,
Looks like joins using different column names across data.tables is not supported. Although, join_mold has all the required functionality, functions like dt_left_join
do not seem to use it.
Here is a snippet comparing the functionality with dplyr:
library("tidytable")
#>
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#>
#> dt
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris_dt = as_dt(iris) %>%
dt_select(Species, Sepal.Length)
iris_2_dt = as_dt(iris) %>%
dt_select(Species, Petal.Length) %>%
dt_rename(Species_2 = Species)
# tidytable
dt_left_join(iris_dt, iris_2_dt, by = c("Species" = "Species_2"))
#> Error in `[.data.table`(y[x, on = on_vec, allow.cartesian = TRUE], , ..all_names): column(s) not found: Species
# dplyr
left_join(iris_dt, iris_2_dt, by = c("Species" = "Species_2")) %>%
tibble::as_tibble()
#> # A tibble: 7,500 x 3
#> Species Sepal.Length Petal.Length
#> <fct> <dbl> <dbl>
#> 1 setosa 5.1 1.4
#> 2 setosa 5.1 1.4
#> 3 setosa 5.1 1.3
#> 4 setosa 5.1 1.5
#> 5 setosa 5.1 1.4
#> 6 setosa 5.1 1.7
#> 7 setosa 5.1 1.4
#> 8 setosa 5.1 1.5
#> 9 setosa 5.1 1.4
#> 10 setosa 5.1 1.5
#> # … with 7,490 more rows
Created on 2020-04-20 by the reprex package (v0.3.0)
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 3.6.2 (2019-12-12)
#> os macOS Mojave 10.14.5
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Asia/Kolkata
#> date 2020-04-20
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0)
#> callr 3.4.2 2020-02-12 [1] CRAN (R 3.6.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
#> data.table 1.12.8 2019-12-09 [1] CRAN (R 3.6.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
#> devtools 2.2.2 2020-02-17 [1] CRAN (R 3.6.0)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
#> dplyr * 0.8.4 2020-01-31 [1] CRAN (R 3.6.0)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
#> jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.0)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
#> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.0)
#> ps 1.3.2 2020-02-13 [1] CRAN (R 3.6.0)
#> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.0)
#> reticulate 1.14 2019-12-17 [1] CRAN (R 3.6.2)
#> rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.2)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.0)
#> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
#> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
#> tidytable * 0.3.1 2020-02-19 [1] CRAN (R 3.6.0)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.0)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0)
#> vctrs 0.2.3 2020-02-20 [1] CRAN (R 3.6.0)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Reprex below.
I made reference to tidyfast since it has something similar.
# original - expected ouput
iris %>%
dplyr::group_nest(Species) %>%
dplyr::mutate(test = purrr::map(data, ~dplyr::filter(.x, Sepal.Width > 3.5))) %>%
tidyr::unnest(test)
# A tibble: 19 x 6
Species data Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <list> <dbl> <dbl> <dbl> <dbl>
1 setosa <tibble [50 x 4]> 5 3.6 1.4 0.2
2 setosa <tibble [50 x 4]> 5.4 3.9 1.7 0.4
3 setosa <tibble [50 x 4]> 5.4 3.7 1.5 0.2
4 setosa <tibble [50 x 4]> 5.8 4 1.2 0.2
5 setosa <tibble [50 x 4]> 5.7 4.4 1.5 0.4
6 setosa <tibble [50 x 4]> 5.4 3.9 1.3 0.4
7 setosa <tibble [50 x 4]> 5.7 3.8 1.7 0.3
8 setosa <tibble [50 x 4]> 5.1 3.8 1.5 0.3
9 setosa <tibble [50 x 4]> 5.1 3.7 1.5 0.4
10 setosa <tibble [50 x 4]> 4.6 3.6 1 0.2
11 setosa <tibble [50 x 4]> 5.2 4.1 1.5 0.1
12 setosa <tibble [50 x 4]> 5.5 4.2 1.4 0.2
13 setosa <tibble [50 x 4]> 4.9 3.6 1.4 0.1
14 setosa <tibble [50 x 4]> 5.1 3.8 1.9 0.4
15 setosa <tibble [50 x 4]> 5.1 3.8 1.6 0.2
16 setosa <tibble [50 x 4]> 5.3 3.7 1.5 0.2
17 virginica <tibble [50 x 4]> 7.2 3.6 6.1 2.5
18 virginica <tibble [50 x 4]> 7.7 3.8 6.7 2.2
19 virginica <tibble [50 x 4]> 7.9 3.8 6.4 2
iris %>%
dt_group_nest(Species) %>%
dt_mutate(test = dt_map(data, function(x) dplyr::filter(x, Sepal.Width > 3.5))) %>%
dt_unnest(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1: 5.0 3.6 1.4 0.2
2: 5.4 3.9 1.7 0.4
3: 5.4 3.7 1.5 0.2
4: 5.8 4.0 1.2 0.2
5: 5.7 4.4 1.5 0.4
6: 5.4 3.9 1.3 0.4
7: 5.7 3.8 1.7 0.3
8: 5.1 3.8 1.5 0.3
9: 5.1 3.7 1.5 0.4
10: 4.6 3.6 1.0 0.2
11: 5.2 4.1 1.5 0.1
12: 5.5 4.2 1.4 0.2
13: 4.9 3.6 1.4 0.1
14: 5.1 3.8 1.9 0.4
15: 5.1 3.8 1.6 0.2
16: 5.3 3.7 1.5 0.2
17: 7.2 3.6 6.1 2.5
18: 7.7 3.8 6.7 2.2
19: 7.9 3.8 6.4 2.0
# adding extra arg to `...` still doesn't work; same output as above
iris %>%
dt_group_nest(Species) %>%
dt_mutate(test = dt_map(data, function(x) dplyr::filter(x, Sepal.Width > 3.5))) %>%
dt_unnest(test, Species)
# tidyfast gives expected output
iris %>%
dt_group_nest(Species) %>%
dt_mutate(test = dt_map(data, function(x) dplyr::filter(x, Sepal.Width > 3.5))) %>%
tidyfast::dt_unnest(test)
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 5.0 3.6 1.4 0.2
2: setosa 5.4 3.9 1.7 0.4
3: setosa 5.4 3.7 1.5 0.2
4: setosa 5.8 4.0 1.2 0.2
5: setosa 5.7 4.4 1.5 0.4
6: setosa 5.4 3.9 1.3 0.4
7: setosa 5.7 3.8 1.7 0.3
8: setosa 5.1 3.8 1.5 0.3
9: setosa 5.1 3.7 1.5 0.4
10: setosa 4.6 3.6 1.0 0.2
11: setosa 5.2 4.1 1.5 0.1
12: setosa 5.5 4.2 1.4 0.2
13: setosa 4.9 3.6 1.4 0.1
14: setosa 5.1 3.8 1.9 0.4
15: setosa 5.1 3.8 1.6 0.2
16: setosa 5.3 3.7 1.5 0.2
17: virginica 7.2 3.6 6.1 2.5
18: virginica 7.7 3.8 6.7 2.2
19: virginica 7.9 3.8 6.4 2.0
Currently tidytable
functions convert data.frame/data.table/tibble objects to a "tidytable" in the background. tidytables print more like tibbles in console, but otherwise have no features separate from a data.table.
When using Rmarkdown using knitr::opts_chunk$set(paged.print=TRUE)
(the default) doesn't work.
@TysonStanley I've taken a couple shots at this and couldn't figure out a solution. Do you have any ideas?
library(tidytable)
library(data.table)
df1 <- data.table(a = "a", b = 1)
df2 <- data.table(b = 1, a = "a")
nested_df <- data.table(id = 1:2,
list_col = list(df1, df2))
nested_df %>%
unnest.(list_col)
#> Error in `[.data.table`(.data, , unlist(list_col, recursive = FALSE), : Column 1 of result for group 2 is type 'double' but expecting type 'character'. Column types must be consistent for each group.
If I use this tool in pkg development,How can I avoid variable name not defined?
@markfairbanks Firstly, thanks for this project!
This reproducible example makes it clear:
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris_model = rpart::rpart(Species ~ ., data = iris)
iris %>%
tibble::as_tibble() %>%
mutate(., pred = predict(iris_model, ., type = "class"))
#> # A tibble: 150 x 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species pred
#> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa setosa
#> 2 4.9 3 1.4 0.2 setosa setosa
#> 3 4.7 3.2 1.3 0.2 setosa setosa
#> 4 4.6 3.1 1.5 0.2 setosa setosa
#> 5 5 3.6 1.4 0.2 setosa setosa
#> 6 5.4 3.9 1.7 0.4 setosa setosa
#> 7 4.6 3.4 1.4 0.3 setosa setosa
#> 8 5 3.4 1.5 0.2 setosa setosa
#> 9 4.4 2.9 1.4 0.2 setosa setosa
#> 10 4.9 3.1 1.5 0.1 setosa setosa
#> # … with 140 more rows
library("tidytable")
#>
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#>
#> dt
iris_2 = as_dt(iris)
iris_2_model = rpart::rpart(Species ~ ., data = iris_2)
iris_2 %>%
dt_mutate(., pred = predict(iris_2_model, ., type = "class"))
#> Error in predict.rpart(iris_2_model, ., type = "class"): object '.' not found
Created on 2020-03-20 by the reprex package (v0.3.0)
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 3.6.2 (2019-12-12)
#> os macOS Mojave 10.14.5
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Asia/Kolkata
#> date 2020-03-20
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0)
#> callr 3.4.2 2020-02-12 [1] CRAN (R 3.6.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
#> data.table 1.12.8 2019-12-09 [1] CRAN (R 3.6.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
#> devtools 2.2.2 2020-02-17 [1] CRAN (R 3.6.0)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.0)
#> dplyr * 0.8.4 2020-01-31 [1] CRAN (R 3.6.0)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
#> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.0)
#> ps 1.3.2 2020-02-13 [1] CRAN (R 3.6.0)
#> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.0)
#> rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.2)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0)
#> rpart 4.1-15 2019-04-12 [1] CRAN (R 3.6.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.0)
#> tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
#> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.0)
#> tidytable * 0.3.1 2020-02-19 [1] CRAN (R 3.6.0)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.0)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0)
#> vctrs 0.2.3 2020-02-20 [1] CRAN (R 3.6.0)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
I wanted to get some opinions on something. I think I want to rename all of the functions. The dt_
prefix starts to get clunky after a while. Example:
test_df %>%
dt_select(dt_starts_with("x"), dt_ends_with("y")) %>%
dt_arrange(x)
That's a bit tough to read.
So this is what I'm thinking about doing:
test_df %>%
select.(starts_with.("x"), ends_with.("y")) %>%
arrange.(x)
or
Edit: As I do more research, functions that start with "." are supposed to be for internal use only. So it would have to be select.()
.
# Edit: ignore, this isn't allowed
test_df %>%
.select(.starts_with("x"), .ends_with("y")) %>%
.arrange(x)
I would keep the old ones in the package for a while, but all documentation would use the "dot" version.
If there was a time to do this, it would be now when the user base was still a bit smaller.
Good idea? Bad idea? Is there another way to rename them that I’m not thinking of?
Reprex and suggested edit below.
packageVersion("tidydt")
#> [1] '0.2.0'
library(tidydt)
#>
#> Attaching package: 'tidydt'
#> The following object is masked from 'package:stats':
#>
#> dt
not_nested <- list(col1 = c("Apple", "Orange"),
col2 = c("Baseball", "Football"))
purrr::map_dfc(not_nested, ~ .) %>% class()
#> [1] "tbl_df" "tbl" "data.frame"
# expect a data.table but got matrix
dt_map_dfc(not_nested, function(x) x) %>% class()
#> [1] "matrix"
dt_map_dfc_edit <- function (.x, .f, ...)
{
result_list <- dt_map(.x, .f, ...)
as_dt(as.data.frame(do.call(cbind, result_list)))
}
dt_map_dfc_edit(not_nested, function(x) x) %>% class()
#> [1] "data.table" "data.frame"
nested <- list(col1 = list(c("Apple", "Banana"),
c("Orange")),
col2 = list(c("Baseball", "Soccer"),
c("Football")))
dt_map_dfc_edit(nested, function(x) x)
#> col1 col2
#> 1: Apple,Banana Baseball,Soccer
#> 2: Orange Football
Created on 2020-01-19 by the reprex package (v0.3.0)
I'm trying to create a couple of convenience functions for processing data.table objects with tidytable (see below), and the first works, but the second doesn't, and I don't understand why. Any clarification would be greatly appreciated!
This function works with data.table objects (eg., unique_n(example_dt, sel_col=z)
)
#' Pretty print number of unique elements in a vector
#'
#' The result will be cat'ed to the screen.
#' tidytable compatable. Maje
#'
#' @param x a vector or data.table. If data.table, sel_col must not be NULL
#' @param label what to call the items in the vector (eg., "samples")
#' @param sel_col If x=data.table, which column to assess?
#' @returns NULL
unique_n = function(x, label='items', sel_col=NULL){
if(any((class(x)) == 'data.table')){
if(is.null(sel_col)){
stop('sel_col cannot be NULL for data.table objects')
}
sel_col = ggplot2::enexpr(sel_col)
x = tidytable::dt_distinct(x, !!sel_col)
x = tidytable::dt_pull(x, !!sel_col)
}
cat(sprintf('No. of unique %s:', label),
length(unique(x)), '\n')
}
This function doesn't work with data.table objects (eg., overlap(example_dt, example_dt, sel_col_x=z, sel_col_y=z)
). I get an "object not found error".
#' Determine counts of setdiff, intersect, & union of 2 vectors (or data.tables)
#'
#' The output is printed text of intersect, each-way setdiff, and union.
#' Data.table compatible! Just make sure to provide sel_col_x and/or sel_col_y
#'
#' @param x vector1 or data.table. If data.table, sel_col_x must not be NULL
#' @param y vector2 or data.table. If data.table, sel_col_y must not be NULL
#' @param sel_col_x If x = data.table, which column to assess?
#' @param sel_col_y If y = data.table, which column to assess?
#' @return NULL
#'
overlap = function(x, y, sel_col_x=NULL, sel_col_y=NULL){
if(any((class(x)) == 'data.table')){
if(is.null(sel_col_x)){
stop('sel_col_x cannot be NULL for data.table objects')
}
sel_col_x = ggplot2::enexpr(sel_col_x)
x = tidytable::dt_distinct(x, !!sel_col_x)
x = tidytable::dt_pull(x, !!sel_col_x)
}
if(any((class(y)) == 'data.table')){
if(is.null(sel_col_y)){
stop('sel_col_y cannot be NULL for data.table objects')
}
sel_col_y = ggplot2::enexpr(sel_col_y)
y = tidytable::dt_distinct(y, !!sel_col_y)
y = tidytable::dt_pull(y, !!sel_col_y)
}
cat('intersect(x,y):', length(intersect(x,y)), '\n')
cat('setdiff(x,y):', length(setdiff(x,y)), '\n')
cat('setdiff(y,x):', length(setdiff(y,x)), '\n')
cat('union(x,y):', length(union(x,y)), '\n')
}
See https://github.com/tidyverse/dplyr/blob/master/R/slice.R
slice_head()
slice_tail()
slice_min()
slice_max()
This one seems pretty straightforward, but should probably be implemented with c_across.()
as well, which is trickier.
Edit: This fails if the data.table
contains list columns. See new implementation below.
mutate_rowwise. <- function(.df, ...) {
mutate.(.df, ..., .by = everything())
}
reprex:
example_dt <- data.table::data.table(
"x col" = c(1,1,1),
"y col" = c(2,2,2),
"z col" = c("a", "a", "b"))
example_dt %>%
mutate_across.(is.numeric, as.character)
# Error in `[.data.table`(.data, , `:=`((.cols), map.(.SD, .fns, ...)), : Some items of .SDcols are not column names: [`x col`, `y col`]
sessionInfo:
R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS
Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/phyloseq-phy/lib/libopenblasp-r0.3.7.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] LeyLabRMisc_0.1.5 tidytable_0.4.1 data.table_1.12.8 ape_5.3
[5] phyloseq_1.30.0 ggplot2_3.2.1 tidyr_1.0.0 dplyr_0.8.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 lattice_0.20-38 Biostrings_2.54.0
[4] assertthat_0.2.1 digest_0.6.23 foreach_1.4.7
[7] IRdisplay_0.7.0 R6_2.4.1 plyr_1.8.5
[10] repr_1.0.2 stats4_3.6.2 evaluate_0.14
[13] pillar_1.4.3 zlibbioc_1.32.0 rlang_0.4.6
[16] lazyeval_0.2.2 uuid_0.1-2 vegan_2.5-6
[19] S4Vectors_0.24.0 Matrix_1.2-18 splines_3.6.2
[22] stringr_1.4.0 igraph_1.2.4.2 munsell_0.5.0
[25] xfun_0.12 compiler_3.6.2 pkgconfig_2.0.3
[28] BiocGenerics_0.32.0 base64enc_0.1-3 multtest_2.42.0
[31] mgcv_1.8-31 htmltools_0.4.0 biomformat_1.14.0
[34] tidyselect_1.1.0 tibble_2.1.3 IRanges_2.20.0
[37] codetools_0.2-16 fansi_0.4.1 permute_0.9-5
[40] crayon_1.3.4 withr_2.1.2 MASS_7.3-51.5
[43] grid_3.6.2 nlme_3.1-143 jsonlite_1.6
[46] gtable_0.3.0 lifecycle_0.1.0 magrittr_1.5
[49] scales_1.1.0 cli_2.0.1 stringi_1.4.5
[52] XVector_0.26.0 reshape2_1.4.3 ellipsis_0.3.0
[55] vctrs_0.3.0 IRkernel_1.1 Rhdf5lib_1.8.0
[58] iterators_1.0.12 tools_3.6.2 ade4_1.7-13
[61] Biobase_2.46.0 glue_1.3.1 purrr_0.3.3
[64] parallel_3.6.2 survival_3.1-8 colorspace_1.4-1
[67] rhdf5_2.30.0 cluster_2.1.0 pbdZMQ_0.3-3
[70] knitr_1.26
tidytable
currently implements its own version of tidyselect
features. It covers the core features a user needs, but is still missing some handy functionality.
Pros:
tidyselect
functions/features as they use in dplyr
.Cons:
tidyselect
dependencies on ellipsis
/glue
/purrr
/vctrs
Reprex and suggested below below.
packageVersion("tidydt")
#> [1] '0.2.0'
library(tidydt)
#>
#> Attaching package: 'tidydt'
#> The following object is masked from 'package:stats':
#>
#> dt
iris_ls <- iris %>%
dplyr::group_split(Species)
dplyr::bind_rows(iris_ls)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
dplyr::bind_rows(dt_map(iris_ls, as_dt))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
# unsuccessful test
dt_bind_rows(iris_ls)
#> Error in dt_bind_rows(iris_ls): All inputs must be a data.frame or data.table
iris_ls_dt <- dt_map(iris_ls, as_dt)
dt_bind_rows(iris_ls_dt)
#> Error in dt_bind_rows(iris_ls_dt): All inputs must be a data.frame or data.table
dt_bind_rows(iris_ls, iris)
#> Error in dt_bind_rows(iris_ls, iris): All inputs must be a data.frame or data.table
# revised fxn
dt_bind_rows_edit <- function(.data, ...) {
# check if input .data is already a list; if not, transform to list
if (class(.data) != "list") {
.data <- list(.data)
}
dots <- enexprs(...)
dots <- dt_map(dots, eval)
# remove list()
dots <- append(.data, dots)
if (!all(dt_map_lgl(dots, is.data.frame)))
stop("All inputs must be a data.frame or data.table")
if (!all(dt_map_lgl(dots, is.data.table)))
dots <- dt_map(dots, as.data.table)
rbindlist(dots)
}
# successful test
dt_bind_rows_edit(iris_ls)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
dt_bind_rows_edit(iris_ls_dt)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
dt_bind_rows_edit(iris_ls, iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 296: 6.7 3.0 5.2 2.3 virginica
#> 297: 6.3 2.5 5.0 1.9 virginica
#> 298: 6.5 3.0 5.2 2.0 virginica
#> 299: 6.2 3.4 5.4 2.3 virginica
#> 300: 5.9 3.0 5.1 1.8 virginica
dt_bind_rows_edit(iris, iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 296: 6.7 3.0 5.2 2.3 virginica
#> 297: 6.3 2.5 5.0 1.9 virginica
#> 298: 6.5 3.0 5.2 2.0 virginica
#> 299: 6.2 3.4 5.4 2.3 virginica
#> 300: 5.9 3.0 5.1 1.8 virginica
Created on 2020-01-19 by the reprex package (v0.3.0)
library(data.table)
library(tidytable)
test_df <- data.table::data.table(
x = c(1,2,3,4),
y = c(4,5,6,7),
z = c("a","a","a","b"))
test_df %>%
dt_slice(1:.N)
#> Error in 1:.N: argument of length 0
Created on 2020-02-28 by the reprex package (v0.3.0)
Thanks a lot @markfairbanks ! summarize_across.()
is much slower than dtplyr
and data.table
solutions in the following example. What would be a faster way using tidytable
?
R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
CPU: Core i7-8750H @ 2.20GHz, 2208 Mhz, 6 Core(s), 12 Logical Processor(s)
RAM: 16 GB
package * version date lib source
data.table * 1.12.9 2020-04-20 [1] local
dplyr * 0.8.99.9003 2020-05-29 [1] Github (tidyverse/dplyr@2855355)
dtplyr * 1.0.1.9000 2020-05-29 [1] Github (tidyverse/dtplyr@dfb22f6)
tidytable * 0.5.0.9 2020-05-29 [1] Github (markfairbanks/tidytable@bc4b1d0)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> dplyr_func() 16.3866 17.2444 18.8779 17.8115 19.5853 34.8279 100
#> dplyr_acr_func() 190.3610 199.9047 207.6058 204.3143 211.3635 278.1041 100
#> dtplyr_func() 8.1858 8.7738 9.7016 9.3106 10.1511 19.2393 100
#> tidytable_func() 71.8009 74.0848 78.2104 76.1958 80.2190 127.3493 100
#> dt_func() 2.7862 2.9893 3.6715 3.1622 4.4132 6.1665 100
library(microbenchmark)
library(data.table)
library(dtplyr)
library(dplyr)
library(tidytable)
library(ggplot2)
diamonds2 <- lazy_dt(diamonds)
dt <- setDT(copy(diamonds))
tidytable_func <- function() {
diamonds %>%
summarize_across.(c(depth, table, price, carat),
mean,
by = c(cut, color, clarity))
}
dplyr_func <- function() {
diamonds %>%
group_by(cut, color, clarity) %>%
summarise_at(vars(depth, table, price, carat), mean)
}
dplyr_acr_func <- function() {
diamonds %>%
group_by(cut, color, clarity) %>%
summarise(across(c(depth, table, price, carat), mean))
}
dtplyr_func <- function() {
diamonds2 %>%
group_by(cut, color, clarity) %>%
summarise_at(vars(depth, table, price, carat), mean) %>%
as_tibble()
}
cols <- c("depth", "table", "price", "carat")
dt_func <- function() {
dt[
, lapply(.SD, mean)
, by = .(cut, color, clarity)
, .SDcols = cols]
}
microbenchmark(
dplyr_func(),
dplyr_acr_func(),
dtplyr_func(),
tidytable_func(),
dt_func(),
unit = 'ms',
times = 100L
)
Originally posted by @tungmilan in #64 (comment)
@markfairbanks it may be beneficial to show how these functions compare with the Tidyverse functions in terms of speed and efficiency (can use the updated bench
package to do so). This is probably the main draw to this package--the idea that someone has a more efficient and a quicker experience than if they were using the Tidyverse.
Hi Mark,
Consider adding separate_rows.
as an equivalent for tidyr::separate_rows
. I would be happy to submit a PR.
pacman::p_load("tidytable", "data.table")
separate_rows. = function(dt, column, sep = ",", trim = FALSE){
tidytable::is_tidytable(dt)
stopifnot(is.character(sep) && length(sep) == 1L)
stopifnot(is.logical(trim) && length(trim) == 1L)
dt = tidytable:::shallow(dt)
dt[ , id_ := .I]
column = rlang::enexpr(column)
column_name = rlang::expr_deparse(column)
other_column_names = setdiff(colnames(dt), column_name)
res = tidytable:::eval_expr(
dt[
, strsplit(as.character(!!column), split = sep)
, by = other_column_names
]
)
data.table::setnames(res, "V1", column_name)
res[ , id_ := NULL]
if (trim) {
tidytable:::eval_expr(res[ , !!column := trimws(!!column)])
}
return(res[])
}
x = sample(c("a,b", "c,d, e", "f ,"), 1e4, replace = TRUE)
y = sample(c(1,2,3), 1e4, replace = TRUE)
temp = data.table(x, y)
dplyr::glimpse(temp)
#> Rows: 10,000
#> Columns: 2
#> $ x <chr> "f ,", "a,b", "f ,", "c,d, e", "c,d, e", "a,b", "a,b", "f ,", "f ,"…
#> $ y <dbl> 2, 3, 1, 2, 3, 3, 1, 2, 3, 2, 1, 1, 1, 2, 3, 1, 3, 2, 1, 2, 1, 1, 3…
system.time({ res_tt = separate_rows.(temp, x, trim = TRUE) })
#> user system elapsed
#> 0.055 0.003 0.057
system.time({ res_tidyr = tidyr::separate_rows(temp, x) })
#> user system elapsed
#> 0.882 0.682 1.571
I sent you this in reddit as well, I wasn't sure how often you checked your account.
I have an advanced R question and hoped you might be able to provide some guidance.
I created this package (tidytable) to replicate tidyverse syntax with a data.table/rlang backend.
Originally the functions followed the syntax of dt_verb()
. I was about to submit a big update to CRAN that would replace this syntax with verb.()
(while adding function deprecation warnings for the dt_
syntax).
My question is this - is it a bad idea to use verb.()
? Is this against best practices since the "." is used for S3 methods? The package passes tests, passes devtools::check()
, and all S3 methods work (ex: mutate..data.frame()). But is this a bad idea?
Thanks for any help - I'll be honest I'm not sure who to reach out to with this issue
Reprex below.
library(tidytable)
#>
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#>
#> dt
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# |- data ----
dat <- structure(
list(
id = c("11", "22"),
phase = c("a", "b"),
values = list(
structure(
list(
a = 0.0584563566053344,
b = 192,
c = "50%",
d = 1,
e = 0,
f = 0,
g = 0
),
row.names = c(NA, -1L),
class = c("tbl_df",
"tbl", "data.frame")
),
structure(
list(
c = "50%",
d = 465L,
e = 0,
g = 290514.430137519,
b = 10961.9288476965,
a = 0.359973896295374,
h = 1.46588348984196,
f = 119.108387941727
),
row.names = c(NA,
-1L),
class = c("tbl_df", "tbl", "data.frame")
)
)
),
row.names = c(NA,
-2L),
class = c("tbl_df", "tbl", "data.frame")
)
# |- tidyr ----
# wide results
dat %>%
tidyr::unnest(values)
#> # A tibble: 2 x 10
#> id phase a b c d e f g h
#> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 a 0.0585 192 50% 1 0 0 0 NA
#> 2 22 b 0.360 10962. 50% 465 0 119. 290514. 1.47
# |- tidytable ----
# long output
dat %>%
unnest.(values)
#> id phase values
#> <chr> <chr> <list>
#> 1: 11 a 0.05845636
#> 2: 11 a 192
#> 3: 11 a 50%
#> 4: 11 a 1
#> 5: 11 a 0
#> 6: 11 a 0
#> 7: 11 a 0
#> 8: 22 b 50%
#> 9: 22 b 465
#> 10: 22 b 0
#> 11: 22 b 290514.4
#> 12: 22 b 10961.93
#> 13: 22 b 0.3599739
#> 14: 22 b 1.465883
#> 15: 22 b 119.1084
Created on 2020-04-05 by the reprex package (v0.3.0)
How should one implement tidyr::unite()
with tidytable? I don't see any equivalent function in the docs
Using by
as an argument prevents the user from having columns named "by" when using mutate.()
or summarize.()
.
See ?lifecycle::deprecated
for redirection example
It would be helpful to add these for the main dplyr verbs that already currently have tests - arrange()
, mutate()
, summarize()
, & filter()
.
In case someone wants to help with this one, all tests are found in the source code of the README, and here's how filter()
would look:
filter_marks <- bench::mark(
tidyverse = filter(test_tbl, a <= 7, c == "a"),
# Only new line, all other parts are already written
dtplyr = as.data.table(filter(lazy_dt(test_dt), a <= 7, c == "a")),
tidytable = filter.(test_dt, a <= 7, c == "a"),
data.table = test_dt[a <= 7 & c == "a"],
check = FALSE, iterations = iters, memory = FALSE, filter_gc = FALSE, time_unit = 'ms') %>%
mutate(expression = as.character(expression),
function_tested = "filter")
For anyone new to pull requests that wants to take a shot at this - feel free to comment on this issue and I can walk you through the process.
As somebody who likes the tidyverse syntax and requires the data.table performance while struggling with its modify-by-reference, I was very happy finding tidytable. Thanks for this great package!
I am working with large datasets (1-10M rows, 50-500 cols) that often require mutating of grouped data.
In this scenario however, I found tidytable::mutate.()
to be much slower than the data.table
equivalent, and still considerably slower than the dplyr
alternative.
library(magrittr)
library(data.table)
rows <- 1000000
ids <- 50000
#simple data set with many different IDs and 1M rows, 3 cols
df <- data.frame(id = as.character(sample(1:ids, size = rows, replace = TRUE)), #using character variable as ID
bike = sample(c("mountain", "allround", "road", "bmx"), size = rows, replace = TRUE),
year = sample(1980:2020, size = rows, replace = TRUE),
stringsAsFactors = FALSE)
results <- bench::mark(
#first run with tidytable
tidytable = df %>%
#sort by case id, time and item
tidytable::arrange.(id, year, bike)%>%
#calculate new item number variable #group by case id
tidytable::mutate.(bike_number = as.integer(tidytable::row_number.()), by = id),
#second run with dplyr
dplyr = df %>%
#sort by case id, time and item
dplyr::arrange(id, year, bike)%>%
#calculate new item number variable #group by case id
dplyr::group_by(id) %>%
dplyr::mutate(bike_number = as.integer(dplyr::row_number())) %>%
dplyr::ungroup(),
#third run with data.table
data.table = data.table::copy(df) %>%
data.table::as.data.table(.) %>%
#sort by case id, time and item
.[base::order(nchar(.[, id]), .[, id], .[, year], .[, bike], method = "radix")] %>%
#calculate new item number variable #group by case id
.[, bike_number := as.integer(seq_len(.N)), by=.[, id]] %>%
.[],
iterations = 3, filter_gc = FALSE, check = FALSE
)
ggplot2::autoplot(results)
#> Lade nötigen Namensraum: tidyr
Created on 2020-06-08 by the reprex package (v0.3.0)
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.3 (2020-02-29)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2020-06-08
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1)
#> backports 1.1.7 2020-05-13 [1] CRAN (R 3.6.3)
#> beeswarm 0.2.3 2016-04-25 [1] CRAN (R 3.6.0)
#> bench 1.1.1 2020-01-13 [1] CRAN (R 3.6.2)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1)
#> curl 4.3 2019-12-02 [1] CRAN (R 3.6.1)
#> data.table * 1.12.9 2020-03-04 [1] Github (Rdatatable/data.table@b1b1832)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
#> dplyr 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> farver 2.0.3 2020-01-16 [1] CRAN (R 3.6.2)
#> fs 1.4.1 2020-04-04 [1] CRAN (R 3.6.3)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.1)
#> ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 3.6.3)
#> ggplot2 3.3.1 2020-05-28 [1] CRAN (R 3.6.3)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.1)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
#> httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> magrittr * 1.5 2014-11-22 [1] CRAN (R 3.6.1)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.1)
#> mime 0.9 2020-02-04 [1] CRAN (R 3.6.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1)
#> pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2)
#> profmem 0.5.0 2018-01-30 [1] CRAN (R 3.6.2)
#> ps 1.3.3 2020-05-08 [1] CRAN (R 3.6.3)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.3)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
#> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
#> rmarkdown 2.2 2020-05-31 [1] CRAN (R 3.6.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.1)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.1)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyr 1.1.0 2020-05-20 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> tidytable 0.5.1 2020-05-29 [1] CRAN (R 3.6.3)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 3.6.3)
#> vctrs 0.3.0 2020-05-11 [1] CRAN (R 3.6.3)
#> vipor 0.4.5 2017-03-22 [1] CRAN (R 3.6.3)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 3.6.3)
#> xfun 0.14 2020-05-20 [1] CRAN (R 3.6.3)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 3.6.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] C:/Users/usr/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.3/library
library(tidytable)
df1 <- tidytable(a = "x", b = 1)
df2 <- tidytable(a = "y", b = 2, c = 3)
nested_df <- tidytable(id = 1:2,
list_col = list(df1, df2))
nested_df %>%
unnest.(list_col)
#> Error in `[.data.table`(.data, , unlist(list_col, recursive = FALSE), : j doesn't evaluate to the same number of columns for each group
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.