praktiskt / featuretoolsr Goto Github PK

View Code? Open in Web Editor NEW

49.0 3.0 8.0 67 KB

An R interface to the Python module Featuretools

License: Other

R 100.00%

rstats featuretools machine-learning feature-engineering r-package

featuretoolsr's People

Contributors

Stargazers

Watchers

Forkers

atharkharal grayskripko minkvsky happyhepingyu statunizaga mstei4176 atusy

featuretoolsr's Issues

featuretoolsR::add_relationship new signiture

the new signature of featuretoolsR::add_relationship(entityset, parent_set, child_set, parent_idx, child_idx) break my previos code. I suggest to change it to featuretoolsR::add_relationship(entityset, parent_set, child_set, parent_idx, child_idx=parent_idx)

distributed.core annoying messages when n_jobs > 1

I get tens of

distributed.core - INFO - Event loop was unresponsive in Nanny for 1276.34s.  
This is often caused by long-running GIL-holding functions or moving large chunks of data. 
This can cause timeouts and instability.

Have you met these warnings? Do you know how to deal with them?
Spent many hours and tried different approaches with python log settings, featuretools settings, reticulate capturing these print messages, capture.output() and sink in R

Column True is not a string

as_entityset(data.frame(a = 1:3))

> 2018-12-24 01:26:37,591 featuretools.entityset - WARNING    index True not found in dataframe, creating new integer column
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: All column names must be strings (Column True is not a string)
In addition: Warning message:
In as_entityset(data.frame(a = 1:3)) :
 
 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: All column names must be strings (Column True is not a string)

Error in py_call_impl(callable, dots$args, dots$keywords)

Hi there,

I've tried this package as instructed in README.md but got the error message after executing

ft_matrix <- es %>%
  dfs(
    target_entity = "set_1", 
    trans_primitives = c("and", 'divide')
  )

error message

' Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: ('Unknown transform primitive divide. ', 'Call ft.primitives.list_primitives() to get', ' a list of available primitives') '

What went wrong?

Here's my sessioninfo:

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950  LC_CTYPE=Chinese (Traditional)_Taiwan.950   
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C                                
[5] LC_TIME=Chinese (Traditional)_Taiwan.950    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2       magrittr_1.5         featuretoolsR_0.1.0  RevoUtils_11.0.1     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] reticulate_1.12    tidyselect_0.2.4   reshape2_1.4.3     purrr_0.2.5        splines_3.5.1     
 [6] lattice_0.20-38    colorspace_1.3-2   generics_0.0.2     stats4_3.5.1       yaml_2.2.0        
[11] survival_2.44-1.1  prodlim_2018.04.18 rlang_0.3.4        ModelMetrics_1.1.0 pillar_1.4.1      
[16] glue_1.3.0         withr_2.1.2        foreach_1.5.0      bindr_0.1.1        plyr_1.8.4        
[21] lava_1.6.5         stringr_1.4.0      timeDate_3043.102  munsell_0.5.0      gtable_0.3.0      
[26] recipes_0.1.5      devtools_1.13.6    codetools_0.2-16   memoise_1.1.0      caret_6.0-80      
[31] class_7.3-14       Rcpp_0.12.18       scales_0.5.0       ipred_0.9-6        jsonlite_1.5      
[36] ggplot2_3.0.0      digest_0.6.19      stringi_1.1.7      dplyr_0.7.6        grid_3.5.1        
[41] tools_3.5.1        lazyeval_0.2.1     tibble_1.4.2       crayon_1.3.4       pkgconfig_2.0.2   
[46] MASS_7.3-50        Matrix_1.2-17      lubridate_1.7.4    gower_0.1.2        assertthat_0.2.1  
[51] rstudioapi_0.10    iterators_1.0.10   R6_2.4.0           rpart_4.1-13       nnet_7.3-12       
[56] nlme_3.1-137       compiler_3.5.1

tidy_feature_matrix error: column `value` must be...

Based on the demo example. I got an error and then did debug(tidy_feature_matrix) to investigate the problem and its location:

to_r <- tibble::as.tibble(reticulate::py_to_r(.data[[1]]))
> Error: Column `value` must be a 1d atomic vector or a list

I think the solution is here

reticulate::py_to_r(.data[[1]]) %>% str

> .frame':	100 obs. of  1 variable:
 $ value:[y, z, z, o, x, ..., n, r, q, z, q]
Length: 100
Categories (25, object): [a, b, c, d, ..., w, x, y, z]
 - attr(*, "pandas.index")=Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  
             12,  13,
             14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
             27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
             40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
             53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
             66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
             79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
             92,  93,  94,  95,  96,  97,  98,  99, 100],
           dtype='int64', name='key')

Seems .data[[1]] isn't a plain pandas DataFrame

-- UPDATE --
Full code to reproduce the error

library(featuretoolsR)
library(magrittr)

set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T))

as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo") %>% 
dfs(
    target_entity = "set_1", 
    trans_primitives = c("and", "divide")) %>% 
    tidy_feature_matrix(remove_nzv = T, nan_is_na = T)

add_relationship signature

I met a problem trying to execute add_relationship() and I solved it only after reading the python featuretools documentation. https://docs.featuretools.com/generated/featuretools.Relationship.html#featuretools.Relationship

It's hard to understand where to place parent and child set arguments when arguments called "set1" and "set2". I suggest to call them as in the original version: parent_variable, child_variable or maybe parent_set, child_set.
Is it possible to have 2 separate arguments for parent_idx and child_idx in order to avoid aligning your entity column names?

Update README

The package is now available on CRAN. Update readme to reflect that.

Readme code does not work

I have just gone through the code as given in Readme last line
tidy <- tidy_feature_matrix(ft_matrix, remove_nzv = T, nan_is_na = T)

gives following:

Removing near zero variance variables
Error in as.vector(x, mode) :
cannot coerce type 'environment' to vector of type 'any'

Any help please?

library("featuretoolsR") error

library("featuretoolsR")
╔═════════════════════════╗
║ featuretoolsR 0.4.4 ║
╚═════════════════════════╝
错误: package or namespace load failed for ‘featuretoolsR’:
attachNamespace()里算'featuretoolsR'时.onAttach失败了，详细内容：
调用: py_get_attr_impl(x, name, silent)
错误: AttributeError: module 'featuretools' has no attribute 'version'

give more intructions

Hi, I wonder if you could provide more instructions on how to use FeaturetoolsR? I found this as a very interesting package but just can't figure out how to use it properly. Thank you!

Ensure pip and virtualenv exists

install_featuretools() lack checks to reliably inform user if pip or virtualenv is missing.

Is a user is missing virtualenv the default message is good enough.

If a user has virtualenv but not pip the created virtualenv gets bricked. Can be (non-intuitively) be fixed by setting custom_virtualenv to true post pip installation.

Add checks to zzz.R

Add support to install_featuretools() to not create a virtualenv until pip exists.

Can't create an entity set: AttributeError: 'EntitySet' object has no attribute 'entity_from_dataframe'

When following the instructions in the README, under the 'Creating and EntitySet' heading. The following code results in an error:

library(featuretoolsR)
library(magrittr)

set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T), a = rep(Sys.Date(), 100))
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, T), b = rep(Sys.time(), 100))

es <- as_entityset(
  set_1, 
  index = "key", 
  entity_id = "set_1", 
  id = "demo", 
  time_index = "a"
)

The error states:

Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: 'EntitySet' object has no attribute 'entity_from_dataframe'

Session Info:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_2.0.2      dplyr_1.0.8         foreign_0.8-81      featuretoolsR_0.4.4

loaded via a namespace (and not attached):
 [1] reticulate_1.24      tidyselect_1.1.2     purrr_0.3.4          reshape2_1.4.4       listenv_0.8.0       
 [6] splines_4.1.2        lattice_0.20-45      colorspace_2.0-3     vctrs_0.3.8          generics_0.1.2      
[11] stats4_4.1.2         utf8_1.2.2           survival_3.2-13      prodlim_2019.11.13   rlang_1.0.1         
[16] ModelMetrics_1.2.2.2 pillar_1.7.0         glue_1.6.2           withr_2.4.3          rappdirs_0.3.3      
[21] foreach_1.5.2        lifecycle_1.0.1      plyr_1.8.6           lava_1.6.10          stringr_1.4.0       
[26] timeDate_3043.102    munsell_0.5.0        gtable_0.3.0         future_1.24.0        recipes_0.2.0       
[31] codetools_0.2-18     caret_6.0-90         parallel_4.1.2       class_7.3-19         fansi_1.0.2         
[36] Rcpp_1.0.8           scales_1.1.1         ipred_0.9-12         jsonlite_1.8.0       parallelly_1.30.0   
[41] png_0.1-7            ggplot2_3.3.5        digest_0.6.29        stringi_1.7.6        rprojroot_2.0.2     
[46] grid_4.1.2           here_1.0.1           hardhat_0.2.0        cli_3.2.0            tools_4.1.2         
[51] tibble_3.1.6         crayon_1.5.0         future.apply_1.8.1   pkgconfig_2.0.3      ellipsis_0.3.2      
[56] MASS_7.3-54          Matrix_1.3-4         data.table_1.14.2    pROC_1.18.0          lubridate_1.8.0     
[61] gower_1.0.0          rstudioapi_0.13      iterators_1.0.14     R6_2.5.1             globals_0.14.0      
[66] rpart_4.1-15         nnet_7.3-16          nlme_3.1-153         compiler_4.1.2

Add tidy support for colnames

When calling tidy_feature_matrix variable names become very non-R-like.

Clean variable names using regexes, something like:

tidynames <- function(df) {
  n <- tolower(names(df))
  tn <- gsub("[^A-z0-9]", "_", n)
  tn <- gsub("(_+?$)|(__+?)", "", tn)
  names(df) <- tn
  return(df)
}

Suggestion: add autonormalize

It could be useful to include the related https://github.com/FeatureLabs/autonormalize in this package.

Reload R-session after featuretools installation

Currently the library can't be used until the R session restarted after running install_featuretools().

Upon successful Featuretools installation, unload and reload R-session. Perhaps something like:

cat("Reloading featuretoolsR\n")
unloadNamespace("featuretoolsR")
.rs.restartR() -> .; rm(.)
library(featuretoolsR)

(Not sure if this is allowed by CRAN, should be checked first)

Readme demo no longer works? Unable to add relationship because child variable is also its index

Hi,

I'm trying to run the demo and it stops at add_relationship.

library(magrittr)
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, TRUE), stringsAsFactors = TRUE)
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, TRUE), stringsAsFactors = TRUE)

as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo") %>%
  add_entity(entity_id = "set_2", df = set_2, index = "key") %>%
  add_relationship(
    parent_set = "set_1",
    child_set = "set_2",
    parent_idx = "key",
    child_idx = "key"
  )

Error:

 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: Unable to add relationship because child variable 'key' in 'set_2' is also its index

I think it might be related to this new error message from this Jun 2020 issue, on featuretools.

I've played around the code but have a hard time understanding how to fix this.
Is there a quick fix?

Thanks for you help!

Diagnotic info:

SessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363) 
reticulate_1.19      featuretoolsR_0.4.4  magrittr_2.0.1

reticulate::py_discover_config() 
python:         C:/Anaconda3/envs/r-reticulate/python.exe
libpython:      C:/Anaconda3/envs/r-reticulate/python36.dll
pythonhome:     C:/Anaconda3/envs/r-reticulate
version:        3.6.13 (default, Feb 19 2021, 05:17:09) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Anaconda3/envs/r-reticulate/Lib/site-packages/numpy
numpy_version:  1.19.5

python versions found: 
 C:/Anaconda3/envs/r-reticulate/python.exe
 C:/Anaconda3/python.exe

Any plans to support categorical-encoding?

The feature tools package in python has below sub-package which can be installed as part of it:
https://pypi.org/project/categorical-encoding/#:~:text=categorical%2Dencoding%20is%20a%20Python,within%20the%20machine%20learning%20pipeline.
Wanted to check if there are any plans or easier way to support the same via your package featuretoolsR?

featuretoolsR::list_primitives() in console

> featuretoolsR::list_primitives()
                                name                              type
1  <environment: 0x000000002bfe8ef8> <environment: 0x000000002bcdad38>
2                               <NA>                              <NA>
3                               <NA>                              <NA>
...                             ...                               ...
62                              <NA>                              <NA>

Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
  corrupt data frame: columns will be truncated or padded with NAs

An interesting part here is that everything is ok when I tried to build this example with reprex::reprex()
But I can't work with it in my RStudio console

out of bounds when executing tidy_feature_matrix

Hi there!

I just stumbled upon your package and I am incredibly happy someone made the effort implement this. Thanks a lot for this!

I started out with your example and unfortunately, I encountered an error when creating a tidy_feature_matrix (the idea of which I absolutely love!)

# pacman::p_install_gh("magnusfurugard/featuretoolsR")

pacman::p_load(tidyverse, featuretoolsR)

# Create some mock data
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T))
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, T))

# Create entityset
es <- as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo")

es <- es %>%
  add_entity(
    df = set_2, 
    entity_id = "set_2", 
    index = "key"
  )

es <- es %>%
  add_relationship(
    set1 = "set_1", 
    set2 = "set_2", 
    idx = "key"
  )

ft_matrix <- es %>%
  dfs(
    target_entity = "set_1", 
    trans_primitives = c("and", "divide")
  )

tidy <- tidy_feature_matrix(ft_matrix)

tidy

Error:

# Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: index 100 is out # of bounds for axis 0 with size 100

Through a traceback, I was able to narrow down the problem to the py_to_r function, which seems to have a problem with the 0 indexing of Python.

See here:

reticulate::py_to_r(ft_matrix[[1]])

Error:

# Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: index 100 is out # of bounds for axis 0 with size 100

Again, thank you for making this available and I totally understand that a lot of this is probably work in progress. I am just glad someone did this :)

Best,

Fabio

Update

I tried with different data and this seems to work fine. So it seems it has something to do with the example data, maybe?

ft <- reticulate::import("featuretools")

es = ft$demo$load_mock_customer(return_entityset=T)

ft_matrix <- es %>%
  dfs(
    target_entity = "customers", 
    trans_primitives = c("and", "divide")
  )

tidy <- tidy_feature_matrix(ft_matrix)

tidy

Works just fine!

An error in the demo example

library(featuretoolsR)
library(magrittr)

set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T))

as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo") %>% 
dfs(
    target_entity = "set_1", 
    trans_primitives = c("and", "divide")) %>% 
    tidy_feature_matrix(remove_nzv = T, nan_is_na = T)

> Removing near zero variance variables
C:\Users\srskr\ANACON~1\lib\site-packages\pandas\core\arrays\categorical.py:486: 
FutureWarning: Index.itemsize is deprecated and will be removed in a future version
  return self.categories.itemsize
Error in `[.python.builtin.object`(nondupe, , colname) : 
  unused argument (colname)

Is there a way to use es$plot() ?

I'd like to plot an entity set. But upon installing graphviz, the plot function only returns digraph metadata, not an actual plot

The tidy_feature_matrix throws a conversion error

Running the example in the readme, I receive:

cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame

Fix DESCRIPTION

Existing primitives are not found

A number of primitives listed in list_primitives() throw an error:

Invalid transform primitive(s): mean. Use list_primitives() to find valid primitives.

Problem with dates after the last reticulate update

"Reticulate now always converts R Date objects into Python datetime objects. Note that these conversions can be inefficient -- if you would prefer conversion to NumPy datetime64 objects / arrays, you should convert your date to POSIXct first."
https://github.com/rstudio/reticulate/blob/master/NEWS.md

The next lines take a couple of seconds on my machine with a good CPU

rep(as_date("2019-01-01"), 500) %>% reticulate::r_to_py()

I fixed it in my project with

r_tibble %>% mutate_if(is.Date, as.POSIXct) %>% reticulate::r_to_py()

You should add this kind of mutation for every incoming R data.frame

cutoff time in index of dfs()

I use dfs(cutoff_time_in_index=TRUE). I think this update https://rstudio.github.io/reticulate/news/index.html#reticulate-1-9-cran "Always call r_to_py S3 method when converting objects from Python to R" broke my code extracting multiindex as new columns with dfs_result[[1]]$reset_index(). Is it possible to preserve these columns after featuretoolsR::dfs()

praktiskt / featuretoolsr Goto Github PK

featuretoolsr's People

Contributors

Stargazers

Watchers

Forkers

featuretoolsr's Issues

Update

Recommend Projects

Recommend Topics

Recommend Org