praktiskt / featuretoolsr Goto Github PK
View Code? Open in Web Editor NEWAn R interface to the Python module Featuretools
License: Other
An R interface to the Python module Featuretools
License: Other
the new signature of featuretoolsR::add_relationship(entityset, parent_set, child_set, parent_idx, child_idx)
break my previos code. I suggest to change it to featuretoolsR::add_relationship(entityset, parent_set, child_set, parent_idx, child_idx=parent_idx)
I get tens of
distributed.core - INFO - Event loop was unresponsive in Nanny for 1276.34s.
This is often caused by long-running GIL-holding functions or moving large chunks of data.
This can cause timeouts and instability.
Have you met these warnings? Do you know how to deal with them?
Spent many hours and tried different approaches with python log settings, featuretools settings, reticulate capturing these print messages, capture.output()
and sink
in R
as_entityset(data.frame(a = 1:3))
> 2018-12-24 01:26:37,591 featuretools.entityset - WARNING index True not found in dataframe, creating new integer column
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: All column names must be strings (Column True is not a string)
In addition: Warning message:
In as_entityset(data.frame(a = 1:3)) :
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: All column names must be strings (Column True is not a string)
Hi there,
I've tried this package as instructed in README.md but got the error message after executing
ft_matrix <- es %>%
dfs(
target_entity = "set_1",
trans_primitives = c("and", 'divide')
)
error message
' Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: ('Unknown transform primitive divide. ', 'Call ft.primitives.list_primitives() to get', ' a list of available primitives') '
What went wrong?
Here's my sessioninfo:
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 magrittr_1.5 featuretoolsR_0.1.0 RevoUtils_11.0.1 RevoUtilsMath_11.0.0
loaded via a namespace (and not attached):
[1] reticulate_1.12 tidyselect_0.2.4 reshape2_1.4.3 purrr_0.2.5 splines_3.5.1
[6] lattice_0.20-38 colorspace_1.3-2 generics_0.0.2 stats4_3.5.1 yaml_2.2.0
[11] survival_2.44-1.1 prodlim_2018.04.18 rlang_0.3.4 ModelMetrics_1.1.0 pillar_1.4.1
[16] glue_1.3.0 withr_2.1.2 foreach_1.5.0 bindr_0.1.1 plyr_1.8.4
[21] lava_1.6.5 stringr_1.4.0 timeDate_3043.102 munsell_0.5.0 gtable_0.3.0
[26] recipes_0.1.5 devtools_1.13.6 codetools_0.2-16 memoise_1.1.0 caret_6.0-80
[31] class_7.3-14 Rcpp_0.12.18 scales_0.5.0 ipred_0.9-6 jsonlite_1.5
[36] ggplot2_3.0.0 digest_0.6.19 stringi_1.1.7 dplyr_0.7.6 grid_3.5.1
[41] tools_3.5.1 lazyeval_0.2.1 tibble_1.4.2 crayon_1.3.4 pkgconfig_2.0.2
[46] MASS_7.3-50 Matrix_1.2-17 lubridate_1.7.4 gower_0.1.2 assertthat_0.2.1
[51] rstudioapi_0.10 iterators_1.0.10 R6_2.4.0 rpart_4.1-13 nnet_7.3-12
[56] nlme_3.1-137 compiler_3.5.1
Based on the demo example. I got an error and then did debug(tidy_feature_matrix)
to investigate the problem and its location:
to_r <- tibble::as.tibble(reticulate::py_to_r(.data[[1]]))
> Error: Column `value` must be a 1d atomic vector or a list
I think the solution is here
reticulate::py_to_r(.data[[1]]) %>% str
> .frame': 100 obs. of 1 variable:
$ value:[y, z, z, o, x, ..., n, r, q, z, q]
Length: 100
Categories (25, object): [a, b, c, d, ..., w, x, y, z]
- attr(*, "pandas.index")=Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65,
66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78,
79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, 100],
dtype='int64', name='key')
Seems .data[[1]]
isn't a plain pandas DataFrame
-- UPDATE --
Full code to reproduce the error
library(featuretoolsR)
library(magrittr)
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T))
as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo") %>%
dfs(
target_entity = "set_1",
trans_primitives = c("and", "divide")) %>%
tidy_feature_matrix(remove_nzv = T, nan_is_na = T)
I met a problem trying to execute add_relationship() and I solved it only after reading the python featuretools documentation. https://docs.featuretools.com/generated/featuretools.Relationship.html#featuretools.Relationship
It's hard to understand where to place parent and child set arguments when arguments called "set1" and "set2". I suggest to call them as in the original version: parent_variable, child_variable or maybe parent_set, child_set.
Is it possible to have 2 separate arguments for parent_idx and child_idx in order to avoid aligning your entity column names?
The package is now available on CRAN. Update readme to reflect that.
I have just gone through the code as given in Readme last line
tidy <- tidy_feature_matrix(ft_matrix, remove_nzv = T, nan_is_na = T)
gives following:
Removing near zero variance variables
Error in as.vector(x, mode) :
cannot coerce type 'environment' to vector of type 'any'
Any help please?
library("featuretoolsR")
╔═════════════════════════╗
║ featuretoolsR 0.4.4 ║
╚═════════════════════════╝
错误: package or namespace load failed for ‘featuretoolsR’:
attachNamespace()里算'featuretoolsR'时.onAttach失败了,详细内容:
调用: py_get_attr_impl(x, name, silent)
错误: AttributeError: module 'featuretools' has no attribute 'version'
Hi, I wonder if you could provide more instructions on how to use FeaturetoolsR? I found this as a very interesting package but just can't figure out how to use it properly. Thank you!
install_featuretools() lack checks to reliably inform user if pip or virtualenv is missing.
Is a user is missing virtualenv the default message is good enough.
If a user has virtualenv but not pip the created virtualenv gets bricked. Can be (non-intuitively) be fixed by setting custom_virtualenv to true post pip installation.
Add checks to zzz.R
Add support to install_featuretools() to not create a virtualenv until pip exists.
When following the instructions in the README, under the 'Creating and EntitySet' heading. The following code results in an error:
library(featuretoolsR)
library(magrittr)
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T), a = rep(Sys.Date(), 100))
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, T), b = rep(Sys.time(), 100))
es <- as_entityset(
set_1,
index = "key",
entity_id = "set_1",
id = "demo",
time_index = "a"
)
The error states:
Error in py_get_attr_impl(x, name, silent) :
AttributeError: 'EntitySet' object has no attribute 'entity_from_dataframe'
Session Info:
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_2.0.2 dplyr_1.0.8 foreign_0.8-81 featuretoolsR_0.4.4
loaded via a namespace (and not attached):
[1] reticulate_1.24 tidyselect_1.1.2 purrr_0.3.4 reshape2_1.4.4 listenv_0.8.0
[6] splines_4.1.2 lattice_0.20-45 colorspace_2.0-3 vctrs_0.3.8 generics_0.1.2
[11] stats4_4.1.2 utf8_1.2.2 survival_3.2-13 prodlim_2019.11.13 rlang_1.0.1
[16] ModelMetrics_1.2.2.2 pillar_1.7.0 glue_1.6.2 withr_2.4.3 rappdirs_0.3.3
[21] foreach_1.5.2 lifecycle_1.0.1 plyr_1.8.6 lava_1.6.10 stringr_1.4.0
[26] timeDate_3043.102 munsell_0.5.0 gtable_0.3.0 future_1.24.0 recipes_0.2.0
[31] codetools_0.2-18 caret_6.0-90 parallel_4.1.2 class_7.3-19 fansi_1.0.2
[36] Rcpp_1.0.8 scales_1.1.1 ipred_0.9-12 jsonlite_1.8.0 parallelly_1.30.0
[41] png_0.1-7 ggplot2_3.3.5 digest_0.6.29 stringi_1.7.6 rprojroot_2.0.2
[46] grid_4.1.2 here_1.0.1 hardhat_0.2.0 cli_3.2.0 tools_4.1.2
[51] tibble_3.1.6 crayon_1.5.0 future.apply_1.8.1 pkgconfig_2.0.3 ellipsis_0.3.2
[56] MASS_7.3-54 Matrix_1.3-4 data.table_1.14.2 pROC_1.18.0 lubridate_1.8.0
[61] gower_1.0.0 rstudioapi_0.13 iterators_1.0.14 R6_2.5.1 globals_0.14.0
[66] rpart_4.1-15 nnet_7.3-16 nlme_3.1-153 compiler_4.1.2
When calling tidy_feature_matrix
variable names become very non-R-like.
Clean variable names using regexes, something like:
tidynames <- function(df) {
n <- tolower(names(df))
tn <- gsub("[^A-z0-9]", "_", n)
tn <- gsub("(_+?$)|(__+?)", "", tn)
names(df) <- tn
return(df)
}
It could be useful to include the related https://github.com/FeatureLabs/autonormalize in this package.
Currently the library can't be used until the R session restarted after running install_featuretools()
.
Upon successful Featuretools installation, unload and reload R-session. Perhaps something like:
cat("Reloading featuretoolsR\n")
unloadNamespace("featuretoolsR")
.rs.restartR() -> .; rm(.)
library(featuretoolsR)
(Not sure if this is allowed by CRAN, should be checked first)
Hi,
I'm trying to run the demo and it stops at add_relationship.
library(magrittr)
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, TRUE), stringsAsFactors = TRUE)
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, TRUE), stringsAsFactors = TRUE)
as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo") %>%
add_entity(entity_id = "set_2", df = set_2, index = "key") %>%
add_relationship(
parent_set = "set_1",
child_set = "set_2",
parent_idx = "key",
child_idx = "key"
)
Error:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: Unable to add relationship because child variable 'key' in 'set_2' is also its index
I think it might be related to this new error message from this Jun 2020 issue, on featuretools.
I've played around the code but have a hard time understanding how to fix this.
Is there a quick fix?
Thanks for you help!
Diagnotic info:
SessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
reticulate_1.19 featuretoolsR_0.4.4 magrittr_2.0.1
reticulate::py_discover_config()
python: C:/Anaconda3/envs/r-reticulate/python.exe
libpython: C:/Anaconda3/envs/r-reticulate/python36.dll
pythonhome: C:/Anaconda3/envs/r-reticulate
version: 3.6.13 (default, Feb 19 2021, 05:17:09) [MSC v.1916 64 bit (AMD64)]
Architecture: 64bit
numpy: C:/Anaconda3/envs/r-reticulate/Lib/site-packages/numpy
numpy_version: 1.19.5
python versions found:
C:/Anaconda3/envs/r-reticulate/python.exe
C:/Anaconda3/python.exe
The feature tools package in python has below sub-package which can be installed as part of it:
https://pypi.org/project/categorical-encoding/#:~:text=categorical%2Dencoding%20is%20a%20Python,within%20the%20machine%20learning%20pipeline.
Wanted to check if there are any plans or easier way to support the same via your package featuretoolsR?
> featuretoolsR::list_primitives()
name type
1 <environment: 0x000000002bfe8ef8> <environment: 0x000000002bcdad38>
2 <NA> <NA>
3 <NA> <NA>
... ... ...
62 <NA> <NA>
Warning message:
In format.data.frame(x, digits = digits, na.encode = FALSE) :
corrupt data frame: columns will be truncated or padded with NAs
An interesting part here is that everything is ok when I tried to build this example with reprex::reprex()
But I can't work with it in my RStudio console
Hi there!
I just stumbled upon your package and I am incredibly happy someone made the effort implement this. Thanks a lot for this!
I started out with your example and unfortunately, I encountered an error when creating a tidy_feature_matrix (the idea of which I absolutely love!)
# pacman::p_install_gh("magnusfurugard/featuretoolsR")
pacman::p_load(tidyverse, featuretoolsR)
# Create some mock data
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T))
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, T))
# Create entityset
es <- as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo")
es <- es %>%
add_entity(
df = set_2,
entity_id = "set_2",
index = "key"
)
es <- es %>%
add_relationship(
set1 = "set_1",
set2 = "set_2",
idx = "key"
)
ft_matrix <- es %>%
dfs(
target_entity = "set_1",
trans_primitives = c("and", "divide")
)
tidy <- tidy_feature_matrix(ft_matrix)
tidy
Error:
# Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: index 100 is out # of bounds for axis 0 with size 100
Through a traceback, I was able to narrow down the problem to the py_to_r function, which seems to have a problem with the 0 indexing of Python.
See here:
reticulate::py_to_r(ft_matrix[[1]])
Error:
# Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: index 100 is out # of bounds for axis 0 with size 100
Again, thank you for making this available and I totally understand that a lot of this is probably work in progress. I am just glad someone did this :)
Best,
Fabio
I tried with different data and this seems to work fine. So it seems it has something to do with the example data, maybe?
ft <- reticulate::import("featuretools")
es = ft$demo$load_mock_customer(return_entityset=T)
ft_matrix <- es %>%
dfs(
target_entity = "customers",
trans_primitives = c("and", "divide")
)
tidy <- tidy_feature_matrix(ft_matrix)
tidy
Works just fine!
library(featuretoolsR)
library(magrittr)
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T))
as_entityset(set_1, index = "key", entity_id = "set_1", id = "demo") %>%
dfs(
target_entity = "set_1",
trans_primitives = c("and", "divide")) %>%
tidy_feature_matrix(remove_nzv = T, nan_is_na = T)
> Removing near zero variance variables
C:\Users\srskr\ANACON~1\lib\site-packages\pandas\core\arrays\categorical.py:486:
FutureWarning: Index.itemsize is deprecated and will be removed in a future version
return self.categories.itemsize
Error in `[.python.builtin.object`(nondupe, , colname) :
unused argument (colname)
I'd like to plot an entity set. But upon installing graphviz, the plot function only returns digraph metadata, not an actual plot
Running the example in the readme, I receive:
cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame
A number of primitives listed in list_primitives() throw an error:
Invalid transform primitive(s):
mean. Use list_primitives() to find valid primitives.
"Reticulate now always converts R Date objects into Python datetime objects. Note that these conversions can be inefficient -- if you would prefer conversion to NumPy datetime64 objects / arrays, you should convert your date to POSIXct first."
https://github.com/rstudio/reticulate/blob/master/NEWS.md
The next lines take a couple of seconds on my machine with a good CPU
rep(as_date("2019-01-01"), 500) %>% reticulate::r_to_py()
I fixed it in my project with
r_tibble %>% mutate_if(is.Date, as.POSIXct) %>% reticulate::r_to_py()
You should add this kind of mutation for every incoming R data.frame
I use dfs(cutoff_time_in_index=TRUE)
. I think this update https://rstudio.github.io/reticulate/news/index.html#reticulate-1-9-cran "Always call r_to_py S3 method when converting objects from Python to R" broke my code extracting multiindex as new columns with dfs_result[[1]]$reset_index()
. Is it possible to preserve these columns after featuretoolsR::dfs()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.