Comments (2)
This sounds useful! model.matrix()
actually checks for this automatically and throws a warning and drops the duplicated predictor (if it is exactly the same as the outcome. Meaning log(Sepal.Width)
would not count as a duplicate). I think this is rather aggressive.
I don't think I would put this in mold()
, as I wouldn't call this a "required" check, but I want hardhat to have a number of extra optional validate_***()
functions that developers can use, and this seems like one of them.
Below is one version of a validate function for this. This uses the original column names and checks for duplicates. So Sepal.Width ~ Sepal.Width
and Sepal.Width ~ log(Sepal.Width)
will both be flagged as having duplicates. There could also be a version that works more like model.matrix()
and checks that the processed training data does not have duplicates (so log(Sepal.Width)
would look different than Sepal.Width
).
library(hardhat)
.formula <- Sepal.Width ~ Sepal.Width
# //////////////////////////////////////////////////////////////////////////////
# mold() lets you use them
x <- mold(.formula, iris)
x$predictors
#> # A tibble: 150 x 1
#> Sepal.Width
#> <dbl>
#> 1 3.5
#> 2 3
#> 3 3.2
#> 4 3.1
#> 5 3.6
#> 6 3.9
#> 7 3.4
#> 8 3.4
#> 9 2.9
#> 10 3.1
#> # … with 140 more rows
x$outcomes
#> # A tibble: 150 x 1
#> Sepal.Width
#> <dbl>
#> 1 3.5
#> 2 3
#> 3 3.2
#> 4 3.1
#> 5 3.6
#> 6 3.9
#> 7 3.4
#> 8 3.4
#> 9 2.9
#> 10 3.1
#> # … with 140 more rows
# //////////////////////////////////////////////////////////////////////////////
# a warning is thrown here
mf <- model.frame(.formula, iris)
head(model.matrix(terms(mf), mf))
#> Warning in model.matrix.default(terms(mf), mf): the response appeared on
#> the right-hand side and was dropped
#> Warning in model.matrix.default(terms(mf), mf): problem with term 1 in
#> model.matrix: no columns are assigned
#> (Intercept)
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
# //////////////////////////////////////////////////////////////////////////////
# the info is here
x$preprocessor$predictors$names
#> [1] "Sepal.Width"
x$preprocessor$outcomes$names
#> [1] "Sepal.Width"
# //////////////////////////////////////////////////////////////////////////////
validate_lhs_rhs_duplication <- function(preprocessor) {
if (!inherits(preprocessor, "terms_preprocessor")) {
return(preprocessor)
}
original_predictor_names <- preprocessor$predictors$names
original_outcome_names <- preprocessor$outcomes$names
dups <- intersect(original_predictor_names, original_outcome_names)
if (length(dups) > 0) {
dups <- glue::glue_collapse(glue::single_quote(dups), ", ")
rlang::abort(glue::glue(
"The supplied `formula` cannot have the same term ",
"as both an outcome and a predictor. The following terms ",
"appear on both sides of the formula: {dups}."
))
}
invisible(preprocessor)
}
validate_lhs_rhs_duplication(x$preprocessor)
#> Error: The supplied `formula` cannot have the same term as both an outcome and a predictor. The following terms appear on both sides of the formula: 'Sepal.Width'.
#> Backtrace:
#> █
#> 1. └─global::validate_lhs_rhs_duplication(x$preprocessor)
Created on 2019-02-16 by the reprex package (v0.2.1.9000)
from hardhat.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
from hardhat.
Related Issues (20)
- Release hardhat 1.2.0 HOT 1
- Pass through strings_as_factors arg HOT 3
- Problem with predicting on new data for character column HOT 5
- Problem with a formula with spaces in the name of a factor and indicators = "none" HOT 2
- `mold()` inconsistently preserves (with XY method) or ignores (with formula method) non-base vector classes HOT 2
- Do a pass over `validate_is()` with an eye towards performance HOT 1
- Avoid `as_tibble()` where possible HOT 2
- multi-outcomes support for `spruce_prob_multi` shall clarify input format for multiple `pred_levels` HOT 2
- Upkeep for hardhat HOT 1
- Regression in development version of hardhat when using sf objects HOT 3
- Release hardhat 1.3.0 HOT 1
- Dynamically calculate weights HOT 5
- Using a division on the left-hand side of a formula throws an "Interaction terms can't be specified on the LHS of `formula`" HOT 2
- importance weights not compatible with DALEXtra::model_profile HOT 2
- Less restrictive snapshot test for recipes HOT 1
- Release hardhat 1.3.1
- correct name of element in mold documentation
- Add `extract_postprocessor()` to docs for `extract_*()` functions
- Release hardhat 1.4.0
- Roadmap: hardhat support for sparse tibbles
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hardhat.