I don't know whether it would fit within this package, or somewhere else. But would it

checking if variable is being used as a outcome and predictor about hardhat HOT 2 CLOSED

tidymodels commented on August 25, 2024

checking if variable is being used as a outcome and predictor

from hardhat.

Comments (2)

DavisVaughan commented on August 25, 2024 1

This sounds useful! model.matrix() actually checks for this automatically and throws a warning and drops the duplicated predictor (if it is exactly the same as the outcome. Meaning log(Sepal.Width) would not count as a duplicate). I think this is rather aggressive.

I don't think I would put this in mold(), as I wouldn't call this a "required" check, but I want hardhat to have a number of extra optional validate_***() functions that developers can use, and this seems like one of them.

Below is one version of a validate function for this. This uses the original column names and checks for duplicates. So Sepal.Width ~ Sepal.Width and Sepal.Width ~ log(Sepal.Width) will both be flagged as having duplicates. There could also be a version that works more like model.matrix() and checks that the processed training data does not have duplicates (so log(Sepal.Width) would look different than Sepal.Width).

library(hardhat)

.formula <- Sepal.Width ~ Sepal.Width

# //////////////////////////////////////////////////////////////////////////////

# mold() lets you use them
x <- mold(.formula, iris)

x$predictors
#> # A tibble: 150 x 1
#>    Sepal.Width
#>          <dbl>
#>  1         3.5
#>  2         3  
#>  3         3.2
#>  4         3.1
#>  5         3.6
#>  6         3.9
#>  7         3.4
#>  8         3.4
#>  9         2.9
#> 10         3.1
#> # … with 140 more rows

x$outcomes
#> # A tibble: 150 x 1
#>    Sepal.Width
#>          <dbl>
#>  1         3.5
#>  2         3  
#>  3         3.2
#>  4         3.1
#>  5         3.6
#>  6         3.9
#>  7         3.4
#>  8         3.4
#>  9         2.9
#> 10         3.1
#> # … with 140 more rows

# //////////////////////////////////////////////////////////////////////////////

# a warning is thrown here
mf <- model.frame(.formula, iris)
head(model.matrix(terms(mf), mf))
#> Warning in model.matrix.default(terms(mf), mf): the response appeared on
#> the right-hand side and was dropped
#> Warning in model.matrix.default(terms(mf), mf): problem with term 1 in
#> model.matrix: no columns are assigned
#>   (Intercept)
#> 1           1
#> 2           1
#> 3           1
#> 4           1
#> 5           1
#> 6           1

# //////////////////////////////////////////////////////////////////////////////

# the info is here
x$preprocessor$predictors$names
#> [1] "Sepal.Width"
x$preprocessor$outcomes$names
#> [1] "Sepal.Width"

# //////////////////////////////////////////////////////////////////////////////

validate_lhs_rhs_duplication <- function(preprocessor) {
  
  if (!inherits(preprocessor, "terms_preprocessor")) {
    return(preprocessor)
  }
  
  original_predictor_names <- preprocessor$predictors$names
  original_outcome_names <- preprocessor$outcomes$names
  
  dups <- intersect(original_predictor_names, original_outcome_names)
  
  if (length(dups) > 0) {
    
    dups <- glue::glue_collapse(glue::single_quote(dups), ", ")
    
    rlang::abort(glue::glue(
      "The supplied `formula` cannot have the same term ",
      "as both an outcome and a predictor. The following terms ",
      "appear on both sides of the formula: {dups}."
    ))
  }
  
  invisible(preprocessor)
}

validate_lhs_rhs_duplication(x$preprocessor)
#> Error: The supplied `formula` cannot have the same term as both an outcome and a predictor. The following terms appear on both sides of the formula: 'Sepal.Width'.
#> Backtrace:
#>     █
#>  1. └─global::validate_lhs_rhs_duplication(x$preprocessor)

^{Created on 2019-02-16 by the reprex package (v0.2.1.9000)}

from hardhat.

github-actions commented on August 25, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

from hardhat.

checking if variable is being used as a outcome and predictor about hardhat HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent