Git Product home page Git Product logo

Comments (7)

sararb avatar sararb commented on June 18, 2024 1

Thank you 🙏 As soon as I got some time, I will test it and update Merlin Models' code accordingly.

from core.

karlhigley avatar karlhigley commented on June 18, 2024

Having thought about this a bit, I'm hesitant to remove accurate metadata and I think we generally want to preserve as much information as we can about the columns.

In a case like this, the target really is categorical, and for most (all?) cases I can think of the target should either be categorical or continuous. It seems like we're implicitly relying on the CATEGORICAL and CONTINUOUS tags only being applied to features though, as if the CATEGORICAL/CONTINUOUS and TARGET tags were mutually exclusive. That runs somewhat counter to the philosophy of tags as small independent units of meaning that can be used individually or in combination to query the schema for matching columns.

So two proposals for how to provide the functionality you're looking for on the models side:

  • Perhaps we can update the Schema selection API to make it easier to exclude columns by tag, so that you can do something like this to retrieve only the features:
schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET)
  • Alternately or in combination with the above, maybe we could introduce a FEATURE tag that's mutually exclusive with TARGET (in the same way that CATEGORICAL and CONTINUOUS can't be applied simultaneously), which would allow selecting the columns you want with in this fashion:
schema.select_by_tag([Tags.CATEGORICAL, Tags.FEATURE])

Would either (or both) of those meet the need driving this feature request?

from core.

karlhigley avatar karlhigley commented on June 18, 2024

@sararb We've added the excluding() method, so this should now work with the latest version of core:

schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET)

from core.

sararb avatar sararb commented on June 18, 2024

I tested the proposed solution using the following: self.schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET).column_names

But I got the following error:

def excluding(self, selector) -> "Schema":
        """Select non-matching columns from this Schema object using a ColumnSelector
    
        Parameters
        ----------
        selector : ColumnSelector
            Selector that describes which columns match
    
        Returns
        -------
        Schema
            New object containing only the ColumnSchemas of selected columns
    
        """
        schema = self
        if selector is not None:
>           if selector.names:
E           AttributeError: 'Tags' object has no attribute 'names'

It seems we need to check the type of the selector here to decide whether to call names or tags property.

This is a basic code to reproduce the error:

import nvtabular as nvt
from merlin.io import Dataset
import pandas as pd
from merlin.schema import Schema, Tags
df = pd.DataFrame(
    {
        "Author": ["User_A", "User_E", "User_B", "User_C"],
        "Engaging User": ["User_B", "User_B", "User_A", "User_D"],
        "Post": [1, 2, 3, 4],
    }
)
cat_names = ["Post", ["Author", "Engaging User"]]
cats = cat_names >> nvt.ops.Categorify(encode_type='joint')
target = cats['Post'] >> nvt.ops.AddMetadata(tags=Tags.TARGET)
workflow = nvt.Workflow(cats+target)
df_transform = workflow.fit_transform(Dataset(df))
df_transform.schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET)

from core.

karlhigley avatar karlhigley commented on June 18, 2024

Ah, right. I have steered you wrong on how to actually use the API we implemented. You can do either:

self.schema.select_by_tag(Tags.CATEGORICAL).excluding(ColumnSelector(tags=Tags.TARGET)).column_names

(a bit clunky, but useful when you already have a selector)

or:

self.schema.select_by_tag(Tags.CATEGORICAL).excluding_by_tag(Tags.TARGET).column_names

from core.

marcromeyn avatar marcromeyn commented on June 18, 2024

Maybe we could add a check in excluding where we check whether a column-selector or tag is being passed in, and then process these accordingly?

from core.

karlhigley avatar karlhigley commented on June 18, 2024

Sure, go for it!

from core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.