Comments (7)
Thank you 🙏 As soon as I got some time, I will test it and update Merlin Models' code accordingly.
from core.
Having thought about this a bit, I'm hesitant to remove accurate metadata and I think we generally want to preserve as much information as we can about the columns.
In a case like this, the target really is categorical, and for most (all?) cases I can think of the target should either be categorical or continuous. It seems like we're implicitly relying on the CATEGORICAL
and CONTINUOUS
tags only being applied to features though, as if the CATEGORICAL
/CONTINUOUS
and TARGET
tags were mutually exclusive. That runs somewhat counter to the philosophy of tags as small independent units of meaning that can be used individually or in combination to query the schema for matching columns.
So two proposals for how to provide the functionality you're looking for on the models side:
- Perhaps we can update the
Schema
selection API to make it easier to exclude columns by tag, so that you can do something like this to retrieve only the features:
schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET)
- Alternately or in combination with the above, maybe we could introduce a
FEATURE
tag that's mutually exclusive withTARGET
(in the same way thatCATEGORICAL
andCONTINUOUS
can't be applied simultaneously), which would allow selecting the columns you want with in this fashion:
schema.select_by_tag([Tags.CATEGORICAL, Tags.FEATURE])
Would either (or both) of those meet the need driving this feature request?
from core.
@sararb We've added the excluding()
method, so this should now work with the latest version of core
:
schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET)
from core.
I tested the proposed solution using the following: self.schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET).column_names
But I got the following error:
def excluding(self, selector) -> "Schema":
"""Select non-matching columns from this Schema object using a ColumnSelector
Parameters
----------
selector : ColumnSelector
Selector that describes which columns match
Returns
-------
Schema
New object containing only the ColumnSchemas of selected columns
"""
schema = self
if selector is not None:
> if selector.names:
E AttributeError: 'Tags' object has no attribute 'names'
It seems we need to check the type of the selector here to decide whether to call names
or tags
property.
This is a basic code to reproduce the error:
import nvtabular as nvt
from merlin.io import Dataset
import pandas as pd
from merlin.schema import Schema, Tags
df = pd.DataFrame(
{
"Author": ["User_A", "User_E", "User_B", "User_C"],
"Engaging User": ["User_B", "User_B", "User_A", "User_D"],
"Post": [1, 2, 3, 4],
}
)
cat_names = ["Post", ["Author", "Engaging User"]]
cats = cat_names >> nvt.ops.Categorify(encode_type='joint')
target = cats['Post'] >> nvt.ops.AddMetadata(tags=Tags.TARGET)
workflow = nvt.Workflow(cats+target)
df_transform = workflow.fit_transform(Dataset(df))
df_transform.schema.select_by_tag(Tags.CATEGORICAL).excluding(Tags.TARGET)
from core.
Ah, right. I have steered you wrong on how to actually use the API we implemented. You can do either:
self.schema.select_by_tag(Tags.CATEGORICAL).excluding(ColumnSelector(tags=Tags.TARGET)).column_names
(a bit clunky, but useful when you already have a selector)
or:
self.schema.select_by_tag(Tags.CATEGORICAL).excluding_by_tag(Tags.TARGET).column_names
from core.
Maybe we could add a check in excluding
where we check whether a column-selector or tag is being passed in, and then process these accordingly?
from core.
Sure, go for it!
from core.
Related Issues (20)
- [FEA] / [QST] Dataset to support random splitting & splitting based on value of some column
- [Bug] Trial bug to test the automation
- fsspec v22.7.1 breaks Dataset.to_parquet
- [FEA] Add a feature to remove Tags from a column in schema HOT 1
- [Bug] Using numba to detect gpu availability breaks Dask-CUDA worker pinning
- [Task] Update default value of `Node.selector` from `None` to `ColumnSelector("*")`
- [FEA] Create an easy functionality to generate dict of tensors- a standard way to move array data across frameworks HOT 1
- dask read_parquet changes disable parquet file statistics reading in default case HOT 2
- [FEA] Add more meaningful error message when we the dataset folder is wrong HOT 1
- [BUG] `repartition` on `Dataset` removes tags from schema HOT 4
- HAS_GPU does not reflect CUDA VISIBLE DEVICE Settings HOT 3
- Error of running Merlin example on Google cloud platform HOT 1
- Add a tag for pre-trained embeddings
- Offer methods for Dataset to cover the most common mechanisms for moving data between partitions
- BUG: unsupported operand type(s) for -: 'TransformWorkflow' and 'list'
- [BUG] Merlin has AttributeError: 'ColumnSelector' object has no attribute 'all' for an example notebook
- Use the nvidia-supplied nvidia-ml-py instead of pynvml
- [BUG] runtime error: "ValueError: high is out of bounds for int32" HOT 1
- Enable TagSet to be used in Schema selection methods
- Add github issue templates HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from core.