Comments (4)
I think the ideal solution would be to drive what is being used from the data (and therefore expected in the data) using schema-like structures like the following:
{
'start': {},
'target': {'shape': ()},
'feat_dynamic_real': {'shape': (1,)},
'feat_static_cat': {'shape': (3,), 'cardinality': [4, 5, 6]}
}
This could be used among other things to configure the transformation chain: the keys in such dictionary will tell you what fields are expected to be in the data. Using this schema-like dictionary,
estimators can do many things:
- They can assume a minimal schema
{'start': {}, 'target': {'shape': ()}
unless a different one is specified; this would pretty much amount to the current behaviour, with the difference that the user will be able to specify everything about the data in one single object (instead of potentially 4 flags and 2 cardinalities) - Or, we could decide of inferring such a schema from the training data, as soon as training is triggered.
- Given such a schema, one can use it to validate a
DataEntry
or a wholeDataset
.
Constructing such a schema from the training data would require a full pass through the dataset, not only looking at which fields are there, but also looking for the maximum of all categorical features (to get the cardinality of their domain). But this doesn't seem too bad to me.
There are some structures in the codebase that aim at something similar I think (cfr. MetaData
). I'm working on a POC for this, I'll send it around when I'm satisfied with it :-)
from gluonts.
Which case is important/priority for us right now: Running smoothly even if the data is not properly formatted or running with correct options that succeeds only when the data is consistent?
I think in case of GluonTS, we aspire to make a scientific library. Thus, I think the algorithms should fail if there are issues in the data. That informs the user that something is not right. Otherwise, you are left wondering why your results are not as good as you are expecting, especially if something is silently not used/discarded/filtered. I think this behavior should be avoided throughout the code base.
from gluonts.
I have started looking at this issue the last two days and it is a combination of addressing the input format question and defining the correct transformations behaviour. I agree with Michael that we should not do things silently and if something is wrong we should throw an error instead of trying to filter it internally. However, this opens more questions:
- Should we check if all the fields are correct in a dataset (probably while creating windows) and throw an error if not? This adds some complexity since it needs to be applied at each created window.
- What should we do with custom fields in a dataset or with fields that are not used by the model, especially with the ones that can break the code, e.g. #94 (note that the fix is not global but only for deepar - any other estimator can fail with the same issue)?
- Setting aside deepar and looking at the bigger picture, what should be the behaviour of all estimators regarding the input data? Should they always take into account (or at least have the option to do so) a field that appears in the dataset or should they use only prespecified fields regardless of the input data as we were doing up to now?
For the cardinality
question I think inferring it from the data in an efficient way is ideal but probably not possible. I think that an informative error message would do the job. Something like: "You are using categorical features but you have not set the cardinality hyperparameter correctly". For the flags part on deepar I have exactly the same opinion (default should be to use the feature since people usually do not know or do not bother to change these values).
from gluonts.
@lostella Can you please confirm if you were able to complete the POC?
Your solution would make using DeepAR much more easier provided we input the data in correct format without having to worry about setting multiple flags. It would also be good if we can log what are the dynamic features and categorical features being used by the model.
from gluonts.
Related Issues (20)
- Consolidate R methods HOT 3
- AssertionError: Dataframe index is not uniformly spaced. HOT 1
- Unexpected Requirement for Prediction Horizon Feature Values with use_feat_dynamic_real=True in Predict Function HOT 1
- mxnet version is not compatible with jupyter notebook in Ananconda
- stable documentation SimpleFeedForwardEstimator example does not work HOT 2
- Cannot script PyTorch model
- Issue with frequency since pandas 2.2.0 in the DeepAREstimator HOT 4
- gluonts.torch.distributions.BinnedUniformsOutput raises AttributeError
- add millisecond L and microsecond U support in `time_feature` HOT 1
- TypeError while using PiecewiseLinearOutput as distribution output with torch version GluonTS 0.14.3
- Cannot use iTransformer with GluonTS version 0.14.4 HOT 2
- Support pydantic-2.5.3 HOT 2
- Notebook problem HOT 1
- Deserialize on CPU-only machine a model trained in colab (using gpu) HOT 1
- Inference Single Item on model trained on Multiple Items HOT 5
- Performance regression in negative binomial from 0.12 to 0.13 and onwards (at least for DeepAR in PyTorch) HOT 7
- INDEX OUT OF RANGE IN SELF -- Pytorch GluonTS for large datasets
- ValueError: setting an array element with a sequence. The requested array has an inhomogeneous HOT 1
- Missing changelog for 0.14.0
- metric mase
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gluonts.