Git Product home page Git Product logo

Comments (4)

lostella avatar lostella commented on May 22, 2024 1

I think the ideal solution would be to drive what is being used from the data (and therefore expected in the data) using schema-like structures like the following:

{
    'start': {},
    'target': {'shape': ()},
    'feat_dynamic_real': {'shape': (1,)},
    'feat_static_cat': {'shape': (3,), 'cardinality': [4, 5, 6]}
}

This could be used among other things to configure the transformation chain: the keys in such dictionary will tell you what fields are expected to be in the data. Using this schema-like dictionary,
estimators can do many things:

  • They can assume a minimal schema {'start': {}, 'target': {'shape': ()} unless a different one is specified; this would pretty much amount to the current behaviour, with the difference that the user will be able to specify everything about the data in one single object (instead of potentially 4 flags and 2 cardinalities)
  • Or, we could decide of inferring such a schema from the training data, as soon as training is triggered.
  • Given such a schema, one can use it to validate a DataEntry or a whole Dataset.

Constructing such a schema from the training data would require a full pass through the dataset, not only looking at which fields are there, but also looking for the maximum of all categorical features (to get the cardinality of their domain). But this doesn't seem too bad to me.

There are some structures in the codebase that aim at something similar I think (cfr. MetaData). I'm working on a POC for this, I'll send it around when I'm satisfied with it :-)

from gluonts.

mbohlkeschneider avatar mbohlkeschneider commented on May 22, 2024

Which case is important/priority for us right now: Running smoothly even if the data is not properly formatted or running with correct options that succeeds only when the data is consistent?

I think in case of GluonTS, we aspire to make a scientific library. Thus, I think the algorithms should fail if there are issues in the data. That informs the user that something is not right. Otherwise, you are left wondering why your results are not as good as you are expecting, especially if something is silently not used/discarded/filtered. I think this behavior should be avoided throughout the code base.

from gluonts.

benidis avatar benidis commented on May 22, 2024

I have started looking at this issue the last two days and it is a combination of addressing the input format question and defining the correct transformations behaviour. I agree with Michael that we should not do things silently and if something is wrong we should throw an error instead of trying to filter it internally. However, this opens more questions:

  1. Should we check if all the fields are correct in a dataset (probably while creating windows) and throw an error if not? This adds some complexity since it needs to be applied at each created window.
  2. What should we do with custom fields in a dataset or with fields that are not used by the model, especially with the ones that can break the code, e.g. #94 (note that the fix is not global but only for deepar - any other estimator can fail with the same issue)?
  3. Setting aside deepar and looking at the bigger picture, what should be the behaviour of all estimators regarding the input data? Should they always take into account (or at least have the option to do so) a field that appears in the dataset or should they use only prespecified fields regardless of the input data as we were doing up to now?

For the cardinality question I think inferring it from the data in an efficient way is ideal but probably not possible. I think that an informative error message would do the job. Something like: "You are using categorical features but you have not set the cardinality hyperparameter correctly". For the flags part on deepar I have exactly the same opinion (default should be to use the feature since people usually do not know or do not bother to change these values).

from gluonts.

sujayramaiah avatar sujayramaiah commented on May 22, 2024

@lostella Can you please confirm if you were able to complete the POC?
Your solution would make using DeepAR much more easier provided we input the data in correct format without having to worry about setting multiple flags. It would also be good if we can log what are the dynamic features and categorical features being used by the model.

from gluonts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.