Git Product home page Git Product logo

minerva-ml / steppy Goto Github PK

View Code? Open in Web Editor NEW
134.0 134.0 31.0 141 KB

Lightweight, Python library for fast and reproducible experimentation :microscope:

License: MIT License

Python 100.00%
data-science deep-learning image-processing machine-learning minimal-interface nlp open-source pipeline python python-library python3 reproducibility reproducible-research steppy steppy-library steppy-toolkit steps

steppy's People

Contributors

gitter-badger avatar jakubczakon avatar kamil-kaczmarek avatar kant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

steppy's Issues

add 'save_output_sample'

In Step add option to:

  • persist small (1% to 5% of data) random sample of the output for browsing purposes.
  • This should be persisted to the separate directory (not to outputs).

Unintuitive adapter syntax

Current syntax for adapters has some peculiarities.
Consider the following example.

        step = Step(
            name='ensembler',
            transformer=Dummy(),
            input_data=['input_1'],
            adapter={'X': [('input_1', 'features')]},
            cache_dirpath='.cache'
        )

This step basically extracts one element of the input. It seems redundant to write brackets and parentheses. Doing
adapter={'X': ('input_1', 'features')},
should be sufficient.

Moreover, to my suprise
adapter={'X': [('input_1', 'features'), ('input_2', 'extra_features')]},
is incorrect, and currently leads to
ValueError: too many values to unpack (expected 2)

My suggestions to make the syntax consistent are:

  1. adapter={'X': ('input_1', 'features')} should map X to extracted features.
  2. adapter={'X': [...]} should map X to a list of extracted objects (specified by elements of the list). In particular adapter={'X': [('input_1', 'features')]} should map X to a one-element list with extracted features.
  3. adapter={'X': ([...], func)} should extract appropriate objects and put them on the list, then func should be called on that list, and X should map to the result of that call.

tutorial: 2-multi-step.ipynb (adapters)

all notebooks should show single story from beginner to advanced user

learning goals:

  • show adapter that connects Steps
  • how to - using adapter - merge multiple outputs into one

Maybe load_saved_input?

Hi,
I have a proposal: let's make it possible to dump adapted input of a step to disk. It's very handy when you are working on a 5th or 10th step in a pipeline that has 2,3 or more input steps. Now you have to set flag load_saved_output=True on each of the input steps to be able to work on your beloved step. If you could just set load_saved_input=True (adapted or not adapted, I think it's worth discussion) on the step you are currently working on, it would be much easier.
What do you think?

tutorial: notebook 6 (Steps recipes)

all notebooks should show single story from beginner to advanced user

goals:

  • show PyTorch callbacks implemented in steps/pytorch/callbacks.py
  • show ready to use architectures in steps/pytorch/architectures/
  • present 2-3 recipes -> ready to use pipelines
    • with for PyTorch
    • with Keras

check shapes of inputs/outputs

check shapes of:

  • Transformer's input
  • Transformer's output
  • adapter input
  • adapter output

these inputs / outputs may be images, Pandas DataFrames, Numpy arrays, Tensors, lists, tuples, etc.

transform() method is actually fit_transform() (almost)

Looking at this for loop: https://github.com/neptune-ml/steps/blob/dev/steps/base.py#L151.
If one has a pipeline, then calling transform() on the last Step of the pipeline differs only slightly with calling fit_transform(), because in both cases all input Steps (and their inputs, etc.) are fitted and then they transform data. The only difference is that the last Step is not fitted.
I think that calling transform() on the last Step of the pipeline should call transform() methods of the input Steps, and yield error or inform the User if one of the input Steps is not saved in ./transformers directory.
What do you think about this idea?

API design

The overall goal of Steps is to make it light weight library for ML-specific pipelines, hence API design is crucial here.

add is_trainable flag

Add is_trainable flag that would be set to True only for things we want to fit. By doing so we wouldn't need to create transformers via touch in inference pipelines

difficult debugging

It is really hard to debug when an error is raised inside a Step instance, it would be great if steppy returned a stack trace with at least a name of a step inside which the error had been raised.

tutorial: 1-getting-started.ipynb

all notebooks should show single story from beginner to advanced user

learning goals:

  • understand Step and Transformer
  • how to: save transformer to disk (in the experiment directory)

technical details:

  • 2-3 Steps connected
    • the idea of fit(), transform() and fit_transform()
    • minimal Step interface
  • Transformer
    • save and load

cache outputs in memory

It would be great to have a possibility to cache steps outputs in memory.
Right now caching outputs by dumping them onto a hard disk and then loading them back can take longer than simply running the same step all over again.

Implementation:

  • each step should have flag, if User wants to cache outputs in memory,
    • if yes, output from transformer should be cached i.e. my_step.cache_output_in_mem = data

steppy logger prints wrong data

in the experiment:

2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, adapting inputs...
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, loading transformer from the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/transformers/groupby_aggregations
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, transforming...
2018-06-19 01:22:34 steppy >>> Step groupby_aggregations, caching outputs to the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/cache/groupby_aggregations
2018-06-19 01:22:34 steppy >>> Step groupby_aggregations, persisting output to the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/cache/groupby_aggregations

both caching outputs and persisting output displays the same directory

refactor adapter.py

Problem: Currently User must from steppy.adapter import Adapter, E in order to use adapters.

Refactor so that:

  • Use does not have to import E
  • add Example to docstrings

Refactor is comprehensive, so that:

  • correct the code
  • correct tests
  • correct docstrings

Do all Steps execute parallel?

Is it necessary to divide executions inside my class to be separate Thread or just divide them between Steps? For example, I can to fit KNN, PCA in one class method and parallel them or create two separate classes for them...

Concat features

How is it possible to do the following Step in new version(use of pandas_concat_inputs)?:

                                    transformer=GroupbyAggregationsFeatures(AGGREGATION_RECIPIES),
                                    input_steps=[df_step],
                                    input_data=['input'],
                                    adapter=Adapter({
                                        'X': ([('input', 'X'),
                                               (df_step.name, 'X')],
                                              pandas_concat_inputs)
                                    }),
                                    cache_dirpath=config.env.cache_dirpath)

tutorial: notebook 7 (Minerva best practices)

all notebooks should show single story from beginner to advanced user

goals:

  • show Minerva best practices to write clean code and make experimentation reproducible:
    • tooling (@click etc)
    • neptune
    • neptune.yaml

How to evaluate each step only once?

I have the following structure of my steps. The problem is that many steps are called more than once and it makes the process of training very slow. Is it possible somehow to simplify it?
more precisely, how to optimize this part? I would like to compute input_missing just once
selection_105

Do we really need a transformer to be cached to perform transform()?

Now the situation is as follows:
if I want to run Step.transform(data) the transformer of this step needs to be cached in an experiment directory.
I don't think it is necessary if the transformer is not fittable, like in postprocessing or preprocessing steps.
I would suggest, that if a transformer has non-trivial fit() method, it should be cached to perform transform(), otherwise it doesn't need to be cached, because in the cached file there is actually no data.
Now, when I have pipeline:
preprocessing step -> neural network step -> postprocessing step
I need to perform postprocessing on all my training data (to create all necessary transformers), which is redundant and often time consuming, OR I can manually create trivial transformer file. I think it should be simplified.

implement make_transformer()

This function / feature should enable users to wrap any function into our Transformer with fit() and transform() functions.

(sklearn/func -> transformer conversion)

it accepts: sklearn transformer, regressor, classificator
and single def as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.