minerva-ml / steppy Goto Github PK

View Code? Open in Web Editor NEW

134.0 13.0 32.0 141 KB

Lightweight, Python library for fast and reproducible experimentation :microscope:

License: MIT License

Python 100.00%

python pipeline data-science machine-learning deep-learning steps nlp reproducible-research reproducibility steppy

steppy's People

Contributors

Stargazers

Watchers

steppy's Issues

UNet error

Currently UNet model in its forward() method returns a single tensor, so len(tensor) is batch size. On the other hand we have assertion in: https://github.com/neptune-ml/steps/blob/dev/steps/pytorch/models.py#L92, so when batch_size>1 this assertion throws an error and it should not.

tutorial: notebook 5 (write own transformers)

all notebooks should show single story from beginner to advanced user

goal:

User guide how to write own Transformers

API design

The overall goal of Steps is to make it light weight library for ML-specific pipelines, hence API design is crucial here.

improve docstrings in the adapter.py

improve docstrings in the adapter.py file
add Example

refactor all examples to work with v0.1.2

PyTorch model is never saved as checkpoint after first epoch

Look here:
https://github.com/minerva-ml/gradus/blob/dev/steps/pytorch/callbacks.py#L266
If self.epoch_id is equal to 0, then loss_sum is equal to self.best_score and model is not saved. I think it should be fixed, because sometimes we want to have model after first epoch saved.

add 'save_output_sample'

In Step add option to:

persist small (1% to 5% of data) random sample of the output for browsing purposes.
This should be persisted to the separate directory (not to outputs).

tutorial: 1-getting-started.ipynb

all notebooks should show single story from beginner to advanced user

learning goals:

understand Step and Transformer
how to: save transformer to disk (in the experiment directory)

technical details:

2-3 Steps connected
- the idea of fit(), transform() and fit_transform()
- minimal Step interface
Transformer
- save and load

correct names connected with cache/save

cache / temp / persist / save / dump
correct names to match intended use.

Do we really need a transformer to be cached to perform transform()?

Now the situation is as follows:
if I want to run Step.transform(data) the transformer of this step needs to be cached in an experiment directory.
I don't think it is necessary if the transformer is not fittable, like in postprocessing or preprocessing steps.
I would suggest, that if a transformer has non-trivial fit() method, it should be cached to perform transform(), otherwise it doesn't need to be cached, because in the cached file there is actually no data.
Now, when I have pipeline:
preprocessing step -> neural network step -> postprocessing step
I need to perform postprocessing on all my training data (to create all necessary transformers), which is redundant and often time consuming, OR I can manually create trivial transformer file. I think it should be simplified.

transform() method is actually fit_transform() (almost)

Looking at this for loop: https://github.com/neptune-ml/steps/blob/dev/steps/base.py#L151.
If one has a pipeline, then calling transform() on the last Step of the pipeline differs only slightly with calling fit_transform(), because in both cases all input Steps (and their inputs, etc.) are fitted and then they transform data. The only difference is that the last Step is not fitted.
I think that calling transform() on the last Step of the pipeline should call transform() methods of the input Steps, and yield error or inform the User if one of the input Steps is not saved in ./transformers directory.
What do you think about this idea?

documentation

read the docs
docstrings (Google style)

tutorial: notebook 5 (Keras)

remove graphviz from the project

use plotly (refactor)

setup Travis CI

tutorial: notebook 6 (Steps recipes)

all notebooks should show single story from beginner to advanced user

goals:

show PyTorch callbacks implemented in steps/pytorch/callbacks.py
show ready to use architectures in steps/pytorch/architectures/
present 2-3 recipes -> ready to use pipelines
- with for PyTorch
- with Keras

Validation loss is calculated by a few callbacks and sometimes it takes a lot of time

This loss could be stored somewhere (e.g. in Model or Transformer?) and then loaded when other callbacks ask for it.

tutorial: notebook 4 (advanced concepts: caching/saving)

all notebooks should show single story from beginner to advanced user

goals:

learn how to use advanced concept in Steps:
- caching
- persist-to-disk/save output
- force-fit

difficult debugging

It is really hard to debug when an error is raised inside a Step instance, it would be great if steppy returned a stack trace with at least a name of a step inside which the error had been raised.

For deep learning pipelines -> creating batches not in loader

tests and informative assertions

data tests
unittest
engarde for data
feature forge
end-to-end testing for standard problems

prepare v0.1.3 with readme.md and index.rst

refactor adapter.py

Problem: Currently User must from steppy.adapter import Adapter, E in order to use adapters.

Refactor so that:

Use does not have to import E
add Example to docstrings

Refactor is comprehensive, so that:

correct the code
correct tests
correct docstrings

How to evaluate each step only once?

I have the following structure of my steps. The problem is that many steps are called more than once and it makes the process of training very slow. Is it possible somehow to simplify it?
more precisely, how to optimize this part? I would like to compute input_missing just once

setup readthedocs

Setup readthedocs, so that:

each new package release on PyPI has its own docs version.
add badge to README.md ( http://docs.readthedocs.io/en/latest/badges.html )
prepare first good-looking setup (use other open-soirce projects as template (for example tensorpack)).

tutorial: 3-adapter_advanced.ipynb (ensemble)

all notebooks should show single story from beginner to advanced user

goals:

Adapter + Ensembling on full end-to-end pipeline

cross entropy definition

cross_entropy function: https://github.com/neptune-ml/steps/blob/dev/steps/pytorch/validation.py#L24 probably should be defined in a slightly different way:

def cross_entropy(output, target, squeeze=False):
if squeeze:
target = target.squeeze(1)
return F.nll_loss(F.log_softmax(output), target)

Please verify.

check shapes of inputs/outputs

check shapes of:

Transformer's input
Transformer's output
adapter input
adapter output

these inputs / outputs may be images, Pandas DataFrames, Numpy arrays, Tensors, lists, tuples, etc.

benchmark tools for building pipelines and reproducible reseach

Research and benchmark similar tools for building pipelines.

API for ensembling

blending
stacking
https://github.com/kaz-Anova/StackNet

Maybe load_saved_input?

Hi,
I have a proposal: let's make it possible to dump adapted input of a step to disk. It's very handy when you are working on a 5th or 10th step in a pipeline that has 2,3 or more input steps. Now you have to set flag load_saved_output=True on each of the input steps to be able to work on your beloved step. If you could just set load_saved_input=True (adapted or not adapted, I think it's worth discussion) on the step you are currently working on, it would be much easier.
What do you think?

type annotations in Step class

Prepare getting started notebook

@jakubczakon on simple example describe how Steps works.

Concat features

How is it possible to do the following Step in new version(use of pandas_concat_inputs)?:

                                    transformer=GroupbyAggregationsFeatures(AGGREGATION_RECIPIES),
                                    input_steps=[df_step],
                                    input_data=['input'],
                                    adapter=Adapter({
                                        'X': ([('input', 'X'),
                                               (df_step.name, 'X')],
                                              pandas_concat_inputs)
                                    }),
                                    cache_dirpath=config.env.cache_dirpath)

have to clean cache manually

Having to call step.clean_cache() is error-prone. Ideally, we should have automatic cache invalidation.

self.model = self.model?

I think this line is unnecessary: https://github.com/neptune-ml/steps/blob/dev/steps/pytorch/models.py#L55 (and previous line 54 as well)

add is_trainable flag

Add is_trainable flag that would be set to True only for things we want to fit. By doing so we wouldn't need to create transformers via touch in inference pipelines

Add checking that the adapter definition matches the list of input steps and input data

Right now it's possible to define an adapter which isn't compatible with the lists given as input steps and input data. This means that we often end up with graphs that don't really represent what really happens.

cache outputs in memory

It would be great to have a possibility to cache steps outputs in memory.
Right now caching outputs by dumping them onto a hard disk and then loading them back can take longer than simply running the same step all over again.

Implementation:

each step should have flag, if User wants to cache outputs in memory,
- if yes, output from transformer should be cached i.e. my_step.cache_output_in_mem = data

Change time format in CLI output

Current:
YYYY-MM-DD HH-MM-SS steps >>>

Proposed:
YYYY-MM-DD HH:MM:SS steps >>>

cpu and gpu utilization in the logs / callback / neptune graph

in the logs -> mean cpu/gpu utilization per batch/epoch
in the neptune monitor callback -> print data on graph
use this post with code snippet

tutorial: notebook 7 (Minerva best practices)

all notebooks should show single story from beginner to advanced user

goals:

show Minerva best practices to write clean code and make experimentation reproducible:
- tooling (@click etc)
- neptune
- neptune.yaml

One `fit_transform` call on the pipeline may result in multiple `fit_transform` calls on some steps

Consider 3 steps A, B, C, connected like this: A -> B, B -> C, A -> C. Say, we fit this pipeline by calling C.fit_transform(...). If A is initialized with force_fitting option, then its fit_transform method will be called twice, which is undesirable behavior. Even when force_fitting is False, transform is going to be called twice, which might require reconsidaration.

Unintuitive adapter syntax

Current syntax for adapters has some peculiarities.
Consider the following example.

        step = Step(
            name='ensembler',
            transformer=Dummy(),
            input_data=['input_1'],
            adapter={'X': [('input_1', 'features')]},
            cache_dirpath='.cache'
        )

This step basically extracts one element of the input. It seems redundant to write brackets and parentheses. Doing
adapter={'X': ('input_1', 'features')},
should be sufficient.

Moreover, to my suprise
adapter={'X': [('input_1', 'features'), ('input_2', 'extra_features')]},
is incorrect, and currently leads to
ValueError: too many values to unpack (expected 2)

My suggestions to make the syntax consistent are:

adapter={'X': ('input_1', 'features')} should map X to extracted features.
adapter={'X': [...]} should map X to a list of extracted objects (specified by elements of the list). In particular adapter={'X': [('input_1', 'features')]} should map X to a one-element list with extracted features.
adapter={'X': ([...], func)} should extract appropriate objects and put them on the list, then func should be called on that list, and X should map to the result of that call.

steppy logger prints wrong data

in the experiment:

2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, adapting inputs...
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, loading transformer from the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/transformers/groupby_aggregations
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, transforming...
2018-06-19 01:22:34 steppy >>> Step groupby_aggregations, caching outputs to the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/cache/groupby_aggregations
2018-06-19 01:22:34 steppy >>> Step groupby_aggregations, persisting output to the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/cache/groupby_aggregations

both caching outputs and persisting output displays the same directory

consider adding abstract image and abstract batch

implement make_transformer()

This function / feature should enable users to wrap any function into our Transformer with fit() and transform() functions.

(sklearn/func -> transformer conversion)

it accepts: sklearn transformer, regressor, classificator
and single def as well.

Do all Steps execute parallel?

Is it necessary to divide executions inside my class to be separate Thread or just divide them between Steps? For example, I can to fit KNN, PCA in one class method and parallel them or create two separate classes for them...

tutorial: 2-multi-step.ipynb (adapters)

all notebooks should show single story from beginner to advanced user

learning goals:

show adapter that connects Steps
how to - using adapter - merge multiple outputs into one

make steppy pip installable

http://peterdowns.com/posts/first-time-with-pypi.html

minerva-ml / steppy Goto Github PK

steppy's People

Contributors

Stargazers

Watchers

Forkers

steppy's Issues

Recommend Projects

Recommend Topics

Recommend Org