minerva-ml / steppy Goto Github PK
View Code? Open in Web Editor NEWLightweight, Python library for fast and reproducible experimentation :microscope:
License: MIT License
Lightweight, Python library for fast and reproducible experimentation :microscope:
License: MIT License
all notebooks should show single story from beginner to advanced user
goals:
@jakubczakon on simple example describe how Steps works.
In Step
add option to:
outputs
).cross_entropy function: https://github.com/neptune-ml/steps/blob/dev/steps/pytorch/validation.py#L24 probably should be defined in a slightly different way:
def cross_entropy(output, target, squeeze=False):
if squeeze:
target = target.squeeze(1)
return F.nll_loss(F.log_softmax(output), target)
Please verify.
Having to call step.clean_cache()
is error-prone. Ideally, we should have automatic cache invalidation.
Current syntax for adapters has some peculiarities.
Consider the following example.
step = Step(
name='ensembler',
transformer=Dummy(),
input_data=['input_1'],
adapter={'X': [('input_1', 'features')]},
cache_dirpath='.cache'
)
This step basically extracts one element of the input. It seems redundant to write brackets and parentheses. Doing
adapter={'X': ('input_1', 'features')},
should be sufficient.
Moreover, to my suprise
adapter={'X': [('input_1', 'features'), ('input_2', 'extra_features')]},
is incorrect, and currently leads to
ValueError: too many values to unpack (expected 2)
My suggestions to make the syntax consistent are:
adapter={'X': ('input_1', 'features')}
should map X
to extracted features
.adapter={'X': [...]}
should map X
to a list of extracted objects (specified by elements of the list). In particular adapter={'X': [('input_1', 'features')]}
should map X
to a one-element list with extracted features
.adapter={'X': ([...], func)}
should extract appropriate objects and put them on the list, then func
should be called on that list, and X
should map to the result of that call.all notebooks should show single story from beginner to advanced user
goal:
all notebooks should show single story from beginner to advanced user
learning goals:
adapter
that connects StepsResearch and benchmark similar tools for building pipelines.
Hi,
I have a proposal: let's make it possible to dump adapted input of a step to disk. It's very handy when you are working on a 5th or 10th step in a pipeline that has 2,3 or more input steps. Now you have to set flag load_saved_output=True
on each of the input steps to be able to work on your beloved step. If you could just set load_saved_input=True
(adapted or not adapted, I think it's worth discussion) on the step you are currently working on, it would be much easier.
What do you think?
all notebooks should show single story from beginner to advanced user
goals:
steps/pytorch/callbacks.py
steps/pytorch/architectures/
I think this line is unnecessary: https://github.com/neptune-ml/steps/blob/dev/steps/pytorch/models.py#L55 (and previous line 54 as well)
check shapes of:
these inputs / outputs may be images, Pandas DataFrames, Numpy arrays, Tensors, lists, tuples, etc.
Looking at this for loop: https://github.com/neptune-ml/steps/blob/dev/steps/base.py#L151.
If one has a pipeline, then calling transform() on the last Step of the pipeline differs only slightly with calling fit_transform(), because in both cases all input Steps (and their inputs, etc.) are fitted and then they transform data. The only difference is that the last Step is not fitted.
I think that calling transform() on the last Step of the pipeline should call transform() methods of the input Steps, and yield error or inform the User if one of the input Steps is not saved in ./transformers directory.
What do you think about this idea?
The overall goal of Steps is to make it light weight library for ML-specific pipelines, hence API design is crucial here.
Add is_trainable flag that would be set to True
only for things we want to fit. By doing so we wouldn't need to create transformers via touch
in inference pipelines
It is really hard to debug when an error is raised inside a Step
instance, it would be great if steppy returned a stack trace with at least a name of a step inside which the error had been raised.
This loss could be stored somewhere (e.g. in Model or Transformer?) and then loaded when other callbacks ask for it.
blending
stacking
https://github.com/kaz-Anova/StackNet
Currently UNet model in its forward() method returns a single tensor, so len(tensor) is batch size. On the other hand we have assertion in: https://github.com/neptune-ml/steps/blob/dev/steps/pytorch/models.py#L92, so when batch_size>1 this assertion throws an error and it should not.
all notebooks should show single story from beginner to advanced user
learning goals:
technical details:
fit()
, transform()
and fit_transform()
all notebooks should show single story from beginner to advanced user
goals:
It would be great to have a possibility to cache steps outputs in memory.
Right now caching outputs by dumping them onto a hard disk and then loading them back can take longer than simply running the same step all over again.
Implementation:
my_step.cache_output_in_mem = data
in the experiment:
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, adapting inputs...
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, loading transformer from the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/transformers/groupby_aggregations
2018-06-19 01:14:54 steppy >>> Step groupby_aggregations, transforming...
2018-06-19 01:22:34 steppy >>> Step groupby_aggregations, caching outputs to the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/cache/groupby_aggregations
2018-06-19 01:22:34 steppy >>> Step groupby_aggregations, persisting output to the /mnt/ml-team/minerva/open-solutions/home-credit/kamil/testing_single/cache/groupby_aggregations
both caching outputs
and persisting output
displays the same directory
Problem: Currently User must from steppy.adapter import Adapter, E
in order to use adapters.
Refactor so that:
E
Refactor is comprehensive, so that:
Is it necessary to divide executions inside my class to be separate Thread or just divide them between Steps? For example, I can to fit KNN, PCA in one class method and parallel them or create two separate classes for them...
How is it possible to do the following Step in new version(use of pandas_concat_inputs
)?:
transformer=GroupbyAggregationsFeatures(AGGREGATION_RECIPIES),
input_steps=[df_step],
input_data=['input'],
adapter=Adapter({
'X': ([('input', 'X'),
(df_step.name, 'X')],
pandas_concat_inputs)
}),
cache_dirpath=config.env.cache_dirpath)
all notebooks should show single story from beginner to advanced user
goals:
@click etc
)Right now it's possible to define an adapter which isn't compatible with the lists given as input steps and input data. This means that we often end up with graphs that don't really represent what really happens.
Consider 3 steps A, B, C, connected like this: A -> B, B -> C, A -> C. Say, we fit this pipeline by calling C.fit_transform(...)
. If A is initialized with force_fitting
option, then its fit_transform
method will be called twice, which is undesirable behavior. Even when force_fitting
is False, transform is going to be called twice, which might require reconsidaration.
Current:
YYYY-MM-DD HH-MM-SS steps >>>
Proposed:
YYYY-MM-DD HH:MM:SS steps >>>
Setup readthedocs, so that:
Look here:
https://github.com/minerva-ml/gradus/blob/dev/steps/pytorch/callbacks.py#L266
If self.epoch_id
is equal to 0, then loss_sum
is equal to self.best_score
and model is not saved. I think it should be fixed, because sometimes we want to have model after first epoch saved.
Now the situation is as follows:
if I want to run Step.transform(data) the transformer of this step needs to be cached in an experiment directory.
I don't think it is necessary if the transformer is not fittable, like in postprocessing or preprocessing steps.
I would suggest, that if a transformer has non-trivial fit()
method, it should be cached to perform transform()
, otherwise it doesn't need to be cached, because in the cached file there is actually no data.
Now, when I have pipeline:
preprocessing step -> neural network step -> postprocessing step
I need to perform postprocessing on all my training data (to create all necessary transformers), which is redundant and often time consuming, OR I can manually create trivial transformer file. I think it should be simplified.
use plotly (refactor)
This function / feature should enable users to wrap any function into our Transformer with fit()
and transform()
functions.
(sklearn/func -> transformer conversion)
it accepts: sklearn transformer, regressor, classificator
and single def as well.
adapter.py
fileA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.