Git Product home page Git Product logo

georgian-io-archive / hydra Goto Github PK

View Code? Open in Web Editor NEW
41.0 15.0 4.0 239 KB

A cloud-agnostic ML Platform that will enable Data Scientists to run multiple experiments, perform hyper parameter optimization, evaluate results and serve models (batch/realtime) while still maintaining a uniform development UX across cloud environments

License: Apache License 2.0

Python 43.88% Dockerfile 0.51% Shell 3.79% HCL 51.26% Makefile 0.56%
mlops ml-platform python cloud-agnostic experimentation hydra

hydra's People

Contributors

coder46 avatar sayonsivakumaran avatar tsa87 avatar will-gp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hydra's Issues

Error when running on fast_local or local on hydra 0.3.4

Hi, I tried running on aws and it worked fine, but got the following error when running locally.

When running with --cloud fast_local, got the following error:

Traceback (most recent call last):
  File "/Users/.../miniconda3/envs/.../bin/hydra", line 8, in <module>
    sys.exit(cli())
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/angelineyasodhara/miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/.../miniconda3/envs.../lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/hydra/cli.py", line 124, in train
    platform.train()
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/hydra/cloud/fast_local_platform.py", line 9, in train
    os.system(" ".join([self.options, 'python3', self.model_path]))
TypeError: sequence item 0: expected str instance, dict found

Updating with -o "" also gives an error at json.loads(options) in hydra/cli.py.

When running with --cloud local, got the following error:
sh: /Users/..../miniconda3/envs/..../lib/python3.7/site-packages/hydra/cloud/../../docker/local_execution.sh: No such file or directory

Implement faster local version training

hydra train --cloud=local is currently a bit slow as it installs model requirements for each run. We want a faster local training variant by using --cloud=fast_local which skips docker and directly runs python3 $model_path

Pip install failing

After Installing hydra using pip
pip install hydra-ml
Running hydra --version gives the following errors

Traceback (most recent call last):
  File "/usr/local/bin/hydra", line 33, in <module>
    sys.exit(load_entry_point('hydra', 'console_scripts', 'hydra')())
  File "/usr/local/bin/hydra", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/metadata.py", line 504, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/metadata.py", line 177, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: hydra

Using exception subclasses

The best practice in python is to use the built-in exceptions as much as possible, avoiding creating new exceptions when built-in exceptions such as ValueError, KeyError, and TypeError can do the job.
Also, it is recommended to use a specific type of exception rather than using the base Exception class.
While the first one is practiced in hydra's codebase, the second one has room for improvement.

File not found error when running `local` and `gcp` mode

After installing hydra via pip pip install hydra-ml==0.3.6
Run a training command with local mode hydra train -y run.yaml --cloud=local
This error gets raised

sh: /Users/faisalanees/.conda/envs/hydra/lib/python3.8/site-packages/hydra/cloud/../docker/local_execution.sh: No such file or directory

Dynamically find available training backends

Currently, the training backends are hard coded in hydra's code. There are a few drawbacks with this approach:

  • All the dependencies for all the training platforms must be installed even though one might not need some of them. otherwise the imports on top of cli.py will fail. (I will elaborate on this issue in a separate feature request issue)
  • if statements like this: https://github.com/georgianpartners/hydra/blob/7c733fd0eb7f9a9081ecdd2b5e3a5db2bdc282d4/hydra/cli.py#L129-L142 can turn up messy and error prone very fast. In an object oriented design this is not a good practice.
  • The user is not able to easily add a new training backend or customize the existing ones (by subclassing them or AbstractPlatform)

I propose to have a code that dynamically discovers the AbstractPlatform subclasses in runtime and registers them with hydra. It will get rid of switch-case-like if statements such as the one above and https://github.com/georgianpartners/hydra/blob/7c733fd0eb7f9a9081ecdd2b5e3a5db2bdc282d4/hydra/cli.py#L74-L77
This discovery interface should also provide the users with an appropriate way (such as a method or a decorator) for users to register and load their custom training platforms manually.

--
I went ahead of myself and started working on a PR before opening the issue. I'll submit it once it's finalized.

Allow for training even with uncommitted changes

Currently you can only do training when you've committed and pushed all new changes in your branch. This introduces a blocker when a Data Scientist is trying out lots of different changes in their code.

Allow for training even with uncommitted changes. This can be done by taking a git diff of the current branch, storing it and then doing an git apply to the current branch during training

Download datasets/artifacts from a predefined S3 path prior to job run

RIght now users have to explicitly write code to download datasets/artifacts which are needed in their experiments. Implement a functionality to allow users to pass a path to the datasets/artifacts as part hydra configs. This will pre-download everything within the path into the container and expose them under data/

Store job metadata onto a database

Currently there is no central way to track job runs apart from dashboards provided by the respective cloud providers. We need a unified place to track and persist runs for longer periods of time. Implement functionality to implicitly store job metadata into a DB. Also write IAC scripts to setup the DB

Implement Hydra train --cloud=local

Command will be
hydra train --cloud=local --model={MODEL_CODE_PATH} --other_hyperparams

This command would train the model locally by running the docker image

Cannot find "HYDRA_PLATFORM" in os.environ with fast_local

HYDRA_PLATFORM environment is empty:

When I ran hydra train -y run_model.yaml --cloud fast_local and hydra train -y run_model.yaml --cloud local, print(os.environ.get('HYDRA_PLATFORM')) outputs None.

I'm in the middle of checking for other modes.

Set up hydra cli

Add setup.py, tests repo and implement a dummy hydra train command

Track grid runs as a single unit

Multiple variants of an experiment can now be launched via the hydra yaml config file. But there is now way to track these runs as they are running and once they are completed. For a start store them in the db with a uuid to track them as a group. Also find a way to show them as a group in mlflow

IAC for AWS Batch Infra

Create terraform scripts to launch aws batch infra for training.
Import components

  1. Separate queues and compute environments for CPU vs GPU workloads
  2. Use launch templates to attach additional disk storage to training instances
  3. Alerting for jobs queuing up in Batch states
  4. Slack alerting configuration
  5. FIne grained IAM perms

Building hydra docker image

Docker is looking for dockerfile inside hydra-ml-projects but dockerfile is in hydra repo. Find a way to build the image from cli

Create Dockerfile to run models

  • - Should be able to run model code from any git repo
  • - Install model requirements via requirements.txt
  • - persist models to s3://hydra-ml/artifacts
  • - push file to ECR

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.