georgian-io-archive / hydra Goto Github PK

A cloud-agnostic ML Platform that will enable Data Scientists to run multiple experiments, perform hyper parameter optimization, evaluate results and serve models (batch/realtime) while still maintaining a uniform development UX across cloud environments

License: Apache License 2.0

Python 43.88% Dockerfile 0.51% Shell 3.79% HCL 51.26% Makefile 0.56%

mlops ml-platform python cloud-agnostic experimentation hydra

hydra's People

Contributors

Stargazers

Watchers

Forkers

iratemonkey maybeee18 mjafarmashhadi albertoa

hydra's Issues

Error when running on fast_local or local on hydra 0.3.4

Hi, I tried running on aws and it worked fine, but got the following error when running locally.

When running with --cloud fast_local, got the following error:

Traceback (most recent call last):
  File "/Users/.../miniconda3/envs/.../bin/hydra", line 8, in <module>
    sys.exit(cli())
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/angelineyasodhara/miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/.../miniconda3/envs.../lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/hydra/cli.py", line 124, in train
    platform.train()
  File "/Users/.../miniconda3/envs/.../lib/python3.7/site-packages/hydra/cloud/fast_local_platform.py", line 9, in train
    os.system(" ".join([self.options, 'python3', self.model_path]))
TypeError: sequence item 0: expected str instance, dict found

Updating with -o "" also gives an error at json.loads(options) in hydra/cli.py.

When running with --cloud local, got the following error:
sh: /Users/..../miniconda3/envs/..../lib/python3.7/site-packages/hydra/cloud/../../docker/local_execution.sh: No such file or directory

Implement Hydra train --cloud=gcp

Command will be
`hydra train --cloud=gcp --model={MODEL_CODE_PATH} --other_hyperparams

This command would train the model on GCP

Implement faster local version training

hydra train --cloud=local is currently a bit slow as it installs model requirements for each run. We want a faster local training variant by using --cloud=fast_local which skips docker and directly runs python3 $model_path

Install Hydra and look for improvements in documentation

Install hydra==0.3.5
Improve doc

Fix bug with GPU Training in AWS

GPU instances not launching when gpu_count specified in aws cloud

Pip install failing

After Installing hydra using pip
pip install hydra-ml
Running hydra --version gives the following errors

Traceback (most recent call last):
  File "/usr/local/bin/hydra", line 33, in <module>
    sys.exit(load_entry_point('hydra', 'console_scripts', 'hydra')())
  File "/usr/local/bin/hydra", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/metadata.py", line 504, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/local/Cellar/[email protected]/3.8.6/Frameworks/Python.framework/Versions/3.8/lib/python3.8/importlib/metadata.py", line 177, in from_name
    raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: hydra

Add AWS Support

Support local/fast_local modes in Windows OS

Using exception subclasses

The best practice in python is to use the built-in exceptions as much as possible, avoiding creating new exceptions when built-in exceptions such as ValueError, KeyError, and TypeError can do the job.
Also, it is recommended to use a specific type of exception rather than using the base Exception class.
While the first one is practiced in hydra's codebase, the second one has room for improvement.

Create a project using MLFlow and train using hydra

As follow up to issue #40 , write a model to use this mlflow tracking feature and run on Hydra. This model should be added in https://github.com/georgianpartners/hydra-ml-projects

Alchemy classifier training on Hydra

Take out training code from alchemy and train on Hydra

Beam GCP Bug fix

Setting up encryption for training datasets

IAC to make buckets storing datasets be encrypted

IAC - Automate Container Deployment in ECS Fargate

Docker containers must be created into task definitions that are runnable in ECS service. These tasks must be configured to run using a load balancer including CloudWatch logging and metrics.

File not found error when running `local` and `gcp` mode

After installing hydra via pip pip install hydra-ml==0.3.6
Run a training command with local mode hydra train -y run.yaml --cloud=local
This error gets raised

sh: /Users/faisalanees/.conda/envs/hydra/lib/python3.8/site-packages/hydra/cloud/../docker/local_execution.sh: No such file or directory

Add abstraction layer API to store metrics (Mlflow, RDS, comet.ml, etc)

IAC - Add optional modules to allow VPC and subnet creation within Terraform

Terraform scripts are currently using existing VPCs and subnets; complete configurations in scripts to allow creation of new VPCs and subnets

Setup MLFlow Tracking infra IAC

Mlflow tracking needs a tracking server to track job metadata. Write IAC in terraform to setup this server. Read more here https://mlflow.org/docs/latest/tracking.html#id63

Dynamically find available training backends

Currently, the training backends are hard coded in hydra's code. There are a few drawbacks with this approach:

All the dependencies for all the training platforms must be installed even though one might not need some of them. otherwise the imports on top of cli.py will fail. (I will elaborate on this issue in a separate feature request issue)
if statements like this: https://github.com/georgianpartners/hydra/blob/7c733fd0eb7f9a9081ecdd2b5e3a5db2bdc282d4/hydra/cli.py#L129-L142 can turn up messy and error prone very fast. In an object oriented design this is not a good practice.
The user is not able to easily add a new training backend or customize the existing ones (by subclassing them or AbstractPlatform)

I propose to have a code that dynamically discovers the AbstractPlatform subclasses in runtime and registers them with hydra. It will get rid of switch-case-like if statements such as the one above and https://github.com/georgianpartners/hydra/blob/7c733fd0eb7f9a9081ecdd2b5e3a5db2bdc282d4/hydra/cli.py#L74-L77
This discovery interface should also provide the users with an appropriate way (such as a method or a decorator) for users to register and load their custom training platforms manually.

--
I went ahead of myself and started working on a PR before opening the issue. I'll submit it once it's finalized.

IAC - Include Auto-Scaling policy in MLFlow ECS Cluster

Include autoscaling policy that allows for containers in ECS cluster to dynamically scale up and scale down based on CPU and memory utilization

Create hydra init in Hydra

Add an init command to allow hydra to run IaC scripts to provision infrastructure.

Test

Add tests to fix github merge actions

Allow for training even with uncommitted changes

Currently you can only do training when you've committed and pushed all new changes in your branch. This introduces a blocker when a Data Scientist is trying out lots of different changes in their code.

Allow for training even with uncommitted changes. This can be done by taking a git diff of the current branch, storing it and then doing an git apply to the current branch during training

Create utility to access MLFlow creds securely - AWS & GCP

Currently mlflow credentials need to be passed via environment variables in the run.yaml file. Add util methods in hydra.utils to automatically set MLFlow tracking username and password environment variables by reading a cloud secret manager (so for GCP it'd be https://cloud.google.com/secret-manager and for AWS it'd be https://aws.amazon.com/secrets-manager/ - for local runs just read from local environment)

Download datasets/artifacts from a predefined S3 path prior to job run

RIght now users have to explicitly write code to download datasets/artifacts which are needed in their experiments. Implement a functionality to allow users to pass a path to the datasets/artifacts as part hydra configs. This will pre-download everything within the path into the container and expose them under data/

Persist job artifacts to S3

Persistent storage of job runs, their logs, and output after job completion to a configurable S3 destination. Create helper functions in the hydra library to allow the user to easily import them into their experiment codes

Note : Also look at mlfow artifacts if it can be used https://mlflow.org/docs/latest/tracking.html#id63

Store job metadata onto a database

Currently there is no central way to track job runs apart from dashboards provided by the respective cloud providers. We need a unified place to track and persist runs for longer periods of time. Implement functionality to implicitly store job metadata into a DB. Also write IAC scripts to setup the DB

Implement Hydra train --cloud=local

Command will be
hydra train --cloud=local --model={MODEL_CODE_PATH} --other_hyperparams

This command would train the model locally by running the docker image

Allow live debugging on training jobs

Start training container in debug mode with sshd enabled into the running ec2 instance. This will allow the user to debug into a running job

Releasing to PyPI using github actions

We have an action that is supposed to push to PyPI on new releases but it doesnt seem to working as expected

Cannot find "HYDRA_PLATFORM" in os.environ with fast_local

HYDRA_PLATFORM environment is empty:

When I ran hydra train -y run_model.yaml --cloud fast_local and hydra train -y run_model.yaml --cloud local, print(os.environ.get('HYDRA_PLATFORM')) outputs None.

I'm in the middle of checking for other modes.

Set up hydra cli

Add setup.py, tests repo and implement a dummy hydra train command

IAC - add iac for instrumentation alerting

Need alerting to notify in case of surge in throughput, mem, cpu util, errors, etc
AWS - cloudwatch alerts
GCP - ?

Track grid runs as a single unit

Multiple variants of an experiment can now be launched via the hydra yaml config file. But there is now way to track these runs as they are running and once they are completed. For a start store them in the db with a uuid to track them as a group. Also find a way to show them as a group in mlflow

Add Database to store metadata of each run

Add options to add mlflow tracking url to hydra yaml config

Set mlflow tracking url for a project via the config yaml file

IAC for AWS Batch Infra

Create terraform scripts to launch aws batch infra for training.
Import components

Separate queues and compute environments for CPU vs GPU workloads
Use launch templates to attach additional disk storage to training instances
Alerting for jobs queuing up in Batch states
Slack alerting configuration
FIne grained IAM perms

Clone https://github.com/georgianpartners/hydra-ml-projects
Fill Readme.md with instructions on how to train

Modes:

fast_local
local
aws
gcp

- Should be able to run model code from any git repo
- Install model requirements via requirements.txt
- persist models to s3://hydra-ml/artifacts
- push file to ECR