bodywork-ml / bodywork-core Goto Github PK

View Code? Open in Web Editor NEW

433.0 11.0 22.0 3.61 MB

ML pipeline orchestration and model deployments on Kubernetes.

Home Page: https://bodywork.readthedocs.io/en/latest/

License: GNU Affero General Public License v3.0

Dockerfile 0.31% Python 99.47% Jupyter Notebook 0.22%

mlops python kubernetes data-science machine-learning pipeline serving continuous-deployment batch devops

bodywork-core's People

Contributors

Stargazers

Watchers

bodywork-core's Issues

Change all Optional Parameters to Concise Version

Change all Optional parameter declarations from namespace: Optional[str] = None, to namespace: str = None. These are equivalent and recognised by Mypy with the latter declaration being more concise thus improving the readability of the code.

Add support for Git repos hosted on GitLab

Description
As a machine learning engineer, I would like to be able to use public and private repos hosted on GitLab with Bodywork, because I cannot use GitHub at my place of work and this prevents me from adopting Bodywork.

Tasks

create a Bodywork GitLab account and create create a copy of the bodywork-test-project, or similar.
extend the bodywork.git module to work with GitLab repos, either public or private (via SSH).
ensure that a SSH_GITLAB_KEY_ENV_VAR will be injected as an environment variable in bodywork.k8s.batch_jobs.configure_batch_stage and bodywork.k8s.service_deployments.configure_service_stage_deployment, or consider refactoring SSH_GITLAB_KEY_ENV_VAR into SSH_GIT_KEY_ENV_VAR, so that it can be used with any remote Git repository host.
add an integration test that uses a GitLab repo.
update docs accordingly.

Label services with git URL, branch and commit-hash

Tasks

Label services with the git URL, branch and commit hash containing the deployment definition.
Ensure that this information is passed to the CLI.

Extend internal k8s API to enable Ingress resource creation and deletion

Story
As a Bodywork Developer, I would like to be able to manage Ingress resources, to enable high-level Ingress functionality for ML engineers.

Tasks

Create a k8s.service_deployments.create_ingress_to_cluster_service function that will create an Ingress resource for a ClusterIP Service.
Create a k8s.service_deployments.delete_ingress_to_cluster_service that will that will delete an Ingress resource for a ClusterIP Service.

Notes

All of the above will be setup assuming that the Kubernetes NGINX ingress controllers is being used.

Decouple repository cloning and config parsing from workflow execution

Description
The bodywork.workflow_execution.run_workflow function should have repo cloning and config parsing refactored into a separate function, such that run_workflow is responsible solely for managing workflow execution (and is called by the new function). This will facilitate future development - e.g. Bodywork REST API server.

Tasks

refactor repo cloning and config parsing into a separate function the then calls bodywork.workflow_execution.run_workflow purely for workflow management.
update bodywork.cli.cli.workflow to use the new function.

Implement Bodywork Deployment Delete

"As an ML Engineer I want to be able to delete a deployment"

Implement a CLI command to delete a namespace (effectively kubectl delete ns project-name)

N.B. We need to update the docs to make it clear that if a user has specified a specific namespace to deploy to in the Bodywork YAML then this will be the name they have to give to the CLI or they will need to specify this as a parameter (would make the usage more consistent).

Add Deployment History Command

"As an ML Engineer I want to see the deployment history for a project"

Need to refine what this is exactly before going ahead with this ticket.

Is this really required? What useful information does this give the user as it is a list of undated deployment names for a project?

Update Cronjob Command

"As an ML Engineer I would like to update an existing Cronjob"

Create the CLI command option to update an existing cronjob i.e. bodywork cronjob update --ns project-name --name cronjob-name --schedule * * * 5

Is it intended to use print statements instead of logging?

First of all great project folks.

Secondly, its not a convention to use print statements, instead logging is used, I am just curious to know why you folks opted for print.

I can make a PR and introduce logging (using structlog) which will help this project to grow and make it easy for people to use API instead of using CLI.

Please let me know :)

Change working directory used to execute stages

Story
As a ML engineer, I would like to be able to write stages that reference files relative to the working directory containing the executable Python module for the stage, so that working with files paths is as easy as it would be when developing locally.

Task

In bodywork.stage.run_stage use the optional cwd argument in subprocess.run to change the working directory; and,
Update any deployment templates using a work-around.

Automate the management of k8s namespaces

As a Machine Learning Engineer, I would like the creation and specification of k8s namespaces to be handled automatically by Bodywork, so that I do not have to know about k8s namespaces or what to do with them (I find this all very intimidating).

Tasks

The user should be provided with the option to specify a namespace in bodywork.yaml - e.g. in a project.namespace field - but if this isn't provided, it should default to project.name.
All remote workflow-jobs should be deployed to the bodywork-deployment-jobs namespace, which will be created automatically when the cluster is configured for Bodywork. This means that the --namespace flag can be dropped for all bodywork deployment commands.
We should remove the need to specify a deployment name with the --name flag and instead use one based on the git-repo URL and the current timestamp, so that the resulting command looks like bodywork deployment create MY_REPO MY_BRANCH.
Refactor bodywork.workflow_execution.run_workflow, so that it doesn't take namespace as an argument and instead creates the namespace, if it needs to, based on the config parameters. If the workflow does not deploy any services, then the namespace should be deleted when the workflow has successfully completed.
Refactor the code in bodywork.k8s.workflow_jobs and bodywork.cli.cli.workflow, to reflect the fact that namespace no longer needs to be thrown around as an argument.

Create a `configure cluster` command

As a Machine Learning Engineer, I would like all k8s cluster setup to be done for me, so that I don't need to understand k8s before deploy my pipelines.

Tasks

Create a bodywork.cli.configure_cluster module.
Implement a setup_cluster_for_bodywork function that will perform all the necessary steps to configure k8s for use with bodywork - e.g.,
- create a bodywork-deployment-jobs namespace, into which all deployment (workflow) jobs will be run.
- create service accounts and roles for bodywork-deployment-jobs, so that workflow-jobs will be able to create resources. This could be done with bodywork.k8s.auth.setup_workflow_service_account, but it may have to be modified with the option to grant permission for namespaces to be created, which will be required by workflow-jobs in bodywork-deployment-jobs (as we want to automate namespace creation).

Make Secrets and Their Associated Groups Clearer

Currently, a secret in group dev with name api-password will be given a k8s resource name of dev-api-password, which makes it hard to figure-out what group it belongs to, if you've forgotten what groups are in existence and you have other k8s secrets floating around. Maybe, this would be easier if the secret name would be something like dev--api-password?

Additionally, it would be useful to have a command along the lines of bodywork secret display groups, that lists the secrets groups in existence by looking at the labels of all secrets.

How we can configure Multiple workflow in same Git Repo

I am exploring Bodywork for Last few days.
We have multiple models running in Production. I want to configure multiple model workflow in same git repo without changing branch. Can you please help me ?

Unhandled exception when deploying from a private GitHub repo via SSH

When trying to deploy a private repo via SSH using,

$ bodywork deployment create \
    --namespace=arc-cpre \
    --name=d1 \
    [email protected]/everlution/arc-cpre.git \
    --git-repo-branch=rest-api-definition \
    -L

I got the following unhandled exception:

testing with local workflow-controller - retries are inactive
namespace=arc-cpre is setup for use by Bodywork
2021-06-22 19:48:29,882 - INFO - workflow_execution.run_workflow - attempting to run workflow for [email protected]/everlution/arc-cpre.git on branch=rest-api-definition in kubernetes namespace=arc-cpre
2021-06-22 19:48:29,939 - ERROR - workflow_execution.run_workflow - failed to execute workflow for rest-api-definition branch of project repository at [email protected]/everlution/arc-cpre.git: Unable to setup SSH for Github and you are trying to connect via SSH: 'failed to setup SSH for github.com - cannot find BODYWORK_GIT_SSH_PRIVATE_KEY environment variable'
Traceback (most recent call last):
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/git.py", line 63, in download_project_code_from_repo
    setup_ssh_for_git_host(hostname)
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/git.py", line 137, in setup_ssh_for_git_host
    raise KeyError(msg)
KeyError: 'failed to setup SSH for github.com - cannot find BODYWORK_GIT_SSH_PRIVATE_KEY environment variable'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/workflow_execution.py", line 81, in run_workflow
    download_project_code_from_repo(repo_url, repo_branch, cloned_repo_dir)
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/git.py", line 74, in download_project_code_from_repo
    raise BodyworkGitError(msg)
bodywork.exceptions.BodyworkGitError: Unable to setup SSH for Github and you are trying to connect via SSH: 'failed to setup SSH for github.com - cannot find BODYWORK_GIT_SSH_PRIVATE_KEY environment variable'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/workflow_execution.py", line 156, in run_workflow
    if config.project.run_on_failure and type(e) not in [
UnboundLocalError: local variable 'config' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/workflow_execution.py", line 167, in run_workflow
    f"Error executing failure stage: {config.project.run_on_failure}"
UnboundLocalError: local variable 'config' referenced before assignment

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/bin/bodywork", line 8, in <module>
    sys.exit(cli())
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/cli/cli.py", line 302, in cli
    args.func(args)
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/cli/cli.py", line 324, in wrapper
    func(*args, **kwargs)
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/cli/cli.py", line 393, in deployment
    workflow(pass_through_args)
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/cli/cli.py", line 324, in wrapper
    func(*args, **kwargs)
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/cli/cli.py", line 562, in workflow
    run_workflow(
  File "/Users/alexioannides/Dropbox/bodywork_client_repos/arc-CPRE/.venv/lib/python3.8/site-packages/bodywork/workflow_execution.py", line 176, in run_workflow
    if config is not None and config.project.usage_stats:
UnboundLocalError: local variable 'config' referenced before assignment

Display Services in all namespaces - CLI command

"As an ML Engineer I would like to see all the deployed services (on the cluster)".

Extend the existing 'Services' CLI command with the option to view all the services on the cluster which is equivalent to kubectl get services --all-namespaces

Bodywork CLI command will be bodywork service display

Modify workflow runner to avoid potentially conflicting deployments

"As a ML engineering manager, I want to ensure that projects from independent repos cannot be deployed into the same namespace, so that separate teams cannot interfere with one another's work (e.g accidentally override or delete another team's services)."

Tasks

modify the namespace to be derived from the git URL and branch as specified on the CLI. Ensure that it's a valid k8s namespace using bodywork.k8s.utils.make_valid_k8s_name.
remove the ability to override the namespace.
ensure that bodywork service display isn't reliant on namespace as an argument.

Modify workflow runner to delete all redundant services

"As a ML engineer, I would like all redundant services to be automatically deleted at the end of a successful workflow execution."

At the end of bodywork.workflow.run_workflow, execute the following:

get a list of all services in the namespace.
delete all services that are not present in the workflow DAG of the config used to make the current deployment.
remove the bodywork service delete command from the CLI.

This now means that Bodywork is doing 100% GitOps.

Implement a delete method for deployments

There is currently no command for,

$ bodywork deployment delete \
     --ns NAMESPACE \
     --name DEPLOYMENT_NAME

As it was never implemented. It was our intention to rely on the Time To Live (TTL) settings and controller to clean-up finished jobs after they have completed (successfully, or otherwise).

Over time, I've noticed that this function hasn't been working. After some digging, it transpires that this features of the k8s hasn't been enabled on AWS EKS as it's only in beta (where it has been for a few years). So, some cluster may not be able to rely on it, so we should provide a simple method for deleting these deployment jobs...

Tasks

implement a delete_workflow_job method in the bodywork.cli.workflow_jobs module; and,
implement a bodywork deployment delete command in the bodywork.cli.cli module.

Extend bodywork cronjob create CLI command to accept history limits

The bodywork.k8s.cronjobs.configure_cronjob function accepts successful_jobs_history_limit and failed_jobs_history_limit arguments, for configuring how many historical runs to keep.

The bodywork.cli.create_cronjob_in_namespace function, however, does not support these arguments (and likewise in bodywork.cli.cli.py), which is an oversight that needs to be corrected.

Ensure that the docs are updated to reflect any changes.

Modify workflow runner to handle ingress resources during execution

Story
As a ML engineer, I would like service ingress to be setup (or updated) as required, so that I can expose my services to the world beyond the k8s cluster.

Tasks

Modify workflow.py to create Ingress resources alongside Service resources and update changes accordingly.
Check that the same can be achieved with Services (updates to things like ports might be lacking).
Ensure that a WARNING is logged when ingress is configured, but there is no NGINX ingress controller deployed in the cluster.

Passing data between Bodywork stages locally

Hi guys,

I set up my own project to test Bodywork (started with pipelines). Now I have three simple stages:

downloading data (sklearn.datasets.load),
preprocessing (this one is a dummy stage, empty),
and training (also sklearn).

Step no. 1 downloads data and saves it to a file which then should be used in step no. 3 i.e. training.
Unfortunately, step 3 cannot find the data, me too. Logs below.

Is it because each stage is executed in its own container and everything gets wiped out after stage execution? Is there any recommended way of sharing data between stages so that I don't have to upload data to S3 and then download it back?

Code: https://github.com/mtszkw/turbo_waffle

PS. When executed manually (python), step 1 leaves data_files dir, so let's assume that script works as intended.

Thanks in advance!

---- pod logs for turbo-waffle--3-train-random-forest

2021-07-12 11:58:16,798 - INFO - stage_execution.run_stage - attempting to run stage=3_train_random_forest from main branch of repo at https://github.com/mtszkw/turbo_waffle
2021-07-12 11:58:16,801 - WARNING - git.download_project_code_from_repo - Not configured for use with private GitHub repos
2021-07-12 11:58:28.952 | INFO | main:_load_preprocessed_data:9 - Reading training data from /tmp/data_files/breast_cancer_data.npy and /tmp/data_files/breast_cancer_target.npy...
Traceback (most recent call last):
File "train_random_forest.py", line 32, in
X_train, y_train = _load_preprocessed_data(X_train_full_path, y_train_full_path)
File "train_random_forest.py", line 10, in _load_preprocessed_data
X_train = np.load(X_train_full_path)
File "/usr/local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 417, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/data_files/breast_cancer_data.npy'
2021-07-12 11:58:29,025 - ERROR - stage_execution.run_stage - Stage 3_train_random_forest failed - CalledProcessError(1, ['python', 'train_random_forest.py', '/tmp/data_files/breast_cancer_data.npy', '/tmp/data_files/breast_cancer_target.npy'])

Add missing documentation for retries variable in bodywork cronjob

The --retries flag for the bodywork cronjob command has not been documented - e.g.,

$ bodywork cronjob create \
    --namespace=MY_NAMESPACE \
    --name=MY_PIPELINE \
    --git-repo-url=https://github.com/MY_USERNAME/MY_REPO \
    --git-repo-branch=BRANCH \
    --retries=N

This came up during Discussion #18.

Send basic usage statistics to a centralised tracking server

Description
As a Bodywork Product Manager, I would like to know how Users are using Bodywork, so I can accurately guide Bodywork's roadmap.

Tasks

provide a new configuration parameters - e.g., logging.usage_stats - that enables Users to opt-out of usage tracking.
hard-code the tracking server URL (and possible credentials), as a constant in the bodywork.constants module.
modify bodywork.workflow_execution.run_workflow to ping the tracking server when it is called, failing gracefully.
add a basic test.

Make local workflow controller the default (and only) option for the CLI

It makes no sense for the workflow controller to be run anywhere other than locally - e.g. even if running on a CI/CD box, it needs to run locally. This should become the default option and we should remove the asynchronous option from the CLI, but not from the internal k8s sub-module, we we'll need this for the REST API.

Make Secrets Namespace Agnostic

As a Machine Learning Engineer I would like to use the same set of secrets across multiple workflows/deployments and not deal with namespaces.

Tasks

Currently Secrets are created for each individual namespace and instead they should be arranged into groups and be namespace agnostic.

For the bodywork secret commands, replace namespace argument with a name that represents the name of the secrets group this secret is in e.g. --group.
Create all secrets in the bodywork-deployment-jobs namespace.
Prefix the name of the secret with the name arg provided or similar so that they can be retrieved according to the group name that they belong to e.g. 'Prod' group secret would be 'Prod-SSHKey'.
Add secrets_group item to project section of bodywork.yaml
Ensure the relevant secrets are retrieved and added to the workflow container/workflow if secrets_group is specified in the config.

N.B. Remember to remove the namespace setup and amend the Secret creation in test_workflow_and_service_management_end_to_end_from_cli

Add support for Bitbucket and Azure DevOps

As a Machine Learning Engineer I would like to use Bitbucket and Azure DevOps to host my code.

Currently Bodywork does not support Bitbucket or Azure DevOps hosted repositories. Extend git.py to allow connection to these repositories and then test this works by running Bodywork against a repo hosted on these sites.

Enable ad hoc remote workflow execution

Story
As a ML engineer, I would like to be able to trigger workflows on an ad hoc basis, without having to have the workflow-controller run locally, so that my machines (e.g. CICD runners) do not have to dedicate non-cluster resources for running workflows.

Tasks

Create a bodywork.k8s.jobs.configure_workflow_job function for defining a workflow execution job, using bodywork.k8s.cronjobs.configure_cronjob as a reference.
Create a bodywork.cli.deploy module that contains a function for triggering workflow job.
Create a bodywork deploy command in bodywork.cli.cli for providing this functionality from the CLI.

K8s Package Methods Should Return a List When Returning Objects

Currently the Bodywork k8s package 'Get' methods return dictionaries. Now we have started to return objects e.g. display_secrets these should instead be returned as a list of objects because this is best practice for the retrieval of objects in the data layer.

However, at this moment in time Secrets is the only object that is returned from the data layer with the rest of the objects being dictionaries returning a specific value, therefore it should not be done now to maintain consistency in the data layer. These at some point will become objects too as we expand the data that is returned for each of these items. When this occurs we should refactor all these methods to return a list.

Update Secret Command

"As an ML Engineer I would like to update an existing Secret"

Extend the Secrets CLI command with a sub command to update a secret i.e. bodywork secret update --ns group-name --name mysecret --data USERNAME=marios PASSWORD=xyz

Shorten Parameter Names for CLI Commands

Some parameter names for CLI commands are unnecessarily long, shorten these to make the UX better. Also make sure they are consistent across commands.

e.g. : git-repo-url -> git-url

Remove Workflow Command

The deployment create command is equivalent to the workflow command therefore the workflow command should be removed to keep things clean and simple.

This will involve updating the bodywork command used to run the workflows in bodywork.k8s.workflow_jobs.configure_workflow_job.

Delete a Group of Secrets

"As an ML Engineer I would like to delete a whole group of secrets"

Expand the CLI delete secret function so that it is possible to delete a whole group of secrets by just providing the group name. This will involve amending the existing 'delete secret' methods in both k8s.secrets.py & cli.secrets.py.

Extend CLI to enable ingress resource handling

Story
As a ML engineer, I would like to be able to see ingress information and use it to manage ingress, so that I have control over how my services are exposed beyond the k8s cluster.

Tasks

Modify cli.service_deployments.display_service_deployments to include ingress information.
Modify cli.service_deployments.delete_service_deployment_in_namespace to handle ingress alongside deployments.

CLI to generate yaml files

Hi,
there is the possibility to use the CLI to generate yaml files instead of interacting directly with K8s installation trough kubeconfig?

Since I usually apply GitOps pipelines to my MLops clusters I would like to test it putting deployments in a helm chart or kustomize yaml.

Afaik I didn't find any option to achieve this.

Ty.

Enable stages to request GPU resources

Description
As a Machine Learning Engineer, I would like to be able to request GPU resources for stages in a workflow, so that tensor-based machine learning (e.g. PyTorch), can benefit from hardware based acceleration for training and serving.

Tasks

manual PoC based on resources listed below.
add a gpu_request config parameter for all stage types.
extend bodywork.k8s.batch_jobs and bodywork.k8s.service_deployments to request GPU resources.
think of a functional test - e.g. trying to run a deployment that uses PyTorch, explicitly checking for GPU availability and printing to stdout?
update docs accordingly.

Resources

Enable horizontal autoscaling for service deployments

Description
As a Machine Learning Engineer, I would like for the number of replicas standing behind my services, to scale automatically, based on CPU utilisation, so that I do not have to frequently monitor CPU utilisation and manually change the number of replicas.

Tasks

extend the config schema to allow for a new (optional) scale_out_replicas parameter, that will represent the number of replicas above those specified in replicas, that Kuberentes can scale the deployment up to.
bodywork.k8s.service_deployments need to be extended to enable CRUD operations for HorizontalPodAutoscaler resources.
if scale_out_replicas is present, then the bodywork.config.StageConfig object should flag to bodywork.workflow_execution.run_workflow, that a HorizontalPodAutoscaler resource should be created.
think a lot about how to test this - e.g. monitor a service that will deterministically consume CPU resources and require scale-out to kick-in?
update docs, where required.

Resources

refer to section 15.1.2 and listing 15.2 of 'Kubernetes in Action'.

Make Kubernetes integration tests cluster agnostic

Story
As a Bodywork Developer, I would like to be able to run Bodywork's integrations tests locally, using Minikube, so that I can easily test Bodywork on different versions of Kubernetes, to that on the official development cluster (AWS EKS).

Background
Currently, Bodywork's integration tests expect the ingress load balancer URL to be in the location where they are installed on our AWS EKS dev cluster, as determined here. This means that Bodywork cannot be tested using local clusters, such as Minikube, which install the ingress controller elsewhere.

Tasks

Look into alternative methods that could be used to obtaining the ingress URL, that are invariant to how it has been installed.
Implement any alternative methods.
Look for similar issues, beyond ingress.

Expose Git commit hash as environment variable in Bodywork containers to help with versioning

Description
As a Machine Learning Engineer, I would like access to the Git commit hash of the running pipeline, so that I can use it to tag the artefacts generated in my pipelines.

Tasks

create a new function in the bodywork.git module, that uses git rev-parse --short HEAD to get the commit hash.
modify bodywork.k8s.batch_jobs and bodywork.k8s.service_deployments to take git commit cases as variables as arguments that are then injected as environment variables, in much the same way that GitHub SSH credentials are injected.
enable workflow_execution.py to retrieve and then inject the git commit hash into jobs and deployments.
add tests and update documentation with a new section along the lines of 'pipeline versioning'.

Update deployment template projects for v0.3.0 release

The config.ini files in each of the deployment templates will require the INGRESS parameter to be set for all service stages, in order to be compatible with Bodywork v0.3.0.

This is also a good opportunity to add a license (MIT for these?), version file and possibly a CHANGELOG.md file.

Some questions regarding the tutorial

I am following your tutorial. It is amazing, but I have a few questions.

In the part 1 of the tutorial, when I wanted to manually test the deployed prediction endpoint, I typed this in my terminal:

curl http://CLUSTER_IP/pipelines/time-to-dispatch--serve-model/api/v0.1/time_to_dispatch

The CLUSTER_IP is not defined and it gives an error. So, I understand that the CLUSTER_IP might not be accessible if I use minikube, but would it be a problem when I deploy it on AWS?

And also, where does the time-to-dispatch--serve-model in the endpoint come from?

Provide the option to execute a batch job on workflow failure

As a MLOps Engineer, I would like to be able to run custom code (e.g. send notifications) if a workflow has failed to execute, so that I can handle errors appropriately.

Tasks

extend the config file schema with a project.run_on_failure parameter, which takes the name of a stage that is only to be run if the workflow raises an exception.
if an exception was raised during bodywork.workflow_execution.run_workflow, then if project.run_on_failure has been set, start a job to run the required Python module specific in project.run_on_failure, using the exception as an argument to the module.

We should be able to use existing test repos that raise errors, in order to build an integration test for this.

Test the Kubernetes Python client to ensure that it can manage Ingress resources

Use this to implement this equivalent of,

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: ingress-scoring-service
  namespace: ml-workflow
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  rules:
  - http:
      paths:
      - path: /scoring-service(/|$)(.*)
        backend:
          serviceName: bodywork-ml-pipeline-project--stage-2-deploy-scoring-service
          servicePort: 5000

When deploying the bodywork-ml-pipeline-project template project and using the Kubernetes NGINX ingress controller as deployed using this.

Improve exception handling and error messages for failed git clones

Story
As a ML engineer, I would like to know when a workflow has failed because it is private and the SSH credentials are missing, so that I can take the necessary steps to correct the problem.

Task
Improve the exception handling in git.py, so that users are better informed.

Create a CHANGELOG.md

That describes what has changed between 0.2.* and 0.3.0 and which will be kept up-to-date from this point onwards (can we modify the CICD template to check for this?).

Switch to using regular expressions in tests for CLI display functions

The test for:

bodywork.cli.cronjobs.display_cronjobs_in_namespace
bodywork.cli.service_deployments.display_service_deployments_in_namespace
bodywork.cli.secrets.display_secrets_in_namespace

Are rough-and-dirty (approximate at best), relying on looking for simple strings. These could be made more precise by using regex to look for exact matches.

Arguments to BodyworkStageConfig exceptions are not passed correctly in stage.py

Leading to log messages that make no sense.

For example, in line 161 of stage.py we have:

time_param_error = BodyworkStageConfigError(
    'MAX_STARTUP_TIME_SECONDS',
    'batch',
    name
)

Which should actually be,

time_param_error = BodyworkStageConfigError(
     name,
     'service'
    'MAX_STARTUP_TIME_SECONDS'
)

Check all uses of BodyworkStageConfigError in stage.py and fix where necessary.

Add ingress details to documentation

Story
As a ML engineer, I would like to know how to setup and manage ingress, so that I can expose my services to the world beyond the k8s cluster.

Task
Document all the extra functionality introduces in the ingress epic.

Add create ingress route option for services in config.ini

Story
As a ML engineer, I would like to be able to configure service stages for cluster ingress, so that I can expose my services to the world beyond the k8s cluster.

Tasks

Modify the bodywork.stage.ServiceStage class to be able to parse, validate and store the following config.ini parameters.

...

[service]
CREATE_INGRESS=True

Enhance the CLI UX

Use the Rich package to improve the rendering of Bodywork deployment information for end users.

bodywork-ml / bodywork-core Goto Github PK

bodywork-core's People

Contributors

Stargazers

Watchers

Forkers

bodywork-core's Issues

---- pod logs for turbo-waffle--3-train-random-forest

Recommend Projects

Recommend Topics

Recommend Org