kubeflow / examples Goto Github PK

A repository to host extended examples and tutorials

License: Apache License 2.0

Jupyter Notebook 31.32% Python 5.05% Shell 0.34% HTML 0.24% Makefile 0.20% CSS 0.14% JavaScript 0.09% Dockerfile 0.29% Jsonnet 62.20% Jinja 0.14%

examples's Issues

[GH Example] Surface RPC metrics

It would be great to be able to show how we collect/surface RPC metrics for models.

The current thinking is to use ISTIO; see kubeflow/kubeflow#464

@lluunn Once we have ISTIO working

/cc @lluunn

DeepVariant Example

DeepVariant might make an interesting.

This could be a nice example to illustrate various aspects such as

Preprocessing
Pipelines.
Large scale training

Create tf-serving examples for models trained using higher level APIs

Example: Estimator and Keras API training to serving + Unit testing
Priority: P1 - It's not a must have, but could be a great contribution to the TF community, where many data scientists train using one of these two APIs.

The k8s-model-server component in Kubeflow currently contains an inception client example that interacts with a custom model graph: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_saved_model.py

ML practitioners often use TF estimator and Keras APIs to train models, as it greatly simplifies the training and validation process. However, converting these to servable models can be trickier and harder to debug. Add some examples of how to build servable models trained using Estimator and Keras APIs, and unit test examples.

[GH Issue] GitHub API token should be set via a secret and not as a parameter

See here

We currently set the GitHub token as a parameter. This would lead it to be checked into source control. But GItHub tokens should be kept secret.

So instead we should modify the APP to use a K8s secret to supply it.

/assign @texasmichelle

Setup a shared environment

Setup a k8s cluster & cloud project for use in creating examples.

Component of Kubeflow #157.

[GH Example] Friction points from Katacoda

Katacoda recently created a scenario out of the GitHub issue summarization example and ran into a number of frustrating issues. The following items need to be addressed in order to turn this example into a platform-independent self-contained unit for use in a wide number of environments.

@BenHall to add his list

Create E2E example of distributed model training and serving.

Currently most of the examples do not show how to complete the training of a model within kubeflow, and also take that trained model and serve it with kubeflow.

We need an example that covers this from start to finish.

Fix the linear training mode in the mnist example, upstream to Tensorflow

I never got the Linear training mode working, so I stuck to CNN for the example.

examples/mnist/model.py

Line 142 in 1be7ccb

if TF_MODEL_TYPE == "LINEAR":

If the Linear portion of the model were fixed, it could be upstreamed in tensorflow to replace their example and reduce duplication.

Also the upstream model does not support distributed training in its current form, or exporting.

[GH Issue Summarization] Create a model server

Create a model server using TFServing.

Component of #14.

Keras model exported as TensorFlow model doesn't work with TensorFlow serving

I am using the keras model as defined in this tutorial: https://github.com/hamelsmu/Seq2Seq_Tutorial/blob/master/notebooks/Tutorial.ipynb

I exported the encoder model using extract_encoder_model and exported it as a Tensorflow model. When used with TensorFlow serving, I get the following error

AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Expected multiples argument to be a vector of length 3 but got length 2
[[Node: Encoder-Last-GRU_1/Tile = Tile[T=DT_FLOAT, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Encoder-Last-GRU_1/ExpandDims, Encoder-Last-GRU_1/Tile/multiples)]]")

[GH Issue Summarization] Train and push periodically

I think the GitHub data is updated regularly. So we could try setting up a cron job or other solution to periodically retrain and push the model.

This depends on deploying it first (#39 ).

Kubeflow examples needs to have an appropriate directory structure

We need to have an example directory structure. E.g., Do we organize top level folder as frameworks i.e., tensorflow, xgboost, scikit etc? Or do we use problem type or something else? This needs to be figured out sooner so that we can add examples in appropriate folders as the current flat structure will soon be hard to navigate.

[GH Example] Use katib/modeldb to store model results

It looks like kartib has some very nice features for keeping track of your models and then surfacing metrics for those models e.g. by launching TensorBoard.

It would be great to combine this with our GH issue summarization example. In particular it would be great if we could load the trained model in a DB and then use the kartib/model DB UI to browse models and look at results.

/cc @gaocegege @YujiOshima

[GH Label Prediction] Extend GH Issue Summarization to Predict Labels

Should we extend our example on GH issue summarization to predict issue labels?

One of the points of the original blog post was to train useful word embeddings on the entire corpus. So we could potentially use this to learn features that would then allow us to train models specific to repositories/orgs which likely have less data and their own taxonomy of labels.

Predicting issue labels would be useful for creating examples that highlight model analysis tooling. A lot of model analysis tools assume you can compute metrics like true positive/true negative etc... If we predict labels we can use actual labels to compute this.

With our existing text summarization example there's no obvious way to compute whether a summarization is accurate or not, which limits our ability to use it as an example of model analysis.

This would also be useful as a proxy for a large class of ecommerce problems where the goal is to dedupe related posts (e.g. different ebay postings for the same product).

@hamelsmu Any idea how difficult this would be and whether it would be valuable?

[GH Issue] Link to IssueSummarization.py is broken

The link on this page
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/serving_the_model.md

to IssueSummarization is broken. Looks like the correct link should be

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/IssueSummarization.py

/assign @ankushagarwal

[Enhance] Image enhancement example

Goals:

Demonstrate a high-impact biomedical imaging use case
Demonstrate distributed training
Demonstrate hyperparameter tuning, potentially informing future design of a TFStudy hptuning CRD
Demonstrate batch inference
Demonstrate the use of tensor2tensor, positioning for increased leverage in developing additional examples

Steps:

Potential additional or non-steps:

Generalizing beyond NFS to support a variety of storage types
Implementing a production-caliber hyperparameter tuning solution

Current PR: #60
Readme: https://github.com/cwbeitel/examples/tree/enhance/enhance

[GH Issue Summarization] Create docker image for use with Jupyter

Component of #14.

[GH Issue] Use persistent volumes for the data

To support Katacoda #89 we should remove the need for GCP credentials.

For the output we should support using PD and making it easy for users to set the PD by parameters.
The input is trickier. I think GCS requires an account even for public buckets. If its a single file we could use the http URL to access it.
- My suggestion would be to create a script that would copy the data using curl to a PD. We could then run that script as a K8s job.

[GH Issue Summarization] E2E Solution on Kubeflow

Replicated from Kubeflow #157.

@hamelsmu published a great blog post about using sequence to sequence models to summarize GitHub issues.

It would be great to turn this into an E2E solution using Kubeflow that highlights the benefit of using Kubeflow and K8s for data science.

There are lots of reasons why I think this blog post would make for a fantastic E2E solution

Text summarization has a lot of applications
It uses GitHub data which is a very rich dataset
Training and preprocessing take enough time (~30 minutes and ~1 hour respectively) that I think it makes sense to run these as K8s jobs but not so much time to be a barrier.

Here's a stab at what an E2E solution might look like

Entrypoint would be a notebook (based on the one in the blogpost).
Notebook would walk through the various steps but instead of (or in addition) to running code directly in the notebook.

I think there's quite a bit of work to be done but I think we can split this into tasks

Setup a shared Kubeflow cluster and Cloud project for dev team to use
Create a Docker image to be used with Jupyter with all dependencies installed
~~Refactor the notebook into libraries with suitable main functions so that relevant steps (preprocessing and training) steps can be invoked in K8s Jobs and TFJobs~~
Build docker image (using Argo to be used by TFJob, TFServing, etc...
Create a model server using ~~TFServing~~ Seldon core
Create a web app to serve as the front end
Create ksonnet component for deploy the model and web app

Switch to project kubeflow-ci

We need to move out of mlkube-testing and into kubeflow-ci

See kubeflow/testing#18

Sketch out recommendation example

I think it would be very valuable to have a recommendation example. The purpose of this issue is to identify a scenario and dataset around which we could build a solution.

Some possible datasets

GitHub Data

We could recommend repositories based on stars
Issues/PRs (comments could be used to indicate a user was interested in an issue)
Recommend reviewers for PRs

Hacker News or Reddit

I think both datasets are available publicly in BigQuery

The MovieLens data seems less interesting because it isn't updated frequently.

[Agents RL] Remove unnecessary tools/ directory

[GH Issue Summarization] Organize notebook into components

Refactor the example notebook into libraries to be invoked with K8sJob & TFJobs:

Preprocessing
Training

Component of #14.

[GH Issue Summarization] Deploy GH web app on kubeflow.org and make it widely accessible

I think it would be interesting to deploy the web app and model on a K8s cluster and make it publicly available.

This would be a good test bed for a variety of things; e.g. periodic training and rollouts; monitoring etc...

[GH Issue Summarization] Create build image

Build a docker image using Argo to be used by K8sJob, TFJob, TFServing, etc.

Component of #14.

[GH Issue] Don't require the model is baked into the docker image

Currently the model hard codes the paths here of

seq2seq_model_tutorial.h5 - the keras model
body_pp.dpkl - the serialized body preprocessor
title_pp.dpkl - the serialized title preprocessor

This means users have to rebuild the docker image just to try out their model. This makes things more difficult from the perspective of rerunning the demo on their own model.

A better approach would be to allow these files to be over written e.g. using environment variables.

We might still want to bake the data into the Docker image so that users can try serving without having to train.

See #89

Create buckets and other resources to store example data

We should create buckets and other resources (GCR) to store data for our examples.

We currently have bucket gs://kubeflow-examples but that's owned by project kubeflow-dev which isn't really the best thing.

I created project: kubeflow-examples

[GH Issue] ksonnet app is missing vendor directory

It doesn't look like the vendor directory got checked in.
https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow

I'm guessing because of our .gitignore.

We should check it in.

Linting instructions

Create instructions for running pylint locally and troubleshooting presubmit failures that prevent PRs from being merged. Follow-on from merged PR #61 .

Document which tools to use locally (tf-operator describes yapf here) and how to find and use the versions in our test infrastructure. Since the Prow UI only shows file-level granularity, explain how to access the test cluster, find the right pods, & view the logs containing line-level failure details.

[GH Issue] E2E test

We should add E2E testing for the GH issue example to make sure it doesn't break.

Some things to test

Make sure we can deploy all the components using ksonnet
Make sure training runs (just run a couple steps)
Test predict RPC generate responses

[GH Issue] Scale out preprocessing using Apache Beam

In the original blog post, Hamel filtered down the number of issues from 5M to 2M.

Can we use Apache Beam to run the preprocessing on all 5 million issues and scale out horizontally?

This would be the first step in giving us a very nice scaling out story.

[GH Issue Summarization] Create a ksonnet app

Create a ksonnet app for deploying the model & web app.

Component of #14.

[GH Issue] Use Pachyderm to launch TFJobs

See kubeflow/kubeflow#151

We'd like to provide an example of Pachyderm + TFJob to illustrate

Combining Pachyderm's orchestration capabilities with TFJob for distributed training
Highlighting Pachyderm's data provenance features with TFJob

The current thought see kubeflow/kubeflow#151 is to create a simple Pachyderm pipeline to launch a TFJob to train the model.

The main challenge is that the data needs to be exported from Pachyderm and the resulting model imported into Pachyderm.

There's lots of discussion in kubeflow/kubeflow#151 about how to do this. The basic idea is

Pachyderm invokes a script that launches a TFJob
As part of the TFJob we export/import data from the Pachyderm data store
- A variety of ideas have been suggested; e.g. using an Argo workflow, init containers, sidecars etc...

To use Pachyderm we would also need to deploy Pachyderm on K8s.

[GH Issue Summarization] Create a front-end web app

Component of #14.

[GH Issue Summarization] Hyperparameter tuning(grid search)?

@hamelsmu @ankushagarwal

For the GitHub issue summarization model would it make sense to do a simple grid search for hyperparameter tuning?

I think this would be pretty straightforward to implement

Create a simple Python program(controller) to do a grid search
- Launch N TFJobs at a time and wait to complete.
Controller can store information in a file on a PD for resilience
- SQLLite might be a good choice
Run the controller as a K8s Job.

I think my main question, does the model have hyperparameters worth tuning? Are through suitable metrics for deciding which model is best?

Katacoda scenario on github summarization example; friction log

Running through the scenario

step 1: no issue

step2:

[Major issue] Creating GCP token (secret) not working --> error: invalid literal source test, expected key=value
[Major issue] As a result, the tfjob failed to start: secrets "gcp-credentials" not found
git clone https://github.com/kubeflow/examples.git; cd examples/github_issue_summarization/notebooks/ks-app
This is not working. The app is at examples/github_issue_summarization/ks-kubeflow
It's creating an environment called tfjob. I think the name should be changed.
Using tfjob as an env name is confusing.
Once an environment has been delayed --> typo ? s/delayed/created ?
Points to IssueSummarization.py for the code. But training code should be training.py
The image is currently gcr.io/agwl-kubeflow/tf-job-issue-summarization. We should use
gcr.io/kubeflow-images-public
The kubectl log command (very last one) needs -nkubeflow

step3:

Got The environment has expired. Please refresh to get a new environment. when started. So I
refresh, redo step 1, and go straight to step3.
I was able to get the prediction back. But it took ~5min for the pod to be ready. We should probably
mention that (also how to check the status, kubectl describe ....

step4:

[Major issue] ks apply frontendenv -c ui failed because github_token is not set.
Needs ks param set ui github_token $GITHUB_TOKEN

cc @jlewi @ankushagarwal

[GH Issue] Document preprocessing and training times

It would be good to provide a rough estimate of how long it takes to preprocess and train the model using datasets of different sizes; e.g.

Using 2M issues sampled from the dataset per the original blog post
Using the entire dataset

The original blog post had some measurements of how long things took using how much resources.

Create a distributed object detection training example

Create XGBoost Zillow housing prediction example from Kaggle kernel

Create XGBoost Zillow housing prediction example from Kaggle Zestimate kernel.

Move tf-controller-examples from kubeflow/kubeflow to kubeflow/examples

tensorflow serving not working with minio

👋 Great work guys!!!

Running the mnist example with minio, getting this error:

FileSystemStoragePathSource encountered a file-system access error: Could not find base path s3://mybucket/models/myjob-8b52d/export/mnist/

Any chance you can publish code for elsonrodriguez/model-server:1.0?

Writing model directly to GCS instead of saving locally and then copying to GCS

The tfjob example for github issue summarization saves the model locally and then uploads it to GCS. We should try to write the model directly to GCS.

[Request] Move examples in tf-operator to this repo

I think we could move the examples in tf-operator to this repo to maintain all examples in one repo.

[GH Issue Summarization] Deploy on dev.kubeflow.org

We should deploy the webserver and model on our dev instance of Kubeflow (dev.kubeflow.org) and provide a public URL for accessing the app.

[GH Issue] Add links and info about Kubeflow to web app

The GH web app should include links and information about Kubeflow.

Its free advertising.

/assign @ankushagarwal

GCR Registry examples

Some examples are publishing Docker images. We should probably create a GCR registry to host these.

Example:
GitHub issue example referring to an image in gcr.io/agwl-kubeflow

[Agents RL] Demonstrate Kubeflow with an E2E RL example

The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.

Core tasks:

Case study described that communicates the business value of the example; who will care about this example and why?
Illustrate the config, submit, monitor, render workflow for single-node training
Prow test verifies model trains in notebook container
Illustration of practice for building and pushing containers efficiently
Distributed training with TFJob operator (e.g. using @danijar's idea)
Illustration of simple hyperparameter tuning
Uses accelerators

Optionally:

Build a custom gym environment that captures a business problem of interest, e.g. reinforcement learning in the context of datacenter cooling, scheduling, hyperparameter tuning, etc.
Deploy the agent and custom environment, e.g. if this environment concerns kubernetes scheduling then use it to schedule resources on a cluster and measure whether there was a benefit

/cc @nkashy1 @danijar @aronchick @jlewi

[GH Issue Summarization] Train model distributed using TFJob

We should be able to train the model using TFJob so that we can take advantage of K8s to train the model distributed.

Right now the instructions only describe how to train inside the Jupyter notebook
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/training_the_model.md

The Argo workflow does train the model as a batch job but its not running distributed
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/workflow/github_issues_summarization.yaml

[GH Issue] Instructions reference build_image.sh; but that script is missing

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/serving_the_model.md#wrap-the-model-into-a-seldon-core-microservice

Instructions say

The build/ directory contains all the necessary files to build the seldon-core microservice image

But I don't see that directory or build_image.sh

@ankushagarwal @texasmichelle is there something I'm missing? I assume the instructions are just outdated?

Updating `Training MNIST using Kubeflow, S3, and Argo` documentation

We went through this documentation using minio as S3 storage and a Kubernetes Cluster hosted on Azure and we found some issues.

I am planning to open a PR to share our learnings from that.

To avoid duplicate sections, I was thinking to create a new md file for minio and refer to it in the README.md in the same folder as few tweaks are necessary to make it work.

WDYT?

@jlewi @wbuchwalter @ritazh @sozercan

kubeflow / examples Goto Github PK

examples's Issues

Recommend Projects

Recommend Topics

Recommend Org