Git Product home page Git Product logo

examples's Introduction

OpenSSF Best Practices OpenSSF Scorecard CLOMonitor

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.


Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

Get Involved

Please refer to the Community page.

examples's People

Contributors

abcdefgs0324 avatar activatedgeek avatar amygdala avatar cwbeitel avatar daniel-sanche avatar dependabot[bot] avatar dsdinter avatar elsonrodriguez avatar gabrielwen avatar govindkag avatar hougangliu avatar iancoffey avatar ironpan avatar jinchihe avatar jlewi avatar josepholaide avatar js-ts avatar kbthu avatar kimwnasptd avatar ldcastell avatar lluunn avatar neokish avatar oblynx avatar puneith avatar richardsliu avatar sarahmaddox avatar svendegroote91 avatar texasmichelle avatar tomcli avatar zhenghuiwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

examples's Issues

[GH Issue Summarization] Hyperparameter tuning(grid search)?

@hamelsmu @ankushagarwal

For the GitHub issue summarization model would it make sense to do a simple grid search for hyperparameter tuning?

I think this would be pretty straightforward to implement

  • Create a simple Python program(controller) to do a grid search

    • Launch N TFJobs at a time and wait to complete.
  • Controller can store information in a file on a PD for resilience

    • SQLLite might be a good choice
  • Run the controller as a K8s Job.

I think my main question, does the model have hyperparameters worth tuning? Are through suitable metrics for deciding which model is best?

Create tf-serving examples for models trained using higher level APIs

Example: Estimator and Keras API training to serving + Unit testing
Priority: P1 - It's not a must have, but could be a great contribution to the TF community, where many data scientists train using one of these two APIs.

The k8s-model-server component in Kubeflow currently contains an inception client example that interacts with a custom model graph: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_saved_model.py

ML practitioners often use TF estimator and Keras APIs to train models, as it greatly simplifies the training and validation process. However, converting these to servable models can be trickier and harder to debug. Add some examples of how to build servable models trained using Estimator and Keras APIs, and unit test examples.

[GH Issue] E2E test

We should add E2E testing for the GH issue example to make sure it doesn't break.

Some things to test

  • Make sure we can deploy all the components using ksonnet
  • Make sure training runs (just run a couple steps)
  • Test predict RPC generate responses

[GH Issue] Scale out preprocessing using Apache Beam

In the original blog post, Hamel filtered down the number of issues from 5M to 2M.

Can we use Apache Beam to run the preprocessing on all 5 million issues and scale out horizontally?

This would be the first step in giving us a very nice scaling out story.

[GH Issue] Document preprocessing and training times

It would be good to provide a rough estimate of how long it takes to preprocess and train the model using datasets of different sizes; e.g.

  • Using 2M issues sampled from the dataset per the original blog post
  • Using the entire dataset

The original blog post had some measurements of how long things took using how much resources.

[GH Issue Summarization] E2E Solution on Kubeflow

Replicated from Kubeflow #157.

@hamelsmu published a great blog post about using sequence to sequence models to summarize GitHub issues.

It would be great to turn this into an E2E solution using Kubeflow that highlights the benefit of using Kubeflow and K8s for data science.

There are lots of reasons why I think this blog post would make for a fantastic E2E solution

  • Text summarization has a lot of applications
  • It uses GitHub data which is a very rich dataset
  • Training and preprocessing take enough time (~30 minutes and ~1 hour respectively) that I think it makes sense to run these as K8s jobs but not so much time to be a barrier.

Here's a stab at what an E2E solution might look like

  • Entrypoint would be a notebook (based on the one in the blogpost).
  • Notebook would walk through the various steps but instead of (or in addition) to running code directly in the notebook.

I think there's quite a bit of work to be done but I think we can split this into tasks

  • Setup a shared Kubeflow cluster and Cloud project for dev team to use
  • Create a Docker image to be used with Jupyter with all dependencies installed
  • Refactor the notebook into libraries with suitable main functions so that relevant steps (preprocessing and training) steps can be invoked in K8s Jobs and TFJobs
  • Build docker image (using Argo to be used by TFJob, TFServing, etc...
  • Create a model server using TFServing Seldon core
  • Create a web app to serve as the front end
  • Create ksonnet component for deploy the model and web app

tensorflow serving not working with minio

๐Ÿ‘‹ Great work guys!!!

Running the mnist example with minio, getting this error:

FileSystemStoragePathSource encountered a file-system access error: Could not find base path s3://mybucket/models/myjob-8b52d/export/mnist/ 

Any chance you can publish code for elsonrodriguez/model-server:1.0?

[GH Label Prediction] Extend GH Issue Summarization to Predict Labels

Should we extend our example on GH issue summarization to predict issue labels?

One of the points of the original blog post was to train useful word embeddings on the entire corpus. So we could potentially use this to learn features that would then allow us to train models specific to repositories/orgs which likely have less data and their own taxonomy of labels.

Predicting issue labels would be useful for creating examples that highlight model analysis tooling. A lot of model analysis tools assume you can compute metrics like true positive/true negative etc... If we predict labels we can use actual labels to compute this.

With our existing text summarization example there's no obvious way to compute whether a summarization is accurate or not, which limits our ability to use it as an example of model analysis.

This would also be useful as a proxy for a large class of ecommerce problems where the goal is to dedupe related posts (e.g. different ebay postings for the same product).

@hamelsmu Any idea how difficult this would be and whether it would be valuable?

[GH Issue Summarization] Train model distributed using TFJob

We should be able to train the model using TFJob so that we can take advantage of K8s to train the model distributed.

Right now the instructions only describe how to train inside the Jupyter notebook
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/training_the_model.md

The Argo workflow does train the model as a batch job but its not running distributed
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/workflow/github_issues_summarization.yaml

GCR Registry examples

Some examples are publishing Docker images. We should probably create a GCR registry to host these.

Example:
GitHub issue example referring to an image in gcr.io/agwl-kubeflow

[GH Example] Use katib/modeldb to store model results

It looks like kartib has some very nice features for keeping track of your models and then surfacing metrics for those models e.g. by launching TensorBoard.

It would be great to combine this with our GH issue summarization example. In particular it would be great if we could load the trained model in a DB and then use the kartib/model DB UI to browse models and look at results.

/cc @gaocegege @YujiOshima

Katacoda scenario on github summarization example; friction log

Running through the scenario

step 1: no issue

step2:

  • [Major issue] Creating GCP token (secret) not working --> error: invalid literal source test, expected key=value
  • [Major issue] As a result, the tfjob failed to start: secrets "gcp-credentials" not found
  • git clone https://github.com/kubeflow/examples.git; cd examples/github_issue_summarization/notebooks/ks-app
    This is not working. The app is at examples/github_issue_summarization/ks-kubeflow
  • It's creating an environment called tfjob. I think the name should be changed.
    Using tfjob as an env name is confusing.
  • Once an environment has been delayed --> typo ? s/delayed/created ?
  • Points to IssueSummarization.py for the code. But training code should be training.py
  • The image is currently gcr.io/agwl-kubeflow/tf-job-issue-summarization. We should use
    gcr.io/kubeflow-images-public
  • The kubectl log command (very last one) needs -nkubeflow

step3:

  • Got The environment has expired. Please refresh to get a new environment. when started. So I
    refresh, redo step 1, and go straight to step3.
  • I was able to get the prediction back. But it took ~5min for the pod to be ready. We should probably
    mention that (also how to check the status, kubectl describe ....

step4:

  • [Major issue] ks apply frontendenv -c ui failed because github_token is not set.
    Needs ks param set ui github_token $GITHUB_TOKEN

cc @jlewi @ankushagarwal

Create buckets and other resources to store example data

We should create buckets and other resources (GCR) to store data for our examples.

We currently have bucket gs://kubeflow-examples but that's owned by project kubeflow-dev which isn't really the best thing.

I created project: kubeflow-examples

[GH Issue] Use Pachyderm to launch TFJobs

See kubeflow/kubeflow#151

We'd like to provide an example of Pachyderm + TFJob to illustrate

  • Combining Pachyderm's orchestration capabilities with TFJob for distributed training
  • Highlighting Pachyderm's data provenance features with TFJob

The current thought see kubeflow/kubeflow#151 is to create a simple Pachyderm pipeline to launch a TFJob to train the model.

The main challenge is that the data needs to be exported from Pachyderm and the resulting model imported into Pachyderm.

There's lots of discussion in kubeflow/kubeflow#151 about how to do this. The basic idea is

  • Pachyderm invokes a script that launches a TFJob

  • As part of the TFJob we export/import data from the Pachyderm data store

    • A variety of ideas have been suggested; e.g. using an Argo workflow, init containers, sidecars etc...

To use Pachyderm we would also need to deploy Pachyderm on K8s.

[Agents RL] Demonstrate Kubeflow with an E2E RL example

The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.

Core tasks:

  • Case study described that communicates the business value of the example; who will care about this example and why?
  • Illustrate the config, submit, monitor, render workflow for single-node training
  • Prow test verifies model trains in notebook container
  • Illustration of practice for building and pushing containers efficiently
  • Distributed training with TFJob operator (e.g. using @danijar's idea)
  • Illustration of simple hyperparameter tuning
  • Uses accelerators

Optionally:

  • Build a custom gym environment that captures a business problem of interest, e.g. reinforcement learning in the context of datacenter cooling, scheduling, hyperparameter tuning, etc.
  • Deploy the agent and custom environment, e.g. if this environment concerns kubernetes scheduling then use it to schedule resources on a cluster and measure whether there was a benefit

/cc @nkashy1 @danijar @aronchick @jlewi

[GH Issue] Use persistent volumes for the data

To support Katacoda #89 we should remove the need for GCP credentials.

  1. For the output we should support using PD and making it easy for users to set the PD by parameters.

  2. The input is trickier. I think GCS requires an account even for public buckets. If its a single file we could use the http URL to access it.

    • My suggestion would be to create a script that would copy the data using curl to a PD. We could then run that script as a K8s job.

[GH Example] Friction points from Katacoda

Katacoda recently created a scenario out of the GitHub issue summarization example and ran into a number of frustrating issues. The following items need to be addressed in order to turn this example into a platform-independent self-contained unit for use in a wide number of environments.

Kubeflow examples needs to have an appropriate directory structure

We need to have an example directory structure. E.g., Do we organize top level folder as frameworks i.e., tensorflow, xgboost, scikit etc? Or do we use problem type or something else? This needs to be figured out sooner so that we can add examples in appropriate folders as the current flat structure will soon be hard to navigate.

[Enhance] Image enhancement example

Goals:

  • Demonstrate a high-impact biomedical imaging use case
  • Demonstrate distributed training
  • Demonstrate hyperparameter tuning, potentially informing future design of a TFStudy hptuning CRD
  • Demonstrate batch inference
  • Demonstrate the use of tensor2tensor, positioning for increased leverage in developing additional examples

Steps:

  • Launcher interface for running component steps in batch and testing for job success; each step smoke tested to run in batch at least displaying help message
  • Illustrate a tfhub-based development workflow (primarily in regard to how model code and dependencies are shipped to jobs) that sufficiently minimizes friction, has support of community
  • Batch data downloader pulls raw data to NFS
  • Correct definition of t2t Problem's for the image identity mapping and super-resolution problems
  • Batch example generator uses t2t-datagen to generate examples
  • T2TExperiment object abstracts interface for triggering a TFJob running t2t-trainer that is amenable to strategy for hyperparameter tuning; launches a job that makes use of a stock t2t model that trains in distributed form
  • Minimal prototype for creating a new hyperparameter study (e.g. registering in redis)
  • StudyRunner runs in batch, periodically launching newly registered experiments
  • Inference step runs in batch, wrapping t2t-decoder to allow user to enhance images by simply providing input and output paths (on NFS)
  • Model actually performs well on the provided task

Potential additional or non-steps:

  • Generalizing beyond NFS to support a variety of storage types
  • Implementing a production-caliber hyperparameter tuning solution

Current PR: #60
Readme: https://github.com/cwbeitel/examples/tree/enhance/enhance

Sketch out recommendation example

I think it would be very valuable to have a recommendation example. The purpose of this issue is to identify a scenario and dataset around which we could build a solution.

Some possible datasets

GitHub Data

  • We could recommend repositories based on stars
  • Issues/PRs (comments could be used to indicate a user was interested in an issue)
  • Recommend reviewers for PRs

Hacker News or Reddit

  • I think both datasets are available publicly in BigQuery

The MovieLens data seems less interesting because it isn't updated frequently.

Keras model exported as TensorFlow model doesn't work with TensorFlow serving

I am using the keras model as defined in this tutorial: https://github.com/hamelsmu/Seq2Seq_Tutorial/blob/master/notebooks/Tutorial.ipynb

I exported the encoder model using extract_encoder_model and exported it as a Tensorflow model. When used with TensorFlow serving, I get the following error

AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Expected multiples argument to be a vector of length 3 but got length 2
[[Node: Encoder-Last-GRU_1/Tile = Tile[T=DT_FLOAT, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Encoder-Last-GRU_1/ExpandDims, Encoder-Last-GRU_1/Tile/multiples)]]")

[GH Issue] Don't require the model is baked into the docker image

Currently the model hard codes the paths here of

  • seq2seq_model_tutorial.h5 - the keras model
  • body_pp.dpkl - the serialized body preprocessor
  • title_pp.dpkl - the serialized title preprocessor

This means users have to rebuild the docker image just to try out their model. This makes things more difficult from the perspective of rerunning the demo on their own model.

A better approach would be to allow these files to be over written e.g. using environment variables.

We might still want to bake the data into the Docker image so that users can try serving without having to train.

See #89

Linting instructions

Create instructions for running pylint locally and troubleshooting presubmit failures that prevent PRs from being merged. Follow-on from merged PR #61 .

Document which tools to use locally (tf-operator describes yapf here) and how to find and use the versions in our test infrastructure. Since the Prow UI only shows file-level granularity, explain how to access the test cluster, find the right pods, & view the logs containing line-level failure details.

DeepVariant Example

DeepVariant might make an interesting.

This could be a nice example to illustrate various aspects such as

  • Preprocessing
  • Pipelines.
  • Large scale training

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.