kubeflow / examples Goto Github PK
View Code? Open in Web Editor NEWA repository to host extended examples and tutorials
License: Apache License 2.0
A repository to host extended examples and tutorials
License: Apache License 2.0
It would be great to be able to show how we collect/surface RPC metrics for models.
The current thinking is to use ISTIO; see kubeflow/kubeflow#464
@lluunn Once we have ISTIO working
/cc @lluunn
DeepVariant might make an interesting.
This could be a nice example to illustrate various aspects such as
Example: Estimator and Keras API training to serving + Unit testing
Priority: P1 - It's not a must have, but could be a great contribution to the TF community, where many data scientists train using one of these two APIs.
The k8s-model-server component in Kubeflow currently contains an inception client example that interacts with a custom model graph: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_saved_model.py
ML practitioners often use TF estimator and Keras APIs to train models, as it greatly simplifies the training and validation process. However, converting these to servable models can be trickier and harder to debug. Add some examples of how to build servable models trained using Estimator and Keras APIs, and unit test examples.
See here
We currently set the GitHub token as a parameter. This would lead it to be checked into source control. But GItHub tokens should be kept secret.
So instead we should modify the APP to use a K8s secret to supply it.
/assign @texasmichelle
Setup a k8s cluster & cloud project for use in creating examples.
Component of Kubeflow #157.
Katacoda recently created a scenario out of the GitHub issue summarization example and ran into a number of frustrating issues. The following items need to be addressed in order to turn this example into a platform-independent self-contained unit for use in a wide number of environments.
Currently most of the examples do not show how to complete the training of a model within kubeflow, and also take that trained model and serve it with kubeflow.
We need an example that covers this from start to finish.
I never got the Linear training mode working, so I stuck to CNN for the example.
Line 142 in 1be7ccb
If the Linear portion of the model were fixed, it could be upstreamed in tensorflow to replace their example and reduce duplication.
Also the upstream model does not support distributed training in its current form, or exporting.
Create a model server using TFServing.
Component of #14.
I am using the keras model as defined in this tutorial: https://github.com/hamelsmu/Seq2Seq_Tutorial/blob/master/notebooks/Tutorial.ipynb
I exported the encoder model using extract_encoder_model
and exported it as a Tensorflow model. When used with TensorFlow serving, I get the following error
AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Expected multiples argument to be a vector of length 3 but got length 2
[[Node: Encoder-Last-GRU_1/Tile = Tile[T=DT_FLOAT, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Encoder-Last-GRU_1/ExpandDims, Encoder-Last-GRU_1/Tile/multiples)]]")
I think the GitHub data is updated regularly. So we could try setting up a cron job or other solution to periodically retrain and push the model.
This depends on deploying it first (#39 ).
We need to have an example directory structure. E.g., Do we organize top level folder as frameworks i.e., tensorflow, xgboost, scikit etc? Or do we use problem type or something else? This needs to be figured out sooner so that we can add examples in appropriate folders as the current flat structure will soon be hard to navigate.
It looks like kartib has some very nice features for keeping track of your models and then surfacing metrics for those models e.g. by launching TensorBoard.
It would be great to combine this with our GH issue summarization example. In particular it would be great if we could load the trained model in a DB and then use the kartib/model DB UI to browse models and look at results.
Should we extend our example on GH issue summarization to predict issue labels?
One of the points of the original blog post was to train useful word embeddings on the entire corpus. So we could potentially use this to learn features that would then allow us to train models specific to repositories/orgs which likely have less data and their own taxonomy of labels.
Predicting issue labels would be useful for creating examples that highlight model analysis tooling. A lot of model analysis tools assume you can compute metrics like true positive/true negative etc... If we predict labels we can use actual labels to compute this.
With our existing text summarization example there's no obvious way to compute whether a summarization is accurate or not, which limits our ability to use it as an example of model analysis.
This would also be useful as a proxy for a large class of ecommerce problems where the goal is to dedupe related posts (e.g. different ebay postings for the same product).
@hamelsmu Any idea how difficult this would be and whether it would be valuable?
The link on this page
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/serving_the_model.md
to IssueSummarization is broken. Looks like the correct link should be
/assign @ankushagarwal
Goals:
Steps:
Potential additional or non-steps:
Current PR: #60
Readme: https://github.com/cwbeitel/examples/tree/enhance/enhance
Component of #14.
To support Katacoda #89 we should remove the need for GCP credentials.
For the output we should support using PD and making it easy for users to set the PD by parameters.
The input is trickier. I think GCS requires an account even for public buckets. If its a single file we could use the http URL to access it.
Replicated from Kubeflow #157.
@hamelsmu published a great blog post about using sequence to sequence models to summarize GitHub issues.
It would be great to turn this into an E2E solution using Kubeflow that highlights the benefit of using Kubeflow and K8s for data science.
There are lots of reasons why I think this blog post would make for a fantastic E2E solution
Here's a stab at what an E2E solution might look like
I think there's quite a bit of work to be done but I think we can split this into tasks
We need to move out of mlkube-testing and into kubeflow-ci
I think it would be very valuable to have a recommendation example. The purpose of this issue is to identify a scenario and dataset around which we could build a solution.
Some possible datasets
GitHub Data
Hacker News or Reddit
The MovieLens data seems less interesting because it isn't updated frequently.
Refactor the example notebook into libraries to be invoked with K8sJob & TFJobs:
Component of #14.
I think it would be interesting to deploy the web app and model on a K8s cluster and make it publicly available.
This would be a good test bed for a variety of things; e.g. periodic training and rollouts; monitoring etc...
Build a docker image using Argo to be used by K8sJob, TFJob, TFServing, etc.
Component of #14.
Currently the model hard codes the paths here of
This means users have to rebuild the docker image just to try out their model. This makes things more difficult from the perspective of rerunning the demo on their own model.
A better approach would be to allow these files to be over written e.g. using environment variables.
We might still want to bake the data into the Docker image so that users can try serving without having to train.
See #89
We should create buckets and other resources (GCR) to store data for our examples.
We currently have bucket gs://kubeflow-examples but that's owned by project kubeflow-dev which isn't really the best thing.
I created project: kubeflow-examples
It doesn't look like the vendor directory got checked in.
https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow
I'm guessing because of our .gitignore.
We should check it in.
Create instructions for running pylint locally and troubleshooting presubmit failures that prevent PRs from being merged. Follow-on from merged PR #61 .
Document which tools to use locally (tf-operator describes yapf here) and how to find and use the versions in our test infrastructure. Since the Prow UI only shows file-level granularity, explain how to access the test cluster, find the right pods, & view the logs containing line-level failure details.
We should add E2E testing for the GH issue example to make sure it doesn't break.
Some things to test
In the original blog post, Hamel filtered down the number of issues from 5M to 2M.
Can we use Apache Beam to run the preprocessing on all 5 million issues and scale out horizontally?
This would be the first step in giving us a very nice scaling out story.
Create a ksonnet app for deploying the model & web app.
Component of #14.
We'd like to provide an example of Pachyderm + TFJob to illustrate
The current thought see kubeflow/kubeflow#151 is to create a simple Pachyderm pipeline to launch a TFJob to train the model.
The main challenge is that the data needs to be exported from Pachyderm and the resulting model imported into Pachyderm.
There's lots of discussion in kubeflow/kubeflow#151 about how to do this. The basic idea is
Pachyderm invokes a script that launches a TFJob
As part of the TFJob we export/import data from the Pachyderm data store
To use Pachyderm we would also need to deploy Pachyderm on K8s.
Component of #14.
For the GitHub issue summarization model would it make sense to do a simple grid search for hyperparameter tuning?
I think this would be pretty straightforward to implement
Create a simple Python program(controller) to do a grid search
Controller can store information in a file on a PD for resilience
Run the controller as a K8s Job.
I think my main question, does the model have hyperparameters worth tuning? Are through suitable metrics for deciding which model is best?
Running through the scenario
step 1: no issue
step2:
git clone https://github.com/kubeflow/examples.git; cd examples/github_issue_summarization/notebooks/ks-app
kubectl log
command (very last one) needs -nkubeflow
step3:
The environment has expired. Please refresh to get a new environment.
when started. So Ikubectl describe ...
.step4:
ks apply frontendenv -c ui
failed because github_token is not set.ks param set ui github_token $GITHUB_TOKEN
It would be good to provide a rough estimate of how long it takes to preprocess and train the model using datasets of different sizes; e.g.
The original blog post had some measurements of how long things took using how much resources.
Create XGBoost Zillow housing prediction example from Kaggle Zestimate kernel.
๐ Great work guys!!!
Running the mnist example with minio, getting this error:
FileSystemStoragePathSource encountered a file-system access error: Could not find base path s3://mybucket/models/myjob-8b52d/export/mnist/
Any chance you can publish code for elsonrodriguez/model-server:1.0
?
The tfjob example for github issue summarization saves the model locally and then uploads it to GCS. We should try to write the model directly to GCS.
I think we could move the examples in tf-operator to this repo to maintain all examples in one repo.
We should deploy the webserver and model on our dev instance of Kubeflow (dev.kubeflow.org) and provide a public URL for accessing the app.
The GH web app should include links and information about Kubeflow.
Its free advertising.
/assign @ankushagarwal
Some examples are publishing Docker images. We should probably create a GCR registry to host these.
Example:
GitHub issue example referring to an image in gcr.io/agwl-kubeflow
The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.
Core tasks:
Optionally:
/cc @nkashy1 @danijar @aronchick @jlewi
We should be able to train the model using TFJob so that we can take advantage of K8s to train the model distributed.
Right now the instructions only describe how to train inside the Jupyter notebook
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/training_the_model.md
The Argo workflow does train the model as a batch job but its not running distributed
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/workflow/github_issues_summarization.yaml
Instructions say
The build/ directory contains all the necessary files to build the seldon-core microservice image
But I don't see that directory or build_image.sh
@ankushagarwal @texasmichelle is there something I'm missing? I assume the instructions are just outdated?
We went through this documentation using minio
as S3 storage and a Kubernetes Cluster hosted on Azure and we found some issues.
I am planning to open a PR to share our learnings from that.
To avoid duplicate sections, I was thinking to create a new md file for minio and refer to it in the README.md
in the same folder as few tweaks are necessary to make it work.
WDYT?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.