Git Product home page Git Product logo

kubecon_gh_demo's Introduction

We will use Google Cloud Deployment Manager

To make it really easy to declaratively define demo infrastructure e.g.

  • Create GCP projects for the demo
  • Configure the project (enable APIs, enable billing etc...)
  • Create GKE clusters

One time setup

In order to use Deployment Manager to create other projects we need to setup a project that will own the deployments.

I created:

  • In GCP org kubeflow.org I created folder demo-projects to contain all the projects
  • Project: kf-demo-owner to own the deployments
  • script setup_demo_owner_project.sh to run required commands to setup kf-demo-owner

On the folder containing the project we need to give the service account used by deployment manager (${PROJECT_NUMBER}@cloudservices.gserviceaccount.com) permission to create projects

* TODO(jlewi): Add gcloud command for this.
* Project number is for the project that owns the deployments (e.g. kf-demo-owner)

To create a new project

  1. copy project_creation/config-kubecon-gh-demo-1.yaml to project_creation/config-${PROJECT}.yaml

    • Modify config-${PROJECT}.yaml change the project name
  2. Run

cd project_creation
gcloud deployment-manager --project=kf-demo-owner deployments create ${PROJECT}--config config-${PROJECT}.yaml

Once you create the deployment if you need to make changes you can just update it with

gcloud deployment-manager --project=kf-demo-owner deployments update ${PROJECT} --config=config-${PROJECT}.yaml
  • You might want to update the IAM section in config.yaml to add users who should be owner of the project
  1. Copy env-kubecon-gh-demo-1.sh to env-${PROJECT}.sh
  • Change the name of the project
  • Set FQDN to
FQDN=${PROJECT}.kubeflow.dev
* **.dev** not **.org**
  1. Update Resource Quotas for the Project

    • Currently this has to be done via the UI
    • Suggested quota usages
    • Recommendations
      • In regions us-east1 & us-central1
      • 100 CPUs per region
      • 200 CPUs (All Region)
      • 100 Tb PDs standard in each region
      • 5 K80s in each region
      • 10 backend services
      • 50 health checks
    • 24 in use ip addresses
  2. Create a new bucket for this project

gsutil mb -p ${PROJECT} gs://${PROJECT}-gh-demo
  • TODO(jlewi): We should create this with deployment manager.
  1. Copy env-kubecon-gh-demo-1.sh to env-${PROJECT}.sh
  • Set/change all the values to correspond to this new project.

To Setup the cluster

Create the Cluster

  1. Modify create_deployment.sh

    • Set the correct environment variables configuring the project and deployment.

Deploying the GH demo

We use the ksonnet app checked in here in directory `git_examples/github_issue_summarization/

Create a GitHub token

kubectl -n kubeflow create secret generic github-token --from-literal=github-oauth=${GITHUB_TOKEN}
ks apply ${ENV} -c seldon
ks apply ${ENV} -c issue-summarization-model-serving
ks apply ${ENV} -c ui

Set a bucket for the job output

ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_bucket kubecon-gh-demo
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_path gh-demo/20180712/output

Run the job

ks apply ${ENV} -c tfjob-v1alpha2

Set tensorboard

ks param set --env=${ENV} tensorboard logDir gs://kubecon-gh-demo/gh-t2t-out
  • Need to use output of T2T job; doesn't look like Keras jobs produces events file

Access the App

The UI will be availabe at

https://${FQDN}/issue-summarization/

Tensorboard will be available at

https://${FQDN}/tensorboard/gh/

Start a Jupyter Notebook

  • Last time I tried I used gcr.io/kubeflow-images-public/tensorflow-1.7.0-notebook-cpu:v0.2.1

    • Looks like we need to install some libraries

      RUN pip install --no-cache-dir annoy ktext nltk pydot
      
  • Checkout the examples repository and open up the training notebook for the GH issue

Precache images

  1. Launch the image prepuller
ks apply ${ENV} -c prepull-daemon

Prepare the demo

  1. Launch a notebook with PVC.
  • Use the image for tf job; you can get the image as follows
ks param --env=${ENV} list | grep "tfjob.*image.*"
  1. Switch to JupyterLab by changing the suffix of the url from /tree to /lab e.g.
https://kubecon-gh1.kubeflow.org/user/accounts.google.com%[email protected]/tree? ->
https://kubecon-gh1.kubeflow.org/user/accounts.google.com%[email protected]/lab?
  1. Create a terminal in the notebook
  2. Confirm that a PD is mounted in /home/joyvan/work
df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          95G   13G   82G  14% /
tmpfs            15G     0   15G   0% /dev
tmpfs            15G     0   15G   0% /sys/fs/cgroup
/dev/sda1        95G   13G   82G  14% /etc/hosts
shm              64M     0   64M   0% /dev/shm
/dev/sdb        9.8G   37M  9.3G   1% /home/jovyan/work
tmpfs            15G     0   15G   0% /sys/firmware
  • Put any work that you want to be saved between container restarts in /home/jovyan/work
  1. Due to a bug we need to do the following to make /home/jovyan/work writable

    kubectl exec -it ${JUPYTER_POD} /bin/bash
    chown -R jovyan /home/jovyan/work/
    
  2. Clone the examples repository into the container

git clone https://github.com/jlewi/examples.git /home/jovyan/work/git_examples
cd /hom/jovyan/work/git_examples
git checkout kubecon_demo
* We are cloning [jlewi@'s fork](https://github.com/jlewi/examples/blob/kubecon_demo/github_issue_summarization/notebooks/Training.ipynb) which has some changes to the notebook
  to support the demo on branch kubecon_demo.

* If you use this branch you shouldn't have to make the changes listed
  below to the notebook
  1. Open in /home/jovyan/work/git_examples/github_issue_summarization/notebooks/Training.ipynb

    • Make sure its a Python3 kernel (look in the upper right corner)
  2. Modify the notebook; change DATA_DIR to the following

    %env DATA_DIR=/home/jovyan/work/github-issues-data
    
    • TODO(jlewi): We should consider checking this into the demo repository.
  3. Download the pretrained model; execute the following in a terminal in the notebook

mkdir -p /home/jovyan/work/model
gsutil cp -r gs://kubeflow-examples-data/gh_issue_summarization/model/v20180426 /home/jovyan/work/model
* This allows us to load the model in the notebook for predictions
  1. Load the trained model

    • Go to the notebook section see Example Results on hold out set
    • Add and execute the following cell
    from keras.models import load_model
    import dill as dpickle
    body_pp_file = "/home/jovyan/work/model/v20180426/body_pp.dpkl"
    
    with open(body_pp_file, 'rb') as body_file:
      body_pp = dpickle.load(body_file)
        
    title_pp_file = "/home/jovyan/work/model/v20180426/title_pp.dpkl"
    with open(title_pp_file, 'rb') as title_file:
      title_pp = dpickle.load(title_file)
          
    model_file = "/home/jovyan/work/model/v20180426/seq2seq_model_tutorial.h5"
    seq2seq_Model = load_model(model_file)
    
    * TODO(jlewi): We should probably check this in possibly to the existing notebook
    
  2. Comment out the cell to download the data

    • TODO(jlewi): Maybe add an if statement so we can disable it easily
  3. Run all cells

    • It will fail before Train Model because we are missing pydot
    • Scroll down to Train Model and Run all cells below
  4. Source the environment variables for your environment

source env-${NAME-OF-ENVIRONMENT}.sh
  1. Submit the trainining job.
cd git_examples/github_issue_summarization/ks-kubeflow
SUFFIX=$(date +%m%d%H%M)
T2TOUTPUT=gs://${BUCKET}/gh-t2t-out/${SUFFIX}
T2TNAME=gh-t2t-trainer-${SUFFIX}
ks param set --env=${ENV} tensor2tensor name ${T2TNAME}
ks param set --env=${ENV} tensor2tensor outputGCSPath ${T2TOUTPUT}
ks apply ${ENV} -c tensor2tensor
* TODO(jlewi): I don't think this actually sets the job name.
  1. Setup TensorBoard
ks param set --env=${ENV} tensorboard logDir ${T2OUTPUT}
ks appy ${ENV} -c tensorboard
  • Check you can access tensorboard at
https://${FQDN}/tensorboard/${T2TNAME}/
  • The trailing slash matters
  • If you get an error upstream connect failure try waiting and refreshing.

Demo Script

  1. Start at JupyterHub; spawn a notebook use the image
ks param --env=kubecon-gh-demo-1 list | grep "tfjob.\*image"
* TODO(jlewi): Need to add daemonset to precache images so loading it is fast.

* Talking points

	* Jupyter on K8s provides reproducible environments via containers
	* HTTPS - Can manage security centrally
  1. Talk about developing/experimenting in a notebook

    • Use sampled data
    • Look at output
  2. Show define model architecture

  3. Generate some predictions

    • Go to section See Example Results on Holdout
    • Load the model (if you haven't already)
    • Execute cells to generate predictions
  4. Now train at scale.

 ks apply ${ENV} -c tensor2tensor
  • TODO(jlewi): Give the job a unique name?

  • Show the pods

kubectl get pods -l kubeflow.org=""
  • You can show logs to show progress
kubectl logs pod ${MAST_POD}
  1. Show tensorboard

    • We provide manifests for running tensorboard
    • We've also integrated it with our reverse proxy for Ambassador to make it easy for datascientists to access
  2. Show predictions in the notebook

  3. Now want a server

    • Show Seldon code
     kubectl get seldondeployments -o yaml
    
    • Show Seldon

Troubleshooting

Deployment Manager

ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1524679659292-56ab0257d0560-60fae1f4-f22ce361]: errors:
- code: RESOURCE_ERROR
  location: /deployments/kubecon-gh-demo-1/resources/kubecon-gh-demo-1
  message: '{"ResourceType":"cloudresourcemanager.v1.project","ResourceErrorCode":"403","ResourceErrorMessage":{"code":403,"message":"User
    is not authorized.","status":"PERMISSION_DENIED","statusMessage":"Forbidden","requestPath":"https://cloudresourcemanager.googleapis.com/v1/projects","httpMethod":"POST"}}'

* This error indicates the service account used by deployment manager doesn't have permission to create projects
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1524681274999-56ab085cac0d9-9b39ef39-a5f3583b]: errors:
- code: RESOURCE_ERROR
  location: /deployments/kubecon-gh-demo-1/resources/patch-iam-policy-kubecon-gh-demo-1
  message: '{"ResourceType":"gcp-types/cloudresourcemanager-v1:cloudresourcemanager.projects.setIamPolicy","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Request
    contains an invalid argument.","status":"INVALID_ARGUMENT","details":[{"@type":"type.googleapis.com/google.cloudresourcemanager.v1.ProjectIamPolicyError","type":"ORG_MUST_INVITE_EXTERNAL_OWNERS","member":"user:[email protected]","role":"roles/owner"},{"@type":"type.googleapis.com/google.cloudresourcemanager.v1.ProjectIamPolicyError","type":"ORG_MUST_INVITE_EXTERNAL_OWNERS","member":"user:[email protected]","role":"roles/owner"},{"@type":"type.googleapis.com/google.cloudresourcemanager.v1.ProjectIamPolicyError","member":"group:google-team@
    kubeflow.org"}],"statusMessage":"Bad Request","requestPath":"https://cloudresourcemanager.googleapis.com/v1/projects/kubecon-gh-demo-1:setIamPolicy","httpMethod":"POST"}}'
* You can work around this by creating a group within the org and then adding external members to the group.

Seldon Server

  • If model server is crash looping; try deleting the pod.

kubecon_gh_demo's People

Contributors

jlewi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.