We will use Google Cloud Deployment Manager
To make it really easy to declaratively define demo infrastructure e.g.
- Create GCP projects for the demo
- Configure the project (enable APIs, enable billing etc...)
- Create GKE clusters
In order to use Deployment Manager to create other projects we need to setup a project that will own the deployments.
I created:
- In GCP org kubeflow.org I created folder demo-projects to contain all the projects
- Project: kf-demo-owner to own the deployments
- script setup_demo_owner_project.sh to run required commands to setup kf-demo-owner
On the folder containing the project we need to give the service account used by deployment manager (${PROJECT_NUMBER}@cloudservices.gserviceaccount.com) permission to create projects
* TODO(jlewi): Add gcloud command for this.
* Project number is for the project that owns the deployments (e.g. kf-demo-owner)
-
copy project_creation/config-kubecon-gh-demo-1.yaml to
project_creation/config-${PROJECT}.yaml
- Modify
config-${PROJECT}.yaml
change the project name
- Modify
-
Run
cd project_creation
gcloud deployment-manager --project=kf-demo-owner deployments create ${PROJECT}--config config-${PROJECT}.yaml
Once you create the deployment if you need to make changes you can just update it with
gcloud deployment-manager --project=kf-demo-owner deployments update ${PROJECT} --config=config-${PROJECT}.yaml
- You might want to update the IAM section in config.yaml to add users who should be owner of the project
- Copy
env-kubecon-gh-demo-1.sh
toenv-${PROJECT}
.sh
- Change the name of the project
- Set
FQDN
to
FQDN=${PROJECT}.kubeflow.dev
* **.dev** not **.org**
-
Update Resource Quotas for the Project
- Currently this has to be done via the UI
- Suggested quota usages
- Recommendations
- In regions us-east1 & us-central1
- 100 CPUs per region
- 200 CPUs (All Region)
- 100 Tb PDs standard in each region
- 5 K80s in each region
- 10 backend services
- 50 health checks
- 24 in use ip addresses
-
Create a new bucket for this project
gsutil mb -p ${PROJECT} gs://${PROJECT}-gh-demo
- TODO(jlewi): We should create this with deployment manager.
- Copy
env-kubecon-gh-demo-1.sh
toenv-${PROJECT}.sh
- Set/change all the values to correspond to this new project.
-
Modify create_deployment.sh
- Set the correct environment variables configuring the project and deployment.
We use the ksonnet app checked in here in directory `git_examples/github_issue_summarization/
Create a GitHub token
kubectl -n kubeflow create secret generic github-token --from-literal=github-oauth=${GITHUB_TOKEN}
ks apply ${ENV} -c seldon
ks apply ${ENV} -c issue-summarization-model-serving
ks apply ${ENV} -c ui
Set a bucket for the job output
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_bucket kubecon-gh-demo
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_path gh-demo/20180712/output
Run the job
ks apply ${ENV} -c tfjob-v1alpha2
Set tensorboard
ks param set --env=${ENV} tensorboard logDir gs://kubecon-gh-demo/gh-t2t-out
- Need to use output of T2T job; doesn't look like Keras jobs produces events file
The UI will be availabe at
https://${FQDN}/issue-summarization/
Tensorboard will be available at
https://${FQDN}/tensorboard/gh/
-
Last time I tried I used
gcr.io/kubeflow-images-public/tensorflow-1.7.0-notebook-cpu:v0.2.1
-
Looks like we need to install some libraries
RUN pip install --no-cache-dir annoy ktext nltk pydot
-
-
Checkout the examples repository and open up the training notebook for the GH issue
- Launch the image prepuller
ks apply ${ENV} -c prepull-daemon
- Launch a notebook with PVC.
- Use the image for tf job; you can get the image as follows
ks param --env=${ENV} list | grep "tfjob.*image.*"
- Switch to JupyterLab by changing the suffix of the url from
/tree
to/lab
e.g.
https://kubecon-gh1.kubeflow.org/user/accounts.google.com%[email protected]/tree? ->
https://kubecon-gh1.kubeflow.org/user/accounts.google.com%[email protected]/lab?
- Create a terminal in the notebook
- Confirm that a PD is mounted in /home/joyvan/work
df -h
Filesystem Size Used Avail Use% Mounted on
overlay 95G 13G 82G 14% /
tmpfs 15G 0 15G 0% /dev
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/sda1 95G 13G 82G 14% /etc/hosts
shm 64M 0 64M 0% /dev/shm
/dev/sdb 9.8G 37M 9.3G 1% /home/jovyan/work
tmpfs 15G 0 15G 0% /sys/firmware
- Put any work that you want to be saved between container restarts in /home/jovyan/work
-
Due to a bug we need to do the following to make
/home/jovyan/work
writablekubectl exec -it ${JUPYTER_POD} /bin/bash chown -R jovyan /home/jovyan/work/
-
Clone the examples repository into the container
git clone https://github.com/jlewi/examples.git /home/jovyan/work/git_examples
cd /hom/jovyan/work/git_examples
git checkout kubecon_demo
* We are cloning [jlewi@'s fork](https://github.com/jlewi/examples/blob/kubecon_demo/github_issue_summarization/notebooks/Training.ipynb) which has some changes to the notebook
to support the demo on branch kubecon_demo.
* If you use this branch you shouldn't have to make the changes listed
below to the notebook
-
Open in
/home/jovyan/work/git_examples/github_issue_summarization/notebooks/Training.ipynb
- Make sure its a Python3 kernel (look in the upper right corner)
-
Modify the notebook; change DATA_DIR to the following
%env DATA_DIR=/home/jovyan/work/github-issues-data
- TODO(jlewi): We should consider checking this into the demo repository.
-
Download the pretrained model; execute the following in a terminal in the notebook
mkdir -p /home/jovyan/work/model
gsutil cp -r gs://kubeflow-examples-data/gh_issue_summarization/model/v20180426 /home/jovyan/work/model
* This allows us to load the model in the notebook for predictions
-
Load the trained model
- Go to the notebook section see Example Results on hold out set
- Add and execute the following cell
from keras.models import load_model import dill as dpickle body_pp_file = "/home/jovyan/work/model/v20180426/body_pp.dpkl" with open(body_pp_file, 'rb') as body_file: body_pp = dpickle.load(body_file) title_pp_file = "/home/jovyan/work/model/v20180426/title_pp.dpkl" with open(title_pp_file, 'rb') as title_file: title_pp = dpickle.load(title_file) model_file = "/home/jovyan/work/model/v20180426/seq2seq_model_tutorial.h5" seq2seq_Model = load_model(model_file)
* TODO(jlewi): We should probably check this in possibly to the existing notebook
-
Comment out the cell to download the data
- TODO(jlewi): Maybe add an if statement so we can disable it easily
-
Run all cells
- It will fail before Train Model because we are missing pydot
- Scroll down to Train Model and Run all cells below
-
Source the environment variables for your environment
source env-${NAME-OF-ENVIRONMENT}.sh
- Submit the trainining job.
cd git_examples/github_issue_summarization/ks-kubeflow
SUFFIX=$(date +%m%d%H%M)
T2TOUTPUT=gs://${BUCKET}/gh-t2t-out/${SUFFIX}
T2TNAME=gh-t2t-trainer-${SUFFIX}
ks param set --env=${ENV} tensor2tensor name ${T2TNAME}
ks param set --env=${ENV} tensor2tensor outputGCSPath ${T2TOUTPUT}
ks apply ${ENV} -c tensor2tensor
* TODO(jlewi): I don't think this actually sets the job name.
- Setup TensorBoard
ks param set --env=${ENV} tensorboard logDir ${T2OUTPUT}
ks appy ${ENV} -c tensorboard
- Check you can access tensorboard at
https://${FQDN}/tensorboard/${T2TNAME}/
- The trailing slash matters
- If you get an error upstream connect failure try waiting and refreshing.
- Start at JupyterHub; spawn a notebook use the image
ks param --env=kubecon-gh-demo-1 list | grep "tfjob.\*image"
* TODO(jlewi): Need to add daemonset to precache images so loading it is fast.
* Talking points
* Jupyter on K8s provides reproducible environments via containers
* HTTPS - Can manage security centrally
-
Talk about developing/experimenting in a notebook
- Use sampled data
- Look at output
-
Show define model architecture
-
Generate some predictions
- Go to section See Example Results on Holdout
- Load the model (if you haven't already)
- Execute cells to generate predictions
-
Now train at scale.
ks apply ${ENV} -c tensor2tensor
-
TODO(jlewi): Give the job a unique name?
-
Show the pods
kubectl get pods -l kubeflow.org=""
- You can show logs to show progress
kubectl logs pod ${MAST_POD}
-
Show tensorboard
- We provide manifests for running tensorboard
- We've also integrated it with our reverse proxy for Ambassador to make it easy for datascientists to access
-
Show predictions in the notebook
-
Now want a server
- Show Seldon code
kubectl get seldondeployments -o yaml
- Show Seldon
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1524679659292-56ab0257d0560-60fae1f4-f22ce361]: errors:
- code: RESOURCE_ERROR
location: /deployments/kubecon-gh-demo-1/resources/kubecon-gh-demo-1
message: '{"ResourceType":"cloudresourcemanager.v1.project","ResourceErrorCode":"403","ResourceErrorMessage":{"code":403,"message":"User
is not authorized.","status":"PERMISSION_DENIED","statusMessage":"Forbidden","requestPath":"https://cloudresourcemanager.googleapis.com/v1/projects","httpMethod":"POST"}}'
* This error indicates the service account used by deployment manager doesn't have permission to create projects
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1524681274999-56ab085cac0d9-9b39ef39-a5f3583b]: errors:
- code: RESOURCE_ERROR
location: /deployments/kubecon-gh-demo-1/resources/patch-iam-policy-kubecon-gh-demo-1
message: '{"ResourceType":"gcp-types/cloudresourcemanager-v1:cloudresourcemanager.projects.setIamPolicy","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Request
contains an invalid argument.","status":"INVALID_ARGUMENT","details":[{"@type":"type.googleapis.com/google.cloudresourcemanager.v1.ProjectIamPolicyError","type":"ORG_MUST_INVITE_EXTERNAL_OWNERS","member":"user:[email protected]","role":"roles/owner"},{"@type":"type.googleapis.com/google.cloudresourcemanager.v1.ProjectIamPolicyError","type":"ORG_MUST_INVITE_EXTERNAL_OWNERS","member":"user:[email protected]","role":"roles/owner"},{"@type":"type.googleapis.com/google.cloudresourcemanager.v1.ProjectIamPolicyError","member":"group:google-team@
kubeflow.org"}],"statusMessage":"Bad Request","requestPath":"https://cloudresourcemanager.googleapis.com/v1/projects/kubecon-gh-demo-1:setIamPolicy","httpMethod":"POST"}}'
* You can work around this by creating a group within the org and then adding external members to the group.
- If model server is crash looping; try deleting the pod.