kubeflow / testing Goto Github PK

Test infrastructure and tooling for Kubeflow.

License: Apache License 2.0

Makefile 1.55% Shell 5.02% Python 47.89% Dockerfile 1.02% HTML 0.06% Jupyter Notebook 33.78% JavaScript 0.86% Go 2.54% Batchfile 0.04% Smarty 1.86% Mustache 1.04% Jinja 4.35%

testing's Introduction

Table of Contents generated with DocToc

Test Infrastructure
- Anatomy of our Tests
- Writing An Argo Workflow For An E2E Test

Test Infrastructure

There are two test infrastructures exist in the Kubeflow community:

If you are interested in oss-test-infra, please find useful resources here.
If you are interested in optional-test-infra, please find useful resources here

We use Prow, K8s' continuous integration tool.

Prow is a set of binaries that run on Kubernetes and respond to GitHub events.

We use Prow to run:

Presubmit jobs
Postsubmit jobs
Periodic tests

Here's high-level idea about how it works

Prow is used to trigger E2E tests
The E2E test will launch an Argo workflow that describes the tests to run
Each step in the Argo workflow will be a binary invoked inside a container
The Argo workflow will use an NFS volume to attach a shared POSIX compliant filesystem to each step in the workflow.
Each step in the pipeline can write outputs and junit.xml files to a test directory in the volume
A final step in the Argo pipeline will upload the outputs to GCS so they are available in spyglass

Quick Links

Anatomy of our Tests

Our prow jobs are defined here
Each prow job defines a K8s PodSpec indicating a command to run
Our prow jobs use run_e2e_workflow.py to trigger an Argo workflow that checks out our code and runs our tests.
Our tests are structured as Argo workflows so that we can easily perform steps in parallel.
The Argo workflow is defined in the repository being tested
- We always use the worfklow at the commit being tested
checkout.sh is used to checkout the code being tested
- This also checks out kubeflow/testing so that all repositories can rely on it for shared tools.

Writing An Argo Workflow For An E2E Test

This section provides guidelines for writing Argo workflows to use as E2E tests

This guide is complementary to the E2E testing guide for TFJob operator which describes how to author tests to performed as individual steps in the workflow.

Some examples to look at

gis.jsonnet in kubeflow/examples

Adding an E2E test to a repository

Follow these steps to add a new test to a repository.

Python function

Create a Python function in that repository and return an Argo workflow if one doesn't already exist
- We use Python functions defined in each repository to define the Argo workflows corresponding to E2E tests
- You can look at prow_config.yaml (see below) to see which Python functions are already defined in a repository.
Modify the prow_config.yaml at the root of the repo to trigger your new test.
- If prow_config.yaml doesn't exist (e.g. the repository is new) copy one from an existing repository (example).
- prow_config.yaml contains an array of workflows where each workflow defines an E2E test to run; example
```
workflows:
 - name: workflow-test
   py_func: my_test_package.my_test_module.my_test_workflow
   kwargs:
       arg1: argument
```
  - py_func: Is the Python method to create a python object representing the Argo workflow resource
  - kwargs: This is an array of arguments passed to the Python method
  - name: This is the base name to use for the submitted Argo workflow.
You can use the e2e_tool.py to print out the Argo workflow and potentially submit it
Examples
- kf_unittests.py creates the E2E workflow for kubeflow/testing

ksonnet

** Using ksonnet is deprecated. New pipelines should use python. **

Create a ksonnet App in that repository and define an Argo workflow if one doesn't already exist
- We use ksonnet apps defined in each repository to define the Argo workflows corresponding to E2E tests
- If a ksonnet app already exists you can just define a new component in that app
  1. Create a .jsonnet file (e.g by copying an existing .jsonnet file)
    - Change the import for the params to use the newly defined component
    - See gis.jsonnet in kubeflow/examples#449
  2. Update the params.libsonnet to add a stanza to define params for the new component
  - See params.jsonnet in kubeflow/examples#449
- You can look at prow_config.yaml (see below) to see which ksonnet apps are already defined in a repository.
Modify the prow_config.yaml at the root of the repo to trigger your new test.
- If prow_config.yaml doesn't exist (e.g. the repository is new) copy one from an existing repository (example).
- prow_config.yaml contains an array of workflows where each workflow defines an E2E test to run; example
```
workflows:
 - app_dir: kubeflow/testing/workflows
   component: workflows
   name: unittests
   job_types:
     - presubmit
   include_dirs:
     - foo/*
     - bar/*
       params:
   params:
     platform: gke
     gkeApiVersion: v1beta1
```
  - app_dir: Is the path to the ksonnet directory within the repository. This should be of the form ${GITHUB_ORG}/${GITHUB_REPO_NAME}/${PATH_WITHIN_REPO_TO_KS_APP}
  - component: This is the name of the ksonnet component to use for the Argo workflow
  - name: This is the base name to use for the submitted Argo workflow.
    - The test infrastructure appends a suffix of 22 characters (see here)
    - The result is passed to your ksonnet component via the name parameter
    - Your ksonnet component should truncate the name if necessary to satisfy K8s naming constraints.
      - e.g. Argo workflow names should be less than 63 characters because they are used as pod labels
  - job_types: This is an array specifying for which types of prow jobs this workflow should be triggered on.
    - Currently allowed values are presubmit, postsubmit, and periodic.
  - include_dirs: If specified, the pre and postsubmit jobs will only trigger this test if the PR changed at least one file matching at least one of the listed directories.
    - Python's fnmatch function is used to compare the listed patterns against the full path of modified files (see here)
    - This functionality should be used to ensure that expensive tests are only run when test impacting changes are made; particularly if its an expensive or flaky presubmit
    - periodic runs ignore include_dirs; a periodic run will trigger all workflows that include job_type periodic
  - A given ksonnet component can have multiple workflow entries to allow different triggering conditions on pre/postsubmit
    - For example, on presubmit we might run a test on a single platform (GKE) but on postsubmit that same test might run on GKE and minikube
    - this can be accomplished with different entries pointing at the same ksonnet component but with different job_types and params.
  - params: A dictionary of parameters to set on the ksonnet component e.g. by running ks param set ${COMPONENT} ${PARAM_NAME} ${PARAM_VALUE}

Using pytest to write tests

pytest is really useful for writing tests
- Results can be emitted as junit files which is what prow needs to report test results
- It provides annotations to skip tests or mark flaky tests as expected to fail
Use pytest to easily script various checks
- For example kf_is_ready_test.py uses some simple scripting to test that various K8s objects are deployed and healthy
Pytest provides fixtures for setting additional attributes in the junit files (docs)
- In particular record_xml_attribute allows us to set attributes that control how's the results are grouped in test grid
  - name - This is the name shown in test grid
    - Testgrid supports grouping by spliting the tests into a hierarchy based on the name
    - recommendation Leverage this feature to name tests to support grouping; e.g. use the pattern
      {WORKFLOW_NAME}/{PY_FUNC_NAME}
      - workflow_name Workflow name as set in prow_config.yaml
      - PY_FUNC_NAME the name of the python test function
      - util.py provides the helper method set_pytest_junit to set the required attributes
      - run_e2e_workflow.py will pass the argument test_target_name to your py function to create the Argo workflow
        
        Use this argument to set the environment variable TEST_TARGET_NAME on all Argo pods.
  - classname - testgrid uses classname as the test target and allows results to be grouped by name
    - recommendation - Set the classname to the workflow name as defined in prow_config.yaml
      - This allows easy grouping of tests by the entries defined in prow_config.yaml
      - Each entry in prow_config.yaml usually corresponds to a different configuration e.g. "GCP with IAP" vs. "GCP with basic auth"
      - So worflow name is a natural grouping

Prow Variables

For each test run PROW defines several variables that pass useful information to your job.
The list of variables is defined in the prow docs.
These variables are often used to assign unique names to each test run to ensure isolation (e.g. by appending the BUILD_NUMBER)
The prow variables are passed via ksonnet parameter prow_env to your workflows
- You can copy the macros defined in util.libsonnet to parse the ksonnet parameter into a jsonnet map that can be used in your workflow.
- Important Always define defaults for the prow variables in the dict e.g. like
```
local prowDict = {
  BUILD_ID: "notset",
  BUILD_NUMBER: "notset",
  REPO_OWNER: "notset",
  REPO_NAME: "notset",
  JOB_NAME: "notset",
  JOB_TYPE: "notset",
  PULL_NUMBER: "notset",  
 } + util.listOfDictToMap(prowEnv);
```
  - This prevents jsonnet from failing in a hard to debug way in the event that you try to access a key which is not in the map.

Argo Spec

Guard against long names by truncating the name and using the BUILD_ID to ensure the name remains unique e.g
```
local name = std.substr(params.name, 0, std.min(58, std.lenght(params.name))) + "-" + prowDict["BUILD_ID"];            
```
- Argo workflow names need to be less than 63 characters because they are used as pod labels
- BUILD_ID are unique for each run per repo; we suggest reserving 5 characters for the BUILD_ID.
Argo workflows should have standard labels corresponding to prow variables; for example
```
labels: prowDict + {    
  workflow_template: "code_search",    
},
```
- This makes it easy to query for Argo workflows based on prow job info.
- In addition the convention is to use the following labels
  - workflow_template: The name of the ksonnet component from which the workflow is created.
The templates for the individual steps in the argo workflow should also have standard labels
```
labels: prowDict + {
  step_name: stepName,
  workflow_template: "code_search",
  workflow: workflowName,
},
```
- step_name: Name of the step (e.g. what shows up in the Argo graph)
- workflow_template: The name of the ksonnet component from which the workflow is created.
- workflow: The name of the Argo workflow that owns this pod.

Following the above conventions make it very easy to get logs for specific steps

kubectl logs -l step_name=checkout,REPO_OWNER=kubeflow,REPO_NAME=examples,BUILD_ID=0104-064201 -c main

Creating K8s resources in tests.

Tests often need a K8s/Kubeflow deployment on which to create resources and run various tests.

Depending on the change being tested

The test might need exclusive access to a Kubeflow/Kubernetes cluster
- e.g. Testing a change to a custom resource usually requires exclusive access to a K8s cluster because only one CRD and controller can be installed per cluster. So trying to test two different changes to an operator (e.g. tf-operator) on the same cluster is not good.
The test might need a Kubeflow/K8s deployment but doesn't need exclusive access
- e.g. When running tests for Kubeflow examples we can isolate each test using namespaces or other mechanisms.
If the test needs exclusive access to the Kubernetes cluster then there should be a step in the workflow that creates a KubeConfig file to talk to the cluster.
- e.g. E2E tests for most operators should probably spin up a new Kubeflow cluster
If the test just needs a known version of Kubeflow (e.g. master or v0.4) then it should use one of the test clusters in project kubeflow-ci for this
- The infrasture to support this is not fully implemented see kubeflow/testing#95 and kubeflow/testing#273

To connect to the cluster:

The Argo workflow should have a step that configures the KUBE_CONFIG file to talk to the cluster
- e.g. by running gcloud container clusters get-credentials
The Kubeconfig file should be stored in the NFS test directory so it can be used in subsequent steps
Set the environment variable KUBE_CONFIG on your steps to use the KubeConfig file

NFS Directory

An NFS volume is used to create a shared filesystem between steps in the workflow.

Your Argo workflows should use a PVC claim to mount the NFS filesystem into each step
- The current PVC name is nfs-external
- This should be a parameter to allow different PVC names in different environments.
Use the following directory structure
```
${MOUNT_POINT}/${WORKFLOW_NAME}
                               /src
                                   /${REPO_ORG}/${REPO_NAME}
                               /outputs
                               /outputs/artifacts
```
- MOUNT_PATH: Location inside the pod where the NFS volume is mounted
- WORKFLOW_NAME: The name of the Argo workflow
  - Each Argo workflow job has a unique name (enforced by APIServer)
  - So using WORKFLOW_NAME as root for all results associated with a particular job ensures there are no conflicts
- /src: Any repositories that are checked out should be checked out here
  - Each repo should be checked out to the sub-directory ${REPO_ORG}/${REPO_NAME}
- /outputs: Any files that should be sync'd to GCS for Gubernator should be written here

Step Image

The Docker image used by the Argo steps should be a ksonnet parameter stepImage
The Docker image should use an immutable image tag e.g gcr.io/kubeflow-ci/test-worker:v20181017-bfeaaf5-dirty-4adcd0
- This ensures tests don't break if someone pushes a new test image
The ksonnet parameter stepImage should be set in the prow_config.yaml file defining the E2E tests
- This makes it easy to update all the workflows to use some new image.
A common runtime is defined here and published to gcr.io/kubeflow-ci/test-worker

Checking out code

The first step in the Argo workflow should checkout out the source repos to the NFS directory
Use checkout.sh to checkout the repos
checkout.sh environment variable EXTRA_REPOS allows checking out additional repositories in addition to the repository that triggered the pre/post submit test
- This allows your test to use source code located in a different repository
- You can specify whether to checkout the repository at HEAD or pin to a specific commit
Most E2E tests will want to checkout kubeflow/testing in order to use various test utilities

Building Docker Images

There are lots of different ways to build Docker images (e.g. GCB, Docker in Docker). Current recommendation is

Define a Makefile to provide a convenient way to invoke Docker builds
Using Google Container Builder (GCB) to run builds in Kubeflow's CI system generally works better than alternatives (e.g. Docker in Docker, Kaniko)
- Your Makefile can have alternative rules to support building locally via Docker for developers
Use jsonnet if needed to define GCB workflows
- Example jsonnet file and associated Makefile
Makefile should expose variables for the following
- Registry where image is pushed
- TAG used for the images
Argo workflow should define the image paths and tag so that subsequent steps can use the newly built images

testing's People

Contributors

Stargazers

Watchers

Forkers

jlewi lluunn jimexist jose5918 texasmichelle elsonrodriguez kunmingg pdmack xyhuang maerville everpeace peterkuria2000 ramdootp richardsliu carmine jbottum gridl kkasravi rileyjbauer devops8012 gabrielwen zhenghuiwang svalleru jinchihe andreyvelich zabbasi ashahba derekhh yanniszark dolkhovenko javaderek rbrishabh swiftdiaries jeffwan prorates suneeta-mall abhi-g bobgy avdaredevil fediazgon chensun sarahmaddox kelvin169 jayant-ism clarketm stephennfernandes terrytangyuan yuzisun pingsutw hefedev nikenano iancoffey harshad16 patrickxys akartsky subodh101 pvaneck isabella232 ujjwalsh chanyilin marvel-works davidspek ezeeyahoo arrikto qinjie545 capri-xiyue pranavparamesh theofpa tomcli deliangfan johnugeorge tenzen-y zijianjoy js-ts linchin zw0610 aws-kf-ci-bot a9p connor-mccarthy tryweirdier gkcalat seldonio ratanajangir ironchef jlyaoyuli jnjerin

testing's Issues

Make ISTIO deployment part of our dev environment

Our dev environment depends on ISTIO for monitoring.

We should document the ISTIO setup in
https://github.com/kubeflow/testing/blob/master/dev-kubeflow-org/README.md

We should also update the shell scripts necessary.

/assign @lluunn

Automatic merges with tide aren't working

See #47

All status checks are green
Status still says PR is not in the tide pool.
Coveralls is not enabled for this repository so I don't think we are hitting
kubernetes/test-infra#7091

Background:
kubeflow/community#20

tide seems to be working for kubeflow/kubeflow.

Get rid of one prow container per repo

Right now our instructions require that every repository build a Docker image and create a custom run.sh script.

Maintaining a separate container per repository is more work than necessary. The sole purpose of this is to specify the command line arguments to run_e2e_workflow.py that control the workflow to run.

We can get rid of that by adopting a convention that the arguments will be defined in the root of the repository e.g. /prow_config.yaml

Then we can have a single container and entrypoint used by our prow jobs that loads the workflows and submits them.

Create bot user to execute repo management tasks.

We need a bot Github account with (minimum necessary) access to repos in kubeflow org in order to automate repo management.
This bot will be responsible for tasks like label sync (kubeflow/community#24), and in future PR scan, issue update.

We'll have a kubeflow.org group as owner of bot user.

minikube testing

We should run E2E tests on minikube.
kube_test is supposed to help with this.

Add e2e testing on OpenShift

The ultimate goal would be to have Argo able to run multiple testing workflows on different platforms.

A first step would be a setup that can be used to run the e2e tests locally on an OpenShift cluster.

Need to take test status into account when reporting prow job success or failure

We need to look at the junit XML files to decide whether to report the job as failed or success

Right now we just look at whether the Argo workflow succeeded or not which is insufficient since tests can fail but Argo workflow can still succeed.

We need to do something like kubeflow/training-operator#319

If we want to be strict about which xml files are expected we could specify the xml files expected in the prow_config.yaml file.

What is kubeflow-ci-yxshi for?

There's a gke cluster kubeflow-ci-yxshi in kubeflow-ci.

What's this for?

/assign @yixinshi

Setup appropriate groups to give access to test infrastructure to members of the community

We should setup appropriate access to the test infrastructure so we can share it with appropriate members of the community.

The current thinking is to

Setup gsuite for kubeflow.org
Use GCP projects within kubeflow.org
Create groups in kubeflow.org to administer access to test infra

To start I think we might want to groups

test-admin
- Permission to administer GKE clusters used by our test infra
- This should be a relatively small group of folks
test-viewer
- Grant permission to view stackdriver logs
- We can probably share this with most contributors (i.e. org members).

DO NOT SUBMIT presubmit check

It would be good to have a presubmit check that will automatically fail if a string like "DO NOT SUBMIT" appears anywhere in the patch.

I'm not sure if this is something prow already supports.

checkout.sh shouldn't require PROW environment variables

checkout.sh supports the PROW environment variables.

But it also supports the more convenient syntax we introduced to specify for extra repos.

For workflows triggered outside of prow (e.g. manually) its more convenient just to specify EXTRA_REPOS.

We should update checkout.sh so that it doesn't assume REPO_OWNER and other prow environment variables.

URL for Argo workflows changed

In gubernator we report the url for the worfklow UI as

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-tf-serving-image-387-f608b1a-532-1c41;tab=workflow

This doesn't appear to be valid anymore with the new version of the Argo UI.

I think we just want to use

http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-tf-serving-image-387-f608b1a-532-1c41

@jimexist would you mind sending another PR to fix this?

Turn down mlkube-testing

We should delete the GKE cluster in mlkube-testing after migrating to kubeflow-ci

Blocked by kubeflow/examples#63

Truncate workflow name in prow_config.yaml

Found that if the workflow name is too long in prow_config.yaml I get this error:

Pod "kubeflow-pytorch-operator-presubmit-pytorchjob-e2e-13-1a13f4d-3-17ac-2950311768" is invalid: metadata.labels: Invalid value: "kubeflow-pytorch-operator-presubmit-pytorchjob-e2e-13-1a13f4d-3-17ac": must be no more than 63 characters

Should probably truncate or provide a warning to to the user.

Building the test worker image fails.

I tried rebuilding the Docker image used by our test workers and I got the following error.

In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
    enum34>=1.1.3; python_version < "3.4" from https://pypi.python.org/packages/c5/db/e56e6b4bbac7c4a06de1c50de6fe1ef3810018ae11732a50f15f62c7d050/enum34-1.1.6-py2-none-any.whl#md5=68f6982cc07dde78f4b500db829860bd (from astroid==1.6.1->-r /tmp/requirements.txt (line 1))
    singledispatch; python_version < "3.4" from https://pypi.python.org/packages/c5/10/369f50bcd4621b263927b0a1519987a04383d4a98fb10438042ad410cf88/singledispatch-3.4.0.3-py2.py3-none-any.whl#md5=d633bac187d681455ab065c645be845d (from astroid==1.6.1->-r /tmp/requirements.txt (line 1))
    backports.functools-lru-cache; python_version < "3.4" from https://pypi.python.org/packages/03/8e/2424c0e65c4a066e28f539364deee49b6451f8fcd4f718fefa50cc3dcf48/backports.functools_lru_cache-1.5-py2.py3-none-any.whl#md5=5a93b4dac0cc414ceff28ff678713e31 (from astroid==1.6.1->-r /tmp/requirements.txt (line 1))
    setuptools>=34.0.0 from https://pypi.python.org/packages/43/41/033a273f9a25cb63050a390ee8397acbc7eae2159195d85f06f17e7be45a/setuptools-38.5.1-py2.py3-none-any.whl#md5=908b8b5e50bf429e520b2b5fa1b350e5 (from google-api-core==0.1.4->-r /tmp/requirements.txt (line 6))
    futures>=3.2.0; python_version < "3.2" from https://pypi.python.org/packages/2d/99/b2c4e9d5a30f6471e410a146232b4118e697fa3ffc06d6a65efde84debd0/futures-3.2.0-py2-none-any.whl#md5=cfd62ab6c9852b04bb6048480fefaa75 (from google-api-core==0.1.4->-r /tmp/requirements.txt (line 6))
    funcsigs>=1; python_version < "3.3" from https://pypi.python.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl#md5=701d58358171f34b6d1197de2923a35a (from mock==2.0.0->-r /tmp/requirements.txt (line 23))
    configparser; python_version == "2.7" from https://pypi.python.org/packages/7c/69/c2ce7e91c89dc073eb1aa74c0621c3eefbffe8216b3f9af9d3885265c01c/configparser-3.5.0.tar.gz#md5=cfdd915a5b7a6c09917a64a573140538 (from pylint==1.8.2->-r /tmp/requirements.txt (line 30))
    backports.ssl_match_hostname from https://pypi.python.org/packages/76/21/2dc61178a2038a5cb35d14b61467c6ac632791ed05131dda72c20e7b9e23/backports.ssl_match_hostname-3.5.0.1.tar.gz#md5=c03fc5e2c7b3da46b81acf5cbacfe1e6 (from websocket-client==0.40.0->-r /tmp/requirements.txt (line 41))
The command '/bin/sh -c pip install -U pip wheel &&     pip install -r /tmp/requirements.txt &&     pip3 install -U pip wheel &&     pip3 install -r /tmp/requirements.txt'

This might be fixed by #36 by switching to pipenv.

Cron job to garbage collect test resources

Our tests create a bunch of resources

Argo workflows
Namespaces
GCE VMs
GKE clusters

Most of these resources (exception is Argo) should be GC'd by the teardown steps in our tests. But some failures/bugs prevent resources from being GC'd.

It would be good to have a cron job to periodically garbage collect old resources like namespaces.

Build Docker images on pre/postsubmit

We'd like to setup our infrastructure so that we can build Docker images on pre/postsubmit so that we always have up to date images.

Our current thinking is that we will use Argo to create pipelines that do the builds and run any tests. We can use Prow to trigger this in response to pre/postsubmit

Here's some detail how this could work

We modify run_workflow_e2e.py so we can easily have Prow trigger and wait for more than one Argo workflow.
Our workflows can take advantage of checkout.sh to checkout the relevant source
- Alternatively we could submit a single workflow with multiple sub workflows
- I think this might entail refactoring our ksonnet definitions so that we can pull out the bits
  corresponding to the step definitions.
We can build and push the image
We can run the tests
If tests pass the final step can retag the image with a label like latest.

Related issues:

/cc @lluunn @yupbank

Setup prow

Need to setup prow for kubeflow/testing

Tests failing; Looks like run_e2e_workflow.py is not waiting for workflows to finish

See for example
kubeflow/kubeflow#787

Logs here:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/?log#log

Relevant bit:

INFO|2018-05-10T18:32:32|/src/kubeflow/testing/py/kubeflow/testing/util.py|58| Updating workflows kubeflow-test-infra.jlewi-kubeflow-gke-deploy-test-4-3a8b
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py|161| URL for workflow: http://testing-argo.kubeflow.org/workflows/kubeflow-test-infra/kubeflow-presubmit-kubeflow-gke-deploy-787-f893ce6-1462-040e?tab=workflow
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/argo_client.py|21| Workflow kubeflow-presubmit-kubeflow-e2e-gke-787-f893ce6-1462-b429 in namespace kubeflow-test-infra; phase=Running
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/argo_client.py|21| Workflow kubeflow-presubmit-kubeflow-e2e-minikube-787-f893ce6-1462-b59b in namespace kubeflow-test-infra; phase=Running
INFO|2018-05-10T18:32:33|/src/kubeflow/testing/py/kubeflow/testing/argo_client.py|21| Workflow kubeflow-presubmit-tf-serving-image-787-f893ce6-1462-0913 in namespace kubeflow-test-infra; phase=Running
INFO|2018-05-10T18:32:34|/src/kubeflow/testing/py/kubeflow/testing/util.py|512| Writing gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/finished.json
INFO|2018-05-10T18:32:34|/src/kubeflow/testing/py/kubeflow/testing/util.py|522| Uploading file /tmp/tmpRunE2eWorkflowbjXGWwlog to gs://kubernetes-jenkins/pr-logs/pull/kubeflow_kubeflow/787/kubeflow-presubmit/1462/build-log.txt.

So looks like we ended while workflows were still running.

Gubernator hyperlinks to GitHub don't work

In Gubernator the Refs links don't work for us. It ends up trying to go to kubernetes/kubernetes
kubernetes/kubernetes@a7511ff

Add labels to workflows

Related to #72 (truncate workflow name)

Right now we encode a lot of information in the workflow name. The pattern is

{ORG}-{REPOSITORY}-{PROW-JOB-TYPE}-{WORKFLOW-NAME}-{PR-NUMBER}-{RANDOMNESS}

It would be good to start using labels to represent this information. This would make it easier to filter and select workflows.

Verify python code is formatted with yapf

In kubeflow/training-operator#401 we started using yapf to enforce a consistent Python style.

We should a presubmit test so that we can ensure all python code is properly formatted.

/cc @jimexist

Write metadata to gubernator at start of run.

In Gubernator, we display metadata that has links to the Argo workflows.

This metadata is stored in finished.json
https://github.com/kubernetes/test-infra/tree/master/gubernator

We currently write finsihed.json after the test finishes. But this means the handy links to the Argo workflows aren't available until the job finishes.

I wonder if we could wrote finished.json at the start of the test but with just the metadata field (no timestamp or result) and then just overwrite it at the end. The purpose being to display the Argo links even before the job finishes.

@BenTheElder @krzyzacy Do you know if this will work?

Optionally ignore some E2E test failures to let new tests bake

See kubeflow/kubeflow#558
When adding a new E2E test we might want to turn it on but not mark failures as blocking/failures because we want to bake the test before making it blocking.

We could do this as follows

in prow_config.yaml add options to indicate a workflow shouldn't be counted as a failure
If a workflow as marked as ignore don't take it into account when
- Marking the prow job as success or failed
- Setting the exit code of the prow job

The JUnit XML files could still indicate whether the test passed or not so that in test grid we could see whether the test was reliable or not.

Fix prow merge's incompatibility with new synced labels.

During label auto-sync, we had a label rename: "do-not-merge/work-in-progress" -> "status/in progress", which might cause prow merge uncompleted PR.
One false positive merge here: kubeflow/community#41 (comment)

To fix it:

1. revert all false merges on affected repos (see affected repo list below)

2. update prow merge rule to listen to label "status/in progress", or revert name change of this label.

Affected repos (might merge PR incomplete):

reporting
examples
community
pytorch-operator
caffe2-operator
experimental-beagle
hp-tuning

Those repos are not affected due to bot user's lack of permission (include our most active repos):

kubeflow
tf-operator
testing
example-seldon
experimental-kvc

Setup test infrastructure for kubeflow.org

We want to move our test infrastructure into kubeflow.org and share it with other members of the community willing to take on responsibility for maintaining it.

Here's what we need

Create a GCP project kubeflow-testing in kubeflow.org
- We'll use this for GCR and our E2E tests.
- Created project kubeflow-ci
Create a Google group in kubeflow.org for owners/maintainers of the test infrastructure
- Created group ci-team@
Identify some folks to be responsible for the infrastructure and add them to the group.
Follow these instructions to setup our K8s cluster in kubeflow-testing
Update our E2E workflows to use the new cluster

Create a standard Argo workflow to run basic tests

We should create a standard workflow to run basic tests that can be shared by all our repositories; eg
- lint checks
- run py and go unittests
I think its an open question whether we would run this as a separate prow job or have a single prow job for each workflow and run this as a sub workflow.

Provide information to debug failed deployments

A lot of times our tests will fail because we timeout waiting for a K8s deployment.
An example is when a deployment fails to start because the image couldn't be pulled.

To debug the test failure we'd like to capture the events for pod because this provides helpful debug information.

For example here is the output of kubectl describe for a pod that failed to start because the name of the binary to invoke was wrong.

kubectl -n kubeflow-presubmit-kubeflow-e2e-299-439a983-360-cee1 describe pods tf-job-operator-7899645f6c-cgxd2
Name:           tf-job-operator-7899645f6c-cgxd2
Namespace:      kubeflow-presubmit-kubeflow-e2e-299-439a983-360-cee1
Node:           gke-kubeflow-testing-default-pool-3cd81eea-2dp8/10.142.0.2
Start Time:     Mon, 26 Feb 2018 18:02:23 -0800
Labels:         name=tf-job-operator
                pod-template-hash=3455201927
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kubeflow-presubmit-kubeflow-e2e-299-439a983-360-cee1","name":"tf-job-operator-789...
Status:         Running
IP:             10.36.4.231
Created By:     ReplicaSet/tf-job-operator-7899645f6c
Controlled By:  ReplicaSet/tf-job-operator-7899645f6c
Containers:
  tf-job-operator:
    Container ID:  docker://e1d6dfd6955d2c4c1d89030626e8f143293f14bd55081421ce0d3da26ab64845
    Image:         gcr.io/kubeflow-images-staging/tf_operator:v20180226-403
    Image ID:      docker-pullable://gcr.io/kubeflow-images-staging/tf_operator@sha256:6c80d00498103a87d2acbdad34ce56886bacf29fb819e4424fa909dfc4c5b2e6
    Port:          <none>
    Command:
      /opt/mlkube/tf_operator
      --controller-config-file=/etc/config/controller_config_file.yaml
      --alsologtostderr
      -v=1
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    ContainerCannotRun
      Message:   oci runtime error: container_linux.go:247: starting container process caused "exec: \"/opt/mlkube/tf_operator\": stat /opt/mlkube/tf_operator: no such file or directory"

      Exit Code:    127
      Started:      Mon, 26 Feb 2018 18:18:16 -0800
      Finished:     Mon, 26 Feb 2018 18:18:16 -0800
    Ready:          False
    Restart Count:  8
    Environment:
      MY_POD_NAMESPACE:  kubeflow-presubmit-kubeflow-e2e-299-439a983-360-cee1 (v1:metadata.namespace)
      MY_POD_NAME:       tf-job-operator-7899645f6c-cgxd2 (v1:metadata.name)
    Mounts:
      /etc/config from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from tf-job-operator-token-kk7pm (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tf-job-operator-config
    Optional:  false
  tf-job-operator-token-kk7pm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tf-job-operator-token-kk7pm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age                From                                                      Message
  ----     ------                 ----               ----                                                      -------
  Normal   Scheduled              19m                default-scheduler                                         Successfully assigned tf-job-operator-7899645f6c-cgxd2 to gke-kubeflow-testing-default-pool-3cd81eea-2dp8
  Normal   SuccessfulMountVolume  19m                kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  MountVolume.SetUp succeeded for volume "config-volume"
  Normal   SuccessfulMountVolume  19m                kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  MountVolume.SetUp succeeded for volume "tf-job-operator-token-kk7pm"
  Normal   Pulled                 18m (x4 over 19m)  kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  Container image "gcr.io/kubeflow-images-staging/tf_operator:v20180226-403" already present on machine
  Normal   Created                18m (x4 over 19m)  kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  Created container
  Warning  Failed                 18m (x4 over 19m)  kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  Error: failed to start container "tf-job-operator": Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/opt/mlkube/tf_operator\": stat /opt/mlkube/tf_operator: no such file or directory"
  Warning  BackOff                9m (x44 over 19m)  kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  Back-off restarting failed container
  Warning  FailedSync             4m (x74 over 19m)  kubelet, gke-kubeflow-testing-default-pool-3cd81eea-2dp8  Error syncing pod

Why is travis running?

On this PR
#126

travis ran and failed. Did someone enable travis? To my knowledge we don't use Travis in this repo.

testing-argo.kubeflow.io needs to be changed to testing-argo.kubeflow.org

The old hostname testing-argo.kubeflow.io is no longer valid.

testing-argo.kubeflow.org should work.

We need to update the test scripts to print out the correctly URLs.

As a temporary work around people can manually edit the URLs.

Add instructions for adding GPUs to our test cluster

We should add instructions to our test-infra setup to add GPUs.

Here are the commands I used

gcloud beta --project=kubeflow-ci container node-pools create k80 --accelerator type=nvidia-tesla-k80,count=1 --zone us-east1-d --cluster kubeflow-testing --machine-type=n1-standard-8 --num-nodes 1 --min-nodes 0 --max-nodes 5 --enable-autoscaling
 kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/cos/daemonset-preloaded.yaml

This is straight from the docs
https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

Garbage collect old Argo workflows / Argo UI unresponsive

On our test cluster in project mlkube-testing. I'm noting Argo workflows going back as far as 22 days old.
I'm also knowing that the UI and kubectl are sluggish.
I suspect this might be because there are so many workflows.

What is the garbage collection policy for workflows that have completed? Is there any?

Setup prow to run lint and python unittests

We should setup some prow jobs for this repo to run lint checks and python unittests

Add a link to Argo UI in gubernator

We should see if we can add a link to the Argo UI for the job so that its readily accessible in gubernator.

I wonder if we can just specify it as metadata?

Add argo status link to GitHub

Would like a link directly from a PR to the corresponding argo workflow to make debugging easier

This was originally aded in #17 but this broke things #20 and was rolled back.

The environment variable GIT_TOKEN isn't defined in the prow job. This code runs in the prow container.

We'll need to rethink how we get the GIT API token; its possible prow attaches a secret already for the API token.

Alternatively it might just be easier to see if we can set it in gubernator see #10 .

/cc @jose5918

Tests are failing because GIT_TOKEN not set

INFO|2018-02-10T02:56:21|/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py|129| URL for workflow: http://testing-argo.kubeflow.io/timeline/kubeflow-test-infra/kubeflow-presubmit-232-a452812-181-cf21;tab=workflow
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 232, in <module>
    success = main()
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 222, in main
    return run(args, file_handler)
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 130, in run
    status = github_status.GithubStatus()
  File "/src/kubeflow/testing/py/kubeflow/testing/github_status.py", line 51, in __init__
    Github.__init__(self)
  File "/src/kubeflow/testing/py/kubeflow/testing/github_status.py", line 24, in __init__
    raise Exception('Missing environment variable GIT_TOKEN')
Exception: Missing environment variable GIT_TOKEN

This is caused by #17 . The environment variable GIT_TOKEN isn't defined in the prow job. This code runs in the prow container.

We'll need to rethink how we get the GIT API token; its possible prow attaches a secret already for the API token.

Alternatively it might just be easier to see if we can set it in the gubernator see #10 .

In the meantime we should roll this back since it breaks our tests.

@jose5918

Setup IAP for our release cluster

For our release cluster we should setup IAP so we can securely connect to the Argo UI.

We will need to wait for all the issues with IAP and Envoy JWT validation to be fixed so we can do this securely.

[MAINTENANCE] Expand kubeflow-ci cluster to add gpus and cpus.

This'll cause a momentary blip.

Argo UI not serving at http://testing-argo.kubeflow.org/

The kubeflow-testing cluster was recreated yesterday.

/assign jlewi

minikube VM unavailable for e2e kubeflow-presubmit check

/cc @jlewi
/cc @jimexist

PTAL

#36 took out the ssh install from #76 in the test-worker image. But a bunch of commits since then too. 😞 Somehow the updated image just got pushed to kubeflow-ci in the past 3 days.

pmackinn@kubeflow-ci:~$ gcloud docker -- run -it --rm --entrypoint=/bin/sh gcr.io/kubeflow-ci/test-worker:latest
WARNING: `gcloud docker` will not be supported for Docker client versions above 18.03. Please use `gcloud auth configure-docker` to configure `docker` to use `gcloud` as a credential helper, then use `docker` as you would for non-GCR regi
stries, e.g. `docker pull gcr.io/project-id/my-image`. Add `--verbosity=error` to silence this warning, e.g. `gcloud docker --verbosity=error -- pull gcr.io/project-id/my-image`. See: https://cloud.google.com/container-registry/docs/suppo
rt/deprecation-notices#gcloud-docker
# dpkg -l | grep ssh
# exit
pmackinn@kubeflow-ci:~$ gcloud docker -- run -it --rm --entrypoint=/bin/sh gcr.io/kubeflow-ci/test-worker:v20180320-8c28838-e3b0c4 
WARNING: `gcloud docker` will not be supported for Docker client versions above 18.03. Please use `gcloud auth configure-docker` to configure `docker` to use `gcloud` as a credential helper, then use `docker` as you would for non-GCR regi
stries, e.g. `docker pull gcr.io/project-id/my-image`. Add `--verbosity=error` to silence this warning, e.g. `gcloud docker --verbosity=error -- pull gcr.io/project-id/my-image`. See: https://cloud.google.com/container-registry/docs/suppo
rt/deprecation-notices#gcloud-docker
# dpkg -l | grep ssh
ii  openssh-client             1:7.2p2-4ubuntu2.4                    amd64        secure shell (SSH) client, for secure access to remote machines
ii  openssh-server             1:7.2p2-4ubuntu2.4                    amd64        secure shell (SSH) server, for secure access from remote machines
ii  openssh-sftp-server        1:7.2p2-4ubuntu2.4                    amd64        secure shell (SSH) sftp server module, for SFTP access from remote machines
ii  ssh                        1:7.2p2-4ubuntu2.4                    all          secure shell client and server (metapackage)
# exit

Prow jobs marked as failure even though Gubernator indicates job succeeded

Example:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/kubeflow_tf-operator/388/kubeflow-tf-operator-presubmit/35/

Job is marked as success and no test failures. There is no error in the log.

The issue is that the pod is exiting with non zero exit code. If we look at the pod logs (via prow jobs UI) we see the following exceptoin

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 294, in <module>
    success = main()
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 284, in main
    return run(args, file_handler)
  File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 184, in run
    if results["status"]["phase"] != "Succeeded":
TypeError: list indices must be integers, not str

This exception happens after the logs have been uploaded to GCS which is why it doesn't show up in the gubernator logs.

/cc @lluunn

Create more generic testing files based on tf-operator python files

I'm referring to this set of files: https://github.com/kubeflow/tf-operator/tree/master/py

For example there should be a generic version deploy.py to deploy a cluster without tf specific params such as https://github.com/kubeflow/tf-operator/blob/master/py/deploy.py#L144

How do we test our UI's

We need to figure out how to test our UIs like the TFJobs UI.

Test cluster is running out of disk quota

Error

"ERROR:root:Exception occured creating cluster: <HttpError 403 when requesting https://container.googleapis.com/v1/projects/kubeflow-ci/zones/us-east1-d/clusters?alt=json returned "Insufficient regional quota to satisfy request for resource: "DISKS_TOTAL_GB". The request requires '100.0' and is short '4.0'. The regional quota is '4096.0' with '96.0' available.">, status: 403
"

This happened on kubeflow/training-operator#485

Use our own build cluster with prow

prow supports using your own build cluster.

Since we already use a K8s cluster to run our Argo workflows, we should ask prow to run our prow jobs in that cluster as well this will give us more access to the logs.

E2E testing for Katacoda Kubeflow Scenarios against Kubeflow at HEAD

Katacoda just added another Kubeflow scenario (the GH issue summarization example).

It would be great if we could work with Katacoda on figuring out how to test these scenarios against the latest version of Kubeflow.

Auto-push Kubeflow deployments corresponding to different Kubeflow versions.

It would be nice if we could auto-push dev.kubeflow.org or perhaps create another environment nightly-dev.kubeflow.org.

There are certain changes like UI that are difficult to test manually. It would be useful if we had an up todate environment.

I think we could easily adapt our E2E tests to do this and then just run it as a cron job.

Move docker images used by release/test infrastructure into a separate project

We use gcr.io/kubeflow-ci/test-worker as the worker image in our CI/CD workflows.
We would like to use the same Docker image in our release workflows.
But we'd like to make sure untruted code running in PRs can't modify those images.

So it would be good to move these docker images into a separate registry such that code running in kubeflow-ci can read but not write that registry.

checkout.sh doesn't work with branches

Example PR: kubeflow/kubeflow#524

Error we observed was

+ /usr/local/bin/checkout.sh /src
+ SRC_DIR=/src
+ mkdir -p /src/kubeflow
+ git clone https://github.com/kubeflow/kubeflow.git /src/kubeflow/kubeflow
Cloning into '/src/kubeflow/kubeflow'...
+ echo Job Name = kubeflow-presubmit
+ REPO_DIR=/src/kubeflow/kubeflow
+ cd /src/kubeflow/kubeflow
Job Name = kubeflow-presubmit
+ '[' '!' -z 524 ']'
+ git fetch origin pull/524/head:pr
From https://github.com/kubeflow/kubeflow
 * [new ref]         refs/pull/524/head -> pr
+ '[' '!' -z 7d9c2633d7ddf2a1957e685f9e7f287340be0e34 ']'
+ git checkout 7d9c2633d7ddf2a1957e685f9e7f287340be0e34
fatal: reference is not a tree: 7d9c2633d7ddf2a1957e685f9e7f287340be0e34

Jsonnet utility function to parse prow_env to a dictionary

In our E2E jsonnet workflows we pass prow environment variables as a string of comma separated key value pairs

e.g

ks param set workflows prow_env "JOB_NAME=${JOB_NAME},JOB_TYPE=presubmit,PULL_NUMBER=${PULL_NUMBER},REPO_NAME=${REPO_NAME},REPO_OWNER=${REPO_OWNER},BUILD_NUMBER=${BUILD_NUMBER}"

It would be nice to parse this into a dictionary of key value pairs so that we could easily reference these parameters in jsonnet by name. Here is a snippet to do that. It would be nice to define this as a utility function in a common location.

  listToDict:: function(v)
    {      
      // We enclose v[0] in brackets because we want to treat it as a variable and use its value
     // as the field and not treat it as a string literal corresponding to the field name.
      [ v[0] ]: v[1],
    },

  // parseEnvToMap takes a string "key1=value1,key2=value2,..."
  // and turns it into an object with keys key1, key2, ... and values value1,...,value2
  parseEnvToMap::function(v)
    local pieces = std.split(v, ",");
    if v != "" && std.length(pieces) > 0 then
      std.foldl(
        function(a, b) a + $.listToDict(std.split(b, "=")),
        pieces, {}
      )
    else {},