Git Product home page Git Product logo

testing's Introduction

Test Infrastructure

This directory contains the Kubeflow test Infrastructure.

We use Prow, K8s' continuous integration tool.

  • Prow is a set of binaries that run on Kubernetes and respond to GitHub events.

We use Prow to run:

  • Presubmit jobs
  • Postsubmit jobs
  • Periodic tests

Here's how it works

  • Prow is used to trigger E2E tests
  • The E2E test will launch an Argo workflow that describes the tests to run
  • Each step in the Argo workflow will be a binary invoked inside a container
  • The Argo workflow will use an NFS volume to attach a shared POSIX compliant filesystem to each step in the workflow.
  • Each step in the pipeline can write outputs and junit.xml files to a test directory in the volume
  • A final step in the Argo pipeline will upload the outputs to GCS so they are available in gubernator

Quick Links

Anatomy of our Tests

  • Our prow jobs are defined in config.yaml
  • Each prow job defines a K8s PodSpec indicating a command to run
  • Our prow jobs use run_e2e_workflow.py to trigger an Argo workflow that checks out our code and runs our tests.
  • Our tests are structured as Argo workflows so that we can easily perform steps in parallel.
  • The Argo workflow is defined in the repository being tested
    • We always use the worfklow at the commit being tested
  • checkout.sh is used to checkout the code being tested

Accessing The Argo UI

The UI is publicly available at http://testing-argo.kubeflow.io/

Working with the test infrastructure

The tests store the results of tests in a shared NFS filesystem. To inspect the results you can mount the NFS volume.

To make this easy, We run a stateful set in our test cluster that mounts the same volumes as our Argo workers. Furthermore, this stateful set is using an environment (GCP credentials, docker image, etc...) that mimics our Argo workers. You can ssh into this stateful set in order to get access to the NFS volume.

kubectl exec -it debug-worker-0 /bin/bash

This can be very useful for reproducing test failures.

Logs

Logs from the E2E tests are available in a number of places and can be used to troubleshoot test failures.

Prow

These should be publicly accessible.

The logs from each step are copied to GCS and made available through gubernator. The K8s-ci robot should post a link to the gubernator UI in the PR. You can also find them as follows

  1. Open up the prow jobs dashboard e.g. for kubeflow/kubeflow
  2. Find your job
  3. Click on the link under job; this goes to the Gubernator dashboard
  4. Click on artifacts
  5. Navigate to artifacts/logs

If these logs aren't available it could indicate a problem running the step that uploads the artifacts to GCS for gubernator. In this case you can use one of the alternative methods listed below.

Argo UI

The argo UI is publicly accessible at http://testing-argo.kubeflow.io/timeline.

  1. Find and click on the workflow corresponding to your pre/post/periodic job
  2. Select the workflow tab
  3. From here you can select a specific step and then see the logs for that step

Unfortunately there are some limitations in the Argo UI e.g.

So if your exit handler fails you may need to look at pod logs or Stackdriver logs directly.

Stackdriver logs

Since we run our E2E tests on GKE, all logs are persisted in Stackdriver logging.

Access to Stackdriver logs is restricted. We are working on giving sufficient access to members of the community (kubeflow/testing#5).

If you know the pod id corresponding to the step of interest then you can use the following Stackdriver filter

resource.type="container"
resource.labels.cluster_name="kubeflow-testing"
resource.labels.container_name = "main"
resource.labels.pod_id=${POD_ID}

The ${POD_ID} is of the form

${WORKFLOW_ID}-${RANDOM_ID}

Adding an E2E test for a new repository

We use prow to launch Argo workflows. Here are the steps to create a new E2E test for a repository. This assumes prow is already configured for the repository (see these instructions for info on setting up prow).

  1. Create a ksonnet App in that repository and define an Argo workflow
  • The first step in the workflow should checkout the code using checkout.sh
  • Code should be checked out to a shared NFS volume to make it accessible to subsequent steps
  1. Create a container to use with the Prow job
  1. Create a prow job for that repository
  • The command for the prow job should be set via the entrypoint baked into the Docker image
  • This way we can change the Prow job just by pushing a docker image and we don't need to update the prow config.

Testing Changes to the ProwJobs

Changes to our ProwJob configs in config.yaml should be relatively infrequent since most of the code invoked as part of our tests lives in the repository.

However, in the event we need to make changes here are some instructions for testing them.

Follow Prow's getting started guide to create your own prow cluster.

  • Note The only part you really need is the ProwJob CRD and controller.

Checkout kubernetes/test-infra.

git clone https://github.com/kubernetes/test-infra git_k8s-test-infra

Build the mkpj binary

bazel build build prow/cmd/mkpj

Generate the ProwJob Config

./bazel-bin/prow/cmd/mkpj/mkpj --job=$JOB_NAME --config-path=$CONFIG_PATH
  • This binary will prompt for needed information like the sha #
  • The output will be a ProwJob spec which can be instantiated using kubectl

Create the ProwJob

kubectl create -f ${PROW_JOB_YAML_FILE}
  • To rerun the job bump metadata.name and status.startTime

To monitor the job open Prow's UI by navigating to the external IP associated with the ingress for your Prow cluster or using kubectl proxy.

Integration with K8s Prow Infrastructure.

We rely on K8s instance of Prow to actually run our jobs.

Here's a dashboard with the results.

Our jobs should be added to K8s config

Setting up Kubeflow Test Infrastructure

Our tests require:

  • a K8s cluster
  • Argo installed on the cluster
  • A shared NFS filesystem

Our prow jobs execute Argo worflows in project/clusters owned by Kubeflow. We don't use the shared Kubernetes test clusters for this.

  • This gives us more control of the resources we want to use e.g. GPUs

This section provides the instructions for setting this up.

Create a GKE cluster

PROJECT=mlkube-testing
ZONE=us-east1-d
CLUSTER=kubeflow-testing
NAMESPACE=kubeflow-test-infra

gcloud --project=${PROJECT} container clusters create \
	--zone=${ZONE} \
	--machine-type=n1-standard-8 \
	--cluster-version=1.8.4-gke.1 \
	${CLUSTER}

Create a GCP service account

  • The tests need a GCP service account to upload data to GCS for Gubernator
SERVICE_ACCOUNT=kubeflow-testing
gcloud iam service-accounts --project=mlkube-testing create ${SERVICE_ACCOUNT} --display-name "Kubeflow testing account"
gcloud projects add-iam-policy-binding ${PROJECT} \
    	--member serviceAccount:${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com --role roles/container.developer
  • Our tests create K8s resources (e.g. namespaces) which is why we grant it developer permissions.

Create a secret key containing a GCP private key for the service account

gcloud iam service-accounts keys create ~/tmp/key.json \
    	--iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
kubectl create secret generic kubeflow-testing-credentials \
    --namespace=kubeflow-test-infra --from-file=`echo ~/tmp/key.json`
rm ~/tmp/key.json

Make the service account a cluster admin

kubectl create clusterrolebinding  ${SERVICE_ACCOUNT}-admin --clusterrole=cluster-admin  \
		--user=${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com
  • The service account is used to deploye Kubeflow which entails creating various roles; so it needs sufficient RBAC permission to do so.

The service account also needs the following GCP privileges because various tests use them

  • Project Viewer (because GCB requires this with gcloud)
  • Cloud Container Builder
  • Kubernetes Engine Admin (some tests create GKE clusters)
  • Logs viewer
  • Storage Admin
  • Service Account User of the Compute Engine Default Service account (to avoid this error)

Create a GitHub Token

You need to use a GitHub token with ksonnet otherwise the test quickly runs into GitHub API limits.

TODO(jlewi): We should create a GitHub bot account to use with our tests and then create API tokens for that bot.

You can use the GitHub API to create a token

  • The token doesn't need any scopes because its only accessing public data and is needed only for API metering.

To create the secret run

kubectl create secret generic github-token --namespace=kubeflow-test-infra --from-literal=github_token=${TOKEN}

Deploy NFS

We use GCP Cloud Launcher to create a single node NFS share; current settings

  • 8 VCPU
  • 1 TB disk

Create a PD for NFS

Note We are in the process of migrating to using an NFS share outside the GKE cluster. Once we move kubeflow/kubeflow to that we can get rid of this section.

Create a PD to act as the backing storage for the NFS filesystem that will be used to store data from the test runs.

  gcloud --project=${PROJECT} compute disks create  \
  	--zone=${ZONE} kubeflow-testing --description="PD to back NFS storage for kubeflow testing." --size=1TB

Create K8s Resources for Testing

The ksonnet app test-infra contains ksonnet configs to deploy the test infrastructure.

First, install the kubeflow package

ks pkg install kubeflow/core

Then change the server ip in test-infra/environments/prow/spec.json to point to your cluster.

You can deploy argo as follows (you don't need to use argo's CLI)

ks apply prow -c argo

Create the PVs corresponding to external NFS

ks apply prow -c nfs-external

Deploy NFS & Jupyter

ks apply prow -c nfs-jupyter
  • This creates the NFS share
  • We use JupyterHub as a convenient way to access the NFS share for manual inspection of the file contents.

Troubleshooting

User or service account deploying the test infrastructure needs sufficient permissions to create the roles that are created as part deploying the test infrastructure. So you may need to run the following command before using ksonnet to deploy the test infrastructure.

kubectl create clusterrolebinding default-admin --clusterrole=cluster-admin [email protected]

Setting up a Kubeflow Repository to Use Prow

  1. Define ProwJobs see pull/4951

  2. For tensorflow/k8s configure webhooks by following these instructions

  3. Add the k8s bot account, k8s-ci-robot, as an admin on the repository

    • Admin privileges are needed to update status (but not comment)
    • Someone with access to the bot will need to accept the request.
  4. Follow instructions for adding a repository to the PR dashboard.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.