Git Product home page Git Product logo

pytorch / test-infra Goto Github PK

View Code? Open in Web Editor NEW
68.0 14.0 64.0 609.73 MB

This repository hosts code that supports the testing infrastructure for the main PyTorch repo. For example, this repo hosts the logic to track disabled tests and slow tests, as well as our continuation integration jobs HUD/dashboard.

Home Page: https://hud.pytorch.org/

License: Other

Python 23.49% Shell 0.72% Makefile 0.14% HCL 4.53% PowerShell 1.04% JavaScript 1.28% TypeScript 61.46% CSS 0.42% Rust 6.84% Smarty 0.07%

test-infra's Introduction

PyTorch TestInfra

The PyTorch TestInfra project is collection of infrastructure components that are supporting the PyTorch CI/CD system. It also contains various PyTorch development tools like linters.

Getting started

Clone the repository:

$ git clone --recursive https://github.com/pytorch/test-infra

Directories

├── aws                                  # Infra running in AWS
│   ├── lambda
│   └── websites                         # Several websites supported by TestInfra
│       ├── download.pytorch.org
│       └── metrics.pytorch.org
├── setup-ssh                            # Shh access setup to CI workers
├── stats                                # CI related stats commited automatically by a bot
├── terraform-aws-github-runner          # Terraform modules and templates used in CI
├── tools                                # Tools and scripts
|   ├── clang-tidy-checks
|   └── scripts
└── torchci                              # Code for hud.pytorch.org and our pytorch bots which run there
    └── pages

Setting up your Dev environment to locally run hud.pytorch.org

  1. Install yarn: E.g. for macs: brew install yarn

  2. cd torchci and install dependencies with yarn install

  3. Setup your environment variables

    a. Copy torchci/.env.example to torchci/.env.local to create a local copy of your environmnet variables. This will NOT be checked into git

    b. For every environment setting defined in there, copy over the corresponding value from Vercel (this requires access to our Vercel deployment)

  4. From torchci run yarn dev to start the dev server. The local endpoint will be printed on the console, it'll most likely be http://localhost:3000. You can find more useful yarn commands in package.json under the scripts section.

Linting

We use actionlint to verify that the GitHub Actions workflows in .github/workflows are correct. To run it locally:

  1. Set up Go

  2. Install actionlint

    go install github.com/rhysd/actionlint/cmd/actionlint@7040327ca40aefd92888871131adc30c7d9c1b6d
  3. Run actionlint

    # The executable will be in ~/go/bin, so make sure that's on your PATH
    # actionlint automatically detects and uses shellcheck, so if it's not in
    # your PATH you will get different results than in CI
    actionlint

Join the PyTorch TestInfra community

See the CONTRIBUTING file for how to help out.

License

PyTorch TestInfra is BSD licensed, as found in the LICENSE file.

test-infra's People

Contributors

1ntegr8 avatar atalman avatar bdruth avatar clee2000 avatar dagitses avatar danilbaibak avatar dependabot[bot] avatar driazati avatar dylan-smith avatar facebook-github-bot avatar gertjanmaas avatar henrynguyen5 avatar huydhn avatar izaitsevfb avatar janeyx99 avatar jeanschmidt avatar kit1980 avatar malfet avatar npalm avatar osalpekar avatar palic avatar pmeier avatar samestep avatar seemethere avatar suo avatar swang392 avatar weiwangmeta avatar zainrizvi avatar zengk95 avatar zhouzhuojie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

test-infra's Issues

Migrate pytorch/pytorch master references to main

Much of our infra here points to master, but as we're transitioning to use main, we should support both for a time for a smooth transition.

This includes some lambdas as well as torchci/HUD2/flaky bot code.

`clang-tidy` unable to find `PyConfig` definition

Ubuntu 18.04 ships with python 3.6, which does not have PyConfig defined. However, PyTorch uses this definition in torch/csrc/deploy/interpreter/interpreter_impl.cpp. Relevant CI failure

The actual build is able to run on python 3.6 enabled machines because the build step checks out python 3.8 and runs against those headers. CMake file

Since we don't actually want to build PyTorch when running the clang-tidyjob, the workaround would be to bump our Docker base to use Cuda 11.2 with ubuntu 20.04

[torchci] grouping

it'd be nice to have groupings in the timeline view similar to what hud.pytorch.org already has

`with-ssh` for Windows

Tracker issue for implementing with-ssh for windows

Retire "master" reference in Grafana

As we're migrating pytorch/pytorch to main, we should retire and rename "master" to main in our Grafana queries.

As we go at it, we should remove any outdated/unused tables

metrics.pytorch.org umbrella issue

  • investigate if we can reduce deployment complexity by consolidating on Terraform + AWS instead of (terraform -> ansible -> docker swarm -> nginx -> grafana)
  • move logging from json-file to cloudwatch
  • add auth via GitHub logins gated on the PyTorch org

Migrate Rockset queries to support "main" default branch

For each of the follow query lambdas, change references in name, default value, and within the query to main instead of master. Be careful not to remove any versioning, but try to create a new version for each change so that our infra doesn't break.

commons

  • cancelled_jobs
  • master_commits
  • original_pr_hud_query

metrics

  • last_branch_push
  • last_successful_workflow
  • master_commit_red
  • master_commit_red_avg
  • master_jobs_red
  • master_jobs_red_avg
  • reverts
  • top_failures

MacOS binaries for clang-tidy

Build MacOS binaries for clang-tidy and upload them to S3. This will require making changes to the setup script since lld is not supported on Macs. Creating a tracking issue since naively opting in to using ld doesn't seem to work.

[torchci] PR view followups

  • We are currently showing git commit summary in the summary view. Probably we should show PR body instead, since it not always the same as the git commit.
  • We don't currently have a link to go to the GitHub PR page anywhere. The commit summary has a "GitHub" link that goes to the commit itself, which is generally not what people want.
  • We don't currently have a way to get to the PR page from anywhere except a direct link
  • We should follow github's path structure for commits and PRs (e.g. /[repoOwner]/[repoName]/pull/[prNumber]). That way someone can change github.com to torch-ci.com and be taken directly to the correct page.

update_dashboards.py doesn't support deletes

Removing a dashboard in Grafana doesn't get reflected in the repo, so the removed dashboard is re-added on the next deploy run.

This dashboard is old and we can remove it, but we need support from the updating script first. To do this, make your changes to https://github.com/pytorch/test-infra/blob/main/aws/websites/metrics.pytorch.org/update_dashboards.py.

  1. Get the credentials from secrets_tool and set them in your terminal
export GRAFANA_USER=abc
export GRAFANA_PASSWORD=abc
  1. Edit the script to recognize when a dashboard is missing in the API response from Grafana but present in the local filesystem, and delete the local file

  2. Verify the correctness by checking git status and seeing that the file was deleted and that the script will notify the GitHub Actions runner by printing ::set-output name=UPDATED_DASHBOARDS::yes

  3. Land the code, delete the dashboard for real and watch https://github.com/pytorch/test-infra/actions/workflows/generated-update-grafana-dashboards.yml to make sure it's working correctly (GitHub_Actions_Status.json should be deleted from https://github.com/pytorch/test-infra/tree/main/aws/websites/metrics.pytorch.org/files/dashboards)

Grafana automate the certificate generation

Implement the automatic refreshing of certificate as per following document:

https://docs.google.com/document/d/1nq3dx-_8wasii1koCkXJDSo3uz_0Ee8DzIS2-j2TOpA/edit#

Possible solution 1:

  1. run the metrics github action install periodically (once a month)
  2. regenerate new certificate during install
  3. disable and remove overwriting of certificates from secrets

Possible solution 2:

  1. create monthly cron job to take down the metrics server update certificates and restart
  2. disable overwriting of certificates from secrets

Clean up grafana dashboards

The dashboards are pretty unorganized right now and it's not clear what is supposed to be used vs experimental. Some dashboards (e.g. gcc5.4 job times) also have outdated ways to gather metrics. We should use folders and re-name things to be more accurate

Consolidate lambdas

Would be nice to consolidate them all in one place and deploy them centrally. We have lambdas sprawled over a bunch of places:

  • ossci-log-analyzer
  • the three lambdas to populate job information from GHA, circleci, jenkins
  • pytorch/probot is deployed a lambda
  • All the lambdas in the aws/ folder

This is currently blocked on our usage of Vercel, which has a 15s timeout for serverless functions. Some of our functions take longer than that, to download big logs and/or analyze them.

To fix:

  • We can move to netlify, which had background functions that adhere to the underlying lambda restrictions (15m)
  • We can wait; I spoke to a Vercel support person who said they are planning similar functionality to background functions (although no ETA was given)

[torchci] List batched commits on HUD

Currently, some commits are hidden due to the fact that GitHub and ShipIt batch commits per push. The way the current HUD works is by taking the head commit of every push and not mentioning the tag-along commits that don't have CI run for them.

This is misleading for diagnosing issues as one of the tag-along commits could be the culprit, e.g., pytorch/pytorch#71611.

We should agree on a way to visualize these batched commits and implement it into the new HUD.

I think the steps of this task

  1. Also return 'commits' as a result of the master_commits query on Rockset
  2. Use the new version of the query near
    const commitsBySha = _.keyBy(commitQuery.results, "sha");
  3. Figure out how we want to display these commits

Migrate auto-commits from slow/disabled tests to a different branch

Currently, the workflows that automatically update slow tests and disabled tests statistics commit the changes to main. To minimize noise, it would be useful to migrate these to push to a different branch, like maybe branch stats-update or something. We can later merge stats-update into main if desired.

Steps:

Automate packer builds for aws/ami/windows

I'd like for us to automate the packer builds for our windows images through Github Actions:

  • Getting packer build policies into our AWS account for use with Managed IAM (Meta Employees: see internal wiki on how to generate these credentials)
  • Write Github Actions Workflow with credentials from Managed IAM to facilitate building packer images in CI (will only really work on non-forked PRs) (code for building packer images can be found here, https://github.com/pytorch/test-infra/tree/main/aws/ami/windows)

This is needed in order to remove manual step in CUDA upgrade. 6. Generate new Windows AMI, test and deploy to canary and prod.

In grouped view, all non-existent jobs should not aggregate to green

If every job in a group doesn't exist, it should aggregate to a grey o, not a green o. Since we skip all jobs for stacked diffs, we frequently are showing misleading green signal. For example, in this image the linux jobs have been failing consistently, but it looks like there is green interspersed with red because of this bug:
image

Updating clang-tidy binary hashes

The installer script in pytorch/pytorch uses a sha256sum hash of the binary for verification. Each time we build a new binary, however, that hash gets invalidated. This causes our CI to fail.

We need to implement versioned binaries which provides us with a window of time to update the hash while still keeping CI active.

Move stats/disabled-tests.json to S3

The disabled tests json is updated through a GHA workflow that runs every 15 min. This causes many commits to the test-infra repo, which adds unnecessary noise + bloats our commit history.

We could remedy this by:

  1. Hosting the disabled tests elsewhere, like on S3. This would effectively remove the noise in our commit history, but we would lose the previous benefits of having it on a GitHub repo. For example, it would be harder to modify and update the file manually, and there would be no review process when that occurs. Secondly, version control would be harder to visualize for an S3 bucket compared to with Git, which is a system built for versioning.

  2. Looking back on a few commits, such as d18b417, it seems as if the contents for the disabled tests have not actually changed, but it's been the order of the json arrays. If there's an easy way to sort the json before we commit, we could also reduce noise by only making commits when the actual contents change.

Please comment which alternative you think is better!

Add a way to view test history on the HUD

This is a lacking area as we currently have no easy way to access test history without sorting through the data ourselves.

Some things we may want to be able to know from a test view:

  • When was the last time this test case was run and on what platforms?
  • What branch/PR was the test case was run on?
  • How often does this test pass?
  • Is this test flaky?
  • How many tests are we running per commit on trunk?

Other requirements for the test view:

  • Publicly accessible
  • Easy to navigate for someone without SQL knowledge.

Improve grafana server logging

As a software developer I would like to make sure grafana server logs are forwarded to host machine. This way in case docker image is killed or we need redeployment of docker image, the logs are not lost.

Metrics add usage statistics about metrics website use

Metrics add usage statistics about metrics website use ultimately we want to display the usage statistics per dashboard.

Signed In users vs not signed , which dashboards are viewed and number of page views. Very similar to google analytics

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.