Git Product home page Git Product logo

dr-ci's Introduction

CI

A log analyzer for CircleCI

Intro

An organization would like to determine what the most common causes of intermittent build failures/flaky tests are in the a repository so that effort can be prioritized to fix them.

Outputs

The Dr. CI project entails two distinct user-facing outputs:

The latter has several distinct utilities:

  • Annotation interface for determinisitic master failures
  • Flakiness review tool
  • Stats dashboards

Codebase

See docs/CODEBASE-OVERVIEW.md.

Repository assumptions

Dr. CI assumes a linear history of the master branch. This can be enforced on GitHub via the following setting under the "Branches" -> "Branch protection rule" section for master:

GitHub setting

Functionality

This tool obtains a list of CircleCI builds run against a GitHub repository for a master branch, downloads their logs (stripped of ANSI escape codes) from AWS, and scans the logs for a predefined list of labeled patterns (regular expressions).

These patterns are curated by an operator. The frequency of occurrence of each pattern are tracked and presented in a web UI.

The database tracks which builds have been already scanned for a given pattern, so that scanning may be performed incrementally or resumed after abort.

Tool workflow

  • A webhook listens for build status changes on a GitHub PR
  • For each failed build, that build's log will be scanned for any of the patterns in the database tagged as "flaky"
  • If all of the failures were flaky, the indicator will be green. There will be a link in the status box to dive into the details.
  • likewise for failures marked with my tool as "known problems"

Known Problem reporting

Requiring that failures in the master branch be annotated will facilitate tracking of the frequency of "brokenness" of master over time, and allow measurement of whether this metric is improving.

It is possible for only specific jobs of a commit to be marked as "known broken", e.g. the Travis CI Lint job.

Log scanning data flow diagram

flow diagram

Deployment

Development Environment Setup

See: docs/development-environment

AWS dependencies and deployment

See: docs/aws

Ingestion overview

  1. A small webservice (named gh-notification-ingest-env in Elastic Beanstalk, and hosted at domain github-notifications-ingest.pytorch.org) receives GitHub webhook notifications and stores them (synchronously) in a database.
  2. A periodic (3-minute interval) AWS Lambda task EnqueSQSBuildScansFunction queries for unprocessed notifications in the database, and enqueues an SQS message for each of them.
  3. Finally, an Elastic Beanstalk Worker-tier server named log-scanning-worker process the SQS messages as capacity allows.

We want a cool-off period during which multiple builds for a given commit can be aggregated into one task for that commit. This is accomplished via an SQS deduplicating queue, where multiple instances of the same commit are consolidated while in the queue.

Optimizations

  • We can skip inspecting all of the "previously-visited" builds if the master "scan" record points to the newest pattern ID.
    • Better yet, use a single DB query to get the list of out-of-date "already-visited" builds, instead of a separate query per build to obtain the unscanned pattern list.

Other Features

  • Periodically fetches builds directly from CircleCI API to catch up on GitHub notifications that may have been dropped

Source attribution

Aho-Corasick implementation is from here: https://github.com/channable/alfred-margaret

Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please read the full text so that you can understand what actions will and will not be tolerated.

License

Dr. CI is MIT licensed.

dr-ci's People

Contributors

bigfootjon avatar kostmo avatar malfet avatar seemethere avatar yns88 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dr-ci's Issues

Create XLA notification only if majority of other builds/tests are green

Right now XLA warning is often false negative, i.e. it is generated for the failures that have nothing to do with XLA.
This can be mitigated by delaying the notification until we have signal from other builds/tests.
I.e. build failure notification should be generated only if more than 50% of other Linux (or just bionic) builds are green.
Similarly, test failure notification should be generated only if other bionic tests are green.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.