Git Product home page Git Product logo

Comments (7)

SanjayVas avatar SanjayVas commented on July 2, 2024

The current theory is that the GHA runner has too few resources resulting in starvation.

Options:

  1. Set up our own self-hosted runner with more machine resources.
    • Something we've been considering anyway to give us more head room to run tests with even more pods (e.g. Reporting server). These are currently not recommended for public repositories due to security concerns, however.
  2. Switch to a lighter-weight Kubernetes implementation (e.g. K3s).
    • Would probably work and speed up the CorrectnessTest, as apparently K3s can get a cluster running on GHA in seconds vs. the minutes for KinD1.
    • Needs more setup, e.g. running a separate container registry. K3d can help make this easier, but may not work with rootless Docker.
  3. Get our code to work under constrained conditions.
    • Best result, but may not be feasible. May take significant investigation and design.

Footnotes

  1. https://github.com/marketplace/actions/setup-k3d-k3s

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 2, 2024

Runs for revisions in main branch, which previously passed:

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 2, 2024

New theory: the free GitHub hosted runners throttle long/intensive runs. Not certain if this is at the workflow level or at the job level.

Evidence: Our Bazel build cache entry in the GHA repo cache was evicted (see #809). This meant that all of our builds were starting with an empty build cache. The cache is saved on merge to the default (main) branch, which hadn't happened in awhile due to the correctness test blocking PRs from being merged. This meant that all of our builds were taking a long time and consuming a lot of machine resources. Now that we had our Bazel build cache repopulated from #807, I attempted to re-enable the correctness test. It passed.

Implications: If we simply re-enable the correctness test, we could run into this problem again if we have a long build (e.g. if much of the build cache becomes invalidated due to a low-level dependency change). If the throttling is at the job level, we could split the correctness test to a separate job in the same workflow and use GHA artifacts to share the Bazel build output. If the throttling is at the workflow level, this isn't an option.

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 2, 2024

I asked this question on the GitHub Community forum. Hopefully we can get some answers about throttling behavior rather than guessing/experimenting: https://github.com/orgs/community/discussions/44143

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 2, 2024

Splitting into multiple jobs is a no-go. It turns out that uploading an artifact is extremely slow, orders of magnitude slower than saving to cache: actions/upload-artifact#199

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 2, 2024

Another theory, similar to #805 (comment): the non-incremental build causes the Bazel server to consume too much RAM, leaving too little for the correctness test to run.

Options: Configure Bazel options for running with limited RAM which may work, or just shut down the Bazel server before running the correctness test.

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 2, 2024

Run where low-level dependencies are changed, resulting in most of the build cache being invalidated passes: https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/3971731433/jobs/6808944572.

I think we're okay.

from cross-media-measurement.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.