Describe the bug The CorrectnessTest is consistently failing on G

Another theory, similar to <a class="issue-link js-issue-link" data-error-text="Failed

CorrectnessTest fails on GitHub Actions about cross-media-measurement HOT 7 CLOSED

SanjayVas commented on July 2, 2024

CorrectnessTest fails on GitHub Actions

from cross-media-measurement.

Comments (7)

SanjayVas commented on July 2, 2024

The current theory is that the GHA runner has too few resources resulting in starvation.

Options:

Set up our own self-hosted runner with more machine resources.
- Something we've been considering anyway to give us more head room to run tests with even more pods (e.g. Reporting server). These are currently not recommended for public repositories due to security concerns, however.
Switch to a lighter-weight Kubernetes implementation (e.g. K3s).
- Would probably work and speed up the CorrectnessTest, as apparently K3s can get a cluster running on GHA in seconds vs. the minutes for KinD¹.
- Needs more setup, e.g. running a separate container registry. K3d can help make this easier, but may not work with rootless Docker.
Get our code to work under constrained conditions.
- Best result, but may not be feasible. May take significant investigation and design.

https://github.com/marketplace/actions/setup-k3d-k3s ↩

from cross-media-measurement.

SanjayVas commented on July 2, 2024

Runs for revisions in main branch, which previously passed:

from cross-media-measurement.

SanjayVas commented on July 2, 2024

New theory: the free GitHub hosted runners throttle long/intensive runs. Not certain if this is at the workflow level or at the job level.

Evidence: Our Bazel build cache entry in the GHA repo cache was evicted (see #809). This meant that all of our builds were starting with an empty build cache. The cache is saved on merge to the default (main) branch, which hadn't happened in awhile due to the correctness test blocking PRs from being merged. This meant that all of our builds were taking a long time and consuming a lot of machine resources. Now that we had our Bazel build cache repopulated from #807, I attempted to re-enable the correctness test. It passed.

Implications: If we simply re-enable the correctness test, we could run into this problem again if we have a long build (e.g. if much of the build cache becomes invalidated due to a low-level dependency change). If the throttling is at the job level, we could split the correctness test to a separate job in the same workflow and use GHA artifacts to share the Bazel build output. If the throttling is at the workflow level, this isn't an option.

from cross-media-measurement.

SanjayVas commented on July 2, 2024

I asked this question on the GitHub Community forum. Hopefully we can get some answers about throttling behavior rather than guessing/experimenting: https://github.com/orgs/community/discussions/44143

from cross-media-measurement.

SanjayVas commented on July 2, 2024

Splitting into multiple jobs is a no-go. It turns out that uploading an artifact is extremely slow, orders of magnitude slower than saving to cache: actions/upload-artifact#199

from cross-media-measurement.

SanjayVas commented on July 2, 2024

Another theory, similar to #805 (comment): the non-incremental build causes the Bazel server to consume too much RAM, leaving too little for the correctness test to run.

Options: Configure Bazel options for running with limited RAM which may work, or just shut down the Bazel server before running the correctness test.

from cross-media-measurement.

SanjayVas commented on July 2, 2024

Run where low-level dependencies are changed, resulting in most of the build cache being invalidated passes: https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/3971731433/jobs/6808944572.

I think we're okay.

from cross-media-measurement.

CorrectnessTest fails on GitHub Actions about cross-media-measurement HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (7)

Footnotes

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org