Comments (7)
The current theory is that the GHA runner has too few resources resulting in starvation.
Options:
- Set up our own self-hosted runner with more machine resources.
- Something we've been considering anyway to give us more head room to run tests with even more pods (e.g. Reporting server). These are currently not recommended for public repositories due to security concerns, however.
- Switch to a lighter-weight Kubernetes implementation (e.g. K3s).
- Would probably work and speed up the CorrectnessTest, as apparently K3s can get a cluster running on GHA in seconds vs. the minutes for KinD1.
- Needs more setup, e.g. running a separate container registry. K3d can help make this easier, but may not work with rootless Docker.
- Get our code to work under constrained conditions.
- Best result, but may not be feasible. May take significant investigation and design.
Footnotes
from cross-media-measurement.
Runs for revisions in main
branch, which previously passed:
HEAD
(6184fb9): https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/3887437721HEAD^
(b69f180): https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/3888096630HEAD^^
(810ca80): https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/3888502001
from cross-media-measurement.
New theory: the free GitHub hosted runners throttle long/intensive runs. Not certain if this is at the workflow level or at the job level.
Evidence: Our Bazel build cache entry in the GHA repo cache was evicted (see #809). This meant that all of our builds were starting with an empty build cache. The cache is saved on merge to the default (main
) branch, which hadn't happened in awhile due to the correctness test blocking PRs from being merged. This meant that all of our builds were taking a long time and consuming a lot of machine resources. Now that we had our Bazel build cache repopulated from #807, I attempted to re-enable the correctness test. It passed.
Implications: If we simply re-enable the correctness test, we could run into this problem again if we have a long build (e.g. if much of the build cache becomes invalidated due to a low-level dependency change). If the throttling is at the job level, we could split the correctness test to a separate job in the same workflow and use GHA artifacts to share the Bazel build output. If the throttling is at the workflow level, this isn't an option.
from cross-media-measurement.
I asked this question on the GitHub Community forum. Hopefully we can get some answers about throttling behavior rather than guessing/experimenting: https://github.com/orgs/community/discussions/44143
from cross-media-measurement.
Splitting into multiple jobs is a no-go. It turns out that uploading an artifact is extremely slow, orders of magnitude slower than saving to cache: actions/upload-artifact#199
from cross-media-measurement.
Another theory, similar to #805 (comment): the non-incremental build causes the Bazel server to consume too much RAM, leaving too little for the correctness test to run.
Options: Configure Bazel options for running with limited RAM which may work, or just shut down the Bazel server before running the correctness test.
from cross-media-measurement.
Run where low-level dependencies are changed, resulting in most of the build cache being invalidated passes: https://github.com/world-federation-of-advertisers/cross-media-measurement/actions/runs/3971731433/jobs/6808944572.
I think we're okay.
from cross-media-measurement.
Related Issues (20)
- GCloud Terraform configuration for test environments hard-codes halo-cmm-dev project name HOT 1
- Panel exchange daemon images have old, vulnerable library dependencies HOT 1
- Requisition.encrypted_requisition_spec_ciphertext field is not populated
- MeasurementSpec.serialized_measurement_public_key field is not populated
- Move non-secret files from K8s Secret to ConfigMap
- Internal Measurement state transitions to SUCCEEDED are not logged for MPC computations
- AWS Terraform config for test environment does not include resources for metrics
- Metric State can be incorrectly set to SUCCEEDED.
- Fix code scanning alert - harfbuzz: allows attackers to trigger O(n^2) growth via consecutive marks HOT 1
- Mills cannot claim tasks due to failed Computation that is accidentally enqueued HOT 1
- Duchy Mill Memory Leakage HOT 1
- New computations is starving in init stage HOT 1
- Rpc calls raise error DEADLINE_EXCEEDED in Duchy HOT 3
- Duchy mill writing output blob error should be transient. HOT 2
- Kingdom could tolerate duplicated SetParticipantRequisitionParams request HOT 4
- EKS Duchy internal server periodically loses Postgres access
- Measurement update_time/etag not updated for child resource updates
- Exchanges deletion cronjob exhausts DB connections HOT 1
- Reduce instances of Reporting ListEventGroups returning 0 results and next_page_token HOT 3
- Avoid failing Measurements at Duchy due to stale ComputationParticipant state
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cross-media-measurement.