Comments (7)
I'm going to mark this bug as closed and start a separate one for the specific cause we're looking into, since despite being the root cause for this one it's somewhat tangential.
from cross-media-measurement.
The theory of why we haven't experienced this until now is that the QA environment is now under much heavier load, i.e. there are more Measurements flowing through it.
from cross-media-measurement.
It turns out the Spanner Java client is already instrumented using OpenCensus. OpenCensus has been replaced by OpenTelemetry, but the Spanner Java client is still using the OpenCensus API. OpenTelemetry provides an OpenCensus Shim for this purpose, but I have not been able to get it working in our configuration where we are using the OpenTelemetry Java agent via the OpenTelemetry K8s Operator. I have an open question with OpenTelemetry about how to make these two work together.
from cross-media-measurement.
It looks like we didn't start seeing this behavior until Jan 25, and in earnest starting Jan 28.
from cross-media-measurement.
It does not appear to be related to the Spanner library update. Something of note is that it looks like StreamMeasurements calls are consistently failing. Looking at Spanner Query Insights, we see that the underlying queries are failing almost all of the time, and we're having an average latency as high as 18s+ and average rows scanned of ~69k. The only child table that can have unbounded growth in this query is DuchyMeasurementLogEntries. Running a query on row counts there shows 15 computations with >5k log entries, with the highest >14k. Excluding these outliers, the average number of log entries per computation is 35.
@renjiezh is it possible we have an infinite loop/retry somewhere?
from cross-media-measurement.
The QA environment is fixed for now. I applied 3 changes:
- The pending PR (world-federation-of-advertisers/common-jvm#179)
- Deleted DuchyMeasurementLogEntries rows for measurements with >100 entries.
- Replaced the MeasurementsByState index with one which includes ordered UpdateTime and ExternalComputationId columns.
(2) and (3) were the primary fixes. (2) is temporary, as we still need to address the root cause that resulted in this many log entries. (3) was to address the full table scan of the Measurements table for the StreamActiveMeasurements query.
From Spanner Query insights, we can see that these queries has a near 100% fail rate, with average latency as high as 18s and the number of rows scanned being as high as 155k. After the changes, this falls to 8.44ms and 1.9k rows scanned respectively.
A PR will be sent for (3).
from cross-media-measurement.
We're seeing CreateDuchyMeasurementLogEntry
calls with "Computation cE-p2RLw8tA at stage COMPLETE, attempt 10813" and FailComputationParticipant
calls with "Unexpected stage or role: (COMPLETE, NON_AGGREGATOR)" from worker1 for the same computation. This appears to be related to the unbounded increase in log entries.
From investigation by @renjiezh:
These abnormal Computations are all in COMPLETE stage. A COMPLETE Computation should have NULL for LockerOwner and LockerExpirationTime but they are not. This is why they are picked up by the mill time and time again and produce infinite log entries. The next step is to figure out why they still have LockerOwner and LockerExpirationTime values.
from cross-media-measurement.
Related Issues (20)
- Panel exchange daemon images have old, vulnerable library dependencies HOT 1
- Requisition.encrypted_requisition_spec_ciphertext field is not populated
- MeasurementSpec.serialized_measurement_public_key field is not populated
- Move non-secret files from K8s Secret to ConfigMap
- Internal Measurement state transitions to SUCCEEDED are not logged for MPC computations
- AWS Terraform config for test environment does not include resources for metrics
- Metric State can be incorrectly set to SUCCEEDED.
- Fix code scanning alert - harfbuzz: allows attackers to trigger O(n^2) growth via consecutive marks HOT 1
- Mills cannot claim tasks due to failed Computation that is accidentally enqueued HOT 1
- Duchy Mill Memory Leakage HOT 1
- New computations is starving in init stage HOT 1
- Rpc calls raise error DEADLINE_EXCEEDED in Duchy HOT 3
- Duchy mill writing output blob error should be transient. HOT 2
- Kingdom could tolerate duplicated SetParticipantRequisitionParams request HOT 4
- EKS Duchy internal server periodically loses Postgres access
- Measurement update_time/etag not updated for child resource updates
- Exchanges deletion cronjob exhausts DB connections HOT 1
- Reduce instances of Reporting ListEventGroups returning 0 results and next_page_token HOT 3
- Avoid failing Measurements at Duchy due to stale ComputationParticipant state
- Stop calling Kingdom from Duchy internal services
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cross-media-measurement.