Git Product home page Git Product logo

Comments (7)

SanjayVas avatar SanjayVas commented on July 20, 2024 1

I'm going to mark this bug as closed and start a separate one for the specific cause we're looking into, since despite being the root cause for this one it's somewhat tangential.

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 20, 2024

The theory of why we haven't experienced this until now is that the QA environment is now under much heavier load, i.e. there are more Measurements flowing through it.

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 20, 2024

It turns out the Spanner Java client is already instrumented using OpenCensus. OpenCensus has been replaced by OpenTelemetry, but the Spanner Java client is still using the OpenCensus API. OpenTelemetry provides an OpenCensus Shim for this purpose, but I have not been able to get it working in our configuration where we are using the OpenTelemetry Java agent via the OpenTelemetry K8s Operator. I have an open question with OpenTelemetry about how to make these two work together.

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 20, 2024

It looks like we didn't start seeing this behavior until Jan 25, and in earnest starting Jan 28.

image

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 20, 2024

It does not appear to be related to the Spanner library update. Something of note is that it looks like StreamMeasurements calls are consistently failing. Looking at Spanner Query Insights, we see that the underlying queries are failing almost all of the time, and we're having an average latency as high as 18s+ and average rows scanned of ~69k. The only child table that can have unbounded growth in this query is DuchyMeasurementLogEntries. Running a query on row counts there shows 15 computations with >5k log entries, with the highest >14k. Excluding these outliers, the average number of log entries per computation is 35.

@renjiezh is it possible we have an infinite loop/retry somewhere?

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 20, 2024

The QA environment is fixed for now. I applied 3 changes:

  1. The pending PR (world-federation-of-advertisers/common-jvm#179)
  2. Deleted DuchyMeasurementLogEntries rows for measurements with >100 entries.
  3. Replaced the MeasurementsByState index with one which includes ordered UpdateTime and ExternalComputationId columns.

(2) and (3) were the primary fixes. (2) is temporary, as we still need to address the root cause that resulted in this many log entries. (3) was to address the full table scan of the Measurements table for the StreamActiveMeasurements query.

From Spanner Query insights, we can see that these queries has a near 100% fail rate, with average latency as high as 18s and the number of rows scanned being as high as 155k. After the changes, this falls to 8.44ms and 1.9k rows scanned respectively.

A PR will be sent for (3).

from cross-media-measurement.

SanjayVas avatar SanjayVas commented on July 20, 2024

We're seeing CreateDuchyMeasurementLogEntry calls with "Computation cE-p2RLw8tA at stage COMPLETE, attempt 10813" and FailComputationParticipant calls with "Unexpected stage or role: (COMPLETE, NON_AGGREGATOR)" from worker1 for the same computation. This appears to be related to the unbounded increase in log entries.

From investigation by @renjiezh:

These abnormal Computations are all in COMPLETE stage. A COMPLETE Computation should have NULL for LockerOwner and LockerExpirationTime but they are not. This is why they are picked up by the mill time and time again and produce infinite log entries. The next step is to figure out why they still have LockerOwner and LockerExpirationTime values.

from cross-media-measurement.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.