Hello, We ran into a problem where several metrics are not written t

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Metrics not written to Metadata backend about heroic HOT 7 CLOSED

spotify commented on May 27, 2024

Metrics not written to Metadata backend

from heroic.

Comments (7)

udoprog commented on May 27, 2024

I'm thinking rate limiting of metadata ingestion kicks in.

Our current mechanism assumes that you will 'keep writing' the same metadata over and over and will drop any writes that go over a certain threshold.

The option you are looking for is writesPerSecond in metadata configuration. Please make sure this is well above your bursts if you want all the writes to go through immediately to Elasticsearch, otherwise make sure that this value is well above your burst threshold. You might encounter ES issues at these rates, so something more clever in the consumer might be needed. Please advise what happens when you attempt this.

from heroic.

jcabmora commented on May 27, 2024

Thanks @udoprog for the response! I did set the writesPerSecond to a very high number and I added a log entry in the RateLimitedCache code to write when messages would be dropped and I made sure that we were not dropping any messages in Heroic. I also set the writesPerSecond to 0 so it would use the DisabledRateLimitedCache and we had the same behavior.

I think that the problem relies on ElasticSearch (I saw the reference to the Jepsen blog entry). I am thinking about implementing something on our Consumer side as you suggested to limit the rate at which metrics are sent to Ingestion. However, I was wondering if you guys think that adding the metrics to a buffer instead of dropping them when the writesPerSecond policy kicks in would be a feature that you believe is valuable and possibly accept. If so, I'd be happy to discuss with you and contribute it.

from heroic.

udoprog commented on May 27, 2024

A local cache could be useful. But also let me explain a bit how I think around metadata writes.

If our goal is to make sure that every write matter. Durability would be an issue. Writes that reside in the buffer are not durable, restarts (or node crashes) could cause large batches of writes to disappear. For this reason it might be better to keep the writes on a durable queue (like kafka, pub/sub) and process them as fast as we want to instead of dropping the write, do not acknowledge it. For the write queue I would be biased towards Pub/Sub because it doesn't have the same ordering guarantee as Kafka and allows acknowledging (or deferring) of individual messages rather than offsets. Maybe there is a way to do this well with Kafka, but I'm not aware of one at the moment.

Also I'd consider inverting the solution a little bit and have the metric agent (for us it is ffwd) perform more de-duplication and only emit one 'metadata update message' once every N minutes. This way we'll reduce the volume of writes against Elasticsearch quite significantly making the system more responsive. It might even be possible to only issue one write every index period (if the agent is aware of this), but it would have to be made a bit smarter than it is today to accomplish that.

I hope this helps you understand more how I think around this. We'd gladly accept a feature that makes your life easier - like a local buffer de-duplicating and tempering the write rate a bit. I think it would be a more sophisticated offering than the one we currently have and a valuable addition.

from heroic.

jcabmora commented on May 27, 2024

I completely agree that a buffer will not guarantee protection against data loss in case of a crash.

I am going to work on the buffering option, and I'll keep in mind your vision to make sure that these changes don't become a burden if later are deemed unnecessary. I'd be interested in helping later if you decide to make the metadata writes work under the queue mechanism.

from heroic.

jcabmora commented on May 27, 2024

Hello!
I ended up looking at the ElasticSearch BulkProcessor API. I implemented that and besides helping on throughput, it has options to configure retrials. This helped us to fix the problem we had (however we are still adding tests).
Looking at the docs it seems that at some point there was an intention to enable bulk processing (the concurrentBulkRequests, bulkActions and flushInterval parameters are mentioned in the ES configuration page). I was wondering if you guys tried this and decided not to implement for some reason.
If you are interested, I'll be more than happy to submit a PR.
Have a good weekend, hopefully you guys get some good weather! I'm in Northern California, and we are getting hammered with rain this weekend.

from heroic.

udoprog commented on May 27, 2024

Hey, that sounds cool. Yeah, we had a few bouts with the bulk processor.

I think the reasons we are currently not using it are the following:

Error reporting. We simply were not seeing when we were getting write errors, so we couldn't instrument it, and therefore can't alert on it.
Back-pressure. Even when Elasticsearch became slow, the bulk processor kept accepting writes quickly, masking the underlying problems. This coupled with not reporting errors properly meant we were mostly losing writes, and could not explain to our users why it sometimes took days for their time series to become visible.

Back pressure is typically implemented by 'holding off' resolving the future until we have acknowledge that the write associated with it has been committed. This causes the CoreIngestionManager to run out of leases and writing processes to block.

This could have been us messing up somewhere, or maybe a newer version of the bulk processor behaves better. Overall we'd be happy to accept the patch. Just make sure to make it an optional process for now that is activated through configuration.

We're in Sweden, and at this time of the year it's a lot of darkness and snow.
I hope the weather clears up for you soon, best of luck!

from heroic.

hexedpackets commented on May 27, 2024

Closing due to age, feel free to re-open if this is still an issue.

from heroic.

Metrics not written to Metadata backend about heroic HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent