I wanted to open a conversation about potentially reducing the amount of logs been wri

Emitting a metrics for failing plugins about signalfx-agent HOT 4 CLOSED

MovieStoreGuy commented on August 13, 2024

Emitting a metrics for failing plugins

from signalfx-agent.

Comments (4)

keitwb commented on August 13, 2024

Sure, having metrics about the agent and its monitors themselves is a good idea; I've been thinking about doing it for a while. Events would provide an exact error message, but metrics are a lot more scaleable in our backend, and events aren't really intended to be used for potentially high-frequency error logs anyway in our current product offering.

Therefore, having metrics for monitors that track failures to connect/read from a service being monitored is the best thing to do. Which monitors/collectd plugins specifically are you dealing the most with? I can try adding some metrics for a few of them and see how it works for you.

I think maintaining the current logging output though is pretty essential for basic debugging of the agent and to know more context about the errors directly from the agent. Not sure how much control you have over log retention but I can't imagine there is much value in keeping agent logs for more than a few days anyway unless you are trying to do some kind of analytics on them. I know you were talking about filtering in #719 and that is probably your best bet for this kind of thing. You can filter out the most common stuff (e.g. a connection refused/timeout error) but still let through rarer error conditions (e.g. a 500 response from a diagnostic endpoint on a service) so that they show up in logs.

from signalfx-agent.

MovieStoreGuy commented on August 13, 2024

Yeah, I want use this as reducing the amount of things ending up in our logging system.
I was having a discussion else where about having detectors for events but I gathered that is a bit of time away.

So the monitors I have mainly used that have been creating a lot of noise are:

expvar
collectd/genericjmx
We also get a decent amount of EOF or timed out awaiting headers errors so we also don't really don't care about that per say and we definitely don't want service owners to freak out that there is an issue when it is just a NOOP.

I was thinking about the dimensions that would be useful and the ones I have are:

monitor
monitor_version
agent_version
So that way we can see if bumping agent version increases / decreases issues as we run the smart agent across all of instances.

We log an insane amount a day so anything we can do to reduce is always appreciated for those who maintain our logging system. Also, while we are on the topic of logging. How hard of a change would it be to switch logging format as a config option?

from signalfx-agent.

keitwb commented on August 13, 2024

Ok, I'll look at expvar since it would be the easiest to generate metrics for starting out.

Would it make any difference if I changed the connection errors to monitored services logged at a WARN (or even INFO) level instead of ERROR level? I don't think it is feasible to totally stop logging those type of things, even with metrics, because of how convenient it is to be able to see misconfiguration and/or see the exact details about the error (i.e. connection timeout, response read timeout, connection refused, etc.) by looking at the log output.

Also, while we are on the topic of logging. How hard of a change would it be to switch logging format as a config option?

It is pretty easy since we use logrus for logging. We already use their structured log mechanism quite a bit so you would get some amount of metadata out of the box (e.g. monitorType is a common field we use). Would you want JSON output?

from signalfx-agent.

MovieStoreGuy commented on August 13, 2024

Ok, I'll look at expvar since it would be the easiest to generate metrics for starting out.
Sweet, keep me in the loop regarding this because I would love to know.

Yeah, it would be sick if you could dropped the log level for failing to collect metrics. Currently we have the level set to error.

Is there a standard for what you guys follow for what things are logged at what level?
Or how many characters should be part of a log?
I did have a look as to how to figure the logger to emit logs as json rather than text but I got distracted doing other things.

from signalfx-agent.

Emitting a metrics for failing plugins about signalfx-agent HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent