Comments (8)
Haven't tried consul-alerts on this scale before. How many instances of consul-alerts is running? I think the slow-down might be caused by the sheer volume of checks being processed. There are no metrics at the moment but I'm keen on finding out how to scale it to such a size.
from consul-alerts.
After messing around with consul-alerts
it seems like https://github.com/AcalephStorage/consul-alerts/blob/master/consul/client.go#L223-L257 would severely limit performance. In a smaller datacenter with lower 4 digit checks the loop appears to take a minute. This causes https://github.com/AcalephStorage/consul-alerts/blob/master/check-handler.go#L57-L61 to take longer than expected and generally slow everything down.
from consul-alerts.
@macb , I have similar situation, I'm running consul-alerts in ~70 servers, ~5 checks in each. https://github.com/AcalephStorage/consul-alerts/blob/master/consul/client.go#L223-L257 is taking at ~1 minute, so it takes ~7 minutes for https://github.com/AcalephStorage/consul-alerts/blob/master/check-handler.go#L57-L61 to run. Any ideas to improve this? Thanks.
from consul-alerts.
I am using this for about 100 servers in AWS and definitely have noticed some inefficiencies and high cpu demands. I am being pretty ambitious and have servers with about 20 checks running. I have to use compute optimized c4 instances to be able to run at this scale and I assume that that the strain is going to increase. However, I think this is great project and I will try to dig into the code and help where I can as well.
from consul-alerts.
There are few factors affecting performance.
- Consul-Alerts is dependent on the "watcher" feature of Consul. This watches the health checks for changes. Any status change triggers consul to send the entire list of health checks from all nodes. This is what consul-alerts processes. (eg. 70 servers * 5 checks = 350 checks to check every time a change is detected).
- still thinking of a way to just get the changed health check instead of all
- The code processes the checks + sending notifications in a linear way.
- go routines might speed things up
from consul-alerts.
Any updates on this? I have a large deployment that's becoming Consul aware, and I'd love to use consul-alerts for notifications. Expected stats: ~200 servers, ~700 services, ~3000 total health checks.
from consul-alerts.
I feel like the real fix for this needs to come from upstream in consul. There is a ticket, I can't find right now, to have consul return only changed entries in the watches. That would be the ideal fix, with maybe full comparisons occasionally run for a sanity check.
from consul-alerts.
I noticed consul-alerts takes considerable resources on our server and consul itself gets very busy with writes when we have consul running so I traced it for a minute in our test environment and made these observations:
The main issue seems to be all the writes it produces
e.g. it seems like it loops over every check and re-writes the content each time even though those contents likely didn't change:
count URL prefix
680 PUT /v1/kv/consul-alerts/checks
count URL prefix
136 PUT /v1/kv/consul-alerts/checks/ecs-1269316829
120 PUT /v1/kv/consul-alerts/checks/ecs-205916921
104 PUT /v1/kv/consul-alerts/checks/ecs-2743417484
104 PUT /v1/kv/consul-alerts/checks/ecs-3237410996
72 PUT /v1/kv/consul-alerts/checks/node1
72 PUT /v1/kv/consul-alerts/checks/node2
72 PUT /v1/kv/consul-alerts/checks/node3
At a minimum it's already reading the contents of this on every loop so it should know if the content has changed. Doing these writes each time seems to have the largest overhead on consul
Some other observations in this capture:
688 calls to reminders prefix.
GET /v1/kv/consul-alerts/reminders
Given we have nothing down that prefix we could just do /v1/kv/consul-alerts/reminders?recurse in one shot
consul-alerts is doing 1375 calls to checks
GET /v1/kv/consul-alerts/checks
we seem to do this in more than one place and could make sense to do this in one shot.
We are doing 2674 calls into config
GET /v1/kv/consul-alerts/config/checks/blacklist
again one shot would be better as we have nothing in the blacklist
A couple of these changes could greatly reduce the overhead of consul-alerts running
from consul-alerts.
Related Issues (20)
- Support parametric notification profiles HOT 2
- Disable notifications with consul tags HOT 1
- email notifier: distinct email receivers based on consul service HOT 1
- Checks status "from [empty] to passing/critical"
- docker image does not send http-notification HOT 3
- 200 is not the only code that indicates success
- Config not updating
- Change the default URL where reads the config in consul "/consul-alerts/xxx" for another URL, for example "/consul-alerts-2/xxx"? HOT 1
- INFO[0003] Running for leader election... Error querying Consul agent: Unexpected response code: 403 (Permission denied) INFO[0003] Shutting down watcher --> Exit Code: 1 HOT 1
- Support JWT auth for HTTP notifier HOT 2
- Update docker image.
- (Stale) Reminder doesnt get deleted
- consul-alerts stops working after some time
- Can't connect to https consul endpoint
- Cant download consulalerts 0.6.0 HOT 1
- version display error
- Opsgenie notifier should allow to configure different target address
- http-endpoint-notifier : Possibility to add HTTP headers ?
- 'Running for leader election' repeating
- Incorrect policy given in README HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from consul-alerts.