Git Product home page Git Product logo

worker's Introduction

Moira 2.0 Build Status codecov Documentation Status Telegram Go Report Card

Moira is a real-time alerting tool, based on Graphite or Prometheus/VictoriaMetrics metrics.

Installation

Docker Compose is the easiest way to try:

git clone https://github.com/moira-alert/docker-compose.git
cd docker-compose
docker-compose pull
docker-compose up

See more on our documentation page.

Feed data in Graphite format to localhost:2003:

echo "local.random.diceroll 4 `date +%s`" | nc localhost 2003

Configure triggers at localhost:8080 using your browser.

Other installation methods are available, see documentation.

Contribution

Check our contribution guideline

Getting Started

See our user guide that is based on a number of real-life scenarios, from simple and universal to complicated and specific.

What is in the other repositories

Code in this repository is the backend part of Moira monitoring application.

Contact us

If you have any questions, you can ask us on Telegram.

Thanks

SKB Kontur

Moira was originally developed and is supported by SKB Kontur, a B2G company based in Ekaterinburg, Russia. We express gratitude to our company for encouraging us to opensource Moira and for giving back to the community that created Graphite and many other useful DevOps tools.

worker's People

Contributors

alexakulov avatar beevee avatar gmlexx avatar le9i0nx avatar melnikk avatar pliner avatar slach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

worker's Issues

International characters in trigger name has been broken.

International characters in trigger name has been broken.
When I try create trigger with russian characters, I get this error in api.log

2016-09-01T16:05:04+0300 [moira.logs#error] Error in delayed decorator wrapped function: 'ascii' codec can't encode characters in position 51-54: ordinal not in range(128)
2016-09-01T16:05:04+0300 [twisted.python.log#info] - 0.012 "PUT /api/trigger/25b7beed-99ef-4fa1-b40a-d5a1b5c775a9 HTTP/1.0" 500 470

CONFIG_PATH from args ignored when starts worker

When moira-checker starts worker (worker.py) config file set from args ignored:

Example:
#ps -aux | grep check root 8178 7.0 0.9 127620 28716 1 S+ 10:50AM 0:00.62 moira-checker /usr/local/moisha/lib/python2.7/site-packages/moira/checker/worker.py -n 0 -c /etc/moira/config.yml -l /var/log/moira/work root 8171 1.0 0.9 120460 27524 1 S+ 10:50AM 0:00.62 /usr/local/moisha/bin/python2.7 /usr/local/moisha/bin/moira-checker -c /usr/local/moisha/etc/config.yml -t test2

The patch may be:
patch_file.txt

can't uninstall over pip

pip uninstall moira-worker
return

Can't uninstall 'moira-worker'. No files were found to uninstall.

False positive triggers

Hello, I have a strange behaviour, when moira think, that metric is higher than it is.

So, I have the Moira metric:
movingMedian(aliasByNode(systemstat.*.cpu.percent, 1), '15min'), with 35 different nodes.

Where it alerts, when CPU is higher than 80.

And one of nodes shows CPU between 80 and 95%, while it's actual metric (I see in Grafana) less than
20% for all day.

Retentions are same: 60s:1d.

Grafana with that metric shows less than 20% too.

Is it known behaviour? How can I prevent it?

Revert evaluating full requestContextInterval

See e57ea95 and especially e57ea95#diff-ad6e777f8d2acc24d503350da3faefe5R134.

After some investigation I came to believe that adding an extra bucket at the end of timeseries is a mistake.

Consider two metrics with names cpu1, cpu2 and function movingAverage(cpu*, '3min', 'min'). Let's say that cpu1 sends points several seconds earlier than cpu2. At some point in time the timeseries could look like this:

cpu1: [1, 2, 2, 2]
cpu2: [1, 2, 2, null] - last point hasn't arrived yet

Original Graphite logic is to always drop the last point, so our movingAverage will compute 1 from [1, 2, 2] for both metrics. Current Moira logic is to always consider the last point, even if it hasn't arrived yet, so our movingAverage will compute 2 from [2, 2, 2] and [2, 2, null].

After some time actual value for cpu2 arrives, and suddenly...

cpu1: [1, 2, 2, 2]
cpu2: [1, 2, 2, 1]

...Moira logic generated a false alarm for cpu2 :(

A possible workaround is to include last bucket only if it contains a datapoint. But this is the worst option, because it shifts some metrics forward and corrupts all combining functions like sumSeries.

I think we should either:

  1. Revert the whole commit in question.
  2. Remove + retention from line 134.

@gmlexx it's up to you :)

Stop all checks if moira-cache is not working

When moira-cache fails, all metrics eventually drop into NODATA state, spamming all users with unnecessary notifications.

In fact, only Moira administrators should notice that Moira is not working properly.

Moira cannot handle averageAbove correctly

2016-04-11_23-21-03

Graphite target on this screenshot returns only metrics that have average value above 1e-9. In total, there are over 200 unique metrics that match this pattern. But if you look for averageAbove for the last hour, there are usually no more than 10 matching metrics.

Problem is, Moira remembers bad state for all 200+ metrics and shows them in ERROR state. Even when we set no data state to OK.

Artifacts in metrics

moira - google chrome 2016-05-11 11 11 17

Metric with wildcards did not disappear after real data had started coming. A little detail: real data started coming a long time after trigger had been created - maybe that's the reason.

Can't delete artifacts in mertics

i have Moira-worker installed over https://github.com/moira-alert/worker/releases/download/v1.1.14/moira_worker-1.1.14.tar.gz

i try delete artifacts described in #16
and i have 500 error

2016-05-12 13:13:51+0000 [_GenericHTTPChannelProtocol,13,127.0.0.1] Unhandled Error
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
            return _inlineCallbacks(None, gen, Deferred())
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
            result = g.send(result)
          File "/usr/local/lib/python2.7/dist-packages/moira/api/request.py", line 89, in wrapper
            yield f(resource, request)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
            return _inlineCallbacks(None, gen, Deferred())
        ---  ---
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
            result = g.send(result)
          File "/usr/local/lib/python2.7/dist-packages/moira/api/resources/metric.py", line 36, in render_DELETE
            metric = request.args.get('name')[0]
        exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'

2016-05-12 13:13:51+0000 [-] - 0.001 "DELETE /api/trigger/668d5aa4-48ef-4a56-a7e5-2920daa4b4c5/metrics?args HTTP/1.0" 500 -

Remind users of broken triggers

In a healthy working environment all triggers should be in OK or sometimes WARN state. Trigger in a ERROR or NODATA state is a critical situation and requires immediate fix.

But in reality, Moira users tend to ignore triggers that switched to ERROR or NODATA state several days ago. These triggers just hang around the top of the list, drawing attention from real problems. We should force users to fix broken triggers or delete them.

I propose to send a notification every 24h for every trigger in ERROR or NODATA state:

ERROR Trigger name [tag1][tag2]
DevOps.system.cpu.load = 99 (ERROR for 48 hours)
...

ERROR and NODATA are critical conditions that require immediate attention. Some of your metrics have been in this state for days. If this is a non-critical issue, you should make it a WARN or OK.

Support fulltext search by trigger name and patterns

Before the introduction of pages in trigger list (moira-alert/web#14), users could find their triggers with a simple page search in their browsers (Ctrl+F). Now it is impossible, and finding triggers on multiple pages became hard.

We should introduce a replacement for trigger selection window and combine it with a fulltext search in a single search string that supports #tags.

Companion issue for moira-web is moira-alert/web#48

Fix Advanced -> Simple mode editing

If you edit an Advanced mode trigger, and you want to make in Simple, you have to manually delete Python expression. Otherwise, Moira complains that expression is invalid, because targets t2, t3, ... are missing.

add new status NOTCHANGED

This status will be usful in custom python expressions, when value of trigger nee some of "death" zone. For example if we have 50 as value of trigger, when wher value reach 49.5 оor 50.5 we can return NOTCHANGED, and trigger doesn't switch it state

High memory consumption by worker processes

We encounter the issue that when the checker service had been off for several days after starting it up the worker processes eat up all the RAM and eventually are killed by kernel.
During normal operation the memory consumption is much lower.

Add support for different time formats

Unexpected behavior arises when users specify time strings like 10m, 10min, 10minutes. Some of them are supported, some aren't.

We should either support all of them or give user a clear error message on wrong ones.

Failed to perform triggers check

I get error:
2017-01-19T11:26:59+0300 [moira.logs#info] Checking trigger 6b8cf6db-97f7-447c-abef-f70ec8ae766f 2017-01-19T11:26:59+0300 [moira.logs#info] Writing new event: {'timestamp': 1484812019, 'metric': u'Test_Vlan.rx', 'value': nan, 'state': 'OK', 'trigger_id': u'6b8cf6db-97f7-447c-abef-f70ec8ae766f', 'old_state': u'NODATA'} 2017-01-19T11:26:59+0300 [moira.logs#error] Trigger check failed: Invalid Nan value when encoding double 2017-01-19T11:26:59+0300 [moira.logs#info] Writing new event: {'old_state': u'OK', 'timestamp': 1484814419, 'state': 'EXCEPTION', 'metric': None, 'trigger_id': u'6b8cf6db-97f7-447c-abef-f70ec8ae766f'} 2017-01-19T11:26:59+0300 [moira.logs#error] Failed to perform triggers check: Invalid Nan value when encoding double

data in a metrics Test_Vlan.rx are.

Set maintanance mode for specific metric

It would be great, if moira had an ability to mark specific metric for maintenance (not whole trigger).
Ex: I have trigger for monitoring availability of my services. When one of them start flapping, I'd like to temporary disable alerts for this concrete service, without disabling alerts for the rest of services.

Web application breaks if tag containing only numbers is created

Chromium Version 52.0.2743.116, console:

angular.js:12416 TypeError: b.value.toLowerCase is not a function
    at http://127.0.0.1:8080/app-c04c88e6cf11fa8e8fce.js:750:69
    at TagList.sort (native)
    at http://127.0.0.1:8080/app-c04c88e6cf11fa8e8fce.js:749:32
    at processQueue (http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:14760:29)
    at http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:14776:28
    at Scope.$eval (http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:16042:29)
    at Scope.$digest (http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:15853:32)
    at Scope.$apply (http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:16150:25)
    at done (http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:10637:48)
    at completeRequest (http://127.0.0.1:8080/common-c04c88e6cf11fa8e8fce.js:10809:8)

GET request on /api/tag/:

{
    "list": [
        666,
        "qwe",
        "cat",
        "lop",
        "nata",
        "5555dsfg"
    ],
    "tags": {
        "5555dsfg": {},
        "666": {},
        "cat": {},
        "lop": {},
        "nata": {},
        "qwe": {}
    }
}

I use:

  • anyjson==0.3.3
  • txredisapi==1.4.3
  • redis_version:3.2.3

Moira-checker ignored args

Moira-checker ignored args

[akulov@vm-edi-graph3 ~]$ moira-checker -t a6b3a011-5203-4598-98fa-e4d6d2a8b797 -l /home/akulov/log/
Traceback (most recent call last):
  File "/usr/bin/moira-checker", line 9, in <module>
    load_entry_point('moira-worker===master', 'console_scripts', 'moira-checker')()
  File "/usr/lib/python2.7/site-packages/moira/checker/server.py", line 50, in run
    logs.checker_master()
  File "/usr/lib/python2.7/site-packages/moira/logs.py", line 33, in checker_master
    log.startLogging(FileLogObserver(daily("checker.log")))
  File "/usr/lib/python2.7/site-packages/moira/logs.py", line 47, in daily
    return ZeroPaddingDailyLogFile(name, path)
  File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/python/logfile.py", line 40, in __init__
    self._openFile()
  File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/python/logfile.py", line 238, in _openFile
    BaseLogFile._openFile(self)
  File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/python/logfile.py", line 64, in _openFile
    self._file = file(self.path, "r+", 1)
IOError: [Errno 13] Permission denied: '/var/log/moira/worker/checker.log'

Missed alert

For the following expression:
alias(summarize(maxSeries(divideSeries(KE.banana.jvm.*.heap.HeapMemoryUsage_used, minSeries(KE.banana.jvm.*.heap.HeapMemoryUsage_max))), '10min', 'max'), 'jvm heap moving max')

Moira missed an alert today. Last event is on May, 5th:
2016-05-10_18-53-45

And here is a graph from today, clearly over the limits at 14:40 MSK and 15:00..15:20 MSK:
render

Targets with {} don't work

I can't added trigger with this target aliasByNode(KE.Databases.{Mirroring,AG}.*.IsSynchronized,3)

2016-06-07 11:14:23+0300 [BaseRedisProtocol,client] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1184, in gotResult
            _inlineCallbacks(r, g, deferred)
          File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1128, in _inlineCallbacks
            result = g.send(result)
          File "/usr/lib/python2.7/site-packages/moira_worker-1.2.3.dev1_gb3445cb-py2.7.egg/moira/graphite/evaluator.py", line 103, in evaluateTokens
            defer.returnValue((yield func(requestContext, *args, **kwargs)))
          File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1274, in unwindGenerator
            return _inlineCallbacks(None, gen, Deferred())
        --- <exception caught here> ---
          File "/usr/lib/python2.7/site-packages/Twisted-15.2.1-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1128, in _inlineCallbacks
            result = g.send(result)
          File "/usr/lib/python2.7/site-packages/moira_worker-1.2.3.dev1_gb3445cb-py2.7.egg/moira/graphite/functions.py", line 1628, in aliasByNode
            series.name = '.'.join(metric_pieces[n] for n in nodes)
          File "/usr/lib/python2.7/site-packages/moira_worker-1.2.3.dev1_gb3445cb-py2.7.egg/moira/graphite/functions.py", line 1628, in <genexpr>
            series.name = '.'.join(metric_pieces[n] for n in nodes)
        exceptions.IndexError: list index out of range
pip list | grep moi
moira-worker (1.2.3.dev1-gb3445cb)

Update trigger failed

pip show Twisted

Name: Twisted
Version: 15.2.1

pip show moira-worker

Name: moira-worker
Version: 1.2.1

curl -H "Content-type: application/json;charset=UTF-8" --data-binary @/opt/moira/alerts/5.load-average.json -s -X PUT http://127.0.0.1:8081/api/trigger/61e981ae-6144-4aee-8b44-9a2931a12aef

5.load-average.json content here http://pastebin.com/UkGZSrHW

failed with error in api.log

2016-05-19 11:03:10+0000 [HiredisProtocol,client] Unhandled Error
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1184, in gotResult
            _inlineCallbacks(r, g, deferred)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
        ---  ---
          File "/usr/local/lib/python2.7/dist-packages/moira/api/request.py", line 91, in wrapper
            yield f(resource, request)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/moira/api/resources/trigger.py", line 73, in render_PUT
            yield self.save_trigger(request, self.trigger_id, "trigger updated")
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/moira/api/request.py", line 31, in decorator
            yield f(*args, **kwargs)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/moira/api/request.py", line 82, in decorator
            yield f(*args, **kwargs)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/local/lib/python2.7/dist-packages/moira/api/resources/redis.py", line 31, in save_trigger
            yield self.db.accuireTriggerCheckLock(trigger_id, 10)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
            result = g.send(result)
          File "/usr/local/lib/python2.7/dist-packages/moira/db.py", line 951, in accuireTriggerCheckLock
            yield task.deferLater(reactor, 0.5, lambda: None)
        exceptions.NameError: global name 'task' is not defined

2016-05-19 11:03:10+0000 [-] - 0.028 "PUT /api/trigger/61e981ae-6144-4aee-8b44-9a2931a12aef? HTTP/1.0" 500 614

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.