nearmap / libddog Goto Github PK

View Code? Open in Web Editor NEW

5.0 6.0 2.0 348 KB

Datadog automation tool created by the Nearmap API team

License: MIT License

Python 99.74% Shell 0.26%

datadog metrics-visualization monitoring-tool monitoring observability metrics

libddog's People

Contributors

Stargazers

Watchers

Forkers

albertteoh numerodix

libddog's Issues

spike support for more widgets

We currently support Timeseries, QueryValue, Note and Group. This basically covers all the functionality that is of interest at the moment, but we have made a lot of assumptions in the design based on the knowledge gathered when implementing these widgets. There may be other things lurking in the Datadog language that we haven't seen yet, or things that are part of other widgets that would be disruptive to our current software architecture.

This issue proposes a quick and dirty spike of more widgets, so that we can shake some of this out. We will need to decide which widgets appear to represent the greatest technical risk and go after those.

implement a dsl for formulas

At the moment we treat formulas as text strings even though we have a dsl for queries:

query_all_reqs = Query(
    metric=Metric(name="aws.elb.request_count"),
    agg=Aggregation(func=AggFunc.SUM),
    name="reqs_all",
)

...

Request(
    formulas=[Formula(text="100 * (reqs_5xx / reqs_all)")],
    queries=[query_all_reqs, query],
    ...

This is unfortunate, because it means:

We need to parse the formula to check for syntax errors,
...and unbound identifiers (in the example reqs_all attached to the Query has to match reqs_all in the formula text),
...and invalid functions, eg. 100 * not_a_func(reqs_all)
Formulas are not structured the way queries are, they are not "programmable".

In terms of the language of formulas we have already captured it fully (to the best of my knowledge) via the grammar and the type definitions in libddog.metrics, so there aren't any unknowns here. We will now turn these types into a dsl so that formulas can be written like so:

query_all_reqs = Query(
    metric=Metric(name="aws.elb.request_count"),
    agg=Aggregation(func=AggFunc.SUM),
    name="reqs_all",  # this name no longer matters
)
reqs_all = query_all_reqs.identifier(name='reqs_all')

Formula(text=100 * (reqs_5xx / reqs_all))

Here, reqs_5xx and reqs_all are plain variables in Python instead. I'm unsure if we can support arithmetic expressions in natural syntax as shown, or whether we will need a custom representation for them like so:

Formula(text=Mul(100, (Div(reqs_5xx, reqs_all)))

Another approach would be to wrap Python literals instead and have these classes support the binary operators we need to support:

Formula(text=Int(100) * (reqs_5xx / reqs_all))

As part of this issue we should:

Implement the dsl for formulas
Add exhaustive unit tests for formulas
Make sure validation on formulas is user friendly (not just a naked assert)
Officially decouple the query dsl from the formula dsl, such that they have no common subclass. From now on we will treat these as separate languages.
Clean up all the types related to formulas in libddog.metrics and remove the ones that aren't actually useful.
Update the integration tests using formulas to use the dsl instead (textual formulas won't be supported anymore, anyway).

Configure labels in aggregation

Context

Explain the conditions which led you to write this issue.

Problem or idea

The following graph definition is handy because it elegantly defines multiple percentiles in the one query:

Request(
                title="ms latency",
                queries=[
                    Query(f"my.query.{percentile}")
                    .agg("avg")
                    .rollup("avg")
                    for percentile in ["95percentile", "median"]
                ],
...

However, it results in two timeseries with the same label and with no legend to differentiate the two lines. Fortunately, in this case, we know the top line should be P95 and the other the median just by the definition of these percentiles:

The only other way to differentiate them is to edit the graph to see which line refers to which metrics query.

Solution

This is more a question to start with as there may be a solution to this that I'm not aware of; is there a way to label each queried metric?

Polish the api

Problem or idea

Flesh out enum alternatives.
Try to simplify the query DSL.
Address TODOs at the api level.

Add metadata to dashboard descriptions

Context

Dashboards maintained by hand and dashboards maintained using libddog live in the same namespace. It would be useful to have a little standardized metadata that would allow users to identity dashboard maintained using libddog.

Solution

To the dashboard description we could insert this footer:

This dashboard is maintained using [libddog](link to libddog github repo) 
and the dashboard definition lives [here](link to definition repo).

spike a self-update mechanism

libddog has relatively frequent releases but users can easily fall behind because they might follow the getting started guide once and then never upgrade.

Putting the burden of upgrading on the user by saying "try to stay current" or "here's how to update to the latest version" is not really a great solution, given that libddog not a frequently used tool like git.

It would therefore be really useful to have a way to detect that the version being used is out of date and have a way to update libddog on the fly.

Here's an idea:

When invoking ddog:
In the background, issue a pip install libddog==0.x.x. Pip will return an error message which lists all the available versions.
Compare the version in use against the latest version returned.
If the version in use is out of date, display a prompt saying "your version if out of date, do you want to upgrade? [y/n]"
If the user selects "y", issue a pip install -U libddog in the background.
Exit ddog, and write a file to a temp location somewhere to say that this check was done at X time. On the next invocation of ddog, read this file and don't do the check again for 24h or similar.

Given the use of pip and network communication involved this might be a somewhat fragile mechanism, so it needs to fail gracefully.

Implement query parser

Implement a query parser that uses the existing grammar and builds an AST. Work items:

Redesign the Query class to be a monadic API instead - this will solve the problem of storing which of rollup and fill was intended to come first in the query (which we don't have a good way of capturing with the current API). After spiking this, I decided that instead of making a big change like that let's just make rollup and fill items in a list that can be provided in any order. It's so much less disruptive.
Extend the AST to to cover not only queries but also query expressions (functions and formulas).
~~[ ] Write a matcher for AST objects that's needed for testing.~~
Write a parser with tests that parses query strings and returns ASTs. Use the matcher to test the parser.

support monitors

The goal of this issue to add support for monitors as a new concept alongside dashboards.

There are many types of monitors in Datadog[1] but the most widely used in Nearmap are Metric Alert Query and Event Alert Query, so these are the two we aim to support first.

[1] https://docs.datadoghq.com/api/latest/monitors/#create-a-monitor

draft a user guide

Let's write a user guide that helps users get started with libddog:

How to get libddog
How to code against the API
The lifecycle of creating, updating, deleting dashboards
Feature support matrix

Add unit tests

Solution

Add unit tests for:

Typical use of each type of widget - validate codegen against expected json
Validation logic that exists in the library to ensure internal consistency

spike migrating away from the `datadog` dependency

The datadog package seems to have been super superseded by datadog-api-client and has limited coverage of the Datadog API anyway.

One option would be to adopt datadog-api-client, which is the recommended library now. But all we really want is to get/put json, not have to deal with model objects (because we have our own anyway). If this turns out to be painful we might as well write our own client library instead.

Move generic library to a separate repo

Move the generic library to a separate repository (this one). As part of that:

Rewrite the update tool to be provided by libddog and detect dashboard definitions in $PWD.
Rewrite the list tool to provided by libddog.

remove ID column from `list-defs` output

Context

Dashboard id's don't need to be included in definitions anymore. It's therefore not very useful to display a None under the ID column when listing definitions. It almost suggests that something is wrong:

$ ddog dash list-defs
ID           GROUPS  WIDGETS  QUERIES  TITLE
None              0        1        1  libddog skel: AWS ELB dashboard

publish-draft fails when the Dashboard has an id

Describe the bug

Passing a Dashboard with an id causes an exception:

    dashboard = Dashboard(
        title="libddog skel: AWS ELB dashboard",
        id="abc",
        desc="Sample dashboard showing metrics from ELB",
        widgets=[cpu_per_az],
        tmpl_var_presets=tmpl_presets_region,
    )

When running publish-draft we see:

$ ddog dash publish-draft -t '*skel*'
Creating dashboard entitled: '[draft] libddog skel: AWS ELB dashboard'... Traceback (most recent call last):
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/datadog/api/api_client.py", line 198, in submit
    response_obj = json.loads(content.decode("utf-8"))
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mmatusiak/envs/monitoring-project/bin/ddog", line 169, in <module>
    cli()
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/mmatusiak/envs/monitoring-project/bin/ddog", line 110, in publish_draft
    exit_code = mgr.publish_draft(title_pat=title)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/libddog/command_line/dashboards.py", line 195, in publish_draft
    id = self.manager.create_dashboard(dashboard=dash)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/libddog/crud/dashboards.py", line 166, in create_dashboard
    return self.client.create_dashboard(dashboard=dashboard)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/libddog/crud/client.py", line 54, in create_dashboard
    resp = datadog.api.Dashboard.create(**client_kwargs)  # type: ignore
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/datadog/api/resources.py", line 50, in create
    return APIClient.submit("POST", path, api_version, body, attach_host_name=attach_host_name, **params)
  File "/home/mmatusiak/envs/monitoring-project/lib/python3.8/site-packages/datadog/api/api_client.py", line 202, in submit
    raise ValueError("Invalid JSON response: {0}".format(content))
ValueError: Invalid JSON response: b'\n\n\n<!DOCTYPE html>

Duplicate query name causes first query to be used for all

Describe the bug

If there are multiple queries passed into a request that have the same name, then the metric from the first query is used for all queries.

Steps to Reproduce

Make a dashboard containing a Timeseries with a single Request using the following 2 queries:

queries=[
    Query(f"my.metric.{metric_part}", name="foo")
    .filter("$cluster")
    .agg("avg")
    .rollup("avg")
    for metric_part in ["first", "second"]
],

Note that the name of the two queries are the same ("foo").

The dashboard produced will contain two queries, but they will both have the metric my.metric.first.

Expected behaviour

Multiple queries with the same name is most likely a bug on the part of the dashboard writer. It would be nice if the constructor for Request raised an error.

URLs

N/A

Additional context

N/A

make the query dsl monadic

As it stands the query dsl is quite verbose:

query = Query(
    metric=Metric(name="aws.ec2.cpuutilization"),
    filter=Filter(conds=[TmplVar(tvar="region")]),
    agg=Aggregation(func=AggFunc.AVG, by=By(tags=["availability-zone"])),
    funcs=[Fill(func=FillFunc.LINEAR), Rollup(func=RollupFunc.AVG, period_s=60)],
)

This initially felt like the most natural representation of query strings, especially at a time when I had a limited picture of the query language. Now that we have a complete language grammar I feel confident in simplifying this syntax to make it more succinct and more similar to the language itself. It will look something like this:

query = (Query('aws.ec2.cpuutilization')
               .filter('$region')
               .agg('avg').by('availability-zone').as_count()
               .fill('linear').rollup('avg', 60))

This has a number of advantages:

Query expressions are more compact and involve a lot less typing.
The dsl is much closer to the query language itself, which should make it easier to learn for users of libddog.
It makes it much easier to build queries incrementally. One place in the code can be responsible for setting the common parts of a query and code in a different place can add use case specific components like additional filters.
User code only has to import the Query function itself from now on, as opposed to all the types we had before.

There are drawbacks too, however:

Python doesn't really love long method chains and when the expression gets too long and needs to be split across lines it needs to be wrapped in parentheses.
The dsl doesn't provide a way to undo things, like remove a filter. This already wasn't convenient with the previous syntax, but now becomes slightly less convenient still. It is an obscure use case, though.

We will exploit Python *args and **kwargs to make the dsl compact, so filtering by multiple tags can be done as:

filter('$region', az='us-east-1a', role='cache')

For filters we need to support both equality and inequality, and that can be done as:

filter(...)      # equal, and by far the most common use
filter_ne(...)   # unequal

The dsl will need validation at several levels:

Things that are technically keywords or constants in the query language need to be validated against a list, eg 'sum'.
Make sure components are not applied multiple times, eg Query().agg('avg').filter(...).agg('sum') is not valid.
Validation of input when the input is an identifier. Since we know the language grammar we can flag eg. a metric name that is malformed. This is more of a nicety though.
Make sure validation errors are user friendly.

draft a maintainer guide

What to include:

Semver policy
Development process and QA efforts (what all the tests are, how to run them)
The steps to make a new release

prepare packaging and release on PyPI

The goal of this issue is to package libddog to be installable as a proper library and cli from PyPI.

In scope to review:

requirements.txt / dev-requirements.txt
setup.py

Let's also decide whether to use setup.py or whether there's a better solution around these days.

include the userid in the dashboard footer

Context

At the moment libddog appends a footer in the dashboard description which includes the timestamp and the git branch name (if detected) of the last update to the dashboard. This already helps to narrow down how that last update happened, but it would be even more useful to also include the userid of the user who made the update (or whose credentials were used to make the update).

It looks like the user identity can be inferred from the app key via the Datadog API:

Make a request to https://api.datadoghq.com/api/v2/current_user/application_keys to list all the app keys for the current user. Iterate over the list and pick the key that matches the app key that's being used in the session.
Make a request to https://api.datadoghq.com/api/v2/current_user/application_keys/<app-key-id> to get details about that app key. In the response there is a handle attribute which contains the userid.

research support for monitors

Other than dashboards, monitors are the second key area we want to support in libddog. Since we haven't done any work on this yet, now would be a good time to spike support for monitors so that we can gauge what commonalities exist between widgets and monitors, what part of the code needs to be reused or rethought.

create a skel for users

In order to make it easier for users to get started with libddog, let's create a skel/ directory with some example dashboards, a copy of our QA related scripts (reformat, typecheck etc), config files for QA tools, a .gitignore, and a github actions template.

Spike parsing queries

Context

At this point we have very good unit tests coverage of the various classes and serialization/validation logic in libddog, but there is still a risk that things will slowly evolve in the Datadog API and our code gets out of date.

It would be useful to also do round trip testing where we maintain a dashboard defined using libddog, PUT it on Datadog, GET it back, and compare the two for unexpected discrepancies. To do this we need to be able to parse dashboards and queries.

Solution

Spike parsing query strings. Use existing dashboards as sample input. The parser doesn't have to cover every last part of the query language, just the most commonly used subset.

In dashboard listing show dashboards maintained by libddog

Context

Since libddog inserts a metadata footer into the dashboard description we can actually tell if a dashboard is maintained by libddog or not. Or strictly speaking: we can tell if a dashboard appears to have been maintained using libddog, at some point. Whether or not it still is we can't tell for sure because the description may still reflect that even if it's not. But it's still a useful approximation that allows us to gauge usage of libddog.

Solution

In dash list-live use the dashboard description field to see if libddog is mentioned. If it is, show the dashboard as being maintained using libddog.

Write integration tests

libddog is a library that generates code meant for Datadog. In order to have confidence that our code is valid it would help enormously to have a set of integration tests that can serve as regression tests against future changes in Datadog and consequent incompatibilities in libddog.

Running the integration tests will require api keys, so they cannot be set up to run automatically in CI. They will need to be run manually.

A dashboard that exemplifies use of all (or most of?) the widgets that libddog currently supports.
A dashboard that uses all (or a sensible selection of?) the metrics query language, including functions.
A test that puts these dashboards (make sure it's using dashboard id/titles in such a way that it's clear it's purely used for QA purposes) on Datadog and validates that the update was successful.
An extension to the test that gets the dashboard after updating it and compares it to the definitions we have locally.

Deprecate Request `title` attribute

The fastest way to create a widget showing a single metric is basically:

    widget = Timeseries(
        title="resource utilization",
        requests=[
            Request(
                title="avg cpu",
                queries=[Query("aws.ec2.cpuutilization").agg("avg")],
            )
        ],
    )

The Request doesn't have any formulas, so a formula is synthesized for the query and given the label "avg cpu" which is taken from the title attribute.

This gives us a widget with a title and the line in the graph has a label.

A common next step is to add a second metric into the widget, which adds a second line to the graph:

    widget = Timeseries(
        title="resource utilization",
        requests=[
            Request(
                title="avg cpu",
                queries=[
                    Query("aws.ec2.cpuutilization").agg("avg"),
                    Query("aws.ec2.disk_read_ops").agg("avg"),
                ],
            )
        ],
    )

But here there's a problem. Both lines have the same label, because again we've synthesized a formula for each query and populated the label from the title attribute.

In fact, at the JSON level the request object doesn't have a title attribute, this is something that only libddog has. It was introduced as a way to label lines before we had the concept of formulas, but at this point it's not needed and causes confusion as illustrated in #49.

Action items:

Deprecate the title attribute and schedule for removal in 6mo or so. The deprecation message should suggest a solution, not just state the error.
Bump version to 0.1.0.

full crud support in ddog

Context

ddog currently only supports updating existing live dashboards.

Problem or idea

New commands to support:

snapshot-live to fetch a live dashboard and store it locally.
list-live to see existing live dashboards.
update-live when hitting a dashboard definition without id should not print an error. Instead it should: 1) list the live dashboards 2) see if there is a live dashboard with the same title 3) prompt the user asking if they wish to overwrite it 4) inform the user that putting the id in the dashboard definition will avoid the prompt next time
update-live flow when there is no matching live dashboard: 1) prompt the user if they want to create it 2) inform the user that they should put the id into their definition to avoid prompts in the future
delete-live should delete a live dashboard by id. To guard against dashboards deleted by mistake it should first fetch the live dashboard and back it up to a file on the user's machine so that it can be restored if needed.

nearmap / libddog Goto Github PK

libddog's People

Contributors

Stargazers

Watchers

Forkers

libddog's Issues

Context

Problem or idea

Solution

Problem or idea

Context

Solution

Solution

Context

Describe the bug

Describe the bug

Steps to Reproduce

Expected behaviour

URLs

Additional context

Context

Context

Solution

Context

Solution

Context

Problem or idea

Recommend Projects

Recommend Topics

Recommend Org