Git Product home page Git Product logo

slo-generator's Introduction

SLO Generator

test build deploy PyPI version Downloads

slo-generator is a tool to compute and export Service Level Objectives (SLOs), Error Budgets and Burn Rates, using configurations written in YAML (or JSON) format.

IMPORTANT NOTE: the following content is the slo-generator v2 documentation. The v1 documentation is available here, and instructions to migrate to v2 are available here.

Table of contents

Description

The slo-generator runs backend queries computing Service Level Indicators, compares them with the Service Level Objectives defined and generates a report by computing important metrics:

  • Service Level Indicator (SLI) defined as SLI = Ngood_events / Nvalid_events
  • Error Budget (EB) defined as EB = 1 - SLI
  • Error Budget Burn Rate (EBBR) defined as EBBR = EB / EBtarget
  • ... and more, see the example SLO report.

The Error Budget Burn Rate is often used for alerting on SLOs, as it demonstrates in practice to be more reliable and stable than alerting directly on metrics or on SLI > SLO thresholds.

Local usage

Requirements

  • python3.7 and above
  • pip3

Installation

slo-generator is a Python library published on PyPI. To install it, run:

pip3 install slo-generator

Notes:

  • To install providers, use pip3 install slo-generator[<PROVIDER_1>, <PROVIDER_2>, ... <PROVIDER_n]. For instance:
    • pip3 install slo-generator[cloud_monitoring] installs the Cloud Monitoring backend / exporter.
    • pip3 install slo-generator[prometheus, datadog, dynatrace] install the Prometheus, Datadog and Dynatrace, backends / exporters.
  • To install the slo-generator API, run pip3 install slo-generator[api].
  • To enable debug logs, set the environment variable DEBUG to 1.
  • To enable colorized output (local usage), set the environment variable COLORED_OUTPUT to 1.

CLI usage

To compute an SLO report using the CLI, run:

slo-generator compute -f <SLO_CONFIG_PATH> -c <SHARED_CONFIG_PATH> --export

where:

  • <SLO_CONFIG_PATH> is the SLO configuration file or folder path.
  • <SHARED_CONFIG_PATH> is the Shared configuration file path.
  • --export | -e enables exporting data using the exporters specified in the SLO configuration file.

Use slo-generator compute --help to list all available arguments.

API usage

On top of the CLI, the slo-generator can also be run as an API using the Cloud Functions Framework SDK (Flask) using the api subcommand:

slo-generator api --config <SHARED_CONFIG_PATH>

where:

Once the API is up-and-running, you can make HTTP POST requests with your SLO configurations (YAML or JSON) in the request body:

curl -X POST -H "Content-Type: text/x-yaml" --data-binary @slo.yaml localhost:8080 # yaml SLO config
curl -X POST -H "Content-Type: application/json" -d @slo.json localhost:8080 # json SLO config

To read more about the API and advanced usage, see docs/shared/api.md.

Configuration

The slo-generator requires two configuration files to run, an SLO configuration file, describing your SLO, and the Shared configuration file (common configuration for all SLOs).

SLO configuration

The SLO configuration (JSON or YAML) is following the Kubernetes format and is composed of the following fields:

  • api: sre.google.com/v2

  • kind: ServiceLevelObjective

  • metadata:

    • name: [required] string - Full SLO name (MUST be unique).
    • labels: [optional] map - Metadata labels, for example:
      • slo_name: SLO name (e.g availability, latency128ms, ...).
      • service_name: Monitored service (to group SLOs by service).
      • feature_name: Monitored feature (to group SLOs by feature).
  • spec:

    • description: [required] string - Description of this SLO.
    • goal: [required] string - SLO goal (or target) (MUST be between 0 and 1).
    • backend: [required] string - Backend name (MUST exist in SLO Generator Configuration).
    • method: [required] string - Backend method to use (MUST exist in backend class definition).
    • service_level_indicator: [required] map - SLI configuration. The content of this section is specific to each provider, see docs/providers.
    • error_budget_policy: [optional] string - Error budget policy name (MUST exist in SLO Generator Configuration). If not specified, defaults to default.
    • exporters: [optional] string - List of exporter names (MUST exist in SLO Generator Configuration).

Note: you can use environment variables in your SLO configs by using ${MY_ENV_VAR} syntax to avoid having sensitive data in version control. Environment variables will be replaced automatically at run time.

โ†’ See example SLO configuration.

Shared configuration

The shared configuration (JSON or YAML) configures the slo-generator and acts as a shared config for all SLO configs. It is composed of the following fields:

  • backends: [required] map - Data backends configurations. Each backend alias is defined as a key <backend_name>/<suffix>, and a configuration map.

    backends:
      cloud_monitoring/dev:
        project_id: proj-cm-dev-a4b7
      datadog/test:
        app_key: ${APP_SECRET_KEY}
        api_key: ${API_SECRET_KEY}

    See specific providers documentation for detailed configuration:

  • exporters: A map of exporters to export results to. Each exporter is defined as a key formatted as <exporter_name>/<optional_suffix>, and a map value detailing the exporter configuration.

    exporters:
      bigquery/dev:
        project_id: proj-bq-dev-a4b7
        dataset_id: my-test-dataset
        table_id: my-test-table
      prometheus:
        url: ${PROMETHEUS_URL}

    See specific providers documentation for detailed configuration:

    • bigquery to export SLO reports to BigQuery for historical analysis and DataStudio reporting.
    • cloudevent to stream SLO reports to Cloudevent receivers.
    • pubsub to stream SLO reports to Pubsub.
    • cloud_monitoring to export metrics to Cloud Monitoring.
    • prometheus to export metrics to Prometheus.
    • datadog to export metrics to Datadog.
    • dynatrace to export metrics to Dynatrace.
    • <custom> to export SLO data or metrics to a custom destination.
  • error_budget_policies: [required] A map of various error budget policies.

    • <ebp_name>: Name of the error budget policy.
      • steps: List of error budget policy steps, each containing the following fields:
        • window: Rolling time window for this error budget.
        • alerting_burn_rate_threshold: Target burnrate threshold over which alerting is needed.
        • urgent_notification: boolean whether violating this error budget should trigger a page.
        • overburned_consequence_message: message to show when the error budget is above the target.
        • achieved_consequence_message: message to show when the error budget is within the target.
    error_budget_policies:
      default:
        steps:
        - name: 1 hour
          burn_rate_threshold: 9
          alert: true
          message_alert: Page to defend the SLO
          message_ok: Last hour on track
          window: 3600
        - name: 12 hours
          burn_rate_threshold: 3
          alert: true
          message_alert: Page to defend the SLO
          message_ok: Last 12 hours on track
          window: 43200

โ†’ See example Shared configuration.

More documentation

To go further with the SLO Generator, you can read:

slo-generator's People

Contributors

anas-aso avatar batraph avatar brunoreboul avatar clement-pruvot avatar dependabot[bot] avatar djabx avatar djetelina avatar eraac avatar faissaloux avatar florianmartineau avatar k1rnt avatar lnxpy avatar lvaylet avatar maximepln avatar misabhishek avatar mrtergl avatar mveroone avatar ocervell avatar one1zero1one avatar pradeepbbl avatar realschwa avatar release-please[bot] avatar sdenef-adeo avatar slo-generator-bot avatar smehboub avatar victuos avatar ycd avatar yezz123 avatar ymotongpoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

slo-generator's Issues

Add a YAML/JSON validator

Have a tool/function to valide the good schema of SLO_CONFIG_FILE and ERROR_BUDGET_POLICY_FILE
And use it on CICD

Support OpenSLO specification

On last SLOConf was announced by Nobl9 the OpenSLO, which aims to be an industry standard to define reliability and performance targets using a simple YAML specification.

The OpenSLO High lights are:

  • Vendor Agnostic
  • Define a standard
  • Provide a CLI to validates the scheme

To promote this proposal as a defacto standard, could be awesome if slo-generator provides an output that is in compliance with the OpenSLO scheme.

What do you think about that?

Prometheus-exporter: metrics with the same labels get overwritten.

Hello, I'm testing a simple example with prometheus backend and prometheus exporter.

Error budget:

    error_budget_policy_ssm.yaml: |
      - error_budget_policy_step_name:  24 hours
        measurement_window_seconds:     86400
        alerting_burn_rate_threshold:   4
        urgent_notification:            true
        overburned_consequence_message: Page to defend the SLO
        achieved_consequence_message:   Last 24 hours on track

SLO:

    slo_prom_metrics_availability_query_sli.yaml: |
      service_name:    prom
      feature_name:    metrics
      slo_name:        availability
      slo_description: 99.9% of Prometheus requests return a good HTTP code
      slo_target:      0.999
      backend:
        class:         Prometheus
        method:        query_sli
        url:           ${PROMETHEUS_URL}
        measurement:
          expression:  >
            sum(rate(prometheus_http_requests_total{code=~"2.."}[window]))
            /
            sum(rate(prometheus_http_requests_total{}[window]))
      exporters:
      - class:         Prometheus
        url:           ${PROMETHEUS_PUSHGATEWAY_URL}

The only metrics present in the push gateway are slo_target and events_count.

No idea if I am doing anything wrong, but the only way I managed to get all the metrics in is by #109 - replacing PUT by POST when talking to push-gateway.

Running without arguments raises an exception

Running slo-generator without arguments raises a FileNotFoundError exception.

I think this exception should be handled! Also, probably it's interesting to show the help msg if slo-generator is run like that.

Version: 1.5.0
Example:

โžœ  slo-generator git:(master) slo-generator   
DEBUG: 0
Traceback (most recent call last):
  File "/home/dgzlopes/.pyenv/versions/3.8.3/bin/slo-generator", line 8, in <module>
    sys.exit(main())
  File "/home/dgzlopes/.pyenv/versions/3.8.3/lib/python3.8/site-packages/slo_generator/cli.py", line 36, in main
    cli(args)
  File "/home/dgzlopes/.pyenv/versions/3.8.3/lib/python3.8/site-packages/slo_generator/cli.py", line 57, in cli
    eb_policy = utils.parse_config(eb_path)
  File "/home/dgzlopes/.pyenv/versions/3.8.3/lib/python3.8/site-packages/slo_generator/utils.py", line 94, in parse_config
    with open(path) as config:
FileNotFoundError: [Errno 2] No such file or directory: '/home/dgzlopes/workspace/k6/slo-generator/error_budget_policy.yaml'

Dynatrace backend count method

In the Dyantrace backend count method, why do you use return sum(values) / len(values) ? It averages the number of events and flaws the SLO calculation (and it also returns a float sometimes).

As the Dynatrace API request returns a list with the number of events for each timestamp, we should just sum them.

The fix would be to just return sum(values)

(NB : I don't have the right to create a new branch and ask for a PR on this repo :( )

Build process should upload slo-generator image to GCR and not assume latest

When building slo-generator using cloudbuild.yaml, there are two issues I've encountered:

  1. The build process doesn't upload the image to GCR:
Finished Step #1                                                                                                                                                                     Starting Step #2 - "Run all tests"
Step #2 - "Run all tests": Pulling image: gcr.io/<REDACTED>/slo-generator
Step #2 - "Run all tests": Using default tag: latest
Step #2 - "Run all tests": Error response from daemon: manifest for gcr.io/<REDACTED>/slo-generator:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/<REDACTED>/slo-generator/manifests/latest".
.....
ERROR: failed to pull because we ran out of retries.
ERROR
ERROR: build step 2 "gcr.io/<REDACTED>/slo-generator" failed: error pulling build step 2 "gcr.io/<REDACTED>/slo-generator": generic::unknown: retry budget exhausted (10 attempts): step exited with non-zero status: 1
  1. The test step (id: Run all tests) assumes latest, which best case is less than ideal if you're building a custom version, and at worst, doesn't work on a fresh repo because 'latest' isn't a valid tag (see error above).

Running API in Kubernetes

Hi,
I wanted to run generator as an API inside of Kubernetes. The image I've built cause no issues when running as a job, so I figured getting the API running should provide no issues.

Unfortunately, API really likes to exit with the error code 255 right after starting. I did a few adjustments to try and find out what's causing it. Firstly, I added a debug line before function framework's _cli is invoked. The logs looked like this:

DEBUG mode is enabled. DEBUG=1
slo_generator.cli - DEBUG - Invoking functions framework
DEBUG mode is enabled. DEBUG=1
Stream closed EOF for monitoring/slo-generator-5bf7c8dd96-j8bdp (slo-generator)

Then I thought that it's probably Gunicorn doing funny things with PIDs and exiting, so I forwarded the DEBUG flag to functions framework, in order to run with Werkzeug. And this one has me baffled. These are the logs:

slo-generator-67df458cbf-6n559 DEBUG mode is enabled. DEBUG=1
slo-generator-67df458cbf-6n559 slo_generator.cli - DEBUG - Invoking functions framework
slo-generator-67df458cbf-6n559 DEBUG mode is enabled. DEBUG=1
slo-generator-67df458cbf-6n559  * Serving Flask app 'run_compute' (lazy loading)
slo-generator-67df458cbf-6n559  * Environment: production
slo-generator-67df458cbf-6n559    WARNING: This is a development server. Do not use it in a production deployment.
slo-generator-67df458cbf-6n559    Use a production WSGI server instead.
slo-generator-67df458cbf-6n559  * Debug mode: on
slo-generator-67df458cbf-6n559 werkzeug - WARNING -  * Running on all addresses.
slo-generator-67df458cbf-6n559    WARNING: This is a development server. Do not use it in a production deployment.
slo-generator-67df458cbf-6n559 werkzeug - INFO -  * Running on http://10.83.7.250:8080/ (Press CTRL+C to quit)
slo-generator-67df458cbf-6n559 werkzeug - INFO -  * Restarting with watchdog (inotify)
slo-generator-67df458cbf-6n559 DEBUG mode is enabled. DEBUG=1
slo-generator-67df458cbf-6n559 slo_generator.cli - DEBUG - Invoking functions framework
slo-generator-67df458cbf-6n559 Traceback (most recent call last):
slo-generator-67df458cbf-6n559   File "/usr/local/bin/slo-generator", line 8, in <module>
slo-generator-67df458cbf-6n559 DEBUG mode is enabled. DEBUG=1
slo-generator-67df458cbf-6n559     sys.exit(main())
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/core.py", line 829, in __call__
slo-generator-67df458cbf-6n559     return self.main(*args, **kwargs)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/core.py", line 782, in main
slo-generator-67df458cbf-6n559     rv = self.invoke(ctx)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
slo-generator-67df458cbf-6n559     return _process_result(sub_ctx.command.invoke(sub_ctx))
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
slo-generator-67df458cbf-6n559     return ctx.invoke(self.callback, **ctx.params)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
slo-generator-67df458cbf-6n559     return callback(*args, **kwargs)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 21, in new_func
slo-generator-67df458cbf-6n559     return f(get_current_context(), *args, **kwargs)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/slo_generator/cli.py", line 153, in api
slo-generator-67df458cbf-6n559     ctx.invoke(_cli,
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
slo-generator-67df458cbf-6n559     return callback(*args, **kwargs)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/functions_framework/_cli.py", line 37, in _cli
slo-generator-67df458cbf-6n559     app = create_app(target, source, signature_type)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/functions_framework/__init__.py", line 292, in create_app
slo-generator-67df458cbf-6n559     function = _function_registry.get_user_function(source, source_module, target)
slo-generator-67df458cbf-6n559   File "/usr/local/lib/python3.9/site-packages/functions_framework/_function_registry.py", line 41, in get_user_function
slo-generator-67df458cbf-6n559     raise MissingTargetException(
slo-generator-67df458cbf-6n559 functions_framework.exceptions.MissingTargetException: File /usr/local/lib/python3.9/site-packages/slo_generator/api/main.py is expected to contain a function named http
Stream closed EOF for monitoring/slo-generator-67df458cbf-6n559 (slo-generator)

At first the app is created nicely (notice Serving Flask app 'run_compute' (lazy loading)), but then werkzeug triggers a reload (although I don't see why it would restart the entire command), which ends with the traceback you see, where it magically wants to run http as a target instead of signature type. I've went through functions framework's code a couple of times and I couldn't pinpoint anything that would be causing this.

When I try to debug the container, running the command (/usr/local/bin/slo-generator api --config /etc/config/config.yaml works and runs properly, no exits). With debug & Flask I see the same issue.

Any guesses as to what could be causing the server to exit would be a huge help.

Support custom metadata

It would be useful to have a mechanism for custom metadata, in order to support the following use cases:

  • As a user, I want to export metrics that have additional labels --> User metadata support in metrics exporter
  • As a user, I want to export SLO reports that have additional metadata --> User metadata support in SLO report exporters
  • As a user, I want to be able to define custom metadata directly in my SLO configuration YAML / JSON file

Fixed by #102

events_count labels in prometheus exporter

Hi,

did I misconfigure something, or is it intended for prometheus exporter to export events count as labels? This would lead to extremely high cardinality, which is very bad for prometheus.

Big query timestamp_human isn't correct

While looking at our last exports of SLO in Bigquery, we found that the timestamp_human isn't correct.

Looking at 12.48 (Paris, France : UTC+2), it shows a time of 12.48 UTC
image

That leads to a gap in our Grafana graphs.

Not sure if we can configure this or if it's a problem in the management of the get_human_time function

def get_human_time(timestamp, timezone="Europe/Paris"):

Thanks for your help !

Add 'offset' field to handle metrics ingestion delays

Hi,

I'm currently trying to understand why the SLO-pipeline sees "bad events" (backend "Stackdriver") although there were no bad events when looking at the actual underlying Stackdriver metric (see screenshot below).

Example: Requests with code 201 are defined as good events and valid events are 201 + 500..511. The chart with the log-based metrics shows that there are only requests with code 2xx and 4xx, but still the generator often exports metrics with "bad events" > 0 (also values for burn rate and SLI are effected).

Note: The service has a higher rate of incoming requests which should provide a meaningful signal during the alerting windows

Any idea what can cause this?

SLO configuration for feature "carts-create"

service_name:     projects
feature_name:     carts-create
slo_description:  95% of cart create API HTTP responses are successful
slo_name:         availability
slo_target:       0.95
metadata:         {}
backend:
  class:          Stackdriver
  method:         good_bad_ratio
  project_id:     ${stackdriver_host_project_id}
  measurement:
    filter_good:  >
      project=${project_id}
      metric.type="logging.googleapis.com/user/carts_create_requests_total"
      metric.labels.http_status = 201
    filter_valid: >
      project=${project_id}
      metric.type="logging.googleapis.com/user/carts_create_requests_total"
      ( metric.labels.http_status = 201 OR
        metric.labels.http_status = 500 OR
        metric.labels.http_status = 501 OR
        metric.labels.http_status = 502 OR
        metric.labels.http_status = 503 OR
        metric.labels.http_status = 504 OR
        metric.labels.http_status = 505 OR
        metric.labels.http_status = 506 OR
        metric.labels.http_status = 507 OR
        metric.labels.http_status = 508 OR
        metric.labels.http_status = 509 OR
        metric.labels.http_status = 510 OR
        metric.labels.http_status = 511 )
exporters:
- class:          Stackdriver
  project_id:     ${stackdriver_host_project_id}

Metrics

  • left: error budget rate (feature: "carts-create")
  • right: request count split by label http_status -> no 5xx,

Bildschirmfoto 2020-11-25 um 15 36 59

Duplicated label `project_id` for exported Stackdriver Metrics

We recently switched to version 1.3.2 (we were running version 1.1.1 before). Since then we noticed an issue with exporting the custom Stackdriver metrics to Prometheus (using Prometheus expoter stackdriver-exporter) due to duplicated label project_id.

Example (note: prometheus Stackdriver exporter adds a prefix to the exported metric):

stackdriver_global_custom_googleapis_com_sli_measurement{alerting_burn_rate_threshold="9.0",error_budget_policy_step_name="1 hour",feature_name="products-update",project_id="test",project_id="test",service_name="projects",slo_name="availability",unit="",window="3600"}

There was a change recently here to update Stackdriver metric labels: 268058a#diff-0fabb485d257541a1302d5eec2d473307643ce28de6f1ffec68e7c50612b66e0R60

Note: Before the version upgrade exporting custom Stackdriver metrics like error_budget_burn_rate using Prometheus stackdriver-exporter was working as expected.

Issue when no good events with good_bad_ratio method

Hello,

When using good_bad_ratio, what happens if one of my two filter filter_good or filter_bad have No data available ?
For exemple if my service have no problems, filter_good have data but not filter_bad.
In this case and since the latest version of slo-generator, no SLI results are computed and sent to Bigquery..

In my exemple i use this log-based-metric :

resource "google_logging_metric" "quality_count" {
  name        = "nfs/quality_count"
  description = "Metric based on logs provided by the Cypress Cronjob. Used to compute quality SLO on NFS"
  filter      = "resource.type=\"k8s_container\"\nresource.labels.container_name=\"crj-sre-homepage-fr\"\njsonPayload.country=\"fr\""
  metric_descriptor {
    metric_kind = "DELTA"
    value_type  = "INT64"
    unit        = "1"
    labels {
      key         = "platform"
      value_type  = "STRING"
      description = "nfs environment"
    }
    labels {
      key         = "status_code"
      value_type  = "INT64"
      description = "status code"
    }
    labels {
      key         = "country"
      value_type  = "STRING"
      description = "country code"
    }
    labels {
      key         = "page"
      value_type  = "STRING"
      description = "page name"
    }
  }
  label_extractors = {
    "country"     = "EXTRACT(jsonPayload.country)",
    "page"        = "EXTRACT(jsonPayload.page)",
    "platform"    = "EXTRACT(jsonPayload.platform)",
    "status_code" = "EXTRACT(jsonPayload.status)"
  }
}

and my slo_definition is :

slo_definition:
  slo_name:        quality
  service_name:    nfs
  feature_name:    fr-hp
  slo_description: quality on nfs fr home page
  slo_target:      0.999
  backend:
    class:         Stackdriver
    project_id:    ${PROJECT_ID}
    method:        good_bad_ratio
    measurement:
      filter_good:
        resource.type="k8s_container"
        metric.type="logging.googleapis.com/user/nfs/quality_count"
        metric.label.page="homepage"
        metric.label.country="fr"
        metric.label.status_code=1
      filter_bad:
        resource.type="k8s_container"
        metric.type="logging.googleapis.com/user/nfs/quality_count"
        metric.label.page="homepage"
        metric.label.country="fr"
        metric.label.status_code=0

From monitoring/metrics-explorer, filter_bad have no results because not yet logs have status_code=0

My proposition is : Why not send SLI data when one of the two filter have data ?

Data points cannot be written more than 25h10s in the past

Hi,

The slo-pipeline Cloud Function (exports SLO metrics to BQ backend) started to fail suddenly and keeps crashing since then in one of our GCP projects running a slo-pipeline. All logged errors refer to the an invalid interval end time value ("2020-10-20T15:33:00-07:00").

google.api_core.exceptions.InvalidArgument: 400 Field timeSeries[0].points[0].interval.end_time had an invalid value of "2020-10-20T15:33:00-07:00": Data points cannot be written more than 25h10s in the past.

Full Stacktrace:

2020-10-27 23:18:29.552 CET
slo-pipeline2dxhzarq4bqc Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable return callable_(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in __call__ return _end_unary_response_blocking(state, call, False, None) File "/env/local/lib/python3.7/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "Field timeSeries[0].points[0].interval.end_time had an invalid value of "2020-10-20T15:18:00-07:00": Data points cannot be written more than 25h10s in the past." debug_error_string = "{"created":"@1603837109.545536881","description":"Error received from peer ipv4:74.125.133.95:443","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Field timeSeries[0].points[0].interval.end_time had an invalid value of "2020-10-20T15:18:00-07:00": Data points cannot be written more than 25h10s in the past.","grpc_status":3}" > The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 449, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 268, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker_v2.py", line 265, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 25, in main compute.export(data, exporters) File "/env/local/lib/python3.7/site-packages/slo_generator/compute.py", line 98, in export ret = exporter.export(data, **config) File "/env/local/lib/python3.7/site-packages/slo_generator/exporters/stackdriver.py", line 50, in export self.create_timeseries(data, **config) File "/env/local/lib/python3.7/site-packages/slo_generator/exporters/stackdriver.py", line 93, in create_timeseries result = self.client.create_time_series(project, [series]) File "/env/local/lib/python3.7/site-packages/google/cloud/monitoring_v3/gapic/metric_service_client.py", line 1060, in create_time_series request, retry=retry, timeout=timeout, metadata=metadata File "/env/local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 145, in __call__ return wrapped_func(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func on_error=on_error, File "/env/local/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target return target() File "/env/local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout return func(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable six.raise_from(exceptions.from_grpc_error(exc), exc) File "<string>", line 3, in raise_from google.api_core.exceptions.InvalidArgument: 400 Field timeSeries[0].points[0].interval.end_time had an invalid value of "2020-10-20T15:18:00-07:00": Data points cannot be written more than 25h10s in the past. 

Any idea why we see this and how we can fix it?

docs: Re-organize documentation

  • Move the section in the README called "Extending the SLO Generator" to the "CONTRIBUTING.md" file.
  • Move the section about Datastudio to docs/deploy/datastudio_slo_report.md
  • Add a reference to the two sections above in the main "README"
  • Document color options
  • Add images to Datastudio report

[New exporter] Prometheus Remote Write

The remote write and remote read features of Prometheus allow transparently sending and receiving samples.

Right now slo-generator has support for the Prometheus Push Gateway, but I think having support for Remote Write would be really cool! b/c:

  • Let's the user export the metrics to a wide-array of Remote Write compatible storages.
  • Removes the need for a "3rth" component: The Push Gateway.

Inspiration: https://pypi.org/project/opentelemetry-exporter-prometheus-remote-write/

Add alert configuration

# alerting:
#   backends:
#     - name: cloud-monitoring
#       class: CloudMonitoring
#       projectId: slo-generator-ci-a2b4
#       notifications:
#         - channel-1
#         - channel-2

#     - name: prometheus-alert-manager
#       class: AlertManager
#       url: ${ALERT_MANAGER_URL}

# alerts:
#   - name: high-burn-rate
#     backend: cloud-monitoring
#     displayName: SLO burn rate too high
#     combiner: OR
#     conditions:
#     - displayName: SLO burn rate
#       conditionThreshold:
#         aggregations:
#           alignmentPeriod: 60s
#           crossSeriesReducer: REDUCE_MEAN
#           groupByFields:
#               - project
#               - resource.label.instance_id
#               - resource.label.zone
#           perSeriesAligner: ALIGN_MAX
#       comparison: COMPARISON_GT
#       duration: 900s
#       filter: >
#         metric.type=compute.googleapis.com/instance/cpu/utilization
#         resource.type=gce_instance
#       thresholdValue: auto
#       trigger:
#         count: 1

BigQuery Exporter: column alerting_burn_rate_threshold always null

As the title say, if you use the bigquery exporter, the column alerting_burn_rate_threshold is always null (instead of having the value in the error_budget_policies.<name>.steps[x].burn_rate_threshold

I suspect something wrong with the name in the configuration file and the column name

BigQuery exporter: multiple SLO but one row

I've 3 SLOs (one per continent Asia/Europe/America), who share the same config

Only these 4 values change

  • metadata.name
  • metadata.labels['location']
  • spec.service_level_indicator.filter_good
  • spec.service_level_indicator.filter_valid
apiVersion: sre.google.com/v2
kind: ServiceLevelObjective
metadata:
  name: entrypoint-ingestion-availability-asia
  labels:
    slo_name: availability
    service_name: entrypoint
    feature_name: ingestion
    scope: internal
    source: load-balancer
    location: asia-east2
    provider: gcp
spec:
  goal: 0.999
  description: xxx
  backend: cloud_monitoring/sre
  error_budget_policy: default
  exporters:
  - bigquery/sre

  method: good_bad_ratio
  service_level_indicator:
    filter_good: >
      project = "XXX"
      metric.type = "loadbalancing.googleapis.com/https/request_count"
      resource.type = "https_lb_rule"
      resource.label.backend_target_name = "entrypoint-default"
      metric.label.proxy_continent = "Asia"
      metric.label.response_code_class >= 200
      metric.label.response_code_class <= 400

    filter_valid: >
      project = "XXX"
      metric.type = "loadbalancing.googleapis.com/https/request_count"
      resource.type = "https_lb_rule"
      resource.label.backend_target_name = "entrypoint-default"
      metric.label.proxy_continent = "Asia"

When I run slo-generator compute -c config -f slo/ --export only one row is written into BigQuery, but if I run each file independently, all the rows is correctly write into BigQuery

Alerts in Kubernates

Hello,
We are trying to use slo-generator for our kubernates environment using slo-generator-gke backend as a Prometheus.
I can see it gives output like this
INFO - availability-metrics | 48 hours | SLI: 41.0447 % | SLO: 99.9 % | Gap: -58.86% | BR: 589.6 / 4.0 | Alert: 1 | Good: 55 | Bad: 79
The alert is configured true in config.yaml
I want to understand how alerting works for kubernates? I was under the impression that alerting will be handled by AlertManager but I can't see any related alert in AlertManager

ModuleNotFoundError: No module named 'retrying' (chore: release 2.1.0)

Hello,
chore: release 2.1.0 (#188)

I have installed slo-generator as explained and created files slo_config & shared_config
I run command : slo-generator compute -f slo/dynatrace.json -c backends/config.yaml --export
I have activate DEBUG mode and i got error
ModuleNotFoundError: No module named 'retrying' (ERROR1)
(import module in slo_generator/backends/dynatrace.py line 23)

I have installed module retrying : pip install retrying
i got error
Building wheel for retrying (setup.py) ... error (ERROR2)

ERROR1

slo_generator.utils - DEBUG - No module named 'retrying'
Traceback (most recent call last):
  File "/home/laurendeau/project/slo-generator-project/venv/lib/python3.8/site-packages/slo_generator/utils.py", line 337, in import_dynamic
    return getattr(importlib.import_module(package), name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/laurendeau/project/slo-generator-project/venv/lib/python3.8/site-packages/slo_generator/backends/dynatrace.py", line 23, in <module>
    from retrying import retry
ModuleNotFoundError: No module named 'retrying'

ERROR2

(venv) VirtualBox:~/project/slo-generator-project$ pip install retrying
Collecting retrying
  Using cached retrying-1.3.3.tar.gz (10 kB)
Requirement already satisfied: six>=1.7.0 in ./venv/lib/python3.8/site-packages (from retrying) (1.16.0)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/laurendeau/project/slo-generator-project/venv/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-qxxveul1/retrying/setup.py'"'"'; __file__='"'"'/tmp/pip-install-qxxveul1/retrying/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-rzdbqjil
       cwd: /tmp/pip-install-qxxveul1/retrying/
  Complete output (6 lines):
  usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
     or: setup.py --help [cmd1 cmd2 ...]
     or: setup.py --help-commands
     or: setup.py cmd --help
  
  error: invalid command 'bdist_wheel'

  ----------------------------------------
  ERROR: Failed building wheel for retrying

v1.5.1 not available @ pypi.org?

I wanted to experiment with the api functionality, however it appears that command is not available in v1.5.0. Is there a reason that v1.5.1 is not available to download from pypi?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.