open-telemetry / wg-prometheus Goto Github PK

Workgroup for building Prometheus-OTLP interoperability for the OTEL Collector and Prometheus related discussions.

License: Apache License 2.0

prometheus k8s-deployment open-telemetry-collector

wg-prometheus's Introduction

OpenTelemetry Prometheus Working Group

This repository is used to track the progress of the Prometheus workgroup to address compatibility gaps between OpenTelemetry and Prometheus, and improve the OpenTelemetry Prometheus support.

The working group is working on:

OpenTelemetry/Prometheus Compatibility Specification
Implementation Phase 1
Implementation Phase 2

The progress of the group can be tracked here.

Useful docs:

Meeting Notes for weekly meeting are here
January 15 2021 Metrics Workshop notes: Metrics workshop notes

wg-prometheus's People

Contributors

Stargazers

Watchers

Forkers

rakyll pingleig isabella232 andreimatei sitedata

wg-prometheus's Issues

Validate receiver configuration is compatible with Prometheus server

Ensure receiver configuration is compatible with Prometheus server

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

Clarify the meaning and purpose of external labels

External labels were discussed in the 4/14 Prometheus-OTel-WG SIG meeting.

The Prometheus documentation describes external labels:

  # The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ <labelname>: <labelvalue> ... ]

When we say that "labels are added", no semantic interpretation is given. This implies:

some external labels describe the process being monitored (e.g., datacenter name)
some external labels describe how the process was monitored (e.g., replica name)

It seems we have a mix of descriptive and non-identifying attributes. OpenTelemetry has not formally added a mechanism to distinguish different kinds of attribute, but it appears increasingly important that we do this. In today's Prometheus/Cortex environment, the backend system has to be configured to recognize duplicate streams of information. I would like for OTLP to include a formal way to encode duplicate streams of information, which means distinguishing identifying and descriptive attributes from those that are non-identifying.

The terminology used here is developed in open-telemetry/opentelemetry-specification#1298, where it seems we have three kinds of attribute: identifying (e.g., "job", "instance"), descriptive (e.g., data center, k8s node), and non-identifying (e.g., replica name).

One way we can expose this information in the OTLP protocol that appears promising to me is with the use of schemas, see open-telemetry/oteps#152.

[Tracker Issue] List of Prometheus issues in the Collector

https://github.com/open-telemetry/opentelemetry-collector/issues?q=is%3Aissue+is%3Aopen+prometheus+label%3Aarea%3Aprometheus

Add scrape target update endpoint to the Collector's Prometheus receiver

In order to support distribution of Prometheus scrape targets among a set of Collector instances as required for #6 we will need to create a mechanism for updating the set of scrape targets used by any given Collector instance. The Prometheus receiver should be extended with a server that can process requests to update the scrape targets for a given job. It should receive a list of Prometheus static_config entries that are written to a file the receiver is watching with a file_sd_config. This service should only be active when enabled by a new configuration option for the Prometheus receiver that is disabled by default. The configuration should also allow the user to specify the port the service will listen on along with any other service configuration items that may be appropriate.

This issue should be used to track the design of this service and to aggregate any other issues or PRs created during implementation.

Remote write compliance: TestRemoteWrite/otelcollector/Summary

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/Summary.

=== CONT  TestRemoteWrite/otelcollector/Summary
    helpers.go:30:
        	Error Trace:	helpers.go:30
        	            				helpers.go:41
        	            				helpers.go:52
        	            				helpers.go:40
        	            				helpers.go:29
        	            				summary.go:30
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Not equal:
        	            	expected: 6
        	            	actual  : 0
        	Test:       	TestRemoteWrite/otelcollector/Summary

Decide on performance benchmarking criteria

The work we planned for Phase 1 will be mainly about stability and performance. In order to achieve our performance goals, we will need to track the improvements and regressions. We are considering to benchmark the entire Phase 1 pipeline (Prometheus receiver -> collector -> Prometheus remote write exporter) and potentially will contribute micro benchmarks as needed. We need to decide on what to benchmark, what platforms we should run the benchmarks on, and dimensions.

Prior work

Previously, we ran manual benchmarks on Kubernetes (EKS), on a cluster with 10 m5.8xlarge nodes. On Kubernetes, the collection scales based on how many jobs running in the entire cluster and how many metrics generated per job. The total number of jobs running in a cluster is capped by the resources available to the cluster. We used a simple app that exposes a lightweight HTTP server that publishes a given number of metrics. The metrics are collected by the OTEL Prometheus receiver and exported to Amazon Managed Service for Prometheus (AMP).

We published 40 160, 400 and 1000 metrics from each server and ran 25, 50, 100, 250 and 500 replicas of the server and measured resource usage, export rate (samples per second), dropped vs exported metric samples. The scraper is configured to scrape at 15 seconds and this is a more aggressive setting than what our users will use. Scraping frequency only became a bottleneck when 1000 metrics are exported from 50+ replicas.

This work mainly targeted Kubernetes and might perform differently on a platform with a Prometheus discovery driver.

Remote write compliance: TestRemoteWrite/otelcollector/Retries500

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write
TestRemoteWrite/otelcollector/Retries500.

=== CONT  TestRemoteWrite/otelcollector/Retries500
    retries.go:57:
        	Error Trace:	retries.go:57
        	            				main_test.go:106
        	            				main_test.go:70
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/Retries500
        	Messages:   	failed to find sample that should have been retried

Prometheus Service Discovery Configuration Interception

To enable the OTel Operator to perform Prometheus scrape target identification for a set of Collector instances in support of #6 we need to be able to identify and extract all *_sd_config and relabel_config entries in each scrape_config entry in the Prometheus receiver configuration. All *_sd_config entries should be replaced by a single file_sd_config entry referencing a file that can be updated by the Collector Prometheus receiver's target update mechanism (to be constructed) prior to the configuration being used to create a ConfigMap for Collector instances. The extracted configurations should be preserved for use by the target discovery and distribution mechanism to be built in the Operator.

This issue should be used to track the design of the SD configuration interception mechanism in the OTel Operator and to aggregate any other issues or PRs created during implementation.

Remote write compliance: TestRemoteWrite/otelcollector/Counter

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/Counter.

=== CONT  TestRemoteWrite/otelcollector/Counter
    counter.go:35:
        	Error Trace:	counter.go:35
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/Counter
        	Messages:   	found zero samples for {__name__="counter"}

Rewrite Prometheus receiver in Collector: Remove OpenCensus dependency and transform Prometheus samples directly into OTLP metrics

Rewrite the Prometheus receiver to remove the OpenCensus dependency and transform Prometheus samples directly into OpenTelemetry metrics data.

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

Support retry mechanisms similar to Prometheus server, allow fine tuning

Support retry mechanisms similar to Prometheus server, allow fine tuning in the Prometheus RW exporter.

See https://prometheus.io/docs/practices/remote_write/ for fine tuning options.

Document Prometheus labeling guidelines and recommendations

Document Prometheus labeling guidelines and recommendations, e.g. label with namespace, pod name, etc for Kubernetes environments.

Outline Prometheus pipeline and use cases

What does the Prometheus pipeline look like? Add use cases that the Prometheus pipeline needs to support.

@rakyll @alolita will add AWS use cases; @punya, @jsuereth will add for GCP

Add StatefulSet support to OTel Operator

As outlined in #14, we want to be able to deploy the Collector in a horizontally-scaled configuration as a StatefulSet. Combined with other work deriving from #6 this will enable efficiently scaling to a large number of scrape targets.

This issue should be used to track the design of StatefulSet management in the OTel Operator and to aggregate any other issues and PRs created during implementation.

Remote write compliance: TestRemoteWrite/otelcollector/Invalid

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/Invalid.

=== CONT  TestRemoteWrite/otelcollector/Invalid
    up.go:39:
        	Error Trace:	up.go:39
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/Invalid
        	Messages:   	found zero samples for up{job="test"}

Prototype an experimental version of OTel Collector Prometheus receiver using Grafana Cloud Agent

Grafana Cloud Agent (GCA) collects and exports metrics to a Prometheus backend using a push based remote-write exporter. GCA also supports service discovery, scraping and WAL which are some of the requirements we want to see the Collector receiver support.

Can we prototype an experimental version of the Prometheus receiver using the GCA as a baseline?

Support write-ahead log (WAL) capabilities similar to Prometheus server

Support WAL capabilities similar to Prometheus server. The Grafana Cloud Agent provides a reference implementation - see https://github.com/grafana/agent

OTel Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

Also see Brian Brazil's post: https://www.robustperception.io/how-much-space-does-the-wal-take-up

@dashpole suggested that since WAL capabilities are generally useful and not only useful to a Prometheus server, it may be worth doing a separate processor instead of in the Prometheus receiver.

Ensure parity for OTEL Collector Prometheus receiver with Prometheus server

@odeke-em Look forward to details on your analysis.

Allow user to configure min/max concurrent outgoing requests, fan-out by timeseries

Allow user to configure min/max concurrent outgoing requests, fan-out by timeseries in the Prometheus RW exporter.

OTel Components:
https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/prometheusremotewriteexporter
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsprometheusremotewriteexporter

Remote write compliance: TestRemoteWrite/otelcollector/Up

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write
TestRemoteWrite/otelcollector/Up.

=== CONT  TestRemoteWrite/otelcollector/Up
    up.go:23:
        	Error Trace:	up.go:23
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/Up
        	Messages:   	found zero samples for up{job="test"}

Write a Prometheus/OpenTelemetry compatibility spec

We would like to document the extend of the compatibility and how OpenTelemetry concepts maps to Prometheus (and vice versa) at https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/compatibility. The spec should also outline the differences in the data model and limitations.

Debug log how many samples are written or dropped in the Prometheus RW exporter

Ensure stability of Prometheus remote write exporter

@odeke-em please list the requirements in detail that you think are needed to ensure stability.

Ensure service discovery is aligned with Prometheus server

Ensure service discovery implemented in the Prometheus receiver is aligned with Prometheus server.

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

Publish reference Kubernetes deployment YAML files with Prometheus configuration examples

Publish reference Kubernetes deployment YAML files with Prometheus configuration examples.

Remote write compliance: TestRemoteWrite/otelcollector/RepeatedLabels

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/RepeatedLabels.

=== CONT  TestRemoteWrite/otelcollector/RepeatedLabels
    labels.go:58:
        	Error Trace:	labels.go:58
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/RepeatedLabels
        	Messages:   	found zero samples for up{job="test"} = 0

[Tracker Issue] Design a Prometheus-specific CRD for the Operator

Potentially, we are going to build a Prometheus Operator to manage the Prometheus autoscaling, sharding and other high-level deployment configuration. Design a CRD document to discuss deployment and scheduling specific configuration and discuss with the OpenTelemetry Operator project to see if there are any potential breaking changes.

Add unit tests for OTel Operator StatefulSet enhancements

See open PR for reconcile tests: open-telemetry/opentelemetry-operator#250

Additional tests need to be added for StatefulSet enhancements in the Operator.

Ref: #31

Support Prometheus histograms

Prometheus uses le, OTel uses ge bounded buckets. The two are mathematically incompatible and impossible to transform from one into the other.

Fix the “out of order samples” and “duplicate sample for timestamp” issues

Fix the “out of order samples” and “duplicate sample for timestamp” issues in the Prometheus RW exporter.

Also see open-telemetry/opentelemetry-collector#2315

Which prometheus "internal" metrics are required for conformance?

Currently, conformance tests require the 'up' metric be present. However, there are other internal metrics, such as scrape_duration_seconds, scrape_samples_post_metric_relabeling, scrape_samples_scraped, and scrape_series_added that a prometheus server would also produce that are designed to make debugging prometheus endpoints easier.

I asked this question at the wg today, and we weren't sure which would be required by conformance. There is a separate question of which are useful, but we should probably start with ones that are required.

@RichiH agreed to raise this with question with prometheus folks.

Support "up" metric for Prometheus

Implement the "up" metric for Prometheus.

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

[DataModel] Support GaugeHistogram from Prometheus receiver

Currently, there is no way to represent a GaugeHistogram imported via the Prometeheus receiver in OTLP.

Define a mechanism to encode GauageHistograms in OTLP.
- See open-telemetry/opentelemetry-proto#274 for discussion on modelling as instantaneous temporality
Update prometheus receiver to import GaugeHistograms
Update PRW exporter to export GauageHistograms

Ensure Prometheus target is added as a label to incoming samples automatically

Ensure Prometheus target is added as a label to incoming samples automatically.

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

A design proposal is needed for this task.

Prometheus Histogram edge case which we don't support

Form the documentation sum is not always present and MUST NOT be present in some scenarios, see:

Negative threshold buckets MAY be used, but then the Histogram MetricPoint MUST NOT contain a sum value as it would no longer be a counter semantically.

This is a very strange edge case to make things work with Prometheus, and I don't know how we can support this and enforce it.

One option is to have open-telemetry/opentelemetry-proto#187 in the data model. This will not work that good, because receiving an OM without a SUM may imply negative buckets are "possibly" present, but it is not guaranteed to be, and next point may contain a SUM? <- @RichiH is this possible?

Remote write compliance: TestRemoteWrite/otelcollector/JobLabel

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/JobLabel.

=== CONT  TestRemoteWrite/otelcollector/JobLabel
    job_label.go:23:
        	Error Trace:	job_label.go:23
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/JobLabel
        	Messages:   	found zero samples for gauge{job="test"}

Add auto-sharded scraping capability for Kubernetes

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

A design proposal is needed for this task.

Support Summary metric in OTel datamodel for Prometheus

See open-telemetry/opentelemetry-specification#1146

[Tracker Issue] Itemize Requirements and Implementation for Prometheus Receiver Tests

Document Prometheus receiver tests needed to verify it is functioning as expected. This doc will also itemize expected behavior.

Determine deployment model: DaemonSet vs StatefulSet

Determine deployment model: DaemonSet vs StatefulSet for Kubernetes environments

A design proposal is needed for this task.

[Tracker Issue] Add Prometheus Remote Write compatibility tests to OTel Collector

Use the test suite at https://github.com/tomwilkie/remote-write-compliance by @tomwilkie for compatibility testing of the Prometheus Remote Write Exporter in the Collector.

[Tracker Issue] Open Issues for Prometheus components in the Collector

See list of open issues here:

As of Sep 2021 Collector-contrib now has Prometheus components so see new list here:
https://github.com/open-telemetry/opentelemetry-collector-contrib/labels/comp%3Aprometheus

Pre-move list:
https://github.com/open-telemetry/opentelemetry-collector/issues?q=is%3Aissue+is%3Aopen+label%3Aarea%3Aprometheus

Remote write compliance: TestRemoteWrite/otelcollector/InstanceLabel

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write
TestRemoteWrite/otelcollector/InstanceLabel.

=== CONT  TestRemoteWrite/otelcollector/InstanceLabel
    helpers.go:21:
        	Error Trace:	helpers.go:21
        	            				helpers.go:52
        	            				helpers.go:13
        	            				instance_label.go:26
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/InstanceLabel
        	Messages:   	label 'instance' not found

Output a stale marker if a series disappears from a scrape

Output a stale marker if a series disappears from a scrape.

Component: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/prometheusreceiver

PrometheusReceiver Ignores Timeseries (Histogram and Summary) Metrics without "_sum" counter

Currently, when Timeseries data is scraped by Prometheus Scraper, it expects bucketed data and two counters along with those buckets - _count and _sum.
Some frameworks do no capture the _sum and hence do not produce the _sum counter in the Prometheus Exp format. While the Scraper and Prometheus work fine and display appropriate graphs without this counter, Prometheus receiver on the other hand expects this counter to be present for all time series metrics and if not present silently ignores the metric and associated datapoints.

Is this really the desired behavior for PrometheusReceiver?

Remote write compliance: TestRemoteWrite/otelcollector/Histogram

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write
TestRemoteWrite/otelcollector/Histogram.

=== CONT  TestRemoteWrite/otelcollector/Histogram
    helpers.go:30:
        	Error Trace:	helpers.go:30
        	            				helpers.go:41
        	            				helpers.go:52
        	            				helpers.go:40
        	            				helpers.go:29
        	            				histogram.go:25
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Not equal:
        	            	expected: 1
        	            	actual  : 0
        	Test:       	TestRemoteWrite/otelcollector/Histogram

Remote write compliance: TestRemoteWrite/otelcollector/NameLabel

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write
TestRemoteWrite/otelcollector/NameLabel.

=== CONT  TestRemoteWrite/otelcollector/NameLabel
    labels.go:108:
        	Error Trace:	labels.go:108
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/NameLabel
        	Messages:   	found zero samples for up{job="test"} = 0

Clarify how Prometheus uses the OpenMetrics "Created" timestamp

The OpenMetrics specification states for Counter metrics:

A MetricPoint in a Metric with the type Counter SHOULD have a Timestamp value called Created. This can help ingestors discern between new metrics and long-running ones it did not see before.

A MetricPoint in a Metric's Counter's Total MAY reset to 0. If present, the corresponding Created time MUST also be set to the timestamp of the reset.

The OpenTelemetry data model agrees that this field is useful, and that it should be optional. We have argued that when the Created / Start time is not set, it is possible to miss process restarts, and thus undercount metrics for short-lived processes.

We are trying to define the proper translation into OTLP for metric points when the Created time is not known. This is relevant in https://github.com/lightstep/opentelemetry-prometheus-sidecar, which reads the WAL and writes OTLP metric streams. We believe that a Created / Start time can be filled in by any stateful observer that is able to remember the last value and its timestamp.

When a stateful observer possesses this information, we believe that processor SHOULD fill in the missing start timestamp.

The issue here is investigatory. Does Prometheus have plans to use the OpenMetrics Created timestamp and eventually include that in its WAL?

Remote write compliance: TestRemoteWrite/otelcollector/Staleness

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/Staleness.

=== CONT  TestRemoteWrite/otelcollector/Staleness
    staleness.go:43:
        	Error Trace:	staleness.go:43
        	            				main_test.go:101
        	            				main_test.go:65
        	Error:      	Should be true
        	Test:       	TestRemoteWrite/otelcollector/Staleness
        	Messages:   	found no staleness markers for stale{job="test"}

[Tracker Issue] Itemize testing requirements for Prometheus Remote Write Exporter

Itemize PRW exporter expected behavior and tests needed to verify PRW exporter is functioning as expected for OTLP to Prometheus data transformations.

These tests are not the same as the PRW compliance tests.

open-telemetry / wg-prometheus Goto Github PK

wg-prometheus's Introduction

OpenTelemetry Prometheus Working Group

wg-prometheus's People

Contributors

Stargazers

Watchers

Forkers

wg-prometheus's Issues

Prior work

Recommend Projects

Recommend Topics

Recommend Org