Git Product home page Git Product logo

opentelemetry-prometheus-sidecar's Introduction

OpenTelemetry Prometheus sidecar

This sidecar is no longer recommend. Please the OpenTelemetry Collector Prometheus receiver instead, documentation on setting up and configuring the collector for Prometheus is available on the Lightstep Observability Learning Portal.

This repository contains a sidecar for the Prometheus Server that sends metrics data to an OpenTelemetry Metrics Protocol endpoint. This software is derived from the Stackdriver Prometheus Sidecar.

OpenTelemetry Prometheus Sidecar Diagram

Repository Status (2/21/2021)

This repository is being maintained by Lightstep and will be donated to OpenTelemetry. We are moving this repository into the OpenTelemetry organization and will continue development on a public fork of the upstream Stackdriver Prometheus sidecar repository.

This code base is 100% OpenTelemetry and Prometheus, not a Lightstep-specific sidecar, functioning to read data collected and written by Prometheus, convert into the OpenTelemetry data model, and write to an OpenTelemetry endpoint.

This sidecar sends OpenTelemetry Protocol version 0.7 (or later versions) over gRPC.

Sidecar design

The sidecar includes:

  1. Prometheus write-ahead log (WAL) reader
  2. Metadata cache that tracks active instruments
  3. Configured settings:
  • Additional resource attributes to apply to all metric timeseries
  • Renaming and prefixing to change the name of metric timeseries
  • Filters to avoid reporting specific metric timeseries
  • Specify whether to use use int64 (optional) vs. double (default) protocol encoding

Sidecar operates by continually:

  1. Reading the prometheus WAL log (package retrieval and tail);
  2. Refreshing its view of the instrument metadata (package metadata);
  3. Transforming WAL samples into OpenTelemetry Protocol(OTLP) metrics (package retrieval);
  4. Sending OTLP metrics to the destination endpoint (package otlp).

Resources to understand how the WAL log works can be found here and here.

Installation

Lightstep publishes Docker images of this binary named lightstep/opentelemetry-prometheus-sidecar:${VERSION}, with the latest release always tagged latest.

To build from source, please clone this repository. You will build a Docker image, push it to a private container registry, and then run the container as described below. To test and build a Docker image for the current operating system, simply:

export DOCKER_IMAGE_NAME=my.image.reposito.ry/opentelemetry/prometheus-sidecar
export DOCKER_IMAGE_TAG=$(cat ./VERSION)
make docker
docker push ${DOCKER_IMAGE_NAME}:${DOCKER_IMAGE_TAG}

Deployment

The sidecar is deployed next to an already running Prometheus server.

An example command-line:

opentelemetry-prometheus-sidecar \
  --destination.endpoint=${DESTINATION} \
  --destination.header="lightstep-access-token=${VALUE}" \
  --destination.attribute="service.name=${SERVICE}" \
  --diagnostics.endpoint=${DIAGNOSTICS} \
  --diagnostics.header="lightstep-access-token=${VALUE}" \
  --diagnostics.attribute="service.name=${SERVICE}" \
  --prometheus.wal=${WAL} \
  --prometheus.endpoint=${PROMETHEUS} \

where:

  • DESTINATION: Destination address https://host:port for sending prometheus metrics
  • DIAGNOSTICS: Diagnostics address https://host:port for sending sidecar telemetry
  • VALUE: Value for the Custom-Header request header
  • SERVICE: Value for the service.name resource attribute
  • WAL: Prometheus' WAL directory, defaults to data/wal
  • PROMETHEUS: URL of the Prometheus UI.

See the list of command-line flags below.

Settings can also be passed through a configuration file. For example:

destination:
  endpoint: https://otlp.io:443
  headers:
    Custom-Header: custom-value
  attributes:
    service.name: my-service-name
diagnostics:
  endpoint: https://otlp.io:443
  headers:
    Custom-Header: custom-value
  attributes:
    service.name: my-service-name
prometheus:
  wal: /prometheus/wal
  endpoint: http://192.168.10.10:9191

See an example configuration yaml file here

The sidecar requires write access to the directory to store its progress between restarts.

Kubernetes and Helm setup

To configure the sidecar for a Prometheus server installed using the Prometheus Community Helm Charts.

Sidecar with Prometheus Helm chart

To configure the core components of the Prometheus sidecar, add the following definition to your custom values.yaml:

server:
  sidecarContainers:
  - name: otel-sidecar
    image: lightstep/opentelemetry-prometheus-sidecar
    imagePullPolicy: Always
    args:
    - --prometheus.wal=/data/wal
    - --destination.endpoint=$(DESTINATION)
    - --destination.header=lightstep-access-token=AAAAAAAAAAAAAAAA
    - --diagnostics.endpoint=$(DIAGNOSTIC)
    - --diagnostics.header=lightstep-access-token=AAAAAAAAAAAAAAAA
    volumeMounts:
    - name: storage-volume
      mountPath: /data
    ports:
    - name: admin-port
      containerPort: 9091
    livenessProbe:
      httpGet:
        path: /-/health
        port: admin-port
      periodSeconds: 30
      failureThreshold: 2

Sidecar with kube-prometheus-stack Helm chart

To configure the sidecar using the Prometheus Operator via the kube-prometheus-stack helm chart:

NOTE: the volume configured in the sidecar must be the same volume as prometheus uses

prometheus:
  prometheusSpec:
    containers:
      - name: otel-sidecar
        image: lightstep/opentelemetry-prometheus-sidecar:latest
        imagePullPolicy: Always

        args:
        - --prometheus.wal=/prometheus/prometheus-db/wal
        - --destination.endpoint=$(DESTINATION)
        - --destination.header=lightstep-access-token=AAAAAAAAAAAAAAAA
        - --diagnostics.endpoint=$(DIAGNOSTIC)
        - --diagnostics.header=lightstep-access-token=AAAAAAAAAAAAAAAA

        #####
        ports:
        - name: admin-port
          containerPort: 9091

        #####
        livenessProbe:
          httpGet:
            path: /-/health
            port: admin-port
          periodSeconds: 30
          failureThreshold: 2

        #####
        resources:
          requests:
            ephemeral-storage: "50M"

        volumeMounts:
        - name: prometheus-db
          mountPath: /prometheus

The upstream Stackdriver Prometheus sidecar Kubernetes README contains more examples of how to patch an existing Prometheus deployment or deploy the sidecar without using Helm.

Configuration

Most sidecar configuration settings can be set through command-line flags, while a few more rarely-used options are only settable through a yaml configuration file. To see all available command-line flags, run opentelemetry-prometheus-sidecar --help. The printed usage is shown below:

usage: opentelemetry-prometheus-sidecar [<flags>]

The OpenTelemetry Prometheus sidecar runs alongside the Prometheus
(https://prometheus.io/) Server and sends metrics data to an OpenTelemetry
(https://opentelemetry.io) Protocol endpoint.

Flags:
  -h, --help                     Show context-sensitive help (also try
                                 --help-long and --help-man).
      --version                  Show application version.
      --config-file=CONFIG-FILE  A configuration file.
      --destination.endpoint=DESTINATION.ENDPOINT
                                 Destination address of a OpenTelemetry Metrics
                                 protocol gRPC endpoint (e.g.,
                                 https://host:port). Use "http" (not "https")
                                 for an insecure connection.
      --destination.attribute=DESTINATION.ATTRIBUTE ...
                                 Destination resource attributes attached to
                                 OTLP data (e.g., MyResource=Value1). May be
                                 repeated.
      --destination.header=DESTINATION.HEADER ...
                                 Destination headers used for OTLP requests
                                 (e.g., MyHeader=Value1). May be repeated.
      --destination.timeout=DESTINATION.TIMEOUT
                                 Destination timeout used for OTLP Export()
                                 requests
      --destination.compression=DESTINATION.COMPRESSION
                                 Destination compression used for OTLP requests
                                 (e.g., snappy, gzip, none).
      --diagnostics.endpoint=DIAGNOSTICS.ENDPOINT
                                 Diagnostics address of a OpenTelemetry Metrics
                                 protocol gRPC endpoint (e.g.,
                                 https://host:port). Use "http" (not "https")
                                 for an insecure connection.
      --diagnostics.attribute=DIAGNOSTICS.ATTRIBUTE ...
                                 Diagnostics resource attributes attached to
                                 OTLP data (e.g., MyResource=Value1). May be
                                 repeated.
      --diagnostics.header=DIAGNOSTICS.HEADER ...
                                 Diagnostics headers used for OTLP requests
                                 (e.g., MyHeader=Value1). May be repeated.
      --diagnostics.timeout=DIAGNOSTICS.TIMEOUT
                                 Diagnostics timeout used for OTLP Export()
                                 requests
      --diagnostics.compression=DIAGNOSTICS.COMPRESSION
                                 Diagnostics compression used for OTLP requests
                                 (e.g., snappy, gzip, none).
      --prometheus.wal=PROMETHEUS.WAL
                                 Directory from where to read the Prometheus
                                 TSDB WAL. Default: data/wal
      --prometheus.endpoint=PROMETHEUS.ENDPOINT
                                 Endpoint where Prometheus hosts its UI, API,
                                 and serves its own metrics. Default:
                                 http://127.0.0.1:9090/
      --prometheus.max-point-age=PROMETHEUS.MAX-POINT-AGE
                                 Skip points older than this, to assist
                                 recovery. Default: 25h0m0s
      --prometheus.health-check-request-timeout=PROMETHEUS.HEALTH-CHECK-REQUEST-TIMEOUT  
                                 Timeout used for health-check requests to the prometheus endpoint. Default: 5s
      --prometheus.scrape-interval=PROMETHEUS.SCRAPE-INTERVAL ...
                                 Ignored. This is inferred from the Prometheus
                                 via api/v1/status/config
      --admin.port=ADMIN.PORT    Administrative port this process listens on.
                                 Default: 9091
      --admin.listen-ip=ADMIN.LISTEN-IP
                                 Administrative IP address this process listens
                                 on. Default: 0.0.0.0
      --security.root-certificate=SECURITY.ROOT-CERTIFICATE ...
                                 Root CA certificate to use for TLS connections,
                                 in PEM format (e.g., root.crt). May be
                                 repeated.
      --opentelemetry.max-bytes-per-request=OPENTELEMETRY.MAX-BYTES-PER-REQUEST
                                 Send at most this many bytes per request.
                                 Default: 65536
      --opentelemetry.min-shards=OPENTELEMETRY.MIN-SHARDS
                                 Min number of shards, i.e. amount of
                                 concurrency. Default: 1
      --opentelemetry.max-shards=OPENTELEMETRY.MAX-SHARDS
                                 Max number of shards, i.e. amount of
                                 concurrency. Default: 200
      --opentelemetry.metrics-prefix=OPENTELEMETRY.METRICS-PREFIX
                                 Customized prefix for exporter metrics. If not
                                 set, none will be used
      --filter=FILTER ...        PromQL metric and attribute matcher which must
                                 pass for a series to be forwarded to
                                 OpenTelemetry. If repeated, the series must
                                 pass any of the filter sets to be forwarded.
      --startup.timeout=STARTUP.TIMEOUT
                                 Timeout at startup to allow the endpoint to
                                 become available. Default: 10m0s
      --healthcheck.period=HEALTHCHECK.PERIOD
                                 Period for internal health checking; set at a
                                 minimum to the shortest Promethues scrape
                                 period
      --healthcheck.threshold-ratio=HEALTHCHECK.THRESHOLD-RATIO
                                 Threshold ratio for internal health checking.
                                 Default: 0.5
      --log.level=LOG.LEVEL      Only log messages with the given severity or
                                 above. One of: [debug, info, warn, error]
      --log.format=LOG.FORMAT    Output format of log messages. One of: [logfmt,
                                 json]
      --log.verbose=LOG.VERBOSE  Verbose logging level: 0 = off, 1 = some, 2 =
                                 more; 1 is automatically added when log.level
                                 is 'debug'; impacts logging from the gRPC
                                 library in particular
      --leader-election.enabled  Enable leader election to choose a single writer.
      --leader-election.k8s-namespace=LEADER-ELECTION.K8S-NAMESPACE  
                                 Namespace used for the leadership election lease.
      --disable-supervisor       Disable the supervisor.
      --disable-diagnostics      Disable diagnostics by default; if unset,
                                 diagnostics will be auto-configured to the
                                 primary destination

Two kinds of sidecar customization are available only through the configuration file. An example sidecar yaml configuration documents the available options.

Command-line and configuration files can be used at the same time, where command-line parameter values override configuration-file parameter values, with one exception. Configurations that support a map from string to string, including both request headers and resource attributes, are combined from both sources.

Startup safety

The sidecar waits until Prometheus finishes its first scrape(s) to begin processing the WAL, to ensure that target information is available before the sidecar tries loading its metadata cache.

This information is obtained through the Prometheus api/v1/status/config endpoint. The sidecar will log which intervals it is waiting for during startup. When using very long scrape intervals, raise the --startup.timeout setting so the sidecar will wait long enough to begin running.

Throughput and memory configuration

The following settings give the user control over the amount of memory used by the sidecar for concurrent RPCs.

The sidecar uses a queue to manage distributing data points from the Prometheus WAL to a variable number of workers, referred to as "shards". Each shard assembles a limited size request, determined by --opentelemetry.max-bytes-per-request (default: 64kB).

The number of shards varies in response to load, based on the observed latency. Upper and lower bounds on the number of shards are configured by --opentelemetry.min-shards and --opentelemetry.max-shards.

CPU usage is determined by the number of shards and the workload. Use --opentelemetry.max-shards to limit the maximum CPU used by the sidecar.

Validation errors

The sidecar reports validation errors using conventions established by Lightstep for conveying information about partial success when writing to the OTLP destination. These errors are returned using gRPC "trailers" (a.k.a. http2 response headers) and are output as metrics and logs. See the sidecar.metrics.failing metric to diagnose validation errors.

Metadata errors

The sidecar may encounter errors between itself and Prometheus, including failures to locate metadata about a targets that Prometheus no longer knows about. Missing metadata, Prometheus API errors, and other forms of inconsistency are reported using sidecar.metrics.failing with key_reason and metric_name attributes.

OpenTelemetry Resource Attributes

Use the --destination.attribute=KEY=VALUE flag to add additional OpenTelemetry resource attributes to all exported timeseries.

Prometheus External Labels

Prometheus external labels are used for diagnostic purposes but are not attached to exported timeseries.

Filters

The --filter flag allows to provide filters which all series have to pass before being sent to the destination. Filters use the same syntax as Prometheus instant vector selectors, e.g.:

opentelemetry-prometheus-sidecar --filter='{__name__!~"cadvisor_.+",job="k8s"}' ...

This drops all series which do not have a job label k8s and all metrics that have a name starting with cadvisor_.

For equality filter on metric name you can use the simpler notation, e.g. --filter='metric_name{label="foo"}'.

The flag may be repeated to provide several sets of filters, in which case the metric will be forwarded if it matches at least one of them.

Metric renames

To change the name of a metric as it is exported, use the metric_renames section in the configuration file:

metric_renames:
  - from: original_metric_name
    to: new_metric_name
# - ...

Static metadata

To change the output type, value type, or description of a metric instrument as it is exported, use the static_metadata section in the configuration file:

static_metadata:
  - metric: some_metric_name
    type: counter # or gauge, or histogram
    value_type: double # or int64
    help: an arbitrary help string
# - ...

Note:

  • All static_metadata entries must have type specified.
  • If value_type is specified, it will override the default value type for counters and gauges. All Prometheus metrics have a default type of double.

High Availability Prometheus Setup

In a HA prometheus setup, a prometheus sidecar can be attached to each replica. All sidecars will write all metric points, it is responsibility of the backend of choice to deduplicate these points.

The leader election can be enabled in order to restrict which replica will send metrics to the Collector, reducing the amount of metrics transferred on the wire.

One of the sidecars will be elected as the Leader, this leader sidecar will tail the Prometheus WAL log, transform and send OTLP metrics to the Collector. All other non-leader sidecars will be in a stand-by mode, it will tail the Prometheus WAL log, but will not send any data to the Collector.

If the leader sidecar fails, a new Leader will be elected and will resume sending data to the collector.

The leader election uses the kubernetes coordination API to elect a leader. Ensure that there is a service account for Prometheus in Kubernetes and then bind it to the role with the following permissions:

rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs:
      - '*'

After to the service account permissions are set up, set the argument flag --leader-election.enabled on the prometheus sidecar.

To change the namespace used for the leadership election lease, set --leader-election.k8s-namespace=LEADER-ELECTION.K8S-NAMESPACE.

Monitoring

When run in the default configuration, the sidecar will self-monitor for liveness and kill the process when it becomes unhealthy. Sidecar liveness can be monitored directly using the /-/health endpoint. We recommend a period of 30 seconds and failureThreshold: 2, for example in your Kubernetes deployment:

    ports:
    - name: admin-port
      containerPort: 9091
    livenessProbe:
      httpGet:
        path: /-/health
        port: admin-port
      periodSeconds: 30
      failureThreshold: 2

Diagnostics

The sidecar is instrumented with the OpenTelemetry-Go SDK and runs with standard instrumentation packages, including runtime and host metrics.

By default, diagnostics are autoconfigured to the primary destination. Configuration options are available to disable or configure alternative destinations and options.

Configuration Options

Separate diagnostics settings can be configure to output OTLP similar to configuring the primary destination, for example:

diagnostics:
  endpoint: https://otel-collector:443
  headers:
    Custom-Header: custom-value
  timeout: timeout-value
  attributes:
    extra.resource: extra-value

Likewise, these fields can be accessed using --diagnostics.endpoint, --diagnostics.header, --diagnostics.timeout, and --diagnostics.attribute.

Log levels

The Prometheus sidecar provides options for logging in the case of diagnosing an issue.

  • We recommend starting with setting the --log.level to be debug, info, warn, error.
  • Additional options are available to set the output format of the logs (--log.format to be logfmt or json), and the number of logs to recorded (--log.verbose to be 0 for off, 1 for some, 2 for more)

Disabling Diagnostics

To disable diagnostics, set the argument flag --disable-diagnostics.

Diagnostic Outputs

The Prometheus Sidecar has two main processes, the supervisor and a subordinate process. Diagnostic traces are sent from the supervisor process to monitor lifecycle events and diagnostic metrics are sent from the subordinate process.

Sidecar Supervisor Tracing

Traces from the supervisor process are most helpful for diagnosing unknown problems. It signals lifecycle events and captures the stderr, attaching it to the trace.

The sidecar will output spans from its supervisor process with service.name=opentelemetry-prometheus-sidecar-supervisor. There are two kinds of spans: operation=health-client and operation=shutdown-report. These spans are also tagged with the attribute sidecar-health with values ok, not yet ready or first time ready.

Sidecar Metrics

Metrics from the subordinate process can help identify issues once the first metrics are successfully written. There are three standard host and runtime metrics to monitor:

Key Success Metrics

These metrics are key to understanding the health of the sidecar. They are periodically printed to the console log to assist in troubleshooting.

Metric Name Type Description Additional Attributes
sidecar.points.produced counter number of points read from the prometheus WAL
sidecar.points.dropped counter number of points dropped due to errors key_reason: metadata, validation
sidecar.points.skipped counter number of points skipped by filters
sidecar.queue.outcome counter outcome of the sample in the queue outcome: success, failed, retry, aborted
sidecar.series.dropped counter number of series or metrics dropped key_reason: metadata, validation
sidecar.series.current gauge number of series in the cache status: live, filtered, invalid
sidecar.metrics.failing gauge failing metric names and explanations key_reason, metric_name

Progress Metrics

Two metrics indicate the sidecar's progress in processing the Prometheus write-ahead-log (WAL). These are measured in bytes and are based on the assumption that each WAL segment is a complete 128MB. Both of these figures are exported as current_segment_number * 128MB + segment_offset. The difference (i.e., sidecar.wal.size - sidecar.wal.offset) indicates how far the sidecar has to catch up, assuming complete WAL segments.

Metric Name Type Description
sidecar.wal.size gauge current writer position of the Prometheus write-ahead log
sidecar.wal.offset gauge current reader position in the Prometheus write-ahead log

Host and Runtime Metrics

Metric Name Type Description Additional Attributes
process.cpu.time counter cpu seconds used state: user, sys
system.network.io counter bytes sent and received direction: read, write
runtime.go.mem.heap_alloc gauge memory in use

Internal Metrics

These metrics are diagnostic in nature, meant for characterizing performance of the code and individual Prometheus installations. Operators can safely ignore these metrics except to better understand sidecar performance.

Metric Name Type Description Additional Attributes
sidecar.connect.duration histogram how many attempts to connect (and how long) error: true, false
sidecar.export.duration histogram how many attempts to export (and how long) error: true, false
sidecar.monitor.duration histogram how many attempts to scrape Prometheus /metrics (and how long) error: true, false
sidecar.metadata.fetch.duration histogram how many attempts to fetch metadata from Prometheus (and how long) mode: single, batch; error: true, false
sidecar.queue.capacity gauge number of available slots for samples (i.e., points) in the queue, counts buffer size times current number of shards
sidecar.queue.running gauge number of running shards, those which have not exited
sidecar.queue.shards gauge number of current shards, as set by the queue manager
sidecar.queue.size gauge number of samples (i.e., points) standing in a queue waiting to export
sidecar.series.defined counter number of series defined in the WAL
sidecar.metadata.lookups counter number of calls to lookup metadata error: true, false
sidecar.refs.collected counter number of WAL series refs removed from memory by garbage collection error: true, false
sidecar.refs.notfound counter number of WAL series refs that were not found during lookup
sidecar.segment.opens counter number of WAL segment open() calls
sidecar.segment.reads counter number of WAL segment read() calls
sidecar.segment.bytes counter number of WAL segment bytes read
sidecar.segment.skipped counter number of skipped WAL segments
sidecar.leadership gauge indicate if this sidecar is the leader or not

Upstream

This repository was copied into a private reposotitory from this upstream fork of stackdriver-prometheus-sidecar, dated July 31, 2020.

Compatibility

The matrix below lists the versions of Prometheus Server and other dependencies that have been qualified to work with releases of opentelemetry-prometheus-sidecar. If the matrix does not list whether they are compatible, please assume they are not verified yet but can be compatible. Feel free to contribute to the matrix if you have run the end-to-end test between a version of opentelemetry-prometheus-sidecar and Prometheus server.

Sidecar Version Compatible Prometheus Server Version(s) Incompatible Prometheus Server Version(s)
0.1.x 2.15, 2.16, 2.18, 2.19, 2.21, 2.22, 2.23, 2.24 2.5

Troubleshooting

The following describes known scenarios that can be problematic for the sidecar.

Sidecar is falling behind

It's possible for the sidecar to not have enough resources by default to keep up with the WAL in high throughput scenarios. When this case occurs, the following message will be displayed in the sidecar's logs:

past WAL segment not found, sidecar may have dragged behind. Consider increasing min-shards, max-shards and max-bytes-per-request value

This message means that the sidecar is looking for a WAL segment file that has been removed, usually due to Prometheus triggering a checkpoint. It's possible to look at the delta between sidecar.wal.size (total wal entries) and sidecar.wal.offset (where the sidecar currently is) to determine if the sidecar has enough resources to keep up. If the offset is increasingly further behind the size, it's recommended to increase the timeseries emitted per request using the following configuration options:

  • --opentelemetry.max-bytes-per-request configures the maximum number of timeseries sent with each request from the sidecar to the OTLP backend.
  • --opentelemetry.max-shards configures the number of parallel go routines and grpc connections used to transmit the data.

opentelemetry-prometheus-sidecar's People

Contributors

ajohnsonls avatar bobclarke avatar carlosalberto avatar cbrahms avatar jmacd avatar lulubai-ls avatar paivagustavo avatar smithclay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

codeboten

opentelemetry-prometheus-sidecar's Issues

Graceful shutdown hangs in 0.11.0

The sidecar's supervisor is not exiting after a SIGTERM is delivered by kubernetes.

I suspect this is a result of the SIGTERM being delivered to the parent process, not the whole process group.

In any case, the result is that shutdown stalls -- since the liveness and readiness handlers continue returning health.

Histograms metrics from Prometheus not supported in Lightstep

I have the sidecar deployed next to my prometheus server, sending metrics to lightstep.

I recently added a new metric (from a custom service) that I can see in my local prometheus, but is not showing up in lightstep. (The metric name is http_request_duration_seconds_count and my project, if helpful, is wetsnow-dev-550a21f7). I do see other metrics updating just fine though. So something is getting through.

Oddly enough, I see other metrics from this process, for example go_gc_duration_seconds and promhttp_metric_handler_requests_total.

Hmm, now that I look at it more closely, the missing metric is a histogram type:

# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.005"} 798
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.01"} 1043
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.025"} 1154
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.05"} 1183
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.1"} 1198
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.25"} 1200
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="0.5"} 1200
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="1"} 1200
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="2.5"} 1200
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="5"} 1200
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="10"} 1200
http_request_duration_seconds_bucket{code="2xx",handler="/metrics",method="GET",service="web-static",le="+Inf"} 1200
http_request_duration_seconds_sum{code="2xx",handler="/metrics",method="GET",service="web-static"} 8.134538902999997
http_request_duration_seconds_count{code="2xx",handler="/metrics",method="GET",service="web-static"} 1200

Is there something special that needs to be done to support histograms?

Here's the log from otel-sidecar showing it working:

level=info ts=2021-02-07T14:38:03.590Z caller=main.go:375 msg="Starting OpenTelemetry Prometheus sidecar" version="(version=0.13.2, branch=main, revision=b02d7fa73811b8dcce34feef0119433e1389424b)" build_context="(go=go1.15.6, [email protected], date=20210204-12:03:20)" host_details="(Linux 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 prometheus-kube-prometheus-prometheus-0 (none))" fd_limits="(soft=1048576, hard=1048576)"
level=warn ts=2021-02-07T14:38:06.606Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=info ts=2021-02-07T14:38:08.556Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:08.591Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=warn ts=2021-02-07T14:38:09.596Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=warn ts=2021-02-07T14:38:12.596Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=info ts=2021-02-07T14:38:13.598Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:13.633Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=warn ts=2021-02-07T14:38:15.610Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=warn ts=2021-02-07T14:38:18.593Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=info ts=2021-02-07T14:38:18.636Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:18.674Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=warn ts=2021-02-07T14:38:21.634Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=info ts=2021-02-07T14:38:23.686Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:23.718Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=warn ts=2021-02-07T14:38:24.633Z caller=main.go:324 msg="Prometheus is not ready" status="503 Service Unavailable"
level=info ts=2021-02-07T14:38:28.720Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:28.752Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:38:33.754Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:33.786Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:38:38.787Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:38.831Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:38:43.832Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:43.868Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:38:48.870Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:48.905Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:38:53.906Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:53.939Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:38:58.942Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:38:58.974Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:39:03.976Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:39:04.008Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:39:09.009Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:39:09.043Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:39:14.045Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:39:14.079Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:39:19.081Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:39:19.113Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:39:24.115Z caller=supervisor.go:398 supervisor=true msg="sidecar is not ready"
level=error ts=2021-02-07T14:39:24.148Z caller=supervisor.go:189 supervisor=true msg="healthcheck failed" err="healthcheck: 503 Service Unavailable"
level=info ts=2021-02-07T14:39:27.783Z caller=main.go:272 msg="OpenTelemetry client started"
level=info ts=2021-02-07T14:39:27.784Z caller=manager.go:104 component=prom_wal msg="Starting Prometheus reader..."
level=info ts=2021-02-07T14:39:29.151Z caller=supervisor.go:388 supervisor=true msg="sidecar is ready"
level=info ts=2021-02-07T14:39:47.629Z caller=manager.go:166 component=prom_wal msg="reached first record after start offset" start_offset=59970585249 skipped_records=21387
level=info ts=2021-02-07T14:40:27.785Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=1 to=2
level=info ts=2021-02-07T14:40:29.192Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=113759 queue.outcome{outcome:success}=23600
level=info ts=2021-02-07T14:40:42.785Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=2 to=3
level=info ts=2021-02-07T14:40:57.786Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=3 to=4
level=info ts=2021-02-07T14:41:27.784Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=4 to=6
level=info ts=2021-02-07T14:41:29.333Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=521618 queue.outcome{outcome:success}=145870
level=info ts=2021-02-07T14:42:27.784Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=6 to=5
level=info ts=2021-02-07T14:42:29.474Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=895346 queue.outcome{outcome:success}=262953
level=info ts=2021-02-07T14:43:29.633Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=1.084468e+06 queue.outcome{outcome:success}=322901
level=info ts=2021-02-07T14:44:29.782Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=1.298271e+06 queue.outcome{outcome:success}=391501
level=info ts=2021-02-07T14:44:42.785Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=5 to=6
level=info ts=2021-02-07T14:45:29.914Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=1.499735e+06 queue.outcome{outcome:success}=455650
level=info ts=2021-02-07T14:45:57.788Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=6 to=4
level=info ts=2021-02-07T14:46:30.088Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=1.701212e+06 queue.outcome{outcome:success}=520614
level=info ts=2021-02-07T14:47:12.784Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=4 to=3
level=info ts=2021-02-07T14:47:30.238Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=1.902694e+06 queue.outcome{outcome:success}=584623
level=info ts=2021-02-07T14:48:27.790Z caller=queue_manager.go:334 component=queue_manager msg="send queue resharding" from=3 to=2
level=info ts=2021-02-07T14:48:30.388Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=2.104179e+06 queue.outcome{outcome:success}=649047
level=info ts=2021-02-07T14:49:30.620Z caller=supervisor.go:409 supervisor=true msg="sidecar is running" samples.processed{}=2.293325e+06 queue.outcome{outcome:success}=706858

Metadata conflict testing needed

I have observed a situation where a Histogram point being switched to a Gauge point causes the sidecar to enter a infinite loop in retrieval/transform.go, because the call to buildHistogram takes 0 points and leaves the caller to attempt the same thing. We should return an error when consumed is 0.

This is not a problem for Prometheus because the targets that produced the conflicting point kinds are distinguished by job and instance labels.

Label skipped timeseries data points by cause

There are several reasons why the sidecar will skip data points, among them cumulative timeseries reset, max-age-exceeded, and so on. These are not well labeled in the output diagnostics, although the total number of skipped points can be observed.

See the number of causes in retrieval/transform.go.

Update prometheus/prometheus library dependency

This is difficult because the project, while written in Go, is not expected to be used as a library by most users. The way Go modules works prevents the use of semantic version labels v2.x without using certain API paths. I need to work around this, somehow, in order to implement code newer than Prometheus v2.0.5, since that is the version at which they added a go.mod file. Something has to be done, but it's not clear how to work around this without forking prometheus.

This is important because without upgrading, we cannot switch to using the OpenTelemetry SDK for telemetry.

tail/tail.go `openSegment` makes assumptions about order of results from ReadDir

The body of this function reads:

func openSegment(dir string, n int) (io.ReadCloser, error) {
	files, err := fileutil.ReadDir(dir)
	if err != nil {
		return nil, err
	}
	for _, fn := range files {
		k, err := strconv.Atoi(fn)
		if err != nil || k < n {
			continue
		}
		if k > n {
			return nil, errors.Errorf("next segment %d too high, expected %d", n, k)
		}
		return wal.OpenReadSegment(filepath.Join(dir, fn))
	}
	return nil, tsdb.ErrNotFound
}

The purpose of this function is to find a filename with the matching number of unspecified width. The fix is to treat k != n as a non-error (i.e., continue).

Restore error-handling specific tests

There were several tests client_test.go that were very Stackdriver specific and/or depended on OpenCensus. These should be restored based on OTel mock trace and metrics libraries, ideally.

Key/value params aren't parsed as expected

Currently when passing a destination.header parameter that is quoted causes unexpected behaviour

--destination.header="myheader=myvalue" will parse into: "myheader:myvalue"

This can cause grpc protocol issues

Sidecar crashes: corruption "unexpected non-zero byte in padded page"

My sidecar process seems to be stuck in a crash loop. I am not sure how to deal with this further. prometheus itself seems to be running.

This happened yesterday as well. I disabled the sidecar, waited a few hours, re-enabled it, and everything was working.

My understanding is that prometheus is supposed to fix any corruptions itself if I restart it, which doesn't seem to be happening here. I also don't see any errors in the prometheus container logs.

Any idea how to handle this?

level=info ts=2021-02-07T18:44:55.592Z caller=main.go:272 msg="OpenTelemetry client started"
level=info ts=2021-02-07T18:44:55.592Z caller=manager.go:104 component=prom_wal msg="Starting Prometheus reader..."
level=info ts=2021-02-07T18:44:57.931Z caller=supervisor.go:388 supervisor=true msg="sidecar is ready"
level=info ts=2021-02-07T18:45:09.526Z caller=manager.go:166 component=prom_wal msg="reached first record after start offset" start_offset=60326608895 skipped_records=11351
level=info ts=2021-02-07T18:45:23.875Z caller=manager.go:221 component=prom_wal msg="Done processing WAL."
level=info ts=2021-02-07T18:45:23.875Z caller=main.go:254 msg="Prometheus reader stopped"
level=info ts=2021-02-07T18:45:23.875Z caller=main.go:260 msg="Stopping Prometheus reader..."
level=info ts=2021-02-07T18:45:23.875Z caller=queue_manager.go:218 component=queue_manager msg="Stopping remote storage..."
level=info ts=2021-02-07T18:45:23.875Z caller=queue_manager.go:226 component=queue_manager msg="Remote storage stopped."
level=error ts=2021-02-07T18:45:23.875Z caller=main.go:285 err="corruption after 126484480 bytes: unexpected non-zero byte in padded page"
level=error ts=2021-02-07T18:45:24.055Z caller=supervisor.go:97 supervisor=true msg="sidecar failed" err="exit status 1"

Here are the logs from the pod around the time of failure:


{"log":"level=info ts=2021-02-07T16:50:41.898Z caller=supervisor.go:409 supervisor=true msg=\"sidecar is running\" samples.processed{}=1.1418224e+07 queue.outcome{outcome:success}=3.599843e+06\n","stream":"stderr","time":"2021-02-07T16:50:41.898720215Z"}
--
  |   | 2021-02-07 11:51:20 | {"log":"level=info ts=2021-02-07T16:51:19.785Z caller=queue_manager.go:334 component=queue_manager msg=\"send queue resharding\" from=3 to=2\n","stream":"stderr","time":"2021-02-07T16:51:19.78590219Z"}
  |   | 2021-02-07 11:51:42 | {"log":"level=info ts=2021-02-07T16:51:41.932Z caller=supervisor.go:409 supervisor=true msg=\"sidecar is running\" samples.processed{}=1.1619205e+07 queue.outcome{outcome:success}=3.663843e+06\n","stream":"stderr","time":"2021-02-07T16:51:41.932562028Z"}
  |   | 2021-02-07 11:52:42 | {"log":"level=info ts=2021-02-07T16:52:42.064Z caller=supervisor.go:409 supervisor=true msg=\"sidecar is running\" samples.processed{}=1.1820151e+07 queue.outcome{outcome:success}=3.727916e+06\n","stream":"stderr","time":"2021-02-07T16:52:42.064678513Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=info ts=2021-02-07T16:52:45.184Z caller=manager.go:221 component=prom_wal msg=\"Done processing WAL.\"\n","stream":"stderr","time":"2021-02-07T16:52:45.185012373Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=info ts=2021-02-07T16:52:45.184Z caller=main.go:254 msg=\"Prometheus reader stopped\"\n","stream":"stderr","time":"2021-02-07T16:52:45.18505515Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=info ts=2021-02-07T16:52:45.184Z caller=main.go:260 msg=\"Stopping Prometheus reader...\"\n","stream":"stderr","time":"2021-02-07T16:52:45.185424547Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=info ts=2021-02-07T16:52:45.184Z caller=queue_manager.go:218 component=queue_manager msg=\"Stopping remote storage...\"\n","stream":"stderr","time":"2021-02-07T16:52:45.186757534Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=info ts=2021-02-07T16:52:45.184Z caller=queue_manager.go:226 component=queue_manager msg=\"Remote storage stopped.\"\n","stream":"stderr","time":"2021-02-07T16:52:45.186778315Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=error ts=2021-02-07T16:52:45.185Z caller=main.go:285 err=\"corruption after 260538368 bytes: unexpected non-zero byte in padded page\"\n","stream":"stderr","time":"2021-02-07T16:52:45.187218782Z"}
  |   | 2021-02-07 11:52:45 | {"log":"level=error ts=2021-02-07T16:52:45.347Z caller=supervisor.go:97 supervisor=true msg=\"sidecar failed\" err=\"exit status 1\"\n","stream":"stderr","time":"2021-02-07T16:52:45.348109793Z"}
  |   | 2021-02-07 11:52:46 | {"log":"level=info ts=2021-02-07T16:52:46.573201004Z caller=operator.go:1161 component=prometheusoperator msg=\"sync prometheus\" key=monitoring/kube-prometheus-prometheus\n","stream":"stdout","time":"2021-02-07T16:52:46.573568658Z"}
  |   | 2021-02-07 11:52:47 | {"log":"level=info ts=2021-02-07T16:52:47.645Z caller=main.go:375 msg=\"Starting OpenTelemetry Prometheus sidecar\" version=\"(version=0.13.2, branch=main, revision=b02d7fa73811b8dcce34feef0119433e1389424b)\" build_context=\"(go=go1.15.6, [email protected], date=20210204-12:03:20)\" host_details=\"(Linux 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 prometheus-kube-prometheus-prometheus-0 (none))\" fd_limits=\"(soft=1048576, hard=1048576)\"\n","stream":"stderr","time":"2021-02-07T16:52:47.645983216Z"}
  |   | 2021-02-07 11:52:48 | {"log":"level=info ts=2021-02-07T16:52:48.283138183Z caller=operator.go:1161 component=prometheusoperator msg=\"sync prometheus\" key=monitoring/kube-prometheus-prometheus\n","stream":"stdout","time":"2021-02-07T16:52:48.283704047Z"}
  |   | 2021-02-07 11:52:52 | {"log":"level=info ts=2021-02-07T16:52:52.607Z caller=supervisor.go:398 supervisor=true msg=\"sidecar is not ready\"\n","stream":"stderr","time":"2021-02-07T16:52:52.607620073Z"}
  |   | 2021-02-07 11:52:52 | {"log":"level=error ts=2021-02-07T16:52:52.638Z caller=supervisor.go:189 supervisor=true msg=\"healthcheck failed\" err=\"healthcheck: 503 Service Unavailable\"\n","stream":"stderr","time":"2021-02-07T16:52:52.638880154Z"}
  |   | 2021-02-07 11:52:57 | {"log":"level=info ts=2021-02-07T16:52:57.639Z caller=supervisor.go:398 supervisor=true msg=\"sidecar is not ready\"\n","stream":"stderr","time":"2021-02-07T16:52:57.639915602Z"}
  |   | 2021-02-07 11:52:57 | {"log":"level=error ts=2021-02-07T16:52:57.672Z caller=supervisor.go:189 supervisor=true msg=\"healthcheck failed\" err=\"healthcheck: 503 Service Unavailable\"\n","stream":"stderr","time":"2021-02-07T16:52:57.672399457Z"}

Here's the prometheus log since a recent restart if that helps at all:

level=info ts=2021-02-07T15:54:19.288Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.1, branch=HEAD, revision=e4487274853c587717006eeda8804e597d120340)"
level=info ts=2021-02-07T15:54:19.288Z caller=main.go:369 build_context="(go=go1.15.6, user=root@0b5231a0de0f, date=20210120-00:09:36)"
level=info ts=2021-02-07T15:54:19.288Z caller=main.go:370 host_details="(Linux 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 prometheus-kube-prometheus-prometheus-0 (none))"
level=info ts=2021-02-07T15:54:19.288Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-02-07T15:54:19.288Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-02-07T15:54:19.295Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-02-07T15:54:19.296Z caller=main.go:738 msg="Starting TSDB ..."
level=info ts=2021-02-07T15:54:19.307Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false
level=info ts=2021-02-07T15:54:19.310Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1611835200034 maxt=1611900000000 ulid=01EX6WT5X8MH9PG49HT3M3VCV4
level=info ts=2021-02-07T15:54:19.311Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1611900000051 maxt=1611964800000 ulid=01EX8TKGH9H7VSPKK5XJJDG8ZM
level=info ts=2021-02-07T15:54:19.317Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1611964800025 maxt=1612029600000 ulid=01EXARD8ZEN824ZQWG3RX3F78R
level=info ts=2021-02-07T15:54:19.320Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612029600053 maxt=1612094400000 ulid=01EXCP6H20XPDW9NKEF42MXM07
level=info ts=2021-02-07T15:54:19.321Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612094400104 maxt=1612159200000 ulid=01EXEM0CKYA2P2X3RMDTRCXYRM
level=info ts=2021-02-07T15:54:19.322Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612159200060 maxt=1612224000000 ulid=01EXGHSPCJEZW92YNGH84XG1R6
level=info ts=2021-02-07T15:54:19.322Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612224000059 maxt=1612288800000 ulid=01EXJFKF39XD9AWB8Y306ADREB
level=info ts=2021-02-07T15:54:19.323Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612288800056 maxt=1612353600000 ulid=01EXMDCRJB6CY1486F059DMT2R
level=info ts=2021-02-07T15:54:19.324Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612353600083 maxt=1612418400000 ulid=01EXPB6H9040M2KN1HY5SH3Y4A
level=info ts=2021-02-07T15:54:19.325Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612418400158 maxt=1612483200000 ulid=01EXR8ZJ11Q815XG7DWKHYYBFZ
level=info ts=2021-02-07T15:54:19.326Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612504800035 maxt=1612512000000 ulid=01EXRPPT2Z6M0KKYP1YZ7CDSWQ
level=info ts=2021-02-07T15:54:19.327Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612483200000 maxt=1612504800000 ulid=01EXRPPZSDV6ZKKQVHTSH67C8G
level=info ts=2021-02-07T15:54:19.328Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612512000017 maxt=1612519200000 ulid=01EXRXJGZKMFZTQR4RV79V203S
level=info ts=2021-02-07T15:54:19.329Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612519200022 maxt=1612526400000 ulid=01EXS4E89492CCK205MJ2SRY9W
level=info ts=2021-02-07T15:54:19.331Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612526400035 maxt=1612548000000 ulid=01EXT6S1DD10AYSY8YRQH3X2XP
level=info ts=2021-02-07T15:54:19.333Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612548000000 maxt=1612612800000 ulid=01EXVXPZFBHB6Q3VBZXZMB9HY3
level=info ts=2021-02-07T15:54:19.334Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612677600040 maxt=1612684800000 ulid=01EXXVG835NCBXAGN0W92D16RN
level=info ts=2021-02-07T15:54:19.335Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612684800035 maxt=1612692000000 ulid=01EXY2BZ0GJHSR97XHA5PCE96A
level=info ts=2021-02-07T15:54:19.337Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612612800035 maxt=1612677600000 ulid=01EXY2CSJ7NEZPW4QTEPNC47HC
level=info ts=2021-02-07T15:54:19.338Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612692000035 maxt=1612699200000 ulid=01EXY97PR1GYE3626XT82053ZS
level=info ts=2021-02-07T15:54:19.341Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1612699200000 maxt=1612706400000 ulid=01EXYG3DGC0C1AWCDHKS3JNM2J
level=info ts=2021-02-07T15:54:19.831Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-02-07T15:54:20.312Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=481.774886ms
level=info ts=2021-02-07T15:54:20.313Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-02-07T15:54:22.024Z caller=head.go:691 component=tsdb msg="WAL checkpoint loaded"
level=info ts=2021-02-07T15:54:26.097Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=446 maxSegment=449
level=info ts=2021-02-07T15:54:27.376Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=447 maxSegment=449
level=info ts=2021-02-07T15:54:31.588Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=448 maxSegment=449
level=info ts=2021-02-07T15:54:31.589Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=449 maxSegment=449
level=info ts=2021-02-07T15:54:31.589Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=1.709557673s wal_replay_duration=9.564408403s total_replay_duration=11.758136451s
level=info ts=2021-02-07T15:54:31.795Z caller=main.go:758 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-02-07T15:54:31.795Z caller=main.go:761 msg="TSDB started"
level=info ts=2021-02-07T15:54:31.795Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2021-02-07T15:54:31.815Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.833Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.836Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.837Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.838Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.839Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.840Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.841Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.843Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.845Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=49.570546ms remote_storage=4.57µs web_handler=1.415µs query_engine=2.295µs scrape=1.01953ms scrape_sd=29.130118ms notify=33.572µs notify_sd=1.45868ms rules=472.205µs
level=info ts=2021-02-07T15:54:31.845Z caller=main.go:710 msg="Server is ready to receive web requests."
level=info ts=2021-02-07T15:54:31.847Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=error ts=2021-02-07T15:54:31.871Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: failed to list *v1.Service: Get \"https://10.245.0.1:443/api/v1/namespaces/kube-system/services?limit=500&resourceVersion=0\": context canceled"
level=error ts=2021-02-07T15:54:31.871Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get \"https://10.245.0.1:443/api/v1/namespaces/kube-system/endpoints?limit=500&resourceVersion=0\": context canceled"
level=info ts=2021-02-07T15:54:31.874Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.877Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.880Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.882Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.884Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.887Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.889Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.893Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.895Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-02-07T15:54:31.896Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=49.044377ms remote_storage=613.709µs web_handler=1.087µs query_engine=1.91µs scrape=116.957µs scrape_sd=33.877575ms notify=17.999µs notify_sd=1.230593ms rules=60.574µs
level=error ts=2021-02-07T15:54:31.906Z caller=klog.go:104 component=k8s_client_runtime func=Errorf msg="Unexpected error when reading response body: context canceled"
level=info ts=2021-02-07T17:00:08.726Z caller=compact.go:507 component=tsdb msg="write block" mint=1612706400000 maxt=1612713600000 ulid=01EXYPZ4S0GXFGCB302M5ZRBA8 duration=8.566188553s
level=info ts=2021-02-07T17:00:09.111Z caller=head.go:824 component=tsdb msg="Head GC completed" duration=289.024627ms
level=info ts=2021-02-07T17:00:09.177Z caller=checkpoint.go:95 component=tsdb msg="Creating checkpoint" from_segment=446 to_segment=447 mint=1612713600000
level=info ts=2021-02-07T17:00:13.289Z caller=head.go:921 component=tsdb msg="WAL checkpoint complete" first=446 last=447 duration=4.112404775s
level=info ts=2021-02-07T17:00:20.843Z caller=compact.go:448 component=tsdb msg="compact blocks" count=3 mint=1612677600040 maxt=1612699200000 ulid=01EXYPZHMM7J7XNYMVHJ1S2KPS sources="[01EXXVG835NCBXAGN0W92D16RN 01EXY2BZ0GJHSR97XHA5PCE96A 01EXY97PR1GYE3626XT82053ZS]" duration=7.510860831s
level=info ts=2021-02-07T17:00:20.913Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01EXY2BZ0GJHSR97XHA5PCE96A
level=info ts=2021-02-07T17:00:20.923Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01EXXVG835NCBXAGN0W92D16RN
level=info ts=2021-02-07T17:00:20.935Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01EXY97PR1GYE3626XT82053ZS
level=warn ts=2021-02-07T18:34:13.274Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:429: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:13.232Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:429: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:13.213Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:427: watch of *v1.Endpoints ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:14.790Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:428: watch of *v1.Service ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:13.914Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:428: watch of *v1.Service ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:15.219Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:428: watch of *v1.Service ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:15.930Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:427: watch of *v1.Endpoints ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=error ts=2021-02-07T18:34:16.256Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: Get \"https://10.245.0.1:443/api/v1/services?allowWatchBookmarks=true&resourceVersion=9282352&timeout=8m18s&timeoutSeconds=498&watch=true\": http2: client connection lost"
level=warn ts=2021-02-07T18:34:16.370Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:427: watch of *v1.Endpoints ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=error ts=2021-02-07T18:34:16.334Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:427: Failed to watch *v1.Endpoints: Get \"https://10.245.0.1:443/api/v1/namespaces/db/endpoints?allowWatchBookmarks=true&resourceVersion=9317062&timeout=9m21s&timeoutSeconds=561&watch=true\": http2: client connection lost"
level=warn ts=2021-02-07T18:34:16.794Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:428: watch of *v1.Service ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=error ts=2021-02-07T18:34:16.723Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: Get \"https://10.245.0.1:443/api/v1/namespaces/auth/services?allowWatchBookmarks=true&resourceVersion=9282352&timeout=8m57s&timeoutSeconds=537&watch=true\": http2: client connection lost"
level=error ts=2021-02-07T18:34:20.115Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: Get \"https://10.245.0.1:443/api/v1/namespaces/monitoring/services?allowWatchBookmarks=true&resourceVersion=9282352&timeout=9m9s&timeoutSeconds=549&watch=true\": http2: client connection lost"
level=error ts=2021-02-07T18:34:20.210Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="/app/discovery/kubernetes/kubernetes.go:428: Failed to watch *v1.Service: Get \"https://10.245.0.1:443/api/v1/namespaces/monitoring/services?allowWatchBookmarks=true&resourceVersion=9282352&timeout=5m55s&timeoutSeconds=355&watch=true\": http2: client connection lost"
level=warn ts=2021-02-07T18:34:30.839Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:429: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:32.198Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:429: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:32.933Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:427: watch of *v1.Endpoints ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:32.808Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:427: watch of *v1.Endpoints ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:35.046Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:429: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"
level=warn ts=2021-02-07T18:34:39.218Z caller=klog.go:88 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:429: watch of *v1.Pod ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"

Use a fixed resource and instrumentation library

The existing code derived from SD has no batching by resource, which at the very least offers a form of compression for sending metric data. In fact the existing otlp/queue_manager.go does the opposite, prevents batching when a timeseries has the same hash in a single request. This is not a requirement for OTLP and can result in small request sizes.

Removing the shard[].seen map will improve performance by allowing larger requests.

Stuck sender shards can block the reader

An indefinitely stuck send, whether due to an unrecoverable error being returned as recoverable, or due to persistently long timeouts, has a potential to block the sender.

The problem is mitigated in #87, but not fully resolved. This requires a misbehaving OTLP server to achieve, but has been reproduced using the stresstest program included in that PR and a faulty test server.

#87 added a backstop to abort sends after they have persistently failed for several export timeouts, which ensures progress, but does not ensure good progress if the rate of the trigger (whatever the cause) is high enough.

The current behavior ensures that the sidecar does not get stuck, but does not ensure good progress, and when this is happening the queue manager resharding logic begins reducing the number of shards. This can result in a nearly-stuck process following #87.

Reported validation error: start time after end time

We have an unconfirmed report that data points send from the sidecar have start time after end time. As printed in the sidecar log sending to Lightstep:

level=warn ts=2021-01-28T21:02:58.422Z caller=queue_manager.go:506 component=queue_manager msg="unrecoverable write error" err="rpc error: code = InvalidArgument desc = [metric name]: must be positive for kind=delta|cumulative]"

This should be impossible, we should check this in the sidecar to be absolutely sure.

Restore validation-errors test

A new test introduced in #157 has been disabled because although it passes reliably in a development environment (including with the race detector), it reliably times out in CI. See cmd/opentelemetry-prometheus-sidecar/validation_test.go.

Throughput limits are difficult to configure

The existing approach to managing gRPC connections allows configuring limits on the number of shards, but not the amount of memory used by gRPC. The existing approach uses separate gRPC connections, which makes it difficult to configure how much memory will be used by the sidecar, without really giving the user good options for configuring high-throughput sidecars.

Sidecar crashing

I am not sure why, but the sidecar is crashing for me. This is the full log:

level=info ts=2021-03-09T03:42:58.984Z caller=main.go:364 msg="starting OpenTelemetry Prometheus sidecar" version="(version=0.18.3, branch=main, revision=a99cc6dae6eba74f1875252c1f051d9a0b3c8fed)" build_context="(go=go1.15.8, user=runner@fv-az12-20, date=20210305-18:06:26)" host_details="(Linux 5.4.89+ #1 SMP Thu Jan 28 22:59:01 PST 2021 x86_64 prometheus-prom0-kube-prometheus-prometheus-0 (none))" fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-03-09T03:43:00.065Z caller=main.go:277 msg="starting OpenTelemetry writer"
level=info ts=2021-03-09T03:43:00.065Z caller=main.go:254 msg="starting Prometheus reader"
level=info ts=2021-03-09T03:43:00.065Z caller=manager.go:100 component=prom_wal msg="starting Prometheus reader"
level=warn ts=2021-03-09T03:43:00.072Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=apiserver_audit_event_total
level=info ts=2021-03-09T03:43:03.867Z caller=supervisor.go:396 supervisor=true msg="sidecar is initializing"
level=warn ts=2021-03-09T03:43:05.073Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=node_softnet_dropped_total
level=warn ts=2021-03-09T03:43:10.164Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=kube_node_status_condition
level=warn ts=2021-03-09T03:43:15.168Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=kube_ingress_info
level=warn ts=2021-03-09T03:43:20.267Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=ambassador_edge_stack_go_memstats_alloc_bytes_total
level=warn ts=2021-03-09T03:43:25.277Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=wildfly_ee_hung_thread_count
level=warn ts=2021-03-09T03:43:30.364Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=kafka_server_sessionexpirelistener_total_zookeepersaslauthenticationspersec_oneminuterate
level=warn ts=2021-03-09T03:43:35.364Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=kubelet_http_requests_total
level=warn ts=2021-03-09T03:43:40.365Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=kubelet_container_log_filesystem_used_bytes
level=info ts=2021-03-09T03:43:40.486Z caller=tail.go:457 component=wal_reader msg="transition to WAL segment" segment=27
level=info ts=2021-03-09T03:43:45.565Z caller=manager.go:174 component=prom_wal msg="reached first record after start offset" start_offset=3625161972 skipped_records=22
level=warn ts=2021-03-09T03:43:55.169Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=grafana_alerting_result_total
level=error ts=2021-03-09T03:43:59.899Z caller=queue_manager.go:546 component=queue_manager msg="unrecoverable write error" points=500 err="rpc error: code = InvalidArgument desc = [data.kind_and_type.kind: metric \"go_memstats_alloc_bytes\" kind mismatch: received cumulative, expected gauge]"
level=error ts=2021-03-09T03:44:01.867Z caller=queue_manager.go:546 component=queue_manager msg="unrecoverable write error" points=500 err="rpc error: code = InvalidArgument desc = [data.kind_and_type.kind: metric \"go_memstats_alloc_bytes\" kind mismatch: received cumulative, expected gauge]"
level=error ts=2021-03-09T03:44:02.841Z caller=queue_manager.go:546 component=queue_manager msg="unrecoverable write error" points=500 err="rpc error: code = InvalidArgument desc = [data.kind_and_type.kind: metric \"go_memstats_alloc_bytes\" kind mismatch: received cumulative, expected gauge]"
level=info ts=2021-03-09T03:44:04.065Z caller=supervisor.go:396 supervisor=true msg="sidecar is running" samples.processed{}=73117 queue.outcome{outcome:success}=14000 dropped.series{key_reason:metadata_not_found}=107178
level=warn ts=2021-03-09T03:44:04.764Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=apiserver_audit_event_total
level=error ts=2021-03-09T03:44:04.965Z caller=queue_manager.go:546 component=queue_manager msg="unrecoverable write error" points=500 err="rpc error: code = InvalidArgument desc = [data.kind_and_type.kind: metric \"go_memstats_alloc_bytes\" kind mismatch: received cumulative, expected gauge]"
level=warn ts=2021-03-09T03:44:09.769Z caller=series_cache.go:376 component=prom_wal msg="metadata not found" metric_name=kubelet_pleg_relist_interval_seconds_bucket
level=error ts=2021-03-09T03:44:10.464Z caller=queue_manager.go:546 component=queue_manager msg="unrecoverable write error" points=345 err="rpc error: code = InvalidArgument desc = [data.kind_and_type.kind: metric \"go_memstats_alloc_bytes\" kind mismatch: received cumulative, expected gauge]"
level=error ts=2021-03-09T03:44:14.564Z caller=queue_manager.go:546 component=queue_manager msg="unrecoverable write error" points=500 err="rpc error: code = InvalidArgument desc = [data.kind_and_type.kind: metric \"go_memstats_alloc_bytes\" kind mismatch: received cumulative, expected gauge]"
level=error ts=2021-03-09T03:44:21.280Z caller=supervisor.go:100 supervisor=true msg="sidecar failed" err="signal: killed"
Logs from 3/8/2021, 10:42:58 PM

Correct / fast startup: wait until Prom completes first scrapes

There is a --startup.delay flag to insert an artificial wait at startup. This is startup problems because sometimes it is not enough to ensure a full scrape, but sometimes it's just slowing us down. We need a full scrape to load metadata.

We could be safe without waiting an arbitrary length of time.

Prometheus exports a metric about its scrape interval timing. It is emitted at the start of the second scrape.
We could use prometheus_target_interval_length_seconds to wait for the first scrapes to complete, since this ensures we have metadata for all active series. This won't entirely work when there are multiple intervals, so we should continue to support the --startup.delay flag.

Missing metadata at startup

The --promtheus.scrape-interval flag supports telling the sidecar which intervals must be waited for, before starting. When these are set incorrectly, the sidecar will wait indefinitely for Prometheus.

When these are not set correctly, there is a chance the sidecar can start requesting metadata before Prometheus initially completes identifying active targets. This leads to sidecar.dropped.series, which is an error that leads to permanent data loss for the involved series. This is also unfortunately not counted well, the sidecar.dropped.series count happens once and then points are not counted as dropped (they appear the same as filtered points).

The unfortunate side-effect is that not knowing the scrape intervals (when there are more than one interval) means potential data loss.

The upcoming release will auto-detect these intervals, for improved correctness.

Re-instrument with OpenTelemetry APIs

This code base uses OpenCensus metrics and console logging for observability.
The Stackdriver codebase used a /statusz page to report status, which was removed from this code.

We would prefer OpenTelemetry spans and metrics with console logging for errors.
To satisfy our need to remotely monitor these, observability should be configurable as a separate OTLP stream for reporting diagnostics.

sidecar exits when configured filter excludes all metrics

if a user configures a filter that excludes all metrics, the sidecar exits with a stacktrace similar to this:

level=warn ts=2021-04-01T21:01:58.555Z caller=supervisor.go:410 supervisor=true msg="process is unhealthy" 
stackdump="goroutine 307 [running]:\ngithub.com/lightstep/opentelemetry-prometheus-sidecar/health.
(*alive).countFailure(...)...

this makes identifying the root cause difficult. a more useful message should be printed out at least, and ideally the sidecar wouldn't exit at all if the number of skipped results continues to grow.

Sidecar stopped submitting stats abruptly

Hello @jmacd

We have a stackdriver-prometheus-sidecar working on our RKE cluster along with Prometheus containers . Prometheus continued collecting data without issues. However, the sidecar stopped submitting data to StackDriver abruptly . We can manually query the data from Prometheus without issues, there's no missing data.

I have also raised an issue on stackdriver-prometheus-sidecar - Stackdriver/stackdriver-prometheus-sidecar#272

While exploring i found out that you have might well aware of the issue and also fixed the potential problem in one of your contributions to opentelemtry-proemtheus-sidecar.

I would love to use opentelemetry-prometheus-sidecar in our setup if this could solve the issue.

Have some questions - These are some of the arguments i am using for my stackdriver-prometheus-sidecar installation
--stackdriver.project-id , stackdriver.kubernetes.location , stackdriver.kubernetes.cluster-name . Can you please let me know what options do i use correspondingly for opentelemetry-prometheus-sidecar deployment so that i can ship metrics to my stack-driver project .

@jmacd any insights here would be really helpfull. awaiting your response

Logs from stackdriver-prometheus-sidecar

level=debug ts=2021-02-08T19:27:42.852Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.001727148154056503 upperBound=1.1
level=debug ts=2021-02-08T19:27:57.851Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=6.218261691304947e-68 samplesOut=6.218261691304947e-68 samplesOutDuration=3.435572350977865e-61 timePerSample=5.524972285714282e+06 sizeRate=7261.878464457889 offsetRate=2.684309866779702e-63 desiredShards=0.001394140958393718
level=debug ts=2021-02-08T19:27:57.851Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.001394140958393718 upperBound=1.1
level=debug ts=2021-02-08T19:28:12.851Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=4.974609353043958e-68 samplesOut=4.974609353043958e-68 samplesOutDuration=2.7484578807822917e-61 timePerSample=5.524972285714282e+06 sizeRate=7772.382771566311 offsetRate=2.1474478934237618e-63 desiredShards=0.001492147964082347
level=debug ts=2021-02-08T19:28:12.851Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.001492147964082347 upperBound=1.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.