lightstep / otel-collector-charts Goto Github PK

This is the repository for Lightstep's recommendations for running an OpenTelemetry Collector.

License: Apache License 2.0

Smarty 88.57% Dockerfile 6.93% Makefile 4.50%

helm-charts kubernetes opentelemetry opentelemetry-collector prometheus

otel-collector-charts's Introduction

ServiceNow Cloud Observability (formerly Lightstep) OpenTelemetry Collector Helm Charts

This is the repository for recommended Helm charts for running an OpenTelemetry Collector using the OpenTelemetry Operator for Kubernetes. We recommend following the quick start documenation here for using these charts.

⚠️ These OpenTelemetry Helm charts are under active development and may have breaking changes between releases.

Charts

otel-cloud-stack - Recommended chart for sending Kubernetes metrics to ServiceNow Cloud Observability using OpenTelemetry-native metric collection and the OpenTelemetry Operator.
kube-otel-stack - Drop in replacement for kube-prometheus-stack, which uses the same configuration for scraping Prometheus exporters and forwarding metrics to Lightstep using the OpenTelemetry Operator. Use this chart if you are looking to compare Kubernetes monitoring in Prometheus with Kubernetes monitoring using ServiceNow Cloud Observability.

Arrow Usage

Note

Arrow usage is in beta, please use at your own risk. Reach out if you have any issues.

In order to use an arrow trace collector, you can use (1) the prebuilt image available via the Github Container Registry (GHCR) or you may (2) build your own custom image.

1. Use the prebuilt Docker image

We have built a Docker image using the recommended build config
This Docker image can be pulled by running: docker pull ghcr.io/lightstep/otel-collector-charts/otelarrowcol-experimental:latest
You can use the collector config (/arrow/config/gateway-collector.yaml) by running: docker run -it -v $(PWD)/config/:/config --entrypoint /otelarrowcol ghcr.io/lightstep/otel-collector-charts/otelarrowcol-experimental:latest --config=/config/gateway-collector.yaml

2. Build your own custom image

We have supplied a collector builder config below.
Once an image is a available, simply apply your desired helm chart with the values.yaml AND the arrow.yaml in the respective chart.
Make sure to replace the image in arrow.yaml with your custom built image.

Build configurations

Some of the features available in these charts are optional because they rely on components that have not been released in the OpenTelemetry Contrib Collector. Specifically, to make use of the new OpenTelemetry Protocol With Apache Arrow support requires using either the prebuilt image or a customer collector build at this time.

See the recommended custom collector build configuration as a starting point.

Contributing and Developing

Please see CONTRIBUTING.md.

otel-collector-charts's People

Contributors

Stargazers

Watchers

Forkers

chuanyi-zjc smithclay justedennnnn doytsujin sjanulonoks davidg-datascene dodegaard toaddyan bismarck bantex01 psalaberria002 rivtoadd

otel-collector-charts's Issues

Metric Collector crashbackoffloop Due to Inability to Reach Target Allocator

Description

The metric collector enters a continuous crash backoff loop because it fails to connect to the target allocator. The issue seems to be caused by the target allocator service being unhealthy, as no endpoints are active under its service, even though there are no logs indicating why the service is unhealthy.

Steps to Reproduce (maybe?)

Start with the kube-otel-stack deployed with Helm chart version 3.9.
Upgrade the Helm chart to version 4.1 for the kube-otel-stack via GitOps.
Observe the metric collector's behavior post-upgrade.
Expected Behavior
The metric collector should successfully connect to the target allocator service without entering a crash loop, or at least provide logs indicating the reason for connection failures.

Actual Behavior

The metric collector continually crashes with the following errors:

Error: cannot start pipelines: Get "http://si-kube-otel-stack-metrics-targetallocator:80/scrape_configs": dial tcp 172.20adf0: connect: connection refused
2024/02/27 08:05:56 collector server run finished with error: cannot start pipelines: Get "http://si-kube-otel-stack-metrics-targetallocator:80/scrape_configs": dial tcp 172.2dff0: connect: connection refused

Upon inspecting the target allocator service, it was found to have no active endpoints, indicating that the service was unhealthy. However, there were no logs or indicators explaining why the service was in this state.

Temporary Fix

The issue was temporarily resolved by wiping the entire kube-otel-stack and rebuilding it. However, this is not a viable long-term solution as it involves downtime and potential data loss.

Environment

Kubernetes version: 1.28
Helm chart version: Upgraded from 3.9 to 4.1
GitOps tool: argocd

Additional Context

https://cloud-native.slack.com/archives/C033BJ8BASU/p1709021610781889

kube-otel-stack daemonset not propagating to ALL nodes

Situation: when trying to deploy daemonset on kube-otel-stack for logscollector, I expect that the collector to appear on EVERY node. (note, I have a stateful set metrics collector and a daemonset logs collector)
Problem: it doesn't actually appear on every node. whereas on the opentelemetry collector helm chart, it DOES appear on every node.

Upon inspection, I noticed that an extra toleration of NoSchedule op=Exists is in the otel-collector helm chart which allows for a collector on EVERY node.

how would we fix this issue?

is it as simple as adding an extra toleration to propagate to the collector? would this require re-formatting how this helm chart is written?

Support sampling in otelarrowcol

There aren't any sampling processors added to the binary. Other otel collector distributions support sampling via probabilisticsampling or tail sampling processors. Can we add it to the binary so at least it is available?

Thanks.

metrics-collector should have logs if in the failure of connecting to target allocator

metricsCollector:
  targetallocator:
      limits:
        cpu: 1000m
        memory: 4000Mi
      requests:
        cpu: 500m
        memory: 2000Mi
  config:
    extensions:
      health_check:
        endpoint: "0.0.0.0:13133"
        path: "/"
        check_collector_pipeline:
          enabled: false
          interval: "5m"
          exporter_failure_threshold: 5
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
          http:
            endpoint: "0.0.0.0:4318"
    processors:
      metricstransform/k8sservicename:
        transforms:
          - include: kube_service_info
            match_type: strict
            action: update
            operations:
              - action: update_label
                label: service
                new_label: k8s.service.name
      resourcedetection/env:
        detectors:
          - env
        timeout: 2s
        override: false
      k8sattributes:
        passthrough: false
        pod_association:
          - sources:
              - from: resource_attribute
                name: k8s.pod.name
      batch:
        send_batch_size: 1000
        timeout: 1s
        send_batch_max_size: 1500
      resource:
        attributes:
          - key: lightstep.helm_chart
            value: kube-otel-stack
            action: insert
          - key: job
            from_attribute: service.name
            action: insert
    exporters:
      prometheusremotewrite:
        endpoint: https://mimir/api/v1/push
        headers:
          "X-Scope-OrgID": "ORG_ID"
        external_labels:
          cluster: cluster_name
    service:
      extensions:
        - health_check
      pipelines:
        metrics:
          receivers:
            - prometheus
          processors:
            - resource
            - resourcedetection/env
            - k8sattributes
            - metricstransform/k8sservicename
            - batch
          exporters:
            - prometheusremotewrite

situation:

kube-proxy went down causing (this was not known initially at time of debugging). When the metrics collector was trying to connect to the target allocator, the health check extension would kill the pod before any log message would come out of the metrics collector because it could not reach the service.

Problem

there was no log to tell me any symptom of failure besides a "timeout"

ask

some kind of logging mechanism that isn't hidden by health check extension would be nice.

Invalid `queue_settings` configuration included in default collector config

It looks like the changes introduced with #42 are invalid for the collector-contrib. Here is the error the collector crashes with:

Error: failed to get config: cannot unmarshal the configuration: 1 error(s) decoding:

* error decoding 'exporters': error reading configuration for "otlp": 1 error(s) decoding:

* '' has invalid keys: queue_settings

Note that I am deploying from main, and not from a release.