Toolbox of recipes, samples, and guides for using the OpenTelemetry Operator on Google Cloud

License: Apache License 2.0

Makefile 17.89% Dockerfile 8.63% JavaScript 6.57% Java 18.62% Python 6.07% Go 42.22%

opentelemetry-operator-sample's Introduction

OpenTelemetry Operator Sample

This repo hosts samples for working with the OpenTelemetry Operator on GCP.

Running the Operator
Sample Applications
Recipes
Contributing
License

Running the Operator

Prerequisites

A running GKE cluster
Helm (for GKE Autopilot)
cert-manager installed in your cluster
- Instructions here
(For private clusters): set up the firewall rules for cert-manager

For GKE Autopilot, install cert-manager with the following Helm commands:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
  --create-namespace \
  --namespace cert-manager \
  --set installCRDs=true \
  --set global.leaderElection.namespace=cert-manager \
  --set extraArgs={--issuer-ambient-credentials=true} \
  cert-manager jetstack/cert-manager

Firewall rules

By default, private GKE clusters may not allow the necessary ports for cert-manager to work, resulting in an error like the following:

Error from server (InternalError): error when creating "collector-config.yaml": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": context deadline exceeded

To fix this, create a firewall rule for your cluster with the following command:

gcloud compute firewall-rules create cert-manager-9443 \
  --source-ranges ${GKE_MASTER_CIDR} \
  --target-tags ${GKE_MASTER_TAG}  \
  --allow TCP:9443

$GKE_MASTER_CIDR and $GKE_MASTER_TAG can be found by following the steps in the firewall docs listed above.

Installing the OpenTelemetry Operator

Install the the Operator with:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.89.0/opentelemetry-operator.yaml

Starting the Collector

Set up an instance of the OpenTelemetry Collector by creating an OpenTelemetryCollector object. The one in this repo sets up a basic OTLP receiver and logging exporter:

kubectl apply -f collector-config.yaml

Auto-instrumenting Applications

The Operator offers auto-instrumentation of application pods by adding an annotation to the Pod spec.

First, create an Instrumentation Custom Resource that contains the settings for the instrumentation. We have provided a sample resource in instrumentation.yaml:

kubectl apply -f instrumentation.yaml

With a Collector and auto-instrumentation set up, you can experiment with it using one of the sample applications, or skip right to the recipes if you already have an application running.

Sample Applications

The sample-apps/ folder contains basic apps to demonstrate collecting traces with the operator in various languages:

NodeJS
Java
Python
DotNET (coming soon)
Go
NodeJS + Java

Each of these sample apps works well with the recipes listed below.

Recipes

The recipes directory holds different sample use cases for working with the operator and auto-instrumentation along with setup guides for each recipe. Currently there are:

Contributing

See CONTRIBUTING.md for details.

License

Apache 2.0; see LICENSE for details.

opentelemetry-operator-sample's People

Contributors

Stargazers

Watchers

Forkers

damemi dashpole aabmass ahlfors apollo17march evillgenius75 rishikeshr echonesis renovate-bot philipz alizoubair hsuyuming santosh-kumar-perumal psx95 stanleyonweni

opentelemetry-operator-sample's Issues

Trying to deploy sample java app facing issue while deploying the service

Command that is failing - kubectl apply -f k8s/.
https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/blob/main/sample-apps/java/README.md

Error
2024-02-15 22:29:00.044 GMT
Error: LinkageError occurred while loading main class com.google.example.service.Main
2024-02-15 22:29:00.044 GMT
java.lang.UnsupportedClassVersionError: com/google/example/service/Main has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0

I am building and running my application in Google cloud shell.
$ java -version
openjdk version "17.0.10" 2024-01-16
OpenJDK Runtime Environment (build 17.0.10+7-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 17.0.10+7-Debian-1deb11u1, mixed mode, sharing)

Reduce privileges of beyla sample

We currently have this securityContext for the beyla daemonset:

opentelemetry-operator-sample/recipes/beyla/beyla-daemonset.yaml

Lines 40 to 42 in b159122

 securityContext: 

 runAsUser: 0 

 privileged: true

We should try to reduce privileges in a way that still works on GKE.

Some potentially helpful links:

It is possible according to https://github.com/grafana/beyla/blob/6cb90b3c6d64b2806a84991eef9125ad99754e85/docs/sources/setup/kubernetes.md#deploy-beyla-unprivileged
Original upstream attempt: grafana/beyla#91
Seems like the workarounds don't work in all cases: grafana/beyla#452
cilium was able to remove their privileged option: cilium/cilium#14446

Needed to run a credentials helper for docker to make `make push` work

Thanks for making this great samples repo!

One small thing that seemed like it may have been a missing step. When I ran make setup it didn't actually seem to create the artifact repository for me - I created it manually.

I also needed to run gcloud auth configure-docker us-central1-docker.pkg.dev to authorize my local docker to talk to GCP artifact repository for make push command to work for me for the go app.

Language sample apps should include a call to action at the end

Let me know what I should do once i've installed one. It should probably link to the recipes and tell me to try one out. Or it could assume users are using the "base" example, and show how to look at that applications' spans in the collector logs.

Add diagrams

For the samples, recipes, and even basic install, it would be nice to have diagrams so users can visualize the end state of each one.

Include cross-language samples

Use a combination of the existing samples to provide a walkthrough for deploying a multi-lingual sample, for example using the NodeJS client and Java server. Add a recipe to auto-instrument an app like this

Sample for Operator auto-instrumentation with Go

Do you have any predictions for auto instrumentation with GO using GKE/ASM?

Sample for k8sclusterreceiver

k8sclusterreceiver is the OTel equivalent of KSM. It can be difficult to configure correctly (GoogleCloudPlatform/opentelemetry-operations-go#760), so a sample would be helpful.

Can't start collector - failed to call webhook - context deadline exceeded

This is a problem in GKE Autopilot cluster version 1.25.6-gke.1000.

I follow the guide from https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample#prerequisites.

I installed cert-manager and OpenTelemetry Operator.
I can't start collector, here is the error:

$ kubectl apply -f collector-config.yaml 
Error from server (InternalError): error when creating "collector-config.yaml": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": context deadline exceeded

Service Account

I have followed this but there is no place to assign the serviceaccount otel-collector to the config. In the example, it's just annotating the namespace. I get permission denied in my deployment.

adding serviceAccount: "otel-collector" doesn't help.

2022/10/21 13:41:11 failed to export to Google Cloud Trace: rpc error: code = PermissionDenied desc = The caller does not have permission

I also changed the role to admin and no luck.

Define thorough operator integration test plan

Come up with a test/CI plan for making sure our recipes work on GKE (and with other GCP components)

Add Go sample project

Need a basic Go project following at least the client/server model of the other apps. This could be extended to show manual instrumentation or used in demos with auto-instrumentation (once it is available for Go)

Sidecar-based resource detection recipe

Add a recipe to build on the basic resource detection recipe using multiple sidecar collectors that report to a centralized collector service for exporting. Demonstrate auto-instrumentation across multiple namespaces and languages (#19)

Deployment + daemonset recipe: Add daemonset memory limiter

Add cpu/memory requests + limits, and memory limiter processor

Error on exporting Metrics to Google Cloud

There seems to be a problem with the metric exporter. On following the guide for exporting metrics to Google Cloud, I was facing the following error.

Collector version

0.75.0

Environment information

Environment

GKE

OpenTelemetry Collector configuration

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  image: otel/opentelemetry-collector-contrib:latest
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
    exporters:
      googlecloud:
        metric:
          create_service_timeseries: true
          use_insecure: true
          prefix: "eevee"
      logging:
        loglevel: debug
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [googlecloud]
        metrics:
          receivers: [otlp]
          processors: []
          exporters: [googlecloud]

Log output

│ 2023-04-17T09:42:48.503Z    info    service/telemetry.go:90    Setting up own telemetry...                                                                                                                                                                                   │
│ 2023-04-17T09:42:48.503Z    info    service/telemetry.go:116    Serving Prometheus metrics    {"address": ":8888", "level": "Basic"}                                                                                                                                         │
│ Error: failed to build pipelines: failed to create "googlecloud" exporter for data type "metrics": no project set in config, or found with application default credentials                                                                                                   │
│ 2023/04/17 09:42:48 collector server run finished with error: failed to build pipelines: failed to create "googlecloud" exporter for data type "metrics": no project set in config, or found with application default credentials

Also, please consider adding recipes for exporting metrics to Google Cloud.

Thank You!

Upgrade Gradle version used for Java sample

It was revealed in #75 that the use of older gradle version could be responsible for slower build times.

Upgrade gradle from v7 → v8 to keep the sample up-to-date.

Python app has dependency issue which causes exception to be raised

Repro:

cd sample-apps/python
docker build -t sample-python-app-repro .
docker run --rm sample-python-app-repro
[2023-10-06 19:52:03 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2023-10-06 19:52:03 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2023-10-06 19:52:03 +0000] [1] [INFO] Using worker: sync
[2023-10-06 19:52:03 +0000] [7] [INFO] Booting worker with pid: 7
[2023-10-06 19:52:03 +0000] [7] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.10/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/src/app/server.py", line 16, in <module>
    from flask import Flask
  File "/usr/local/lib/python3.10/site-packages/flask/__init__.py", line 5, in <module>
    from .app import Flask as Flask
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 30, in <module>
    from werkzeug.urls import url_quote
ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/usr/local/lib/python3.10/site-packages/werkzeug/urls.py)
[2023-10-06 19:52:03 +0000] [7] [INFO] Worker exiting (pid: 7)
[2023-10-06 19:52:03 +0000] [1] [INFO] Shutting down: Master
[2023-10-06 19:52:03 +0000] [1] [INFO] Reason: Worker failed to boot.

Generate table of contents in readmes

Similar to GoogleCloudPlatform/opentelemetry-collector-builder-sample#18, we should generate the table of contents for the main readme and subdirectory readmes in the sample-apps and recipes folders. This generation should be verified in a presubmit for new PRs

Prometheus target allocator recipe

Add a recipe for running the collector as a deployment with the prometheus target allocator: https://github.com/open-telemetry/opentelemetry-operator#target-allocator

Implementation of a "Kubernetes Operator" on .NET project

I appreciate if someone can share example that works on .net

Consider using service graph metrics directly from beyla

Beyla 1.5 supports directly generating service graph metrics: https://github.com/grafana/beyla/releases/tag/v1.5.0

We should evaluate it to see if it meets our needs, and see if it decreases the resource consumption of the daemonset or collector.

Logging exporters should use logLevel: debug

This lets users see the contents of logged spans (e.g. attributes)

Locations:

Beyla upgrade breaks sample

Updating the image to: image: grafana/beyla:1.0.2 causes the sample to stop emitting traces and metrics after about 2 minutes, based on the last metric produced:

StartTimestamp: 2023-12-08 18:48:33.302193405 +0000 UTC
Timestamp: 2023-12-08 18:50:03.301807418 +0000 UTC

The beyla logs contain:

time=2023-12-08T18:48:28.125Z level=INFO msg="system wide instrumentation. Creating a single instrumenter" component=discover.TraceAttacher
2023/12/08 18:49:55 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:49:58 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:49:59 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:01 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:03 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:04 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:05 failed to upload metrics: gRPC exporter is shutdown
time=2023-12-08T18:50:05.418Z level=WARN msg="error shutting down metrics provider" component=otel.MetricsReporter error="failed to upload metrics: gRPC exporter is shutdown"
2023/12/08 18:50:05 failed to upload metrics: gRPC exporter is shutdown
time=2023-12-08T18:50:15.417Z level=WARN msg="error shutting down metrics provider" component=otel.MetricsReporter error="failed to upload metrics: gRPC exporter is shutdown"

Provide links to cloud trace UI from example

It currently says: "Once the Collector restarts, you should see traces from your application". In the cloud trace example: https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/tree/main/recipes/cloud-trace

We should link to console.cloud.google.com/traces/list.

Upgrade beyla to 1.3.1

Will need to update metric names to newest semconv per grafana/beyla#567

Beyla server spans has wrong `server.address` for the given `k8s.pod.name`

I'm running the Beyla servicegraph example with the bookinfo app. Beyla produces server spans like the following

ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.19.0
Resource attributes:
     -> k8s.deployment.name: Str(reviews-v3)
     -> k8s.namespace.name: Str(default)
     -> k8s.node.name: Str(gke-cluster-1-pool-1-6a06028c-h7v0)
     -> k8s.pod.name: Str(reviews-v3-5c5cc7b6d-95p7x)
     -> k8s.pod.start_time: Str(2024-03-07 05:32:33 +0000 UTC)
     -> k8s.pod.uid: Str(2d087ce0-bd15-46a7-85c1-e8451675d7bf)
     -> k8s.replicaset.name: Str(reviews-v3-5c5cc7b6d)
     -> service.instance.id: Str(beyla-agent-xsfzc-1668141)
     -> service.name: Str(reviews-v3)
     -> service.namespace: Str(default)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(beyla)
ScopeSpans #0
ScopeSpans SchemaURL: 
InstrumentationScope github.com/grafana/beyla 
Span #0
    Trace ID       : 090a0a68c4d3602e9a815b193db11e18
    Parent ID      : 
    ID             : 94489a8be3f6c080
    Name           : GET /**
    Kind           : Server
    Start time     : 2024-03-07 23:01:00.743158717 +0000 UTC                                                                                                          
    End time       : 2024-03-07 23:01:00.743865598 +0000 UTC
    Status code    : Unset
    Status message : 
Attributes:
     -> http.request.method: Str(GET)
     -> http.response.status_code: Int(200)
     -> url.path: Str(/ratings/0)
     -> client.address: Str(10.4.1.88)
     -> server.address: Str(10.8.10.91)
     -> server.port: Int(9080)
     -> http.request.body.size: Int(243)
     -> http.route: Str(/**)

Notably

k8s.pod.name: reviews-v3-5c5cc7b6d-95p7x
server.address: 10.8.10.91
client.address: 10.4.1.88
Kind: Server

When I look up the pod IP for reviews-v3-5c5cc7b6d-95p7x it is 10.4.1.88 :

kubectl get pod reviews-v3-5c5cc7b6d-95p7x -o wide 
NAME                         READY   STATUS    RESTARTS   AGE   IP          NODE                                 NOMINATED NODE   READINESS GATES                     
reviews-v3-5c5cc7b6d-95p7x   1/1     Running   0          17h   10.4.1.88   gke-cluster-1-pool-1-6a06028c-h7v0   <none>           <none>

And 10.8.10.91 is the ClusterIP for the ratings service:

kubectl get service ratings
NAME      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
ratings   ClusterIP   10.8.10.91   <none>        9080/TCP   23h

This makes me think Beyla is setting span kind to server when it is actually a client span (reviews calls ratings for bookinfo app). service.name indicates that the span is coming from the reviews-v3 pod.

Those logs of otel-collector was always [severity: "ERROR"]

I used the below config to deploy OpenTelemetryCollector CRD. But, I checked the logs of otel-collector in GKE workloads page, then logs' severity was "ERROR", like the printscreen png.

Would I miss any config of OpenTelemetryCollector CRD ? Thank you.

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  image: otel/opentelemetry-collector-contrib:latest
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20

    exporters:
      googlecloud:
      logging:
        loglevel: debug

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [logging, googlecloud]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [googlecloud]
        logs:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [googlecloud]

Recipe for getting HPA external metrics working

Trying to find a way for solving these needs:

Ability for developers to see traces
Ability to get metrics ( which can then be used for alerting)
Ability to configure HPA with metrics such as request latencies
Preferably with zero or as little app code changes as possible

I have experimented with following setup

Opentelemetry Operator, with Auto-instrumentation and Collector sending metrics and traces to googlecloud exporter
Custom metrics Stackdriver adapter (New resource model) picking the metrics from Monitoring
A HPA trying to get external metric ( latency) for scaling

With that setup

Traces are going fine to Cloud Trace
Metrics are going fine to Monitoring. For my Otel- auto-instrumented app, i can see the workload.googleapis.com/http.server.request.duration . It is however of type DISTRIBUTION

It seems that the last mile causes problems. Stackdriver adapter can pick only metric types DOUBLE or INT64. The HPA is then reporting "unable to fetch metrics from external metrics API: Expected metric of type DoubleValue or Int64Value, but received TypedValue: { 0xc000d44900 [] []}"

Should i rethink my approach to solve the HPA case? What would be the way to do that? Add googlemanagedprometheus- exporter? If i continue using existing otlp receivers, would i need to do mapping for time series?

Here is my current Collector config:

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20
  batch:
    send_batch_size: 10000
    timeout: 10s
exporters:
  googlecloud:
    project: my-project
    metric:
    trace:
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter,batch]
      exporters: [googlecloud]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter,batch]
      exporters: [googlecloud]

Update trace-enhancements demo to use transform processor

It seems most otel collector things are moving to transform processor + OTTL, we should do the same here.

It isn't obvious that the collector at the repo base doesn't write to GCP

A user simply following the commands in the repo readme installs a collector that uses the logging exporter. This isn't called out very clearly, and at least one user was surprised that the GoogleCloud sample doesn't write to google cloud. We either need to:

Make it clearer that it doesn't write to google cloud, or
Make the "default" sample write to google cloud

Deployment + daemonset recipe: Deployment persistent buffering

Add the persistent queue to the OTLP exporter on the deployment

Expand beyla recipe to include a service graph

The traces we get from beyla aren't really that useful, as they aren't properly integrated into the trace tree. We should use them to produce a service graph instead of sending them to cloud trace. This should be feasible using the servicegraphconnector.

persistent queue sample should use a persistent volume

Otherwise, data can be lost when a new version is rolled out.

googlecloudplatform / opentelemetry-operator-sample Goto Github PK