Git Product home page Git Product logo

opentelemetry-operator-sample's Introduction

OpenTelemetry Operator Sample

This repo hosts samples for working with the OpenTelemetry Operator on GCP.

Running the Operator

Prerequisites

For GKE Autopilot, install cert-manager with the following Helm commands:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
  --create-namespace \
  --namespace cert-manager \
  --set installCRDs=true \
  --set global.leaderElection.namespace=cert-manager \
  --set extraArgs={--issuer-ambient-credentials=true} \
  cert-manager jetstack/cert-manager

Firewall rules

By default, private GKE clusters may not allow the necessary ports for cert-manager to work, resulting in an error like the following:

Error from server (InternalError): error when creating "collector-config.yaml": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": context deadline exceeded

To fix this, create a firewall rule for your cluster with the following command:

gcloud compute firewall-rules create cert-manager-9443 \
  --source-ranges ${GKE_MASTER_CIDR} \
  --target-tags ${GKE_MASTER_TAG}  \
  --allow TCP:9443

$GKE_MASTER_CIDR and $GKE_MASTER_TAG can be found by following the steps in the firewall docs listed above.

Installing the OpenTelemetry Operator

Install the the Operator with:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/download/v0.89.0/opentelemetry-operator.yaml

Starting the Collector

Set up an instance of the OpenTelemetry Collector by creating an OpenTelemetryCollector object. The one in this repo sets up a basic OTLP receiver and logging exporter:

kubectl apply -f collector-config.yaml

Auto-instrumenting Applications

The Operator offers auto-instrumentation of application pods by adding an annotation to the Pod spec.

First, create an Instrumentation Custom Resource that contains the settings for the instrumentation. We have provided a sample resource in instrumentation.yaml:

kubectl apply -f instrumentation.yaml

With a Collector and auto-instrumentation set up, you can experiment with it using one of the sample applications, or skip right to the recipes if you already have an application running.

Sample Applications

The sample-apps/ folder contains basic apps to demonstrate collecting traces with the operator in various languages:

Each of these sample apps works well with the recipes listed below.

Recipes

The recipes directory holds different sample use cases for working with the operator and auto-instrumentation along with setup guides for each recipe. Currently there are:

Contributing

See CONTRIBUTING.md for details.

License

Apache 2.0; see LICENSE for details.

opentelemetry-operator-sample's People

Contributors

aabmass avatar apollo17march avatar damemi avatar dashpole avatar dependabot[bot] avatar jsuereth avatar psx95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opentelemetry-operator-sample's Issues

Trying to deploy sample java app facing issue while deploying the service

Command that is failing - kubectl apply -f k8s/.
https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample/blob/main/sample-apps/java/README.md

Error
2024-02-15 22:29:00.044 GMT
Error: LinkageError occurred while loading main class com.google.example.service.Main
2024-02-15 22:29:00.044 GMT
java.lang.UnsupportedClassVersionError: com/google/example/service/Main has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0

I am building and running my application in Google cloud shell.
$ java -version
openjdk version "17.0.10" 2024-01-16
OpenJDK Runtime Environment (build 17.0.10+7-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 17.0.10+7-Debian-1deb11u1, mixed mode, sharing)

Reduce privileges of beyla sample

We currently have this securityContext for the beyla daemonset:

securityContext:
runAsUser: 0
privileged: true

We should try to reduce privileges in a way that still works on GKE.

Some potentially helpful links:

Needed to run a credentials helper for docker to make `make push` work

Thanks for making this great samples repo!

One small thing that seemed like it may have been a missing step. When I ran make setup it didn't actually seem to create the artifact repository for me - I created it manually.

I also needed to run gcloud auth configure-docker us-central1-docker.pkg.dev to authorize my local docker to talk to GCP artifact repository for make push command to work for me for the go app.

Add diagrams

For the samples, recipes, and even basic install, it would be nice to have diagrams so users can visualize the end state of each one.

Include cross-language samples

Use a combination of the existing samples to provide a walkthrough for deploying a multi-lingual sample, for example using the NodeJS client and Java server. Add a recipe to auto-instrument an app like this

Can't start collector - failed to call webhook - context deadline exceeded

This is a problem in GKE Autopilot cluster version 1.25.6-gke.1000.

I follow the guide from https://github.com/GoogleCloudPlatform/opentelemetry-operator-sample#prerequisites.

I installed cert-manager and OpenTelemetry Operator.
I can't start collector, here is the error:

$ kubectl apply -f collector-config.yaml 
Error from server (InternalError): error when creating "collector-config.yaml": Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": context deadline exceeded

Service Account

I have followed this but there is no place to assign the serviceaccount otel-collector to the config. In the example, it's just annotating the namespace. I get permission denied in my deployment.

adding serviceAccount: "otel-collector" doesn't help.

2022/10/21 13:41:11 failed to export to Google Cloud Trace: rpc error: code = PermissionDenied desc = The caller does not have permission

I also changed the role to admin and no luck.

Add Go sample project

Need a basic Go project following at least the client/server model of the other apps. This could be extended to show manual instrumentation or used in demos with auto-instrumentation (once it is available for Go)

Error on exporting Metrics to Google Cloud

There seems to be a problem with the metric exporter. On following the guide for exporting metrics to Google Cloud, I was facing the following error.

Collector version

0.75.0

Environment information

Environment

GKE

OpenTelemetry Collector configuration

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  image: otel/opentelemetry-collector-contrib:latest
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
    exporters:
      googlecloud:
        metric:
          create_service_timeseries: true
          use_insecure: true
          prefix: "eevee"
      logging:
        loglevel: debug
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [googlecloud]
        metrics:
          receivers: [otlp]
          processors: []
          exporters: [googlecloud]

Log output

│ 2023-04-17T09:42:48.503Z    info    service/telemetry.go:90    Setting up own telemetry...                                                                                                                                                                                   │
│ 2023-04-17T09:42:48.503Z    info    service/telemetry.go:116    Serving Prometheus metrics    {"address": ":8888", "level": "Basic"}                                                                                                                                         │
│ Error: failed to build pipelines: failed to create "googlecloud" exporter for data type "metrics": no project set in config, or found with application default credentials                                                                                                   │
│ 2023/04/17 09:42:48 collector server run finished with error: failed to build pipelines: failed to create "googlecloud" exporter for data type "metrics": no project set in config, or found with application default credentials

Also, please consider adding recipes for exporting metrics to Google Cloud.

Thank You!

Python app has dependency issue which causes exception to be raised

Repro:

cd sample-apps/python
docker build -t sample-python-app-repro .
docker run --rm sample-python-app-repro
[2023-10-06 19:52:03 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2023-10-06 19:52:03 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)
[2023-10-06 19:52:03 +0000] [1] [INFO] Using worker: sync
[2023-10-06 19:52:03 +0000] [7] [INFO] Booting worker with pid: 7
[2023-10-06 19:52:03 +0000] [7] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.10/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.10/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/src/app/server.py", line 16, in <module>
    from flask import Flask
  File "/usr/local/lib/python3.10/site-packages/flask/__init__.py", line 5, in <module>
    from .app import Flask as Flask
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 30, in <module>
    from werkzeug.urls import url_quote
ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (/usr/local/lib/python3.10/site-packages/werkzeug/urls.py)
[2023-10-06 19:52:03 +0000] [7] [INFO] Worker exiting (pid: 7)
[2023-10-06 19:52:03 +0000] [1] [INFO] Shutting down: Master
[2023-10-06 19:52:03 +0000] [1] [INFO] Reason: Worker failed to boot.

Beyla upgrade breaks sample

Updating the image to: image: grafana/beyla:1.0.2 causes the sample to stop emitting traces and metrics after about 2 minutes, based on the last metric produced:

StartTimestamp: 2023-12-08 18:48:33.302193405 +0000 UTC
Timestamp: 2023-12-08 18:50:03.301807418 +0000 UTC

The beyla logs contain:

time=2023-12-08T18:48:28.125Z level=INFO msg="system wide instrumentation. Creating a single instrumenter" component=discover.TraceAttacher
2023/12/08 18:49:55 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:49:58 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:49:59 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:01 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:03 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:04 failed to upload metrics: gRPC exporter is shutdown
2023/12/08 18:50:05 failed to upload metrics: gRPC exporter is shutdown
time=2023-12-08T18:50:05.418Z level=WARN msg="error shutting down metrics provider" component=otel.MetricsReporter error="failed to upload metrics: gRPC exporter is shutdown"
2023/12/08 18:50:05 failed to upload metrics: gRPC exporter is shutdown
time=2023-12-08T18:50:15.417Z level=WARN msg="error shutting down metrics provider" component=otel.MetricsReporter error="failed to upload metrics: gRPC exporter is shutdown"

Beyla server spans has wrong `server.address` for the given `k8s.pod.name`

I'm running the Beyla servicegraph example with the bookinfo app. Beyla produces server spans like the following

ResourceSpans #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.19.0
Resource attributes:
     -> k8s.deployment.name: Str(reviews-v3)
     -> k8s.namespace.name: Str(default)
     -> k8s.node.name: Str(gke-cluster-1-pool-1-6a06028c-h7v0)
     -> k8s.pod.name: Str(reviews-v3-5c5cc7b6d-95p7x)
     -> k8s.pod.start_time: Str(2024-03-07 05:32:33 +0000 UTC)
     -> k8s.pod.uid: Str(2d087ce0-bd15-46a7-85c1-e8451675d7bf)
     -> k8s.replicaset.name: Str(reviews-v3-5c5cc7b6d)
     -> service.instance.id: Str(beyla-agent-xsfzc-1668141)
     -> service.name: Str(reviews-v3)
     -> service.namespace: Str(default)
     -> telemetry.sdk.language: Str(java)
     -> telemetry.sdk.name: Str(beyla)
ScopeSpans #0
ScopeSpans SchemaURL: 
InstrumentationScope github.com/grafana/beyla 
Span #0
    Trace ID       : 090a0a68c4d3602e9a815b193db11e18
    Parent ID      : 
    ID             : 94489a8be3f6c080
    Name           : GET /**
    Kind           : Server
    Start time     : 2024-03-07 23:01:00.743158717 +0000 UTC                                                                                                          
    End time       : 2024-03-07 23:01:00.743865598 +0000 UTC
    Status code    : Unset
    Status message : 
Attributes:
     -> http.request.method: Str(GET)
     -> http.response.status_code: Int(200)
     -> url.path: Str(/ratings/0)
     -> client.address: Str(10.4.1.88)
     -> server.address: Str(10.8.10.91)
     -> server.port: Int(9080)
     -> http.request.body.size: Int(243)
     -> http.route: Str(/**)

Notably

  • k8s.pod.name: reviews-v3-5c5cc7b6d-95p7x
  • server.address: 10.8.10.91
  • client.address: 10.4.1.88
  • Kind: Server

When I look up the pod IP for reviews-v3-5c5cc7b6d-95p7x it is 10.4.1.88 :

kubectl get pod reviews-v3-5c5cc7b6d-95p7x -o wide 
NAME                         READY   STATUS    RESTARTS   AGE   IP          NODE                                 NOMINATED NODE   READINESS GATES                     
reviews-v3-5c5cc7b6d-95p7x   1/1     Running   0          17h   10.4.1.88   gke-cluster-1-pool-1-6a06028c-h7v0   <none>           <none>

And 10.8.10.91 is the ClusterIP for the ratings service:

kubectl get service ratings
NAME      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
ratings   ClusterIP   10.8.10.91   <none>        9080/TCP   23h

This makes me think Beyla is setting span kind to server when it is actually a client span (reviews calls ratings for bookinfo app). service.name indicates that the span is coming from the reviews-v3 pod.

Those logs of otel-collector was always [severity: "ERROR"]

I used the below config to deploy OpenTelemetryCollector CRD. But, I checked the logs of otel-collector in GKE workloads page, then logs' severity was "ERROR", like the printscreen png.

Would I miss any config of OpenTelemetryCollector CRD ? Thank you.

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
spec:
  image: otel/opentelemetry-collector-contrib:latest
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 65
        spike_limit_percentage: 20

    exporters:
      googlecloud:
      logging:
        loglevel: debug

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [logging, googlecloud]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [googlecloud]
        logs:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [googlecloud]

otel-col-errorlog

Recipe for getting HPA external metrics working

Trying to find a way for solving these needs:

  • Ability for developers to see traces
  • Ability to get metrics ( which can then be used for alerting)
  • Ability to configure HPA with metrics such as request latencies
  • Preferably with zero or as little app code changes as possible

I have experimented with following setup

  • Opentelemetry Operator, with Auto-instrumentation and Collector sending metrics and traces to googlecloud exporter
  • Custom metrics Stackdriver adapter (New resource model) picking the metrics from Monitoring
  • A HPA trying to get external metric ( latency) for scaling

With that setup

  • Traces are going fine to Cloud Trace
  • Metrics are going fine to Monitoring. For my Otel- auto-instrumented app, i can see the workload.googleapis.com/http.server.request.duration . It is however of type DISTRIBUTION

It seems that the last mile causes problems. Stackdriver adapter can pick only metric types DOUBLE or INT64. The HPA is then reporting "unable to fetch metrics from external metrics API: Expected metric of type DoubleValue or Int64Value, but received TypedValue: { 0xc000d44900 [] []}"

Should i rethink my approach to solve the HPA case? What would be the way to do that? Add googlemanagedprometheus- exporter? If i continue using existing otlp receivers, would i need to do mapping for time series?

Here is my current Collector config:

receivers:
  otlp:
    protocols:
      grpc:
      http:
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20
  batch:
    send_batch_size: 10000
    timeout: 10s
exporters:
  googlecloud:
    project: my-project
    metric:
    trace:
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter,batch]
      exporters: [googlecloud]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter,batch]
      exporters: [googlecloud]

It isn't obvious that the collector at the repo base doesn't write to GCP

A user simply following the commands in the repo readme installs a collector that uses the logging exporter. This isn't called out very clearly, and at least one user was surprised that the GoogleCloud sample doesn't write to google cloud. We either need to:

  • Make it clearer that it doesn't write to google cloud, or
  • Make the "default" sample write to google cloud

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.