Git Product home page Git Product logo

observatorium's Introduction

Observatorium

Build Status Slack

Configuration for Multi-Tenant, Flexible, Scalable, Observability Backend

Observatorium allows you to run and operate effectively a multi-tenant, easy to operate, scalable open source observability system on Kubernetes. This system will allow you to ingest, store and use common observability signals like metrics, logging and tracing. Observatorium is a "meta project" that allows you to manage, integrate and combine multiple well-established existing projects like Thanos, Loki, Tempo/Jaeger, Open Policy Agent etc under a single consistent system with well-defined tenancy APIs and signal correlation capabilities.

As active maintainers and contributors to the underlying projects, we created a reference configuration, with extra software that connects those open source solutions into one unified and easy to use service. It adds missing gaps between those projects like consistency, multi-tenancy, security and resiliency pieces that are needed for a robust backend.

Read more on High Level Architecture docs.

Context

As the Red Hat Monitoring Team, we were focusing on the Observability software and concepts since the CoreOS acquisition. From the beginning, one of our main goals was to establish a stable in-cluster metric collection, querying, and alerting for OpenShift clusters. With the growth of managed OpenShift (OSD) clusters, the scope of the team goal has extended: we had to develop a scalable, global, metric stack that can be run in local as well as a central location for monitoring and telemetry purposes. We also worked together with Red Hat Logging and Tracing teams to implement something similar for logging and tracing. We’re also working on Continuous Profiling aspects.

From the very beginning our teams were leveraging Open Source to accomplish all those goals. We believe that working with the communities is the best way to have long term, successful systems, share knowledge and establish solid APIs. You might have not seen us, but members of our teams have been actively maintaining and contributing to major Open Source standards and projects like Prometheus, Thanos, Loki, Grafana, kube-state-metrics (KSM), prometheus-operator, kube-prometheus, Alertmanager, cluster-monitoring-operator (CMO), OpenMetrics, Jaeger, ConProf, Cortex, SIG CNCF Observability, SIG K8s Instrumentation and more.

What's Included

  • Observatorium is primarily defined in Jsonnet, which allows great flexibility and reusability. The main configuration resources are stored in components directory, and they import further official resources like kube-thanos. Some Examples:

  • We are aware that not everybody speaks Jsonnet, and not everybody have it's own GitOps pipeline, so we designed alternative deployments based on the main Jsonnet resources. Operator project delivers Kubernetes plain Operator that operates Observatorium.

NOTE: Observatorium is set of cloud native, mostly stateless components that mostly do not have special operating logic. For those operations that required automation, specialized controllers were designed. Use Operator only if this is your primary installation logic or if you don't have CI pipeline.

NOTE2: Operator is in heavy progress. There are already plans to streamline its usage and redesign current CustomResourceDefinition in next version. Yet, it's currently used in production by many bigger users, so any changes will be done with care.

  • The Thanos Receive Controller is a Kubernetes controller written in Go that distributes essential tenancy configuration to the desired pods.

  • The API is the facet of Observatorium service. It's a lightweight proxy written in Go that helps with multi-tenancy, tenancy (isolation, cross tenancy requests, rate-limiting, roles, tracing). This proxy should be used for all external traffic with Observatorium.

  • OPA-AMS is our Go library for integrating Open Policy Agent with Red Hat authorization service for smooth OpenShift experience.

  • up is a useful Go service that periodically queries Observatorium and outputs vital metrics on the Observatorium read path healthiness and performance over time.

  • token-refresher is a simple Go CLI allowing to perform OIDC refresh flow.

Getting Started

Status: Work In Progress

While metric and logging part using Thanos and Loki is used in production at Red Hat,documentation, full design, user guides, different configurations support are in progress.

Stay Tuned!

Missing something or not sure?

Let us know! Visit our Slack channel or put a GitHub issue!

observatorium's People

Contributors

aditya-konarde avatar bill3tt avatar brancz avatar bwplotka avatar clyang82 avatar dependabot[bot] avatar douglascamata avatar esnible avatar jessicalins avatar joaobravecoding avatar jpkrohling avatar kakkoyun avatar krasi-georgiev avatar lilic avatar matej-g avatar metalmatze avatar moadz avatar morvencao avatar onprem avatar pavolloffay avatar periklis avatar philipgough avatar rollandf avatar rubenvp8510 avatar saswatamcode avatar smarterclayton avatar squat avatar tareqmamari avatar thibaultmg avatar yeya24 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

observatorium's Issues

Loki components stuck on `CrashLoopBackOff` when deploying example manifests on GKE

Yesterday I was trying to deploy Observatorium on a fresh new GKE cluster.

To deploy Observatorium I followed the e2e test script in the operator repository.

The components loki-compactor, loki-ingester and loki-querier got stuck on CrashLoopBackOff, all with the same error:

level=error ts=2021-05-06T20:07:23.034407503Z caller=log.go:149 msg="error running loki" err="mkdir /data/loki: permission denied

After manually editing their statefulsets, adding the securityContext below it worked just fine:

      securityContext: 
        runAsGroup: 0
        runAsNonRoot: false
        runAsUser: 0

Of course, this is not the safest approach, but I was expecting that those manifests would work on any Kubernetes cluster without manual intervention. Should I make adjustments somewhere?

Operator fails on invalid memory address or nil pointer dereference

It looks like some kind of race that resulted in:

[admin@spoke1 configuration]$ oc -n observatorium logs observatorium-operator-6ff57f6b77-7vvpq
W0510 15:09:32.528320       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ts=2020-05-10T15:09:33.233019951Z caller=resource.go:123 msg="resources trigger started"
level=debug ts=2020-05-10T15:09:33.324971608Z caller=handlers.go:124 resource-handler=obs action=add key=openshift-monitoring/observatorium-xyz
level=debug ts=2020-05-10T15:09:33.325189766Z caller=handlers.go:127 resource-handler=obs msg="transformed key" original-key=openshift-monitoring/observatorium-xyz new-key=openshift-monitoring/observatorium-xyz
level=debug ts=2020-05-10T15:09:33.325392506Z caller=handlers.go:124 resource-handler=obs action=add key=observatorium/observatorium-xyz
level=debug ts=2020-05-10T15:09:33.325480897Z caller=handlers.go:127 resource-handler=obs msg="transformed key" original-key=observatorium/observatorium-xyz new-key=observatorium/observatorium-xyz
ts=2020-05-10T15:09:33.325461273Z caller=resource.go:185 msg="sync triggered" key=openshift-monitoring/observatorium-xyz
level=debug ts=2020-05-10T15:09:50.926212928Z caller=feedback.go:101 msg="initializing status" namespace=openshift-monitoring name=observatorium-xyz kind=Observatorium apiVersion=core.observatorium.io/v1alpha1
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11a7261]

goroutine 219 [running]:
k8s.io/apimachinery/pkg/apis/meta/v1/unstructured.(*Unstructured).GetResourceVersion(...)
	/go/pkg/mod/k8s.io/[email protected]/pkg/apis/meta/v1/unstructured/unstructured.go:282
github.com/brancz/locutus/client.(*ResourceClient).prepareUnstructuredForUpdate(0xc004ee3a10, 0x0, 0xc000667c48, 0x2, 0x0)
	/go/pkg/mod/github.com/brancz/[email protected]/client/client.go:141 +0x41
github.com/brancz/locutus/client.(*ResourceClient).UpdateWithCurrent(0xc004ee3a10, 0x0, 0xc000667c48, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/brancz/[email protected]/client/client.go:135 +0x5a
github.com/brancz/locutus/rollout.(*CreateOrUpdateObjectAction).Execute(0x234a820, 0xc004ee3a10, 0xc000667c48, 0x0, 0x0)
	/go/pkg/mod/github.com/brancz/[email protected]/rollout/actions.go:34 +0x252
github.com/brancz/locutus/rollout.(*RolloutRunner).executeSingleAction(0xc00029c410, 0xc002d08c20, 0xe, 0xc000667c48, 0x234bc40, 0x175ba00)
	/go/pkg/mod/github.com/brancz/[email protected]/rollout/rollout.go:171 +0xd0
github.com/brancz/locutus/rollout.(*RolloutRunner).executeAction(0xc00029c410, 0xc002d08c20, 0xe, 0xc000667c48, 0x1, 0xa20000000156c93b)
	/go/pkg/mod/github.com/brancz/[email protected]/rollout/rollout.go:161 +0x9a
github.com/brancz/locutus/rollout.(*RolloutRunner).runStep(0xc00029c410, 0xc0037cfe40, 0xc000667c48, 0xe, 0xc0004bd3c0)
	/go/pkg/mod/github.com/brancz/[email protected]/rollout/rollout.go:143 +0x56
github.com/brancz/locutus/rollout.(*RolloutRunner).Execute(0xc00029c410, 0xc00021fda0, 0x0, 0x0)
	/go/pkg/mod/github.com/brancz/[email protected]/rollout/rollout.go:126 +0x3a0
github.com/brancz/locutus/config.(*ConfigPasser).Execute(0xc00042d320, 0xc00021fda0, 0x40e768, 0x30)
	/go/pkg/mod/github.com/brancz/[email protected]/config/config.go:34 +0x97
github.com/brancz/locutus/trigger/types.(*ExecutionRegister).Execute(0xc0004046c0, 0xc00021fda0, 0xc00046c600, 0x11fd)
	/go/pkg/mod/github.com/brancz/[email protected]/trigger/types/types.go:40 +0x6f
github.com/brancz/locutus/trigger/resource.(*ResourceTrigger).sync(0xc0004046c0, 0xc0003c6600, 0x26, 0x0, 0x10ca100)
	/go/pkg/mod/github.com/brancz/[email protected]/trigger/resource/resource.go:200 +0x2cf
github.com/brancz/locutus/trigger/resource.(*ResourceTrigger).processNextWorkItem(0xc0004046c0, 0xc0005d4f00)
	/go/pkg/mod/github.com/brancz/[email protected]/trigger/resource/resource.go:172 +0x100
github.com/brancz/locutus/trigger/resource.(*ResourceTrigger).worker(0xc0004046c0)
	/go/pkg/mod/github.com/brancz/[email protected]/trigger/resource/resource.go:161 +0x2b
created by github.com/brancz/locutus/trigger/resource.(*ResourceTrigger).Run
	/go/pkg/mod/github.com/brancz/[email protected]/trigger/resource/resource.go:125 +0x107

Move deploy directory

The deploy directory is currently confusing IMO; the entire repo contains deployment examples for the observatorium platform. The deploy directory on the other hand only includes configuration for deploying the operator. Let's move this into the operator directory instead.

cc @observatorium/maintainers @rollandf

Interact with Observatorium using `obsctl`

@onprem and I discussed this a bit today.

It would be a nice idea to have a dedicated CLI for interacting with Observatorium.

The CLI, obsctl, will allow users to,

  • Manage auth for multiple tenants by saving their configuration locally at a path like $(HOME)/.obs/config. So no need to manually configure OIDC flags while making requests each time
  • Configure rules for a tenant and view the configured rules
  • View series, labels, and eventually query metrics for a tenant

In the future, this can also allow setting Alertmanager configs for tenants (refer to tenant alerting proposal).

Open to feedback and suggestions! πŸ™‚

build_dev fails with RUNTIME ERROR: Field does not exist: service

With commit 5c40043 when running ./build_dev.sh, we get the following error:

$ ./build_dev.sh 
+ set -o pipefail
+ rm -rf environments/dev/manifests
+ mkdir environments/dev/manifests
+ jsonnet -J vendor -m environments/dev/manifests environments/dev/main.jsonnet
+ xargs '-I{}' sh -c 'cat {} | gojsontoyaml > {}.yaml' -- '{}'
RUNTIME ERROR: Field does not exist: service
	components/observatorium.libsonnet:207:45-62	thunk from <object <anonymous>>
	components/observatorium.libsonnet:206:58-65	
	<std>:688:15-22	thunk <val> from <function <format_codes_arr>>
	<std>:695:27-30	thunk from <thunk <s> from <function <format_codes_arr>>>
	<std>:565:22-25	thunk from <function <format_code>>
	<std>:565:9-26	function <format_code>
	<std>:695:15-60	thunk <s> from <function <format_codes_arr>>
	<std>:700:24-25	thunk from <thunk <s_padded> from <function <format_codes_arr>>>
	<std>:475:30-33	thunk from <thunk from <function <pad_left>>>
	<std>:475:19-34	thunk from <function <pad_left>>
	...
	<std>:750:7-46	function <anonymous>
	<std>:227:7-23	function <anonymous>
		
	vendor/kube-thanos/kube-thanos-query.libsonnet:61:24-29	thunk from <thunk from <thunk <c> from <object <anonymous>>>>
	<std>:227:21-22	thunk from <function <anonymous>>
	<std>:749:17-21	thunk from <function <anonymous>>
	<std>:749:8-22	function <anonymous>
	<std>:227:7-23	function <anonymous>
		
	During manifestation	

Setup Jaeger backend

Setup a simple Jaeger backend, made up of:

  • Jaeger All in One with Badger (cc @burmanm). Once we figure out how HA can be achieved in our setup, we can split the deployments: collector in read-write mode, query in read-only mode.
  • Jaeger Service, exposing the UI port (16686) and the collector's gRPC (14250).

@aditya-konarde: how can we proceed with adding the Jaeger Agent as a sidecar to the Thanos pods? Should we start with the all-in-one+service first, and the sidecar in a follow-up issue? Or is there a deployment where I could already add the sidecar as part of the initial PR?

cc @objectiser

Images in the CRD should be optional, with defaults handled by operator

We would like to have defaults for the images in the Observatorium CR. There are currently are 11 images to configure (6x Thanos,api,queryCache, memcached, memcached_exporter, receivecontroller). This could be a hurdle for the user.

  1. Make images as optional in the CRD.
  2. Operator will use preconfigured images if the user did not specified them in the CR.
  3. The user can override one or more images.

cortex-query-frontend config error resulting CrashLoopBackOff

Tested locally, this is not failing the CI tests:

$ KUBECTL=./kubectl ; ./tests/e2e.sh deploy-operator  ; ./tests/e2e.sh test
Error from server (AlreadyExists): namespaces "minio" already exists
namespace/observatorium created
secret/thanos-objectstorage created
persistentvolumeclaim/minio created
deployment.apps/minio created
service/minio created
customresourcedefinition.apiextensions.k8s.io/observatoria.core.observatorium.io created
clusterrole.rbac.authorization.k8s.io/observatorium-operator created
clusterrolebinding.rbac.authorization.k8s.io/observatorium-operator created
deployment.apps/observatorium-operator created
observatorium.core.observatorium.io/observatorium-xyz created
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=
Waiting for observatorium-xyz currentStatus=Not Started
Waiting for observatorium-xyz currentStatus=Not Started
observatorium-xyz CR status is now: Finished
deployment.extensions/minio condition met
deployment.extensions/observatorium-xyz-thanos-query condition met
job.batch/observatorium-up created
job.batch/observatorium-up condition met

Result:

[test@bkr-hv01 configuration]$ kubectl -n observatorium logs observatorium-xyz-cortex-query-frontend-86c9c6c798-pt25w
error loading config from /etc/cache-config/config.yaml: yaml: unmarshal errors:
  line 4: field log_queries_longer_than not found in type frontend.Config
  line 6: field query_range not found in type cortex.Config

Additional details

Name:         observatorium-xyz-cortex-query-frontend-86c9c6c798-pt25w
Namespace:    observatorium
Priority:     0
Node:         crc-w6th5-master-0/192.168.130.11
Start Time:   Sun, 15 Mar 2020 18:11:12 +0200
Labels:       app.kubernetes.io/component=query-cache
              app.kubernetes.io/instance=observatorium-xyz
              app.kubernetes.io/name=cortex-query-frontend
              app.kubernetes.io/part-of=observatorium
              app.kubernetes.io/version=master-8533a216
              pod-template-hash=86c9c6c798
Annotations:  k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.128.1.49"
                    ],
                    "dns": {},
                    "default-route": [
                        "10.128.0.1"
                    ]
                }]
              openshift.io/scc: restricted
Status:       Running
IP:           10.128.1.49
IPs:
  IP:           10.128.1.49
Controlled By:  ReplicaSet/observatorium-xyz-cortex-query-frontend-86c9c6c798
Containers:
  cortex-query-frontend:
    Container ID:  cri-o://2f23fbb1a78c242f6070976af78a269a06aa9d763996a327bec0c33ff6c82b58
    Image:         quay.io/cortexproject/cortex:master-8533a216
    Image ID:      quay.io/cortexproject/cortex@sha256:fddd93b98b5789761700f5103734a0f2af39b3b5bfeece26b224f5bbb53e4e4c
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      -log.level=debug
      -config.file=/etc/cache-config/config.yaml
      -querier.max-retries-per-request=0
      -frontend.downstream-url=http://observatorium-xyz-thanos-query.observatorium.svc.cluster.local.:9090
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Mar 2020 18:14:10 +0200
      Finished:     Sun, 15 Mar 2020 18:14:10 +0200
    Ready:          False
    Restart Count:  5
    Environment:    <none>
    Mounts:
      /etc/cache-config/ from query-cache-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-vrqwc (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  query-cache-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      observatorium-xyz-cortex-query-frontend
    Optional:  false
  default-token-vrqwc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-vrqwc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                         Message
  ----     ------     ----                  ----                         -------
  Normal   Scheduled  <unknown>             default-scheduler            Successfully assigned observatorium/observatorium-xyz-cortex-query-frontend-86c9c6c798-pt25w to crc-w6th5-master-0
  Normal   Pulled     4m8s (x5 over 5m33s)  kubelet, crc-w6th5-master-0  Container image "quay.io/cortexproject/cortex:master-8533a216" already present on machine
  Normal   Created    4m8s (x5 over 5m33s)  kubelet, crc-w6th5-master-0  Created container cortex-query-frontend
  Normal   Started    4m8s (x5 over 5m33s)  kubelet, crc-w6th5-master-0  Started container cortex-query-frontend
  Warning  BackOff    23s (x25 over 5m32s)  kubelet, crc-w6th5-master-0  Back-off restarting failed container

How is observatorium an Apache license?

How is observatorium an Apache license when loki is AGPL? I see there is a plan to use tempo for tracing which is yet another AGPL license so I wonder how obervatorium can release it as part of Apache 2.0? is observatorium using a fork of loki if so what about the crucial fixes that went in recently such as out of order writes?

Add "Vendor" directories

For offline builds, there is a need for "vendor" directory.

  1. Add vendor for Go libraries.
  2. Support different vendor directory location for Jsonnet libraries

API configured for traces even with tracing disabled

Notice that https://github.com/observatorium/observatorium/blob/main/configuration/examples/dev/manifests/api-deployment.yaml#L43 has

        - --traces.write.endpoint=observatorium-xyz-otel-collector:4317
        - --grpc.listen=0.0.0.0:8090

Of the three examples, only local has tracing enabled, yet all three examples supply now the flags to the API to forward traces. (Since #461 ). Tracing .enabled does correctly control the deployment of tracing components.

I am looking into the fix but Jsonnet debugging is new to me. I have the syntax figured out, but Jsonnet's lazy declarative semantics make it tricky to figure out why part of the Jsonnet respects tracing .enabled and another part doesn't.

cc @pavolloffay

observatorium operator is CrashLoopBackOff

observatorium operator is CrashLoopBackOff due to

W0811 08:43:28.497747       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0811 08:43:29.900357       1 request.go:621] Throttling request took 1.003361175s, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1?timeout=32s
I0811 08:43:39.900375       1 request.go:621] Throttling request took 11.002939908s, request: GET:https://172.30.0.1:443/apis/apps.open-cluster-management.io/v1?timeout=32s
ts=2020-08-11T08:43:46.100131257Z caller=main.go:145 msg="failed to create resource trigger" err="failed to create client for Observatorium in core.observatorium.io/v1alpha1: discovering resource information failed for Observatorium in core.observatorium.io/v1alpha1: unable to retrieve the complete list of server APIs: proxy.open-cluster-management.io/v1beta1: the server is currently unable to handle the request"

I think the observatorium operator should not retrieve all server apis.

Remove @jpkrohling as maintainer

My last day at Red Hat is Sep 30, and unfortunately, I do not intend on contributing to Observatorium further, as I do not think this aligns with my upcoming goals. As such, I would like to be removed as a maintainer. I had fun with the PRs I contributed and I'll be following the project from the outside.

Go module name

When I use the observatorium api, it throws

go: finding github.com/observatorium/configuration latest
go: github.com/open-cluster-management/multicluster-monitoring-operator/cmd/manager imports
	github.com/observatorium/configuration/api/v1alpha1: github.com/observatorium/[email protected]: parsing go.mod:
	module declares its path as: github.com/configuration
	        but was required as: github.com/observatorium/configuration

I think we should rename the module name to github.com/observatorium/configuration in go.mod. Does it make sense? @nmagnezi

Add operator manifest files to jsonnet

Follow up from #461 (comment)

The getting started installation requires installing cert-manager, jaeger and opentelemetry operators before installing observatorium e.g. kubectl apply -f ./manifests/ .

I have tried adding operator installation manifest files to the manifest directory (e.g. 00-cert-manager.yaml and 01-jaeger-operator.yaml), but it didn't work as expected:

kubectl apply -f ./manifests/                                                                                                                                                                                                                             ploffay@fedora
customresourcedefinition.apiextensions.k8s.io/certificaterequests.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/certificates.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/challenges.acme.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/clusterissuers.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/issuers.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/orders.acme.cert-manager.io created
namespace/cert-manager created
serviceaccount/cert-manager-cainjector created
serviceaccount/cert-manager created
serviceaccount/cert-manager-webhook created
configmap/cert-manager-webhook created
clusterrole.rbac.authorization.k8s.io/cert-manager-cainjector created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-issuers created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-certificates created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-orders created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-challenges created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim created
clusterrole.rbac.authorization.k8s.io/cert-manager-view created
clusterrole.rbac.authorization.k8s.io/cert-manager-edit created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-approve:cert-manager-io created
clusterrole.rbac.authorization.k8s.io/cert-manager-controller-certificatesigningrequests created
clusterrole.rbac.authorization.k8s.io/cert-manager-webhook:subjectaccessreviews created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-cainjector created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-issuers created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-clusterissuers created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-certificates created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-orders created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-challenges created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-ingress-shim created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-approve:cert-manager-io created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-controller-certificatesigningrequests created
clusterrolebinding.rbac.authorization.k8s.io/cert-manager-webhook:subjectaccessreviews created
role.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created
role.rbac.authorization.k8s.io/cert-manager:leaderelection created
role.rbac.authorization.k8s.io/cert-manager-webhook:dynamic-serving created
rolebinding.rbac.authorization.k8s.io/cert-manager-cainjector:leaderelection created
rolebinding.rbac.authorization.k8s.io/cert-manager:leaderelection created
rolebinding.rbac.authorization.k8s.io/cert-manager-webhook:dynamic-serving created
service/cert-manager created
service/cert-manager-webhook created
deployment.apps/cert-manager-cainjector created
deployment.apps/cert-manager created
deployment.apps/cert-manager-webhook created
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
namespace/observability created
customresourcedefinition.apiextensions.k8s.io/jaegers.jaegertracing.io created
serviceaccount/jaeger-operator created
role.rbac.authorization.k8s.io/leader-election-role created
role.rbac.authorization.k8s.io/prometheus created
clusterrole.rbac.authorization.k8s.io/jaeger-operator-metrics-reader created
clusterrole.rbac.authorization.k8s.io/manager-role created
clusterrole.rbac.authorization.k8s.io/proxy-role created
rolebinding.rbac.authorization.k8s.io/leader-election-rolebinding created
rolebinding.rbac.authorization.k8s.io/prometheus created
clusterrolebinding.rbac.authorization.k8s.io/jaeger-operator-proxy-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/manager-rolebinding created
service/jaeger-operator-metrics created
service/jaeger-operator-webhook-service created
deployment.apps/jaeger-operator created
mutatingwebhookconfiguration.admissionregistration.k8s.io/jaeger-operator-mutating-webhook-configuration created
validatingwebhookconfiguration.admissionregistration.k8s.io/jaeger-operator-validating-webhook-configuration created
configmap/observatorium-xyz-observatorium-api created
deployment.apps/observatorium-xyz-observatorium-api created
secret/observatorium-xyz-observatorium-api created
service/observatorium-xyz-observatorium-api created
serviceaccount/observatorium-xyz-observatorium-api created
deployment.apps/observatorium-xyz-gubernator created
role.rbac.authorization.k8s.io/observatorium-xyz-gubernator created
rolebinding.rbac.authorization.k8s.io/observatorium-xyz-gubernator created
service/observatorium-xyz-gubernator created
serviceaccount/observatorium-xyz-gubernator created
service/observatorium-xyz-loki-compactor-grpc created
service/observatorium-xyz-loki-compactor-http created
statefulset.apps/observatorium-xyz-loki-compactor created
configmap/observatorium-xyz-loki created
deployment.apps/observatorium-xyz-loki-distributor created
service/observatorium-xyz-loki-distributor-grpc created
service/observatorium-xyz-loki-distributor-http created
service/observatorium-xyz-loki-gossip-ring created
service/observatorium-xyz-loki-index-gateway-grpc created
service/observatorium-xyz-loki-index-gateway-http created
statefulset.apps/observatorium-xyz-loki-index-gateway created
service/observatorium-xyz-loki-ingester-grpc created
service/observatorium-xyz-loki-ingester-http created
statefulset.apps/observatorium-xyz-loki-ingester created
deployment.apps/observatorium-xyz-loki-querier created
service/observatorium-xyz-loki-querier-grpc created
service/observatorium-xyz-loki-querier-http created
deployment.apps/observatorium-xyz-loki-query-frontend created
service/observatorium-xyz-loki-query-frontend-grpc created
service/observatorium-xyz-loki-query-frontend-http created
deployment.apps/observatorium-xyz-loki-query-scheduler created
service/observatorium-xyz-loki-query-scheduler-grpc created
service/observatorium-xyz-loki-query-scheduler-http created
deployment.apps/minio unchanged
persistentvolumeclaim/minio unchanged
secret/loki-objectstorage created
secret/thanos-objectstorage created
service/minio unchanged
service/observatorium-xyz-thanos-compact created
serviceaccount/observatorium-xyz-thanos-compact created
statefulset.apps/observatorium-xyz-thanos-compact created
deployment.apps/observatorium-xyz-thanos-query created
service/observatorium-xyz-thanos-query-frontend-memcached created
serviceaccount/observatorium-xyz-thanos-query-frontend-memcached created
statefulset.apps/observatorium-xyz-thanos-query-frontend-memcached created
deployment.apps/observatorium-xyz-thanos-query-frontend created
service/observatorium-xyz-thanos-query-frontend created
serviceaccount/observatorium-xyz-thanos-query-frontend created
service/observatorium-xyz-thanos-query created
serviceaccount/observatorium-xyz-thanos-query created
configmap/observatorium-xyz-thanos-receive-controller-tenants created
deployment.apps/observatorium-xyz-thanos-receive-controller created
role.rbac.authorization.k8s.io/observatorium-xyz-thanos-receive-controller created
rolebinding.rbac.authorization.k8s.io/observatorium-xyz-thanos-receive-controller created
service/observatorium-xyz-thanos-receive-controller created
serviceaccount/observatorium-xyz-thanos-receive-controller created
service/observatorium-xyz-thanos-receive-default created
statefulset.apps/observatorium-xyz-thanos-receive-default created
serviceaccount/observatorium-xyz-thanos-receive created
service/observatorium-xyz-thanos-receive created
service/observatorium-xyz-thanos-rule created
serviceaccount/observatorium-xyz-thanos-rule created
statefulset.apps/observatorium-xyz-thanos-rule created
service/observatorium-xyz-thanos-store-memcached created
serviceaccount/observatorium-xyz-thanos-store-memcached created
statefulset.apps/observatorium-xyz-thanos-store-memcached created
serviceaccount/observatorium-xyz-thanos-store-shard created
service/observatorium-xyz-thanos-store-shard-0 created
statefulset.apps/observatorium-xyz-thanos-store-shard-0 created
Error from server (InternalError): error when creating "manifests/01-jaeger-operator.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.250.44:443: connect: connection refused
Error from server (InternalError): error when creating "manifests/01-jaeger-operator.yaml": Internal error occurred: failed calling webhook "webhook.cert-manager.io": Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.96.250.44:443: connect: connection refused
Error from server (InternalError): error when creating "manifests/tracing-jaeger-test-oidc.yaml": Internal error occurred: failed calling webhook "mjaeger.kb.io": Post "https://jaeger-operator-webhook-service.observability.svc:443/mutate-jaegertracing-io-v1-jaeger?timeout=10s": dial tcp 10.96.66.156:443: connect: connection refused
unable to recognize "manifests/tracing-otelcollector.yaml": no matches for kind "OpenTelemetryCollector" in version "opentelemetry.io/v1alpha1"


k get all -n observability                                                                                                                                                                                                                                ploffay@fedora
NAME                                   READY     STATUS              RESTARTS   AGE
pod/jaeger-operator-758fd794bf-vwv9t   0/2       ContainerCreating   0          10m

NAME                                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/jaeger-operator-metrics           ClusterIP   10.96.81.217   <none>        8443/TCP   10m
service/jaeger-operator-webhook-service   ClusterIP   10.96.66.156   <none>        443/TCP    10m

NAME                              READY     UP-TO-DATE   AVAILABLE   AGE
deployment.apps/jaeger-operator   0/1       1            0           10m

NAME                                         DESIRED   CURRENT   READY     AGE
replicaset.apps/jaeger-operator-758fd794bf   1         1         0         10m

The kubectl apply -f does not wait between files/manifests until objects are ready. From the logs above it's obvious that Jaeger and OTEL CRs could not be created, in addition to this the Jaeger operator was not deployed correctly. The operators use cert-manager and the cert-manager has to be in the ready state before deploying the operator. The ready state should be verified via cmctl check api tool (checking the cer-manager deployment is not sufficient).

The kubectl apply -f will not work with included operator manifest files, however we could provide a make target to install the operators if that would help people to get started. Note that the operators installation varies from platform - on OpenShift the operators are installed via OLM whereas on vanilla k8s directly via manifests.

@periklis @bwplotka do you have any suggestions? If no I will close this as wont-fix (at least I documented the problems in this issue).

Detect configuration template generation errors.

Sorry for starting πŸ”₯ again.. (:

I find this quite confusing that if I make a mistake like this, so use a field that is not defined before, there is 0 build error, and nothing changes in generated files.

  store+: {
      image: obs.config.thanosImage,
      version: obs.config.thanosVersion,
      args+: [
       "try and error",
      ],
      objectStorageConfig: obs.config.objectStorageConfig,
      replicas: '${{THANOS_STORE_REPLICAS}}',
      resources: {
        requests: {
          cpu: '${THANOS_STORE_CPU_REQUEST}',
          memory: '${THANOS_STORE_MEMORY_REQUEST}',
        },
        limits: {
          cpu: '${THANOS_STORE_CPU_LIMIT}',
          memory: '${THANOS_STORE_MEMORY_LIMIT}',
        },
      },

I know that this is silly mistake of mine, I probably modify some config struct not the container itself, but as a newbie, I have 0 idea where is the field that I want and if the field is called args, arg or arguments without looking manually into some deep (3rd level, because configuration -> kube-thanos -> ksonnet ) repo on exact commit.

AC:

  • Changing undefined fields should cause generation error.

I believe it's because of our language choice. I would add this issue as another example of jsonnet not being suitable for this kind of production configuration.

Curious, would CUE @metalmatze help with that?

dev up client manifest fails to spawn a working pod

the up manifest under dev fails with the following:

kubectl -n observatorium logs observatorium-xyz-observatorium-up-5978d796c-8p555
level=error caller=main.go:153 ts=2020-05-27T07:59:26.4076684Z name=up msg="could not parse command line flags" err="--queries-file is invalid: open /etc/up/queries.yaml: no such file or directory"

This is not being used in CI, is it there by mistake?
This should either be fixed or removed.

P.S.
The up pod created by the job does not use --queries-file

Installing Observatorium on OpenShift issues

Adding this as a tracking for issues that were hit while getting Observatorium to work on an OpenShift cluster and not just using Kind.

Based off using configuration/examples/dev files

  1. Dex namespace is not created by default so when applying the manifests dex fails.
  2. Test configmap is not part of the tests manifests. Needed to apply configuration/tests/manifests/test-ca-tls.yaml
  3. Missing Dex configmap. Needed to apply tests/manifests/observatorium-xyz-tls-dex.yaml
  4. By default there is no observatorium route create so we cannot access the api endpoint. Created with this
apiVersion: v1
kind: Route
items:
- apiVersion: route.openshift.io/v1
  kind: Route
  metadata:
    name: observatorium
    namespace: observatorium
  spec:
    host: observatorium-xyz-observatorium.apps.my.obs.stuff.devshift.org
    port:
      targetPort: 8080
    to:
      kind: Service
      name: observatorium-xyz-observatorium-api
      weight: 100
    wildcardPolicy: None
  1. observatorium-xyz-thanos-query and observatorium-xyz-thanos-query-frontend service accounts do not have privilege to access the anyuuid scc. Needed to add them via:
# oc adm policy add-scc-to-user anyuid -z observatorium-xyz-thanos-query -n observatorium
# oc adm policy add-scc-to-user anyuid -z observatorium-xyz-thanos-query-frontend -n observatorium
  1. The observatorium api would start before dex in some cases. I would need to manually delete the observatorium-xyz-observatorium-api pod so it would restart and find the tenant.

After this I was able to port forward from dex to get the bearer token with curl via

# oc port-forward -n dex svc/dex 5556
# curl --request POST --url https://127.0.0.1:5556/dex/token --insecure --data grant_type=password --data [email protected] --data password=password --data client_id=test --data client_secret=ZXhhbXBsZS1hcHAtc2VjcmV0 --data scope="openid email" | sed 's/^{.*"id_token":[^"]*"\([^"]*\)".*}/\1/'

I was then able to add it as a data source to Grafana.

Much thanks to @ianbillett and @matej-g !!

Generate Operator manifests with jsonnet

Currently, all of the manifests for deploying the operator, ie the RBAC and deployment, are written by hand. These should be generated via jsonnet just like all other observatorium components.

make generate broken in master

when I tried to rebase PR-273 after PR-281 , I noticed it fails to generate jsonnet manifests. Tested this directly with master branch and got the very same result:

➜  deployments git:(master) βœ— make generate
cd operator; GO111MODULE="on" go build -o /Users/nmagnezi/Desktop/deployments/tmp/bin/controller-gen sigs.k8s.io/controller-tools/cmd/controller-gen

cd operator/jsonnet; /Users/nmagnezi/Desktop/deployments/tmp/bin/jb install
make jsonnetfmt
jsonnetfmt -n 2 --max-blank-lines 2 --string-style s --comment-style s -i ./example/main.jsonnet ./environments/dev/main.jsonnet ./environments/base/default-config.libsonnet ./environments/base/main.jsonnet ./environments/base/observatorium.jsonnet ./tests/main.jsonnet ./components/observatorium-configure.libsonnet ./components/jaeger-agent.libsonnet ./components/oauth-proxy.libsonnet ./components/up-job.libsonnet ./components/jaeger-collector.libsonnet ./components/minio.libsonnet ./components/memcached.libsonnet ./components/loki-caches.libsonnet ./components/etcd.libsonnet ./components/loki.libsonnet ./components/dex.libsonnet ./components/cortex-query-frontend.libsonnet ./components/up.libsonnet ./components/observatorium.libsonnet ./operator/jsonnet/obs-operator.jsonnet ./operator/jsonnet/main.jsonnet ./operator/jsonnet/operator-config.libsonnet
rm -rf environments/dev/manifests
mkdir environments/dev/manifests
jsonnet -J operator/jsonnet/vendor -m environments/dev/manifests environments/dev/main.jsonnet | xargs -I{} sh -c 'cat {} | gojsontoyaml > {}.yaml' -- {}
RUNTIME ERROR: field does not exist: member
	environments/dev/../base/../../components/./loki.libsonnet:632:8-18	object <anonymous>
	environments/dev/../base/../../components/./loki.libsonnet:(509:20)-(567:4)	object <anonymous>
	environments/dev/../base/../../components/observatorium.libsonnet:295:34-52	thunk <o>
	std.jsonnet:1268:24
	std.jsonnet:1268:5-33	function <anonymous>
	environments/dev/../base/../../components/observatorium.libsonnet:295:17-53	thunk <a>
	environments/dev/../base/../../components/observatorium.libsonnet:(293:7)-(296:4)	function <anonymous>
	environments/dev/../base/../../components/observatorium.libsonnet:(293:7)-(296:4)	object <anonymous>
	environments/dev/main.jsonnet:120:1-14
make: *** [environments/dev/manifests] Error 1

volumeClaimTemplate does not work

From here - https://github.com/observatorium/deployments/blob/master/example/manifests/observatorium.yaml we can see that it is allowed to define volumeClaimTemplate for store/rule/receiver/compact component. Actually, the operator does not take this definition to apply to each component. the component is still using emptyDir: {}. for example: https://github.com/observatorium/deployments/blob/d5ba8dea045be9f219e69357928ee9453a7d6784/environments/base/manifests/thanos-rule-statefulSet.yaml#L77-L80

It seems the thanos does not support https://github.com/thanos-io/kube-thanos/blob/c070408f18b155fca804cd3da7709313b1717965/jsonnet/kube-thanos/kube-thanos-receive.libsonnet#L108-L110

Make Loki as an optional

Observatorium is a unified service to alert and troubleshoot infrastructure. It combines observability signals such as metrics, logs, traces, profiles and more in a single experience to reduce mean time to resolution.

In my understanding, the metrics, logs, traces, profiles etc are the components of observatorium. they should not be designed as required components. they are should be optional. we should have one component enabled at least in order to install observatorium operator.

@brancz @squat any comments? Thanks.

Rename repository to observatorium/deployments

With more and more things being added to this repository, it's not only about the initial jsonnet configuration anymore.
Therefore we propose to rename this repository to observatorium/deployments.
Hopefully, it'll be a bit more obvious that this repository contains a few different ways of deploying Observatorium going forward.

/cc @observatorium/maintainers @rollandf @nmagnezi

add CI stage for jsonnet verification

add a stage to the .drone.yml config to verify that the generated files are correct and match those in the repo, otherwise we can't differentiate hand-made changes from properly generated ones.

`make generate` fails on macOS

make generate fails on OSX, producing a file named {}.yaml and deleting most of the .yaml deployment examples.

To reproduce, on a Mac,

cd configuration
touch components/observatorium.libsonnet
make generate
git status

The problem is very small xargs defaults in Darwin. I am preparing a PR.

Note: This was the most confusing bug I have seen in a long time. It took me a long time to realize that make generate was only deleting 1/3 of the example deployment .yaml, rather than all of it. I spent a long time looking for problems with shell, filename, and escape sequence differences between Mac and Linux.

Proposal add tracing API and OpenTelemetry collector

This is a proposal to expose tracing ingestion API in the Observatorium backed by the OpenTelemetry collector.

The OTEL collector does not provide any persistence, it is just a data forwarder as of now, however, it can be configured to send data to a variety of observability vendors or OSS projects like Jaeger or Zipkin or simply log data to the standard output. After this proposal is done we will be working on adding a storage/platform for trace data (e.g. Jaeger).

This proposal contains two main action items:

  1. Expose tracing ingestion API in the API service
  2. Add deployments manifests for OpenTelemetry collector

Expose tracing ingestion API

There are a couple of open-source tracing protocols out there - Zipkin, Jaeger and OpenTelemetry. At the moment the OpenTelemetry seems to be the most popular and it is projected to have the biggest adoption. Hence I would propose to use the OpenTelemetry as the trace ingestion API.

The OpenTelemetry protocol primarily supports gRPC for sending traces (I will refer to it as OTLP gRPC) with proto encoding. The HTTP protocol with protobuf encoding is as well supported. The JSON encoding is still in experimental mode. The majority of users use OTLP gRPC and the HTTP is used in environments where gRPC cannot be used (e.g. mobile clients). Because the Observatorium supports HTTP I would propose to start with OTLP HTTP and at the same time discuss how OTLP gRPC could be supported.

Multitenancy

The multitenancy should be handled the same way as it is for metrics and logs. The API service could add a HTTP header or attribute/label to data. This label would then be used in the collector to identify the tenant.

Add deployment manifests for OpenTelemetry collector

This is a pretty straightforward task. There is OpenTelemetry operator or HELM chart, but we could as well use plain Kubernetes manifests given the simplicity of this stateless component.

Other related changes

The Observatorium API is instrumented with OpenTelemetry API. During the deployment it could be automatically configured to report data the OTEL collector as an internal user.

Pin Hugo version

Right now it's not provided for users:

.PHONY: web
web:
	cd $(WEBSITE_DIR) && \
	hugo -b $(WEBSITE_BASE_URL)

.PHONY: web-serve
web-serve:
	cd $(WEBSITE_DIR) && \
	hugo serve

Docs: Improve getting started documentation

Currently, getting started documentation is applicable only to running Observatorium locally on kind. It would be nice to have some guidance for users who want to try running Observatorium on a cluster with manifests from observatorium/configuration/examples/dev, since the procedure differs from running it locally.

Create proposal for tenant rate limits

Quotas we discussed about:

  • Series pushed (remote write)
  • Series touched (query)

Open Question:
This does not touch samples, should be include this too? (it matters if users touch 1000 samples or 1, but not as much as series)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.