thanos-io / kube-thanos Goto Github PK

View Code? Open in Web Editor NEW

506.0 18.0 174.0 1.06 MB

Kubernetes specific configuration for deploying Thanos.

License: Apache License 2.0

Shell 0.99% Jsonnet 93.83% Makefile 5.19%

kubernetes thanos prometheus kube-prometheus

kube-thanos's People

Contributors

Stargazers

Watchers

Forkers

paulfantom aditya-konarde benjaminhuo kakkoyun s-urbaniak squat zgfh metalmatze linux-xiayun cmanzi yeya24 obiesmans hartman17 qianchenglong simonpasquier tehcyx wujie1993 devsisters lilic laashub-soa kuberkaul aaronaddleman husoule rogerdfarmer scottdhowell3 juanmancebo paravx2020 jay-bhambhani zjj2wry paytmlabs vrrs mikechengwei clyang82 a7vicky devopstoday11 heshamaboelmagd isabella232 stolostron surajnarwade rgaoistheone craigfurman superq zshearin macduff23 dgrisonnet old-camel marvel-works wppurking volgorean tongcheng-elong clix-dev-llc brancz awitten-yb jpugliesi vshn medkbadri cten a8j8i8t8 vandenbosch gencoo kikichuang darren2013 suevelyn harshitsearce evgenyb automaticgiant-ml bip7686 ulfmansson thomask33 spaparaju onprem scheying drc-pnambiar fabianburgosr the-beer-emoji babujii bill3tt slashpai philipgough echenim mwasilew2 stevehipwell cit-consulting aletessier alhood77 dev25 dgamo andrewrothstein thingso2 andrein marco210 louij2 zhonglin6666 ettancos moussdia shreejitverma aoisososola jmartinh forbearing prashbnair

kube-thanos's Issues

Regenerate Manifests

The manifests aren't up to date with the latest code on master, would it be possible to re-generate them? Another option would be to remove the folder and generate the manifests on tag creation as a release artifact?

Support new endpoint flag in querier

store flag is deprecated and going to be replaced by endpoint flag.
The new flag should be supported by this library.

add extraEnv to all containers to make injection of custom env vars easier

I need to inject some environment variables to all Thanos pods (to configure tracing). Considering this is currently not supported by the lib I'll be probably doing this by patching multiple components in the generated json.

Would adding support for it be something you'd be interested in? I'm happy to prepare a PR

For example, for kube-thanos-store it could be something like:

(...)
      env: [
        (existing env config)
      ] + (
        if std.length(ts.config.extraEnv) > 0 then [
         ts.config.extraEnv,
        ] else []
      ),

(...)

not sure who the maintainers are, pinging @yeya24 @brancz @metalmatze for some feedback

Why query has duplicated thanos-store config

yml file:
https://github.com/thanos-io/kube-thanos/blob/e1a68590f56034ca1a43d59401f03e72fd01ac5f/examples/all/manifests/thanos-query-deployment.yaml
config:

- args:
        - query
        - --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local
        - --store=dnssrv+_grpc._tcp.thanos-store-0.thanos.svc.cluster.local
        - --store=dnssrv+_grpc._tcp.thanos-store-1.thanos.svc.cluster.local
        - --store=dnssrv+_grpc._tcp.thanos-store-2.thanos.svc.cluster.local

Full example of kube-thanos and kube-prometheus

Hey! I've spent a good part of today combining kube-thanos and kube-prometheus into one jsonnet file. I think the process could have been a lot smoother with examples (Like a full combo of the two mixed together) and if some defaults were set to mesh a bit more nicely with kube-prometheus (such as the namespace).

I could potentially throw my config at you the developers if I have opportunity to remove confidential information. Would that be of use?

"thanos-objectstorage" not found

kubectl apply -f manifests/
kubectl describe po -n thanos thanos-store-0

Just got error: Error: secret "thanos-objectstorage" not found and i checked yaml files,have not found file of secret yaml.

# ls manifests/
thanos-query-deployment.yaml      thanos-query-serviceMonitor.yaml  thanos-store-serviceAccount.yaml  thanos-store-service.yaml
thanos-query-serviceAccount.yaml  thanos-query-service.yaml         thanos-store-serviceMonitor.yaml  thanos-store-statefulSet.yaml

Time based partitioning flags for the store

It'd be neat if we could pass config elements that map to the --max-time and --min-time thanos store flags, to implement time-based partitioning (https://thanos.io/tip/components/store.md/#time-based-partitioning). At the moment, I'm doing this by patching the jsonnet-generated statefulset to append to the container's args.

I'm happy to make a PR, if this is a feature the maintainers want.

Add kustomization.yaml

I was wondering if I could add a kustomization.yaml that list the manifest files. kube-prometheus has one and would make it easier for teams using kustomization to just pull upstream changes into their kustomization when they are are building thanos.

Not all VMs are equal. Support for kubernetes tolerations and nodeAffinity

I work in an organisation where we are heavy users of kubernetes running on Microsoft Azure AKS .
Thanos and kube-thanos has worked out great for us. However thanos requires more memory than what we have on ordinary application servers. The solution is to schedule thanos to run on a different node pool with more memory than normal applications. To achieve this one could use a combination of two features in kubernetes; Taints and Tolerations and Node Affinity .

In the current version of kube-thanos these two fields are not configurable. I hope to contribute to the community a pull request where these two sections can be set up with jsonnet-bundler.

The end result should contain tolerations to all objects of kind: Deployment:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: query-layer
      app.kubernetes.io/instance: thanos-query
      app.kubernetes.io/name: thanos-query
  template:
    metadata:
      labels:
        app.kubernetes.io/component: query-layer
        app.kubernetes.io/instance: thanos-query
        app.kubernetes.io/name: thanos-query
        app.kubernetes.io/version: v0.19.0
    spec:
      tolerations: 
        - effect: NoSchedule
          key: CriticalAddonsOnly
          operator: Equal
          value: "true"
...

The end result should also contain nodeAffinity to all objects of kind: Deployment:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: query-layer
      app.kubernetes.io/instance: thanos-query
      app.kubernetes.io/name: thanos-query
  template:
    metadata:
      labels:
        app.kubernetes.io/component: query-layer
        app.kubernetes.io/instance: thanos-query
        app.kubernetes.io/name: thanos-query
        app.kubernetes.io/version: v0.19.0
    spec:
      ...
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: agentpool
                operator: In
                values:
                - systempool
        podAntiAffinity:
        ...

A working solution should build on standard kubernetes configuration, and be generic enough to fit into a similar setup on all major cloud providers.

There might be other ways to achieve the same result on Azure Kubernetes Services. To run thanos on dedicated hardware. My proposal might not be the only good solution.

Unify naming conventions

We use command names for the names of files, we only have two exception querier and compactor, should we rename them?

What do you think?

Run components as non-root

Thanos components run as root by default:

$ docker run --rm -ti --entrypoint= quay.io/thanos/thanos:v0.14.0 id
uid=0(root) gid=0(root) groups=10(wheel)

Pods should probably have a restricted security context, I currently run them with the following (except receive and rule that I do not have):

securityContext:
  fsGroup: 2000
  runAsNonRoot: true
  runAsUser: 1000

Do you have an opinion on the uid/gid to use? 65534/nobody/nogroup seems popular too, but not everyone thinks this what they should be used for.

Question about Service Account Name?

In the manifest, I see yaml about service account. What is service account used for? For description?

Switch CI to github actions

Circle CI is broken and it is time to switch to Github actions.

Bad links in all.jsonnet

Links in example are broken. For example here:

kube-thanos/all.jsonnet

Line 113 in add27c3

 // For DNS service discovery reference https://thanos.io/service-discovery.md/#dns-service-discovery 

should be https://thanos.io/tip/thanos/service-discovery.md/#dns-service-discovery

PodSecurityPolicy integration

I think there could be various ways to do this, but my first hunch goes towards allowing to pass a PodSecurityPolicy name to bind each component to.

Set terminationMessagePolicy to FallbackToLogsOnError

By setting terminationMessagePolicy to FallbackToLogsOnError we save a snapshot of logs in the a new pod if it errored out previously.

Add convenience functions for Thanos Receive to demonstrate separate Ingester and Router functionality

Add convenience functions for Thanos Receive to demonstrate separate Ingester and Router functionality.

cc @yashrsharma44

add envoy proxy for remote side car

see thanos-io/thanos#977

As Thanos Query does not support mTLS per store currently, the recommended pattern is to add for each remote store an envoy proxy, that can terminate the TLS connection. It will be nice to add it to kube-thanos, to make it easier to deploy it with Thanos.

[thanos-rule] No quotation marks on alert.label-drop

In thanos-rule manifest, there is a mistake in args :

- --alert.label-drop="rule_replica"

should be

- --alert.label-drop=rule_replica

PR opened #116

[Question] how to use the mixin.libsonnet?

Are there any recommendations on how to import the mixins?

I tried just importing mixin.libsonnet since it will import the other configs but now I'm getting this error saying there are duplicated rules and Prometheus will now go into a crash loop.

level=info ts=2019-09-24T17:44:32.693Z caller=main.go:670 msg="TSDB started"
level=info ts=2019-09-24T17:44:32.693Z caller=main.go:740 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-09-24T17:44:32.697Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-09-24T17:44:32.698Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-09-24T17:44:32.699Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-09-24T17:44:32.700Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=error ts=2019-09-24T17:44:32.749Z caller=manager.go:833 component="rule manager" msg="loading groups failed" err="groupname: \"thanos-querier.rules\" is repeated in the same file"
level=error ts=2019-09-24T17:44:32.749Z caller=manager.go:833 component="rule manager" msg="loading groups failed" err="groupname: \"thanos-receive.rules\" is repeated in the same file"
level=error ts=2019-09-24T17:44:32.749Z caller=manager.go:833 component="rule manager" msg="loading groups failed" err="groupname: \"thanos-store.rules\" is repeated in the same file"
level=error ts=2019-09-24T17:44:32.749Z caller=main.go:759 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
level=info ts=2019-09-24T17:44:32.749Z caller=main.go:523 msg="Stopping scrape discovery manager..."
level=info ts=2019-09-24T17:44:32.749Z caller=main.go:537 msg="Stopping notify discovery manager..."
level=info ts=2019-09-24T17:44:32.750Z caller=main.go:559 msg="Stopping scrape manager..."
level=info ts=2019-09-24T17:44:32.750Z caller=main.go:519 msg="Scrape discovery manager stopped"
level=error ts=2019-09-24T17:44:32.750Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=error ts=2019-09-24T17:44:32.750Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=error ts=2019-09-24T17:44:32.750Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=info ts=2019-09-24T17:44:32.751Z caller=main.go:533 msg="Notify discovery manager stopped"
level=error ts=2019-09-24T17:44:32.751Z caller=endpoints.go:131 component="discovery manager notify" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=info ts=2019-09-24T17:44:32.751Z caller=manager.go:815 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-09-24T17:44:32.751Z caller=manager.go:821 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-09-24T17:44:32.753Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2019-09-24T17:44:32.753Z caller=main.go:724 msg="Notifier manager stopped"
level=info ts=2019-09-24T17:44:32.753Z caller=main.go:553 msg="Scrape manager stopped"
level=error ts=2019-09-24T17:44:32.753Z caller=main.go:733 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": one or more errors occurred while applying the new configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\")"

[Question/Proposal] How to track compatible Thanos

Is it a good practice to create matching version tags with Thanos itself? And cut releases accordingly?

There might be several versions out there and depending on the version Thanos uses different flags, for example. It would be really good to know which versions are compatible with the current state.

What do you think?

Add functionality to render individual components

kube-thanos has shaped up pretty nicely into a modular library. A concern that users still face is that they must render all components manually, even though they just pass a single config. See here for an example, not only is this leaky configuration with layering violations, it's inconvenient and error prone (it took multiple attempts to get certain parts right in the linked example.

It would be great if each individual component could offer .manifests field that could be recursively used to build the above without leaking configuration into the final rendering act.

@kakkoyun @metalmatze

I keep on getting a unbound immediate persistent claim on thanos-store

I am following this article : https://programmer.group/how-to-use-thanos-to-implement-prometheus-multi-cluster-monitoring.html

Everything is fine until I use kube-thanos build scritpt, I can build the manifests for store and query, but when I apply the manifest I keep on getting unbound immediate persistent claim on thanos-store. The OBJSTORE_ CONFIG to access minio works for the prometheus sidecar statefulset, but not for the store.

`apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
app.kubernetes.io/version: v0.17.0
name: thanos-store
namespace: monit
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
serviceName: thanos-store
template:
metadata:
labels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
app.kubernetes.io/version: v0.17.0
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- thanos-store
- key: app.kubernetes.io/instance
operator: In
values:
- thanos-store
namespaces:
- monit
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- args:
- store
- --log.level=info
- --log.format=logfmt
- --data-dir=/var/thanos/store
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --objstore.config=$(OBJSTORE_CONFIG)
- --ignore-deletion-marks-delay=24h
env:
- name: OBJSTORE_CONFIG
valueFrom:
secretKeyRef:
key: thanos.yaml
name: thanos-objectstorage
image: quay.io/thanos/thanos:v0.17.0
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: 10902
scheme: HTTP
periodSeconds: 30
name: thanos-store
ports:
- containerPort: 10901
name: grpc
- containerPort: 10902
name: http
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: 10902
scheme: HTTP
periodSeconds: 5
resources: {}
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/thanos/store
name: data
readOnly: false
terminationGracePeriodSeconds: 120
volumes: []
volumeClaimTemplates:

metadata:
labels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
name: data
spec:
accessModes:
- ReadWriteOnce
  resources:
  requests:
  storage: 3Gi`

When I add a storageClass to the volumeClaimTemplates spec section, it basically ignores it, no matter what I keep on getting this error. Can you please shed some light in to what's missing here ?

301 response missing Location header

getting the following error. Please let me know what is missing.

level=info ts=2020-02-25T12:27:04.082433095Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2020-02-25T12:27:04.082519571Z caller=factory.go:43 msg="loading bucket configuration"
level=info ts=2020-02-25T12:27:04.099206846Z caller=inmemory.go:167 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
level=info ts=2020-02-25T12:27:04.099521724Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-02-25T12:27:04.099760677Z caller=store.go:288 msg="starting store node"
level=info ts=2020-02-25T12:27:04.099850267Z caller=prober.go:127 msg="changing probe status" status=healthy
level=info ts=2020-02-25T12:27:04.099887538Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-02-25T12:27:04.099942146Z caller=store.go:243 msg="initializing bucket store"
level=info ts=2020-02-25T12:27:04.131914347Z caller=prober.go:107 msg="changing probe status" status=ready
level=info ts=2020-02-25T12:27:04.131935092Z caller=http.go:78 service=http/server component=store msg="internal server shutdown" err="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=info ts=2020-02-25T12:27:04.131976289Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=info ts=2020-02-25T12:27:04.131983955Z caller=grpc.go:98 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=warn ts=2020-02-25T12:27:04.131999302Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=info ts=2020-02-25T12:27:04.13202616Z caller=grpc.go:117 service=gRPC/server component=store msg="gracefully stopping internal server"
level=info ts=2020-02-25T12:27:04.132126286Z caller=grpc.go:129 service=gRPC/server component=store msg="internal server shutdown" err="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=error ts=2020-02-25T12:27:04.13216824Z caller=main.go:194 msg="running command failed" err="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"

following is my configuration:

- args:
    - store
    - --data-dir=/var/thanos/store
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - |
      --objstore.config=type: s3
      config:
        bucket: "amjad-thanos"
        endpoint: "s3.eu-west-1.amazonaws.com"
        access_key: "kajshdajkshd87098098"
        secret_key: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxas"

Add support for remote_read via thanos-remote-read

I would be interested to integrate G-Research/thanos-remote-read into kube-thanos to provide remote_read functionality to prometheus-operator deployments.

As prometheus-operator is very clear about its scope (sidecard and ThanosRuler) I guess this would be the right place to implement that feature.

all.jsonnet causes the store statefulset manifest to be duplicated

Currently the example jsonnet at all.jsonnet causes the StatefulSet for thanos-store to be duplicated, once with memcache and once without. This is probably due to

kube-thanos/all.jsonnet

Line 157 in a6a0027

{ 'thanos-store-statefulSet-with-memcached': swm.statefulSet }

and this can be seen in https://github.com/thanos-io/kube-thanos/tree/a6a0027a3cd3da380479642debb202f2722710ee/examples/all/manifests

I suggest that the example be updated with a variable which takes in a boolean to check if memcached is required or not and updates the StatefulSet in-place. Let me know how that sounds and I can create a PR for the same.

kube-thanos + istio = broken GRPC

Hey! So I have istio installed in my cluster and am attempting to install Thanos components as well. With istio installed, none of my grpc connections are successful and querier receives the following for all services:

rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure"

I've renamed the grpc service ports to http2 to no avail, I'm utilizing the dnssrv+ approach for local service discovery --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local
(--store=dnssrv+_http2._tcp.thanos-store.monitoring.svc.cluster.local) with the name swap

Istio mutual TLS is disabled and istio-proxy logs don't give me any clues, however without istio, everything can discover one another fine, and works as intended.

Any ideas?

Eliminate job regex matcher on Alerts

Although I'm the one who introduced those, I realized they are error-prone and they could do more harm then they make things easier. Better to remove those selectors for more reliable alerting rules.

Support thanos compactor sharding

Store gateway supports hash sharding and it would be really good to have this support for compactor as well.

All error rate dashboard should not show the success rate

Currently all error rate graphs show 0-100 percent. Including the success rate. We just not interested in the success in this graph.

We should instead remove the success rate from those graphs and only show the errors. Last we want to remove the Y-max, that's currently set to 1, and have it open, so we can also see small error rates.

/cc @kakkoyun @brancz @paulfantom @squat @ant31

Add healthcheck probes

The containers are missing liveness/readiness probes.

As its a Kubernetes best-practice, we should get them added.

I can work on it once I figure out my jb crashing locally :)

Side car is not able to clean up block (Unsupported type of io.Reader)

Hi
I am using the thanos sidecar for prometheus which is from operator after the block generation the block is not uplodig success ful some irregular formates are geenerating these are the logs.

level=warn ts=2020-11-26T03:03:51.020333618Z caller=s3.go:399 msg="could not guess file size for multipart upload; upload might be not optimized" name=debug/metas/01ER17WJWGR8Z44CHR42DQW3SY.json err="unsupported type of io.Reader"
level=error ts=2020-11-26T03:04:12.102790195Z caller=shipper.go:349 msg="failed to clean upload directory" err="unlinkat /prometheus/thanos/upload/01ER17WJWGR8Z44CHR42DQW3SY/chunks: directory not empty"

How-to add nodeSelector restriction to components?

Hi there,

Before the recent refactor, I was able to add by nodeSelector restriction, like the following example:

local s = 
  t.store + 
  t.store.withVolumeClaimTemplate + 
  t.store.withServiceMonitor + 
  commonConfig + {
    config+:: {
      name: 'thanos-store',
      replicas: 1,
    },
    statefulSet+: {
      spec+: {
        template+: {
          spec+: {
            nodeSelector+: {
              'k8s.scaleway.com/pool-name': 'kubernetes-infra',
            },
          },
        },
      },
    },
  };

With the new format, how can I achieve the same behavior?

I tried the following:

local s = t.store(commonConfig {
  replicas: 1,
  serviceMonitor: true,
  statefulSet+: {
    spec+: {
      template+: {
        spec+: {
          nodeSelector+: {
            'k8s.scaleway.com/pool-name': 'kubernetes-infra',
          },
        },
      },
    },
  },
});

Thanks in advance,

support jaeger tracing

It is convenient to add basic tracing configuration to the jsonnet files. I am not sure if this is in the scope of this repo?

what can i replace here "dnssrv+_client._tcp.<MEMCACHED_SERVICE>.thanos.svc.cluster.local"

i had to tryed
thano-store, thano-store-0, thano-store-1, thano-store-2 but pod not started. could you advise

level=info ts=2022-03-24T07:39:58.779840562Z caller=caching_bucket_factory.go:71 msg="loading caching bucket configuration"
level=error ts=2022-03-24T07:39:58.781341992Z caller=resolver.go:99 msg="failed to lookup SRV records" host=_client._tcp.thanos-store.thanos.svc.cluster.local err="no such host"
level=error ts=2022-03-24T07:39:58.781506043Z caller=main.go:132 err="no server address resolved for \nfailed to create memcached client\ngithub.com/thanos-io/thanos/pkg/store/cache.NewCachingBucketFromYaml\n\t/home/circleci/project/pkg/store/cache/caching_bucket_factory.go:92\nmain.runStore\n\t/home/circleci/project/cmd/thanos/store.go:260\nmain.registerStore.func1\n\t/home/circleci/project/cmd/thanos/store.go:195\nmain.main\n\t/home/circleci/project/cmd/thanos/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\ncreate caching bucket\nmain.runStore\n\t/home/circleci/project/cmd/thanos/store.go:262\nmain.registerStore.func1\n\t/home/circleci/project/cmd/thanos/store.go:195\nmain.main\n\t/home/circleci/project/cmd/thanos/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\npreparing store command failed\nmain.main\n\t/home/circleci/project/cmd/thanos/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"

Use single ServiceMonitor for Receive instances

We need similar changes as we had in #188 for Receivers.

cc @craigfurman

questions: add secret for receiver object bucket

I want to know, Should i manually first add a secret of object bucket for receiver .

side car service name is incorrect

The sidecar.libsonnet add a store with prometheus-k8s.monitoring, which is the regular Prometheus service, instead of pointing to prometheus-operated.monitoring, which is the headless service.

Thanos v0.17.0 and `--experimental.enable-index-cache-postings-compression`

The flag --experimental.enable-index-cache-postings-compression was removed from Thanos v0.17.0, and its behaviour is now the default.

At the time of writing, kube-thanos sets this flag when an index cache is configured: https://github.com/thanos-io/kube-thanos/blob/master/jsonnet/kube-thanos/kube-thanos-store.libsonnet#L64. This is itself a little strange to me, because an in-memory index cache is used by default, even if it's not explicitly configured.

I'm new to this project and am not sure yet what the best way forward is. At the very least I think kube-thanos will have to pull the version from config and not set this flag when v0.17.0+ is used. That also avoids a breaking change.

I'm happy to PR this, but figured it'd be best to get maintainer opinions first.

Add HOST_IP_ADDRESS env to all containers

It is not uncommon to deploy tracing agents as daemonsets on every node. The way the tracing libraries are then configured is typically by configuring an environment variable via the Kubernetes downward API. It seems paradox to force this onto downstream users, when the tracing config is already available.

Since it is not harmful to users that don't use this environment variable, I propose we automatically add the HOST_IP_ADDRESS environment variable to each Thanos component container, so that for example for tracing purposes the environment variable can be used directly.

cc @metalmatze @kakkoyun

Add end-to-end tests

Configure compactor CPU threads

Can we add support for --compact.concurrency ? As stated in https://thanos.io/tip/components/compact.md/#cpu it is possible to spawn more threads working in the thanos-compact component. Unfortunatly this flag is not available

Support full stack in kustomization.yaml

The current kustomization.yaml file doesn't support all the components of the Thanos stack. Can it be updated to use the manifests from the examples/all-manifests folder or can we add a kustomization.yaml file to the that folder?
Would be happy to create a PR if it can be decided where the best location is.
Thanks

Compact statefulset has a non-headless service as it's serviceName

I'm not sure if this is a problem, but the compact statefulset doesn't have the corresponding headless service to the value set in serviceName like the other statefulsets.

Support specify the annotations for SAs

We have a case to support AWS STS. It is supported by thanos

STS Endpoint

If you want to use IAM credential retrieved from an instance profile, Thanos needs to authenticate through AWS STS. For this purposes you can specify your own STS Endpoint.

By default Thanos will use endpoint: https://sts.amazonaws.com/ and AWS region coresponding endpoints.

In order to support in STS/ROSA clusters, I need to add annotated for SAs: "thanos-store-shard", "thanos-compact", "thanos-receive" , "thanos-receive-controller" to provide ARN Permissoin Policy.

thanos-receive ingress

Hello everyone, can somebody help me with ingress configuration for thanos-receive?
I have pretty good working thanos stack running on minikube, but when I deploy it to testing k8s env., I have a lot errors in logs:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/auth-realm: Authentication Required
# nginx.ingress.kubernetes.io/backend-protocol: GRPC
nginx.ingress.kubernetes.io/proxy-buffer-size: "128k"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "40"
nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
nginx.ingress.kubernetes.io/proxy-buffering: "on"
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "1024m"
nginx.ingress.kubernetes.io/auth-secret: prometheus-auth
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "20m"
name: thanos-receive
namespace: monitoring
spec:
rules:

host: receive-thanos.sandbox.k8s.sandbox.example.com
http:
paths:
- backend:
  serviceName: thanos-receive
  servicePort: 19291
  path: /
  tls:
hosts:
- receive-thanos.sandbox.k8s.sandbox.example.com
  secretName: thanos-tls-certs

logs of ingress pod:
2021/02/24 09:19:32 [error] 89#89: *1227 upstream timed out (110: Connection timed out) while connecting to upstream, client: 1.2.3.4, server: receive-thanos.sandbox.k8s.sandbox.example.com, request: "POST /api/v1/receive HTTP/1.1", upstream: "http://100.96.24.19:19291/api/v1/receive", host: "receive-thanos.sandbox.k8s.sandbox.example.com"

logs of receive:
level=error ts=2021-02-24T09:37:28.552040714Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" level=debug ts=2021-02-24T09:37:20.829019499Z caller=handler.go:315 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"

logs of prometheus
ts=2021-02-24T10:06:20.725Z caller=dedupe.go:112 component=remote level=warn remote_name=b29d86 url=https://receive-thanos.sandbox.k8s.sandbox.example.com/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 502 Bad Gateway: <html>"

Non-default ServiceAccounts

All components use the default service account right now which is problematic from a security standpoint, as in GCP for example through workload identity the object storage bucket permissions are given through the service account, so even components that don't need object storage access get it currently.

I'll prepare a PR to create a ServiceAccount per component.

@kakkoyun @metalmatze

Breaking update to store shard statefulsets

Store shards deployed before #199 was merged cannot easily be updated with the latest revision of kube-thanos, because of 2 changes made to StatefulSet.spec: volumeClaimTemplates, and selector. The error I'm seeing from Kubernetes is updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden.

This isn't a huge problem because of being able to jsonnet-patch around this, but was still a little time consuming to diagnose and fix, and I wonder if there's a reasonable place to document this to save future users the time.

Docs/Version question

I was reading through the README and it says to use another repo, is this correct?

jb install github.com/metalmatze/kube-thanos/jsonnet/kube-thanos@master

thanos-store without volumeClaimTemplate not working

According to the docs, backing thanos-store with a PVC is optional:

It [thanos-store] acts primarily as an API gateway and therefore does not need significant amounts of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times.

When compiling a jsonnet configuration not containing config.volumeClaimTemplate, the statefulset is never created because the "data" volume can not be found:

$ kubectl -n monitoring describe statefulset/thanos-store
Name:               thanos-store
Namespace:          monitoring
CreationTimestamp:  Mon, 22 Nov 2021 15:10:46 +0000
Selector:           app.kubernetes.io/component=object-store-gateway,app.kubernetes.io/instance=thanos-store,app.kubernetes.io/name=thanos-store
Labels:             app.kubernetes.io/component=object-store-gateway
                    app.kubernetes.io/instance=thanos-store
                    app.kubernetes.io/name=thanos-store
                    app.kubernetes.io/version=v0.22.0
                    kustomize.toolkit.fluxcd.io/name=kube-prometheus-thanos
                    kustomize.toolkit.fluxcd.io/namespace=monitoring
Annotations:        <none>
Replicas:           1 desired | 0 total
Update Strategy:    RollingUpdate
  Partition:        0
Pods Status:        0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=object-store-gateway
                    app.kubernetes.io/instance=thanos-store
                    app.kubernetes.io/name=thanos-store
                    app.kubernetes.io/version=v0.22.0
  Service Account:  thanos-store
  Containers:
   thanos-store:
    Image:       quay.io/thanos/thanos:v0.22.0
    Ports:       10901/TCP, 10902/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      store
      --log.level=info
      --log.format=logfmt
      --data-dir=/var/thanos/store
      --grpc-address=0.0.0.0:10901
      --http-address=0.0.0.0:10902
      --objstore.config=$(OBJSTORE_CONFIG)
      --ignore-deletion-marks-delay=24h
    Liveness:   http-get http://:10902/-/healthy delay=0s timeout=1s period=30s #success=1 #failure=8
    Readiness:  http-get http://:10902/-/ready delay=0s timeout=1s period=5s #success=1 #failure=20
    Environment:
      OBJSTORE_CONFIG:  <set to the key 'thanos.yaml' in secret 'thanos-objectstorage'>  Optional: false
      HOST_IP_ADDRESS:   (v1:status.hostIP)
    Mounts:
      /var/thanos/store from data (rw)
  Volumes:      <none>
Volume Claims:  <none>
Events:
  Type     Reason        Age                   From                    Message
  ----     ------        ----                  ----                    -------
  Warning  FailedCreate  8m41s (x19 over 30m)  statefulset-controller  create Pod thanos-store-0 in StatefulSet thanos-store failed error: Pod "thanos-store-0" is invalid: spec.containers[0].volumeMounts[0].name: Not found: "data"

Logic exists to ensure any passed volumeClaimTemplate is qualified to be used with thanos-store, if a template is passed: https://github.com/thanos-io/kube-thanos/blob/main/jsonnet/kube-thanos/kube-thanos-store.libsonnet#L28

This logic should be extended to include an emptyDir volume definition for "data" in the StatefulSet if no volumeClaimTemplate is passed.