thanos-io / kube-thanos Goto Github PK
View Code? Open in Web Editor NEWKubernetes specific configuration for deploying Thanos.
License: Apache License 2.0
Kubernetes specific configuration for deploying Thanos.
License: Apache License 2.0
The manifests aren't up to date with the latest code on master, would it be possible to re-generate them? Another option would be to remove the folder and generate the manifests on tag creation as a release artifact?
store
flag is deprecated and going to be replaced by endpoint
flag.
The new flag should be supported by this library.
I need to inject some environment variables to all Thanos pods (to configure tracing). Considering this is currently not supported by the lib I'll be probably doing this by patching multiple components in the generated json.
Would adding support for it be something you'd be interested in? I'm happy to prepare a PR
For example, for kube-thanos-store it could be something like:
(...)
env: [
(existing env config)
] + (
if std.length(ts.config.extraEnv) > 0 then [
ts.config.extraEnv,
] else []
),
(...)
not sure who the maintainers are, pinging @yeya24 @brancz @metalmatze for some feedback
yml file:
https://github.com/thanos-io/kube-thanos/blob/e1a68590f56034ca1a43d59401f03e72fd01ac5f/examples/all/manifests/thanos-query-deployment.yaml
config:
- args:
- query
- --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local
- --store=dnssrv+_grpc._tcp.thanos-store-0.thanos.svc.cluster.local
- --store=dnssrv+_grpc._tcp.thanos-store-1.thanos.svc.cluster.local
- --store=dnssrv+_grpc._tcp.thanos-store-2.thanos.svc.cluster.local
Hey! I've spent a good part of today combining kube-thanos and kube-prometheus into one jsonnet file. I think the process could have been a lot smoother with examples (Like a full combo of the two mixed together) and if some defaults were set to mesh a bit more nicely with kube-prometheus (such as the namespace).
I could potentially throw my config at you the developers if I have opportunity to remove confidential information. Would that be of use?
Just got error: Error: secret "thanos-objectstorage" not found
and i checked yaml files,have not found file of secret yaml.
# ls manifests/
thanos-query-deployment.yaml thanos-query-serviceMonitor.yaml thanos-store-serviceAccount.yaml thanos-store-service.yaml
thanos-query-serviceAccount.yaml thanos-query-service.yaml thanos-store-serviceMonitor.yaml thanos-store-statefulSet.yaml
It'd be neat if we could pass config elements that map to the --max-time
and --min-time
thanos store flags, to implement time-based partitioning (https://thanos.io/tip/components/store.md/#time-based-partitioning). At the moment, I'm doing this by patching the jsonnet-generated statefulset to append to the container's args.
I'm happy to make a PR, if this is a feature the maintainers want.
I was wondering if I could add a kustomization.yaml that list the manifest files. kube-prometheus has one and would make it easier for teams using kustomization to just pull upstream changes into their kustomization when they are are building thanos.
I work in an organisation where we are heavy users of kubernetes running on Microsoft Azure AKS .
Thanos and kube-thanos has worked out great for us. However thanos requires more memory than what we have on ordinary application servers. The solution is to schedule thanos to run on a different node pool with more memory than normal applications. To achieve this one could use a combination of two features in kubernetes; Taints and Tolerations and Node Affinity .
In the current version of kube-thanos these two fields are not configurable. I hope to contribute to the community a pull request where these two sections can be set up with jsonnet-bundler.
The end result should contain tolerations to all objects of kind: Deployment:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-query
app.kubernetes.io/name: thanos-query
template:
metadata:
labels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-query
app.kubernetes.io/name: thanos-query
app.kubernetes.io/version: v0.19.0
spec:
tolerations:
- effect: NoSchedule
key: CriticalAddonsOnly
operator: Equal
value: "true"
...
The end result should also contain nodeAffinity to all objects of kind: Deployment:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-query
app.kubernetes.io/name: thanos-query
template:
metadata:
labels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-query
app.kubernetes.io/name: thanos-query
app.kubernetes.io/version: v0.19.0
spec:
...
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- systempool
podAntiAffinity:
...
A working solution should build on standard kubernetes configuration, and be generic enough to fit into a similar setup on all major cloud providers.
There might be other ways to achieve the same result on Azure Kubernetes Services. To run thanos on dedicated hardware. My proposal might not be the only good solution.
We use command names for the names of files, we only have two exception querier
and compactor
, should we rename them?
What do you think?
Thanos components run as root by default:
$ docker run --rm -ti --entrypoint= quay.io/thanos/thanos:v0.14.0 id
uid=0(root) gid=0(root) groups=10(wheel)
Pods should probably have a restricted security context, I currently run them with the following (except receive and rule that I do not have):
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
Do you have an opinion on the uid/gid to use? 65534
/nobody
/nogroup
seems popular too, but not everyone thinks this what they should be used for.
In the manifest, I see yaml about service account. What is service account used for? For description?
Circle CI is broken and it is time to switch to Github actions.
Links in example are broken. For example here:
Line 113 in add27c3
should be https://thanos.io/tip/thanos/service-discovery.md/#dns-service-discovery
I think there could be various ways to do this, but my first hunch goes towards allowing to pass a PodSecurityPolicy name to bind each component to.
By setting terminationMessagePolicy to FallbackToLogsOnError we save a snapshot of logs in the a new pod if it errored out previously.
Add convenience functions for Thanos Receive to demonstrate separate Ingester and Router functionality.
As Thanos Query does not support mTLS per store currently, the recommended pattern is to add for each remote store an envoy proxy, that can terminate the TLS connection. It will be nice to add it to kube-thanos, to make it easier to deploy it with Thanos.
In thanos-rule manifest, there is a mistake in args :
- --alert.label-drop="rule_replica"
should be
- --alert.label-drop=rule_replica
PR opened #116
Are there any recommendations on how to import the mixins?
I tried just importing mixin.libsonnet
since it will import the other configs but now I'm getting this error saying there are duplicated rules and Prometheus will now go into a crash loop.
level=info ts=2019-09-24T17:44:32.693Z caller=main.go:670 msg="TSDB started"
level=info ts=2019-09-24T17:44:32.693Z caller=main.go:740 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2019-09-24T17:44:32.697Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-09-24T17:44:32.698Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-09-24T17:44:32.699Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-09-24T17:44:32.700Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=error ts=2019-09-24T17:44:32.749Z caller=manager.go:833 component="rule manager" msg="loading groups failed" err="groupname: \"thanos-querier.rules\" is repeated in the same file"
level=error ts=2019-09-24T17:44:32.749Z caller=manager.go:833 component="rule manager" msg="loading groups failed" err="groupname: \"thanos-receive.rules\" is repeated in the same file"
level=error ts=2019-09-24T17:44:32.749Z caller=manager.go:833 component="rule manager" msg="loading groups failed" err="groupname: \"thanos-store.rules\" is repeated in the same file"
level=error ts=2019-09-24T17:44:32.749Z caller=main.go:759 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
level=info ts=2019-09-24T17:44:32.749Z caller=main.go:523 msg="Stopping scrape discovery manager..."
level=info ts=2019-09-24T17:44:32.749Z caller=main.go:537 msg="Stopping notify discovery manager..."
level=info ts=2019-09-24T17:44:32.750Z caller=main.go:559 msg="Stopping scrape manager..."
level=info ts=2019-09-24T17:44:32.750Z caller=main.go:519 msg="Scrape discovery manager stopped"
level=error ts=2019-09-24T17:44:32.750Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=error ts=2019-09-24T17:44:32.750Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=error ts=2019-09-24T17:44:32.750Z caller=endpoints.go:131 component="discovery manager scrape" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=info ts=2019-09-24T17:44:32.751Z caller=main.go:533 msg="Notify discovery manager stopped"
level=error ts=2019-09-24T17:44:32.751Z caller=endpoints.go:131 component="discovery manager notify" discovery=k8s role=endpoint msg="endpoints informer unable to sync cache"
level=info ts=2019-09-24T17:44:32.751Z caller=manager.go:815 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-09-24T17:44:32.751Z caller=manager.go:821 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-09-24T17:44:32.753Z caller=notifier.go:602 component=notifier msg="Stopping notification manager..."
level=info ts=2019-09-24T17:44:32.753Z caller=main.go:724 msg="Notifier manager stopped"
level=info ts=2019-09-24T17:44:32.753Z caller=main.go:553 msg="Scrape manager stopped"
level=error ts=2019-09-24T17:44:32.753Z caller=main.go:733 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": one or more errors occurred while applying the new configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\")"
Is it a good practice to create matching version tags with Thanos itself? And cut releases accordingly?
There might be several versions out there and depending on the version Thanos uses different flags, for example. It would be really good to know which versions are compatible with the current state.
What do you think?
kube-thanos has shaped up pretty nicely into a modular library. A concern that users still face is that they must render all components manually, even though they just pass a single config. See here for an example, not only is this leaky configuration with layering violations, it's inconvenient and error prone (it took multiple attempts to get certain parts right in the linked example.
It would be great if each individual component could offer .manifests
field that could be recursively used to build the above without leaking configuration into the final rendering act.
I am following this article : https://programmer.group/how-to-use-thanos-to-implement-prometheus-multi-cluster-monitoring.html
Everything is fine until I use kube-thanos build scritpt, I can build the manifests for store and query, but when I apply the manifest I keep on getting unbound immediate persistent claim on thanos-store. The OBJSTORE_ CONFIG to access minio works for the prometheus sidecar statefulset, but not for the store.
`apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
app.kubernetes.io/version: v0.17.0
name: thanos-store
namespace: monit
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
serviceName: thanos-store
template:
metadata:
labels:
app.kubernetes.io/component: object-store-gateway
app.kubernetes.io/instance: thanos-store
app.kubernetes.io/name: thanos-store
app.kubernetes.io/version: v0.17.0
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- thanos-store
- key: app.kubernetes.io/instance
operator: In
values:
- thanos-store
namespaces:
- monit
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- args:
- store
- --log.level=info
- --log.format=logfmt
- --data-dir=/var/thanos/store
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --objstore.config=$(OBJSTORE_CONFIG)
- --ignore-deletion-marks-delay=24h
env:
- name: OBJSTORE_CONFIG
valueFrom:
secretKeyRef:
key: thanos.yaml
name: thanos-objectstorage
image: quay.io/thanos/thanos:v0.17.0
livenessProbe:
failureThreshold: 8
httpGet:
path: /-/healthy
port: 10902
scheme: HTTP
periodSeconds: 30
name: thanos-store
ports:
- containerPort: 10901
name: grpc
- containerPort: 10902
name: http
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: 10902
scheme: HTTP
periodSeconds: 5
resources: {}
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/thanos/store
name: data
readOnly: false
terminationGracePeriodSeconds: 120
volumes: []
volumeClaimTemplates:
When I add a storageClass to the volumeClaimTemplates spec section, it basically ignores it, no matter what I keep on getting this error. Can you please shed some light in to what's missing here ?
getting the following error. Please let me know what is missing.
level=info ts=2020-02-25T12:27:04.082433095Z caller=main.go:149 msg="Tracing will be disabled"
level=info ts=2020-02-25T12:27:04.082519571Z caller=factory.go:43 msg="loading bucket configuration"
level=info ts=2020-02-25T12:27:04.099206846Z caller=inmemory.go:167 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
level=info ts=2020-02-25T12:27:04.099521724Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-02-25T12:27:04.099760677Z caller=store.go:288 msg="starting store node"
level=info ts=2020-02-25T12:27:04.099850267Z caller=prober.go:127 msg="changing probe status" status=healthy
level=info ts=2020-02-25T12:27:04.099887538Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-02-25T12:27:04.099942146Z caller=store.go:243 msg="initializing bucket store"
level=info ts=2020-02-25T12:27:04.131914347Z caller=prober.go:107 msg="changing probe status" status=ready
level=info ts=2020-02-25T12:27:04.131935092Z caller=http.go:78 service=http/server component=store msg="internal server shutdown" err="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=info ts=2020-02-25T12:27:04.131976289Z caller=prober.go:137 msg="changing probe status" status=not-healthy reason="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=info ts=2020-02-25T12:27:04.131983955Z caller=grpc.go:98 service=gRPC/server component=store msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=warn ts=2020-02-25T12:27:04.131999302Z caller=prober.go:117 msg="changing probe status" status=not-ready reason="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=info ts=2020-02-25T12:27:04.13202616Z caller=grpc.go:117 service=gRPC/server component=store msg="gracefully stopping internal server"
level=info ts=2020-02-25T12:27:04.132126286Z caller=grpc.go:129 service=gRPC/server component=store msg="internal server shutdown" err="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
level=error ts=2020-02-25T12:27:04.13216824Z caller=main.go:194 msg="running command failed" err="bucket store initial sync: sync block: MetaFetcher: iter bucket: Get https://amjad-thanos.s3.dualstack.eu-west-1.amazonaws.com/?delimiter=%2F&encoding-type=url&prefix=: 301 response missing Location header"
following is my configuration:
- args:
- store
- --data-dir=/var/thanos/store
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- |
--objstore.config=type: s3
config:
bucket: "amjad-thanos"
endpoint: "s3.eu-west-1.amazonaws.com"
access_key: "kajshdajkshd87098098"
secret_key: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxas"
I would be interested to integrate G-Research/thanos-remote-read into kube-thanos to provide remote_read functionality to prometheus-operator deployments.
As prometheus-operator is very clear about its scope (sidecard and ThanosRuler) I guess this would be the right place to implement that feature.
Currently the example jsonnet at all.jsonnet
causes the StatefulSet for thanos-store
to be duplicated, once with memcache and once without. This is probably due to
Line 157 in a6a0027
I suggest that the example be updated with a variable which takes in a boolean to check if memcached is required or not and updates the StatefulSet in-place. Let me know how that sounds and I can create a PR for the same.
Hey! So I have istio installed in my cluster and am attempting to install Thanos components as well. With istio installed, none of my grpc connections are successful and querier receives the following for all services:
rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: connection failure"
I've renamed the grpc
service ports to http2
to no avail, I'm utilizing the dnssrv+ approach for local service discovery --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local
(--store=dnssrv+_http2._tcp.thanos-store.monitoring.svc.cluster.local
) with the name swap
Istio mutual TLS is disabled and istio-proxy logs don't give me any clues, however without istio, everything can discover one another fine, and works as intended.
Any ideas?
Although I'm the one who introduced those, I realized they are error-prone and they could do more harm then they make things easier. Better to remove those selectors for more reliable alerting rules.
Store gateway supports hash sharding and it would be really good to have this support for compactor as well.
Currently all error rate graphs show 0-100 percent. Including the success rate. We just not interested in the success in this graph.
We should instead remove the success rate from those graphs and only show the errors. Last we want to remove the Y-max, that's currently set to 1, and have it open, so we can also see small error rates.
The containers are missing liveness/readiness probes.
As its a Kubernetes best-practice, we should get them added.
I can work on it once I figure out my jb
crashing locally :)
Hi
I am using the thanos sidecar for prometheus which is from operator after the block generation the block is not uplodig success ful some irregular formates are geenerating these are the logs.
level=warn ts=2020-11-26T03:03:51.020333618Z caller=s3.go:399 msg="could not guess file size for multipart upload; upload might be not optimized" name=debug/metas/01ER17WJWGR8Z44CHR42DQW3SY.json err="unsupported type of io.Reader"
level=error ts=2020-11-26T03:04:12.102790195Z caller=shipper.go:349 msg="failed to clean upload directory" err="unlinkat /prometheus/thanos/upload/01ER17WJWGR8Z44CHR42DQW3SY/chunks: directory not empty"
Hi there,
Before the recent refactor, I was able to add by nodeSelector restriction, like the following example:
local s =
t.store +
t.store.withVolumeClaimTemplate +
t.store.withServiceMonitor +
commonConfig + {
config+:: {
name: 'thanos-store',
replicas: 1,
},
statefulSet+: {
spec+: {
template+: {
spec+: {
nodeSelector+: {
'k8s.scaleway.com/pool-name': 'kubernetes-infra',
},
},
},
},
},
};
With the new format, how can I achieve the same behavior?
I tried the following:
local s = t.store(commonConfig {
replicas: 1,
serviceMonitor: true,
statefulSet+: {
spec+: {
template+: {
spec+: {
nodeSelector+: {
'k8s.scaleway.com/pool-name': 'kubernetes-infra',
},
},
},
},
},
});
Thanks in advance,
It is convenient to add basic tracing configuration to the jsonnet files. I am not sure if this is in the scope of this repo?
i had to tryed
thano-store, thano-store-0, thano-store-1, thano-store-2 but pod not started. could you advise
level=info ts=2022-03-24T07:39:58.779840562Z caller=caching_bucket_factory.go:71 msg="loading caching bucket configuration"
level=error ts=2022-03-24T07:39:58.781341992Z caller=resolver.go:99 msg="failed to lookup SRV records" host=_client._tcp.thanos-store.thanos.svc.cluster.local err="no such host"
level=error ts=2022-03-24T07:39:58.781506043Z caller=main.go:132 err="no server address resolved for \nfailed to create memcached client\ngithub.com/thanos-io/thanos/pkg/store/cache.NewCachingBucketFromYaml\n\t/home/circleci/project/pkg/store/cache/caching_bucket_factory.go:92\nmain.runStore\n\t/home/circleci/project/cmd/thanos/store.go:260\nmain.registerStore.func1\n\t/home/circleci/project/cmd/thanos/store.go:195\nmain.main\n\t/home/circleci/project/cmd/thanos/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\ncreate caching bucket\nmain.runStore\n\t/home/circleci/project/cmd/thanos/store.go:262\nmain.registerStore.func1\n\t/home/circleci/project/cmd/thanos/store.go:195\nmain.main\n\t/home/circleci/project/cmd/thanos/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\npreparing store command failed\nmain.main\n\t/home/circleci/project/cmd/thanos/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"
We need similar changes as we had in #188 for Receivers.
cc @craigfurman
The sidecar.libsonnet
add a store with prometheus-k8s.monitoring
, which is the regular Prometheus service, instead of pointing to prometheus-operated.monitoring
, which is the headless service.
The flag --experimental.enable-index-cache-postings-compression
was removed from Thanos v0.17.0, and its behaviour is now the default.
At the time of writing, kube-thanos sets this flag when an index cache is configured: https://github.com/thanos-io/kube-thanos/blob/master/jsonnet/kube-thanos/kube-thanos-store.libsonnet#L64. This is itself a little strange to me, because an in-memory index cache is used by default, even if it's not explicitly configured.
I'm new to this project and am not sure yet what the best way forward is. At the very least I think kube-thanos will have to pull the version from config and not set this flag when v0.17.0+ is used. That also avoids a breaking change.
I'm happy to PR this, but figured it'd be best to get maintainer opinions first.
It is not uncommon to deploy tracing agents as daemonsets on every node. The way the tracing libraries are then configured is typically by configuring an environment variable via the Kubernetes downward API. It seems paradox to force this onto downstream users, when the tracing config is already available.
Since it is not harmful to users that don't use this environment variable, I propose we automatically add the HOST_IP_ADDRESS
environment variable to each Thanos component container, so that for example for tracing purposes the environment variable can be used directly.
Can we add support for --compact.concurrency
? As stated in https://thanos.io/tip/components/compact.md/#cpu it is possible to spawn more threads working in the thanos-compact component. Unfortunatly this flag is not available
The current kustomization.yaml file doesn't support all the components of the Thanos stack. Can it be updated to use the manifests from the examples/all-manifests folder or can we add a kustomization.yaml file to the that folder?
Would be happy to create a PR if it can be decided where the best location is.
Thanks
I'm not sure if this is a problem, but the compact statefulset doesn't have the corresponding headless service to the value set in serviceName like the other statefulsets.
We have a case to support AWS STS. It is supported by thanos
STS Endpoint
If you want to use IAM credential retrieved from an instance profile, Thanos needs to authenticate through AWS STS. For this purposes you can specify your own STS Endpoint.
By default Thanos will use endpoint: https://sts.amazonaws.com/ and AWS region coresponding endpoints.
In order to support in STS/ROSA clusters, I need to add annotated for SAs: "thanos-store-shard", "thanos-compact", "thanos-receive" , "thanos-receive-controller" to provide ARN Permissoin Policy.
Hello everyone, can somebody help me with ingress
configuration for thanos-receive?
I have pretty good working thanos stack running on minikube, but when I deploy it to testing k8s env., I have a lot errors in logs:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/auth-realm: Authentication Required
# nginx.ingress.kubernetes.io/backend-protocol: GRPC
nginx.ingress.kubernetes.io/proxy-buffer-size: "128k"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "40"
nginx.ingress.kubernetes.io/proxy-buffers-number: "8"
nginx.ingress.kubernetes.io/proxy-buffering: "on"
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "1024m"
nginx.ingress.kubernetes.io/auth-secret: prometheus-auth
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "20m"
name: thanos-receive
namespace: monitoring
spec:
rules:
logs of ingress pod:
2021/02/24 09:19:32 [error] 89#89: *1227 upstream timed out (110: Connection timed out) while connecting to upstream, client: 1.2.3.4, server: receive-thanos.sandbox.k8s.sandbox.example.com, request: "POST /api/v1/receive HTTP/1.1", upstream: "http://100.96.24.19:19291/api/v1/receive", host: "receive-thanos.sandbox.k8s.sandbox.example.com"
logs of receive:
level=error ts=2021-02-24T09:37:28.552040714Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" level=debug ts=2021-02-24T09:37:20.829019499Z caller=handler.go:315 component=receive component=receive-handler msg="failed to handle request" err="context deadline exceeded"
logs of prometheus
ts=2021-02-24T10:06:20.725Z caller=dedupe.go:112 component=remote level=warn remote_name=b29d86 url=https://receive-thanos.sandbox.k8s.sandbox.example.com/api/v1/receive msg="Failed to send batch, retrying" err="server returned HTTP status 502 Bad Gateway: <html>"
All components use the default service account right now which is problematic from a security standpoint, as in GCP for example through workload identity the object storage bucket permissions are given through the service account, so even components that don't need object storage access get it currently.
I'll prepare a PR to create a ServiceAccount per component.
Store shards deployed before #199 was merged cannot easily be updated with the latest revision of kube-thanos, because of 2 changes made to StatefulSet.spec: volumeClaimTemplates, and selector. The error I'm seeing from Kubernetes is updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
.
This isn't a huge problem because of being able to jsonnet-patch around this, but was still a little time consuming to diagnose and fix, and I wonder if there's a reasonable place to document this to save future users the time.
I was reading through the README and it says to use another repo, is this correct?
jb install github.com/metalmatze/kube-thanos/jsonnet/kube-thanos@master
According to the docs, backing thanos-store with a PVC is optional:
It [thanos-store] acts primarily as an API gateway and therefore does not need significant amounts of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times.
When compiling a jsonnet configuration not containing config.volumeClaimTemplate
, the statefulset is never created because the "data" volume can not be found:
$ kubectl -n monitoring describe statefulset/thanos-store
Name: thanos-store
Namespace: monitoring
CreationTimestamp: Mon, 22 Nov 2021 15:10:46 +0000
Selector: app.kubernetes.io/component=object-store-gateway,app.kubernetes.io/instance=thanos-store,app.kubernetes.io/name=thanos-store
Labels: app.kubernetes.io/component=object-store-gateway
app.kubernetes.io/instance=thanos-store
app.kubernetes.io/name=thanos-store
app.kubernetes.io/version=v0.22.0
kustomize.toolkit.fluxcd.io/name=kube-prometheus-thanos
kustomize.toolkit.fluxcd.io/namespace=monitoring
Annotations: <none>
Replicas: 1 desired | 0 total
Update Strategy: RollingUpdate
Partition: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/component=object-store-gateway
app.kubernetes.io/instance=thanos-store
app.kubernetes.io/name=thanos-store
app.kubernetes.io/version=v0.22.0
Service Account: thanos-store
Containers:
thanos-store:
Image: quay.io/thanos/thanos:v0.22.0
Ports: 10901/TCP, 10902/TCP
Host Ports: 0/TCP, 0/TCP
Args:
store
--log.level=info
--log.format=logfmt
--data-dir=/var/thanos/store
--grpc-address=0.0.0.0:10901
--http-address=0.0.0.0:10902
--objstore.config=$(OBJSTORE_CONFIG)
--ignore-deletion-marks-delay=24h
Liveness: http-get http://:10902/-/healthy delay=0s timeout=1s period=30s #success=1 #failure=8
Readiness: http-get http://:10902/-/ready delay=0s timeout=1s period=5s #success=1 #failure=20
Environment:
OBJSTORE_CONFIG: <set to the key 'thanos.yaml' in secret 'thanos-objectstorage'> Optional: false
HOST_IP_ADDRESS: (v1:status.hostIP)
Mounts:
/var/thanos/store from data (rw)
Volumes: <none>
Volume Claims: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 8m41s (x19 over 30m) statefulset-controller create Pod thanos-store-0 in StatefulSet thanos-store failed error: Pod "thanos-store-0" is invalid: spec.containers[0].volumeMounts[0].name: Not found: "data"
Logic exists to ensure any passed volumeClaimTemplate is qualified to be used with thanos-store, if a template is passed: https://github.com/thanos-io/kube-thanos/blob/main/jsonnet/kube-thanos/kube-thanos-store.libsonnet#L28
This logic should be extended to include an emptyDir volume definition for "data" in the StatefulSet if no volumeClaimTemplate is passed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.