Git Product home page Git Product logo

k8s-stackdriver's People

Contributors

44past4 avatar astropuffin avatar bmoyles0117 avatar brett-elliott avatar bryan-hz avatar bskiba avatar catherinef-dev avatar cezarygerard avatar dependabot[bot] avatar directxman12 avatar douglas-reid avatar erocchi avatar jacobx33 avatar kawych avatar kwiesmueller avatar laoj2 avatar loburm avatar malusgreen avatar masijiaqiu avatar olagacek avatar osalau avatar piosz avatar serathius avatar shoucongc avatar shuaich avatar solodov avatar sophieliu15 avatar x13n avatar yungdarek122 avatar znirzej avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

k8s-stackdriver's Issues

Metrics from multiple google projects (external)

I have a GKE cluster in a google project, and I have Google Cloud resources (pubsub, bigtable) in a different project.

I would like to autoscale using the external metrics api based on those resources.

Is it possible to pass the project id to the adapter? My main project already gets stackdriver metrics from all other projects.

Thank you.

prometheus-to-sd --dynamic-source is resolving pods in kube-system

Hi, I have this DaemonSet :

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: {{ template "prometheus.fullname" . }}
  labels:
    app: {{ template "prometheus.name" . }}
    chart: {{ template "qpipeline.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  selector:
    matchLabels:
      monitor: kamon-to-prometheus
  template:
    metadata:
      labels:
        monitor: kamon-to-prometheus
    spec:
      containers:
        - name: prometheus-to-sd
          image: {{ .Values.stackdriver.image }}
          command: ["/monitor", "--stackdriver-prefix={{ .Values.stackdriver.prefix }}",
                    "--dynamic-source=mix:http://:{{ .Values.stackdriver.port }}{{ .Values.stackdriver.endpoint }}?podIdLabel=kamon-to-prometheus&namespaceIdLabel=default",
                    "--namespace-id=default"]

To run in default namespace and it should resolve pods also in default namespace, but I cannot force it and I get :

main.go:123] pods is forbidden: User "system:serviceaccount:default:default" cannot list pods in the namespace "kube-system": Unknown user "system:serviceaccount:default:default"

which means it tries to do service discovery in the kube-system instead of default ... It is hardcoded here for kube-system

podNamespace = "kube-system"

I mean, I'm running this DaemonSet in default namespace and all pods that it should discover lives also in default namespace, but it has hardcoded kube-system namespace. Shouldn't it use the --namespace-id flag instead of it being a constant?

Event is sent to SD even though newEvent.Count != oldEvent.Count+1

if newEvent.Count != oldCount+1 {
// Sink doesn't send a LogEntry to Stackdriver, b/c event compression might
// indicate that part of the watch history was lost, which may result in
// multiple events being compressed. This may create an unecessary
// flood in Stackdriver. Also this is a perfectly valid behavior for the
// configuration with empty backing storage.
glog.V(2).Infof("Event count has increased by %d != 1.\n"+
"\tOld event: %+v\n\tNew event: %+v", newEvent.Count-oldCount, oldEvent, newEvent)
}
receivedEntryCount.WithLabelValues(newEvent.Source.Component).Inc()
logEntry := s.logEntryFactory.FromEvent(newEvent)
s.logEntryChannel <- logEntry

I'm trying to better understand the behavior when the watcher sends an UPDATE event.
As per my understanding of the comment, if the newEvent.Count is greater than 1 when compared to oldEvent.Count we don't want to send a logEntry to Stackdriver. However, the if block doesn't have a return statement to break out.
It looks like we are sending the logEntry/newEvent regardless of the count.

Is this the expected behavior?

[kubelet-to-gcm] UsageNanoCores missing from CPUStats

From logs

W0604 12:26:23.181684       1 poll.go:46] Failed to create time series request: Failed to translate data from summary &{***}: UsageNanoCores missing from CPUStats &{2018-06-04 12:26:23 +0000 UTC <nil> 0xc420280b78}
kubectl version
Client Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.8-gke.0", GitCommit:"6e5b33a290a99c067003632e0fd6be0ead48b233", GitTreeState:"clean", BuildDate:"2018-02-16T18:28:23Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.8-gke.0", GitCommit:"6e5b33a290a99c067003632e0fd6be0ead48b233", GitTreeState:"clean", BuildDate:"2018-02-16T18:26:58Z", GoVersion:"go1.8.3b4", Compiler:"gc", Platform:"linux/amd64"}

kubelet-to-gcm version: 1.2.4

fluent-plugin-systemd out of date

Hi
I'm running a customized fluentd-gcp image (it just adds the fluent-plugin-rewrite-tag-filter plugin). I had to upgrade to the latest fluentd-gcp image recently because fluentd started crashing on gke with the version we were running. However I'm having trouble using the latest fluentd-gcp image (2.0.18). When I run fluentd with fluent-plugin-rewrite-tag-filter ~>2.1 installed, it fails on dependency conflict

Unable to activate fluent-plugin-systemd-0.0.11, because fluentd-1.2.5 conflicts with fluentd (~> 0.12) path=nil error_class=Gem::ConflictError error="Unable to activate fluent-plugin-systemd-0.0.11, because fluentd-1.2.5 conflicts with fluentd (~> 0.12)

I'm not sure how much testing is required to upgrade the fluent-plugin-systemd plugin to latest?

direct-to-sd sample not working anymore

When running the direct-to-sd example on my GKE cluster I get the following error:

2018/03/02 10:27:36 Failed to write time series data: googleapi: Error 500: One or more TimeSeries could not be written:
An internal error occurred: timeSeries[0], backendError

Could this be related to the new stackdriver model mentioned in #86 ?

GKE. Stackdriver Kubernetes Beta. Regional or Multizone Cluster do not show pods metrics.

Hi

I've to try to use new Stackdriver Kubernetes Beta monitoring for GKE cluster (region: us-east4) in the zonal cluster everything works great. All resources (pods, nodes, namespaces, services, deployments etc.) shows in Stackdriver -> Resources -> Kubernets Beta. But if I try to use GKE in the regional cluster or multi-zone there is only aggregated metrics for namespaces and nothing about pods or nodes metrics.

I didn't find any cases or bug reports about this that's why I've opened the issue here. If this case should be redirected to GCP Support Team please let me know.

Thank you.

update README with correct image hosting location

I'm pretty sure that this image is hosted on gcr.io now and that the location listed in the README file is out of date.

The current location seems to be:
e.g.
gcr.io/google-containers/fluentd-gcp

As there is a 2.0.2 tag at that location that was updated within the last 6 months. Where as the old location has not been updated for two years. https://hub.docker.com/r/kubernetes/fluentd-gcp/tags/

A general overhaul of the README file would be great.

[prometheus-to-sd] Support custom paths

Right now the path to /metrics is hardcoded in the prometheus-to-sd's source (and it doesn't even seem documented...). Had to do some code-digging to find where I should serve my metrics.

However, I think there should be ways to configure this path. My use-case is that I want to expose two endpoints with different paths: one for pod-level metrics, another one for service-level metrics (the latter are harder to calculate and I want to do it only once per service per interval, not once per pod per interval).

custom-metrics-stackdriver-adapter should use DefaultTokenSource

Although Stackdriver Monitoring is not support only GCP but also multi-cloud and hybrid-cloud deployment,
custom-metrics-stackdriver-adapter is strongly tied with GCE/GKE because it use ComputeTokenSource.
DefaultTokenSource should be more widely useful option because it allows to use GOOGLE_APPLICATION_CREDENTIALS.

I am not testing it yet but it will be a small patch like below.
apstndb@5cc3399

Support configuration of scrape-interval + export-interval in prometheus-to-sd-kube-state-metrics.yaml

According to the Readme, prometheus-to-sd supports scrape-interval and export-interval params to configure the rate of scraping and exporting.
https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd#scrape-interval-vs-export-interval

However the helm chart does not support configuring those values:
https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd/kubernetes/prometheus-to-sd-kube-state-metrics.yaml#L32

We can work around this by forking the yaml but it would be a nice to have if the official version could support this :)

Where are the metrics sent from event-exporter?

I spent the day digging around but could not figure out where I can see the metrics being sent from event-exporter.

Besides the one running in the 1.7+ GKE cluster, I even ran my own (log shows it is successfully sending metrics to Stackdriver, I created the ClusterRoleBinding), however I'm not sure where to actually see the metrics in SD (I tried to create a chart with custom metrics, but none appear).

Tailing the container gives me:

...
I0912 20:56:31.575763       1 sink.go:160] Sending 4 entries to Stackdriver
I0912 20:56:31.685767       1 sink.go:167] Successfully sent 4 entries to Stackdriver
...

Further, I want to confirm that it is exporting events kubectl get event?

Any guidance or README update in this would be great.

Label addition to metrics with prefix "container.googleapis.com" should not break GCP/GKE?

Currently, label addition will break GKE. I checked https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd, for metrics with prefix "container.googleapis.com" (e.g., etcd metrics), if definition (for example, label is one of the definition) of the metric was changed, then the metric is marked as broken and the metric is not going to be pushed:

func (cache *MetricDescriptorCache) ValidateMetricDescriptors(metrics map[string]*dto.MetricFamily, whitelisted []string) {
. prometheus-to-sd only UpdateMetricDescriptors if it contains "custom.googleapis.com" prefix:
if strings.HasPrefix(commonConfig.GceConfig.MetricsPrefix, customMetricsPrefix) {
.

How ever, many label additions are conceptually backward compatible since their introduction would not break existing readers that are unaware of the label. So I would expect backward compatible label addition does not break GCP/GKE.

Here is an example: etcd-io/etcd#10022

prometheus-to-sd: warning log spam "Metric ... not found in the cache."

I hope this is the right place to report this bug. I am running a GKE cluster with version v1.7.5.

I get many (~10/minute) warnings like these in the stackdriver logging output:

W  Metric stackdriver_sink_request_count was not found in the cache. 
W  Metric stackdriver_sink_received_entry_count was not found in the cache. 
W  Metric stackdriver_sink_successfully_sent_entry_count was not found in the cache. 

This only happened recently after a Kubernetes upgrade. Am I doing something wrong or how can I reduce these warnings?

Detailed log output

{
 insertId:  "1p24kzgfok4w2d"   
 labels: {
  compute.googleapis.com/resource_name:  "gke-live-cluster-pool-XXXXXX"    
  container.googleapis.com/namespace_name:  "kube-system"    
  container.googleapis.com/pod_name:  "event-exporter-1421584133-43n5j"    
  container.googleapis.com/stream:  "stderr"    
 }
 logName:  "projects/XXXXXX/logs/prometheus-to-sd-exporter"   
 receiveTimestamp:  "2017-09-13T10:36:48.347526394Z"   
 resource: {
  labels: {
   cluster_name:  "live-cluster"     
   container_name:  "prometheus-to-sd-exporter"     
   instance_id:  "XXXXXX"     
   namespace_id:  "kube-system"     
   pod_id:  "event-exporter-1421584133-43n5j"     
   project_id:  "wcd-production"     
   zone:  "europe-west1-d"     
  }
  type:  "container"    
 }
 severity:  "WARNING"   
 textPayload:  "Metric stackdriver_sink_successfully_sent_entry_count was not found in the cache."   
 timestamp:  "2017-09-13T10:36:46Z"   
}

event-exporter deployment

{
  "kind": "Deployment",
  "apiVersion": "extensions/v1beta1",
  "metadata": {
    "name": "event-exporter",
    "namespace": "kube-system",
    "selfLink": "/apis/extensions/v1beta1/namespaces/kube-system/deployments/event-exporter",
    "uid": "65301c2e-917d-11e7-8857-42010a840101",
    "resourceVersion": "9533781",
    "generation": 1,
    "creationTimestamp": "2017-09-04T14:29:01Z",
    "labels": {
      "addonmanager.kubernetes.io/mode": "Reconcile",
      "k8s-app": "event-exporter",
      "kubernetes.io/cluster-service": "true",
      "version": "v0.1.5"
    },
    "annotations": {
      "deployment.kubernetes.io/revision": "1",
      "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"apps/v1beta1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"labels\":{\"addonmanager.kubernetes.io/mode\":\"Reconcile\",\"k8s-app\":\"event-exporter\",\"kubernetes.io/cluster-service\":\"true\",\"version\":\"v0.1.5\"},\"name\":\"event-exporter\",\"namespace\":\"kube-system\"},\"spec\":{\"replicas\":1,\"template\":{\"metadata\":{\"labels\":{\"k8s-app\":\"event-exporter\",\"version\":\"v0.1.5\"}},\"spec\":{\"containers\":[{\"command\":[\"/event-exporter\"],\"image\":\"gcr.io/google-containers/event-exporter:v0.1.5\",\"name\":\"event-exporter\"},{\"command\":[\"/monitor\",\"--component=event_exporter\",\"--stackdriver-prefix=container.googleapis.com/internal/addons\",\"--whitelisted-metrics=stackdriver_sink_received_entry_count,stackdriver_sink_request_count,stackdriver_sink_successfully_sent_entry_count\"],\"image\":\"gcr.io/google-containers/prometheus-to-sd:v0.2.1\",\"name\":\"prometheus-to-sd-exporter\",\"volumeMounts\":[{\"mountPath\":\"/etc/ssl/certs\",\"name\":\"ssl-certs\"}]}],\"serviceAccountName\":\"event-exporter-sa\",\"terminationGracePeriodSeconds\":30,\"volumes\":[{\"hostPath\":{\"path\":\"/etc/ssl/certs\"},\"name\":\"ssl-certs\"}]}}}}\n"
    }
  },
  "spec": {
    "replicas": 1,
    "selector": {
      "matchLabels": {
        "k8s-app": "event-exporter",
        "version": "v0.1.5"
      }
    },
    "template": {
      "metadata": {
        "creationTimestamp": null,
        "labels": {
          "k8s-app": "event-exporter",
          "version": "v0.1.5"
        }
      },
      "spec": {
        "volumes": [
          {
            "name": "ssl-certs",
            "hostPath": {
              "path": "/etc/ssl/certs"
            }
          }
        ],
        "containers": [
          {
            "name": "event-exporter",
            "image": "gcr.io/google-containers/event-exporter:v0.1.5",
            "command": [
              "/event-exporter"
            ],
            "resources": {},
            "terminationMessagePath": "/dev/termination-log",
            "terminationMessagePolicy": "File",
            "imagePullPolicy": "IfNotPresent"
          },
          {
            "name": "prometheus-to-sd-exporter",
            "image": "gcr.io/google-containers/prometheus-to-sd:v0.2.1",
            "command": [
              "/monitor",
              "--component=event_exporter",
              "--stackdriver-prefix=container.googleapis.com/internal/addons",
              "--whitelisted-metrics=stackdriver_sink_received_entry_count,stackdriver_sink_request_count,stackdriver_sink_successfully_sent_entry_count"
            ],
            "resources": {},
            "volumeMounts": [
              {
                "name": "ssl-certs",
                "mountPath": "/etc/ssl/certs"
              }
            ],
            "terminationMessagePath": "/dev/termination-log",
            "terminationMessagePolicy": "File",
            "imagePullPolicy": "IfNotPresent"
          }
        ],
        "restartPolicy": "Always",
        "terminationGracePeriodSeconds": 30,
        "dnsPolicy": "ClusterFirst",
        "serviceAccountName": "event-exporter-sa",
        "serviceAccount": "event-exporter-sa",
        "securityContext": {},
        "schedulerName": "default-scheduler"
      }
    },
    "strategy": {
      "type": "RollingUpdate",
      "rollingUpdate": {
        "maxUnavailable": "25%",
        "maxSurge": "25%"
      }
    },
    "revisionHistoryLimit": 2,
    "progressDeadlineSeconds": 600
  },
  "status": {
    "observedGeneration": 1,
    "replicas": 1,
    "updatedReplicas": 1,
    "readyReplicas": 1,
    "availableReplicas": 1,
    "conditions": [
      {
        "type": "Progressing",
        "status": "True",
        "lastUpdateTime": "2017-09-04T14:29:20Z",
        "lastTransitionTime": "2017-09-04T14:29:02Z",
        "reason": "NewReplicaSetAvailable",
        "message": "ReplicaSet \"event-exporter-1421584133\" has successfully progressed."
      },
      {
        "type": "Available",
        "status": "True",
        "lastUpdateTime": "2017-09-13T09:43:49Z",
        "lastTransitionTime": "2017-09-13T09:43:49Z",
        "reason": "MinimumReplicasAvailable",
        "message": "Deployment has minimum availability."
      }
    ]
  }
}

Cannot retrieve metric timeseries.

Hi,

I'm following...

https://cloud.google.com/kubernetes-engine/docs/tutorials/custom-metrics-autoscaling
and
https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/custom-metrics-stackdriver-adapter/examples/prometheus-to-sd

...to get custom-metric autoscaling working on a GKE v1.9.3 cluster.

  • I can view the metrics in stackdriver successfully.
  • I can see that the custom-metrics-stackdriver-adaptor can see the metrics
JEG-CON-GEL0068:helm-haproxy-ingress james.masson$ curl -s http://localhost:8001/apis/custom.metrics.k8s.io/v1beta1/ | head -20
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "pods/go_goroutines",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "pods/go_memstats_alloc_bytes",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
  • I cannot get custom-metrics-stackdriver-adaptor to retrieve the values
JEG-CON-GEL0068:helm-haproxy-ingress james.masson$ curl http://localhost:8001/apis/custom.metrics.k8s.io/v1beta1/namespaces/shared-services/pods/*/go_goroutines
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "the server could not find the metric go_goroutines for pods",
  "reason": "NotFound",
  "code": 404

Nothing interesting in the logs

JEG-CON-GEL0068:helm-haproxy-ingress james.masson$ kubectl -n shared-services logs po/custom-metrics-stackdriver-adapter-5fc6b6d856-s62tz
I0220 11:00:41.810065       1 serving.go:283] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0220 11:00:43.110267       1 serve.go:85] Serving securely on 0.0.0.0:443
I0220 11:00:52.050474       1 trace.go:76] Trace[629458047]: "List /apis/custom.metrics.k8s.io/v1beta1/namespaces/shared-services/pods/*/go_memstats_alloc_bytes" (started: 2018-02-20 11:00:51.28992615 +0000 UTC m=+12.274162011) (total time: 760.486521ms):
Trace[629458047]: [760.486521ms] [760.47448ms] END

I have haproxy/exporter/prometheus-to-sd configured like this

      - name: haproxy-ingress
        image: quay.io/jcmoraisjr/haproxy-ingress:v0.5-beta.1
        args:
        - --default-backend-service=kube-system/default-http-backend
        - --default-ssl-certificate=$(POD_NAMESPACE)/tls-secret
        - --configmap=$(POD_NAMESPACE)/haproxy-ingress
        - --publish-service=$(POD_NAMESPACE)/haproxy-ingress
        readinessProbe:
          httpGet:
            path: /healthz
            port: 10253
            scheme: HTTP
        livenessProbe:
          httpGet:
            path: /healthz
            port: 10253
            scheme: HTTP
          initialDelaySeconds: 10
          timeoutSeconds: 1
        ports:
        - name: http
          containerPort: 80
        - name: https
          containerPort: 443
        - name: stat
          containerPort: 1936
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
      - name: haproxy-exporter
        image: prom/haproxy-exporter:v0.9.0
        args:
        - '--haproxy.scrape-uri=http://localhost:1936/?stats;csv'
        ports:
        - name: prometheus
          containerPort: 9101
      - name: prometheus-to-sd
        image: gcr.io/google-containers/prometheus-to-sd:v0.2.3
        command:
          - /monitor
          - --stackdriver-prefix=custom.googleapis.com
          - --source=:http://localhost:9101
          - --pod-id=$(POD_NAME)
          - --namespace-id=$(POD_NAMESPACE)
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace

any hints on where to look next?

thanks

James M

Empty Stackdriver Traces

I'm on google cloud and some of these tools are running automatically. I've been trying to track down empty traces in Stackdriver Trace and I suspect they're coming from this pod. There's a whole bunch that look like this:

screen shot 2017-07-23 at 3 27 24 pm

If they're coming from here is it possible to turn them off?

Authentication issue with prometheus-to-sd

Trying to use prometheus-to-sd in my project (as shown in the example kube config) but keep getting this error:

Error while sending request to Stackdriver googleapi: Error 403: Request had insufficient authentication scopes., forbidden

I also tried adding auth credentials for a service account that has the "monitoring admin" and "metrics writer" roles, but no dice. This is how the kube config for my prometheus-to-sd container looks like:

       - name: prometheus-to-sd
          image: gcr.io/google-containers/prometheus-to-sd:v0.2.1
          ports:
            - name: profiler
              containerPort: 6060
          command:
            - /monitor
            - --stackdriver-prefix=custom.googleapis.com
            - --source=kube-state-metrics:http://localhost:5000
            - --pod-id=$(POD_NAME)
            - --namespace-id=$(POD_NAMESPACE)
          volumeMounts:
            - name: prometheus-to-sd-cred
              mountPath: /etc/cred
              readOnly: true
          env:
            - name: GOOGLE_APPLICATION_CREDENTIALS
              value: /etc/cred/prometheus-to-sd-cred.json
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

Metrics not sent after certain amount of time using prometheus-to-sd

I am using prometheus-to-sd v0.3.1 for sending metrics from a spring boot application.
When the application is running in isolation i.e. no service hitting this service, it is able to send metrics to Stackdriver. But as soon as i switch the traffic to this service i.e. other service starts calling this service it stops sending metrics.
Also if I let the application to run in isolation for few hours it stops sending metrics after 3-4 hours.

When i check the logs in Stackdriver logging interface there is no logs form prometheus-to-sd container from the point at which its stop sending metrics.
Could let me know where i could find more info for this problem?

Gke cluster version: 1.10.6.2
kubernetes version: 1.10

 - name: applicationName-sd
          image: gcr.io/google-containers/prometheus-to-sd:v0.3.1
          command:
            - /monitor
            - --source=:http://localhost:42802/prometheus
            - --stackdriver-prefix=custom.googleapis.com/applicationName
            - --pod-id=$(POD_ID)
            - --namespace-id=$(POD_NAMESPACE)
          env:
            - name: POD_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

applicationName and port is having right names in my config.

Logs from application start up

`GCE config: &{Project:projectName Zone:zone Cluster:dev-v2 Instance:insatnceName MetricsPrefix:custom.googleapis.com/applicationName}
Taking source configs from flags 
Taking source configs from kubernetes api server 
Built the following source configs: [{ localhost 42802 /prometheus [] {applciationName-5f75979b6b-b4jh6 namespace}}] 
Running prometheus-to-sd, monitored target is  localhost:42802 
GCE config: &{Project:projectName Zone:us-east4-b Cluster:dev-v2 Instance:gke-dev-v2-standard8-10c2b7d3-xx6m MetricsPrefix:custom.googleapis.com/applicationName} 
Taking source configs from flags 
Taking source configs from kubernetes api server 
Built the following source configs: [{ localhost 42802 /prometheus [] {applicationName-5f75979b6b-n5tt6 namespace}}] 
Running prometheus-to-sd, monitored target is  localhost:42802 
GCE config: &{Project:projectName Zone:us-east4-b Cluster:dev-v2 Instance:gke-dev-v2-standard8-10c2b7d3-d6dd MetricsPrefix:custom.googleapis.com/applicationName} 
Taking source configs from flags 
Taking source configs from kubernetes api server 
Built the following source configs: [{ localhost 42802 /prometheus [] {applicationName-5f75979b6b-5fdlb namespace}}] 
Running prometheus-to-sd, monitored target is  localhost:42802 
GCE config: &{Project:projectName Zone:us-east4-c Cluster:dev-v2 Instance:gke-dev-v2-standard8-541c8362-kv5l MetricsPrefix:custom.googleapis.com/applicationName} 
Taking source configs from flags 
Taking source configs from kubernetes api server 
Built the following source configs: [{ localhost 42802 /prometheus [] {applicationName-5f75979b6b-tqjcw namespace}}] 
Running prometheus-to-sd, monitored target is  localhost:42802 
GCE config: &{Project:projectName Zone:us-east4-a Cluster:dev-v2 Instance:gke-dev-v2-standard8-61da8de0-zb18 MetricsPrefix:custom.googleapis.com/applicationName} 
Taking source configs from flags 
Taking source configs from kubernetes api server 
Built the following source configs: [{ localhost 42802 /prometheus [] {applicationName-5f75979b6b-74vnt namespace}}] 
Running prometheus-to-sd, monitored target is  localhost:42802 
GCE config: &{Project:projectName Zone:us-east4-b Cluster:dev-v2 Instance:gke-dev-v2-standard8-10c2b7d3-5scm MetricsPrefix:custom.googleapis.com/applicationName} 
Taking source configs from flags 
Taking source configs from kubernetes api server 
Built the following source configs: [{ localhost 42802 /prometheus [] {applicationName-5f75979b6b-l46x8 namespace}}] 
Running prometheus-to-sd, monitored target is  localhost:42802
Metric process_start_time_seconds invalid or not defined. Using 1970-01-01 00:00:01 +0000 UTC instead. Cumulative metrics might be inaccurate.`

There are warning messages like these after start up but it was still sending the metrics to stackdriver.
Metric process_start_time_seconds invalid or not defined. Using 1970-01-01 00:00:01 +0000 UTC instead. Cumulative metrics might be inaccurate

The port that is exposing the metrics is also the port for liveliness and readiness probe but it hits a different url.

I have the 3+ pods running.
Let me what more details i could provide.

[event-exporter] Does not validate sink-opts

Running event exporter with command

event-exporter -sink-opts="-location=europe-west2-b"

does not fail, but location is not set.

I would expect it failing with badly passed arguments

[custom-metrics-stackdriver-adapter] new release ?

looking in the container registry, the least release of the metrics adapter was July 2018, over 6 months ago. Since then it looks like the ability to query for metrics across projects, as well as use GOOGLE_APPLICATION_CREDENTIALS have been merged.

Can we get a new release published so these new features can be used? (Thanks!). Happy to assist with that however I can as my org has large need for cross-project metric queries for autoscaling to go GA in GKE.

[prometheus-to-sd] Allow adding a prefix/suffix to the metrics names

This is a feature request. I'd like to add prefix or suffix to the scraped metrics' names. Since in a case when collecting metrics from some components whose names of metrics are difficult to change, I'd like to distinguish such metrics on Stackdriver; especially by the environment the metrics are collected (e.g., staging or production).

Linking traces to log entries

I'm running a Golang HTTP server on K8s using OpenCensus for tracing. In my logs I set the JSON field trace to "projects/$PROJECT/traces/$TRACE_ID".

If I click on the "Log" link in a Stackdriver trace (see screenshot) I end up with zero log results. However, if I manually replace trace=... with jsonPayload.trace=... the filter works and I get the log entries associated with the trace I clicked on.

Is there anything I can do to make this work automatically? Ie. is there any way to set the trace field directly instead of just the jsonPayload.trace?

trace

prometheus-to-sd: Unknown metric error logged every minute

I'm running two clusters that were upgraded to 1.8.3-gke.0, and are running gcr.io/google-containers/prometheus-to-sd:v0.2.2. Every minute, this container logs:

Error while sending request to Stackdriver googleapi: Error 400: Unknown metric: container.googleapis.com/internal/addons/event_exported/stackdriver_sink_received_entry_count, badRequest

It seems like this must be some sort of mismatch between the expected names, and the actual names in Stackdriver or something?

[Feature Request] Ability to override resource labels with prometheus labels

I am working on adding a cAdvisor daemonset with an example integration with the prometheus-to-sd sidecar

Prometheus-to-sd currently just uses the namespace_id, pod_id, and container_name provided by the config. However, for cAdvisor, or other monitoring daemonsets that publish metrics about other pods/containers in the cluster, adding namespace_id, pod_id, and container_name as prometheus labels results in stackdriver labels pointing to the "correct" container, but the monitored resource points to the cAdvisor container (as that is where the metrics originate from).

I propose either allowing prometheus labels to override the monitored resource attributes by default (if provided), or adding the ability to configure which prometheus label is used for each monitored resource attribute. E.g. --container-name-label, --pod-id-label, etc, and allowing only these, or the --pod-id, --namespace-id flags to be set. The latter has the advantage of being able to perform "remapping" of labels from the target pod to monitored resource attributes in stackdriver, and means we don't need to match labels exactly (e.g. I could specify --namespace-id-label=namespace_name if that was how the producer labeled metrics).

[prometheus-to-sd] Resource types: None

I am running the prometheus-to-sd sidecar that scrapes the Prometheus metrics endpoint of my service and sends metrics to Stackdriver. There is a metric called handler_error_count which represents API handler errors, e.g.:

handler_error_count{caller="some_caller",error_type="client",outcome="invalid_argument",procedure="some_procedure"} 1

I can find the metric in Stackdriver's Metrics Explorer but it says that its resource type is None. For this reason I am unable to create charts or set up alert policies using this metric.

Unless I am missing something, the resource type should be gke_container for all metrics.

Any suggestions on how to fix this issue would be appreciated.

[prometheus-to-sd] Support for SUMMARY metric types

It seems that prometheus-to-sd is unable to translate summary metrics and throws errors every time that scrape the metrics:

W0817 15:04:56.690842       1 translator.go:61] Error while processing metric http_request_duration_microseconds: Metric type SUMMARY of family http_request_duration_microseconds not supported

However, the stackdriver-prometheus project actually is able to parse this kind of metric and push it to stackdriver (https://github.com/Stackdriver/stackdriver-prometheus/blob/9f5c836bfc3abea91b7473f4a57c107806099bf9/stackdriver/translate.go#L118).

Is there any reason that prometheus-to-sd does not support this kind of metrics? Are you planning to add support for summary metrics? I can work on it if you accept PRs.

Error while sending request to Stackdriver ... unsupported protocol scheme """

I'm getting this error logged every ~30 seconds: Error while sending request to Stackdriver Post /v3/projects/{my project}/timeSeries?alt=json: unsupported protocol scheme """

I've just been watching my kube logs via the kubectl log CLI this whole time and this wasn't appearing in any of those logs. All my services are working as expected. But when I checked GCP logs interface on the website, apparently this has been going on for at least 1 week now, perhaps longer.

Any ideas?

[kubelet-to-gcm] Dropping metrics for containers with same name and start time

Looks like mechanism for deduplicating data points (

if container.StartTime.Time.Before(metricsSeen[containerName]) || container.StartTime.Time.Equal(metricsSeen[containerName]) {
)
is not handling cases where we have two pods with containers that have same name and start time.
Same start dates are common when pods are deployed using as kubelet static pods.

/cc @loburm

prometheus-to-stackdriver why create cumulative metrics for histograms

I have been using histograms in my python app to monitor the response time of my application. Naturally histograms in prometheus have the value type Distribution in stackdriver but im unclear why this component choses Cumulative at the metric kind.

func extractMetricKind(mType dto.MetricType) string {

Cumulative metrics seem appropriate for counters but for prometheus Histograms wouldn't the kind being delta (or gauge) be more suitable. My response time metrics aren't graphing properly in the UI and i think its because of this.

Overwrites MonitoredDescriptor, dropping all historical data

Prometheus-to-sd tries to update the Stackdriver MonitoredDescriptor if the installed one is incompatible with the Prometheus metric. This is a lossy operation that will drop all the historical data, and it could be triggered by unexpected changes or bugs in the Prometheus exporters.

I propose that we refuse the write and log an error. The metric descriptor update should be pulled out into a separate tool that requires manual invocation, for users who are OK losing the data.

The code in question is here: https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd/translator/stackdriver.go#L88

MonitoredDescriptor.Create is documented here, I've asked the owners to document this behavior explicitly: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.metricDescriptors/create

h/t to quentin

[prometheus-to-sd] json: unsupported value: +Inf

The following error keeps showing up in the logs:

E0509 10:17:28.743204 1 stackdriver.go:58] Error while sending request to Stackdriver json: error calling MarshalJSON for type *monitoring.CreateTimeSeriesRequest: json: error calling MarshalJSON for type *monitoring.TimeSeries: json: error calling MarshalJSON for type *monitoring.Point: json: error calling MarshalJSON for type *monitoring.TypedValue: json: error calling MarshalJSON for type *monitoring.Distribution: json: unsupported value: +Inf

Default Value Type for prometheus-to-sd

The current default type is "INT64"

This is breaking floating point metrics such as: "custom.googleapis.com/kube-state-metrics/kube_pod_container_resource_requests_cpu_cores"

Since the variable that holds the Value is of type float64 already, I think the best candidate for default type should be "DOUBLE" instead of "INT64".

[prometheus-to-sd] metrics are missing labels in stack driver

I'm trying to get the kube-state-metrics setup and I can see metrics in stack driver, but none of them appear to have labels. When I looked at the prometheus-to-sd logs I see a bunch of errors like this:

Error in attempt to update metric descriptor googleapi: Error 400: 
Field labels had an invalid value: 
When creating metric custom.googleapis.com/kube-state-metrics/kube_service_labels: 
the metric has more than 10 labels., badRequest

I'm not quite sure what the issue is but I really need the labels for this to work. Any one seen this before?

Log spam: Unrecognized metric label

We are seeing a lot of log entries in stackdriver of the form:

E0509 04:26:19.168197 1 stackdriver.go:58] Error while sending request to Stackdriver googleapi: Error 400: Field timeSeries[0].metric.labels[1] had an invalid value of "code": Unrecognized metric label., badRequest

This is on a GKE cluster, 1.9.7-gke.0

This is coming from the prometheus-to-sd container

Dead Link in the README.md

Hi Googlers,

thanks for your great work and contribution to the open source community!
I just wanted to let you know that you have a 404 in the Readme: stackdriverSite.

404
Thats all :)

Cheers Maiky

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.