jirwin / burrow_exporter Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 69.0 1.29 MB

Prometheus exporter for burrow

License: Apache License 2.0

Go 95.00% Shell 2.50% Dockerfile 2.50%

burrow_exporter's People

Contributors

Stargazers

Watchers

burrow_exporter's Issues

KafkaConsumerPartitionLag possibly using the wrong field?

We have a situation where the kafka consumer was stopped for an upgrade and we expect the lag to increase while the upgrade is taking place. However the metrics reported in Prometheus didn't show the lag increasing. After looking at the burrow API and burrow_exporter it seem that burrow_exporter is reporting the lag at the last offset commit (end.lag) instead of the current lag (current_lag) of the partition.

$ curl -s http://burrow.service.docker:8000/v3/kafka/local/consumer/GROUP/lag  | jq '.'
{
  "error": false,
  "message": "consumer status returned",
  "status": {
    "cluster": "local",
    "group": "GROUP",
    "status": "ERR",
    "complete": 1,
    "partitions": [
      {
        "topic": "short",
        "partition": 0,
        "owner": "",
        "status": "STOP",
        "start": {
          "offset": 12490020,
          "timestamp": 1547163289948,
          "lag": 0
        },
        "end": {
          "offset": 12490085,
          "timestamp": 1547163309431,
          "lag": 0
        },
        "current_lag": 70131,
        "complete": 1
      },
...

The above if from the burrow V3 API and shows the end.lag as zero but the current_lag is positive.

I'm relatively new to Kafka and burrow so this may be by design but it seems to me that that current lag would be more useful than the lag at last commit.

Thanks

Docker image

Hi,

it would be nice to have a docker build from this package. Something like

docker run -p 80:3000 -e BURROW_HOME="http://{burrow_host}/v2/kafka" -e PROMETHEUS_ENDPOINT="/metrics" -d burrow-exporter

At least some README.md explaining "Getting started"

Any chance to have look at this? Actually, this is only project converting burrow to prometheus.

burrow_exporter does not support huge cluster

Hi,
I was currently testing this project on our cluster but we are facing an issue with the current version.
Our cluster currently has more than 2k topics and even more consumers groups. Apparently this exporter launch a go-routine by cluster, then by topic & then by consumer group to scrape the REST api of Burrow in parallel. The problem that was that too many simultaneous call to the rest api were done causing this exporter to crash with a unable to open socket : too many open files.

For now we just disabled all the go routine call (exporter.go => line 139, 148, 191) and everything works fine, the exporter is able to scrape the whole api in less than 10 seconds

burrow_exporter does not remove old topic which was removed by kafka?

I removed some Topic from kafka , but burrow-exporter still have old topic metrics with unchanging values untill reboot burrow-exporter.

unable to export to external Prometheus server

Hi,

I have a working grafana with prometheus server (on a node). ON a diffarent node I am runnibng burow and burrow_exporter (as 2 docker services). For burrow_exporter i have the below configuration. when i see burrow_logs i dont see any errors (docker logs -f ; I see Scraping burrow.. , and Finished scraping burrow.... ). But i dont see any entries into my prometheus server with substring "burrow" so clearly the scrapped data is not sent across.

if you can give me a way to push data from burrow_exporter to external promethus server over an arbitary IP, will be a great help.

services:
  burrow:
    build: .
    environment:
        BURROW_ADDR: http://172.31.18.137:8000
        METRICS_ADDR: 172.31.18.137:9090
        INTERVAL: 5
        API_VERSION: 3
    volumes:
      - ../burrow-master/docker-config:/etc/burrow/
      - ${PWD}/tmp:/var/tmp/burrow

consumer group status matching

Hi,

I'm trying to identify how does consumer group status matches against the metric exported. I receive a number for the status of a consumer group and i want to identify what does that number mean based on the list of string status on burrow:

NOTFOUND -
OK - (status number 2)
WARN - (status number 3)
ERR -
STOP - (status number 4)
STALL -

Those number above are assumptions based on test i did. I'm not completely sure about those number. Anyway, i looked through the documentation on burrow and here and i can't find anything.
Does this information exist? can someone here provide that information?

kafka_burrow_topic_partition_offset only metric showing up

Hi,
I am using the latest code built as a Docker image

docker run --name burrow-exporter -d -p 8090:8080 --link burrow -e BURROW_ADDR="http://burrow:8000/" -e METRICS_ADDR="0.0.0.0:8080" -e API_VERSION="3" -e INTERVAL="30" jirwin/burrow_exporter

The only metric showing up in the /metrics endpoint is kafka_burrow_topic_partition_offset, why other metrics are absent, like the lag?

Many thanks

Feature request: Add burrow health check as metric

It would be nice to have a metric that indicates if burrow_exporter has been able to scrape burrow and that the health check succeeded.

Burrow exporter CrashLoopBackOff when Kafka cluster has issues

I have developed Helm charts which uses your Docker image for burrow exporter.
Helm chart reference: https://github.com/Yolean/kubernetes-kafka/blob/master/linkedin-burrow/burrow.yml
However, when things are not in good shape with the Kafka cluster things go wrong with the Burrow exporter as well. Following is the log

time="2019-07-24T02:16:09Z" level=error msg="error listing clusters. Continuing." err="Get http://localhost:8000/v3/kafka: dial tcp 127.0.0.1:8000: connect: connection refused" time="2019-07-24T02:16:39Z" level=info msg="Scraping burrow..." timestamp=1563934599815846635 time="2019-07-24T02:16:39Z" level=error msg="error making request" endpoint="http://localhost:8000/v3/kafka" err="Get http://localhost:8000/v3/kafka: dial tcp 127.0.0.1:8000: connect: connection refused" time="2019-07-24T02:16:39Z" level=error msg="error retrieving cluster details" err="Get http://localhost:8000/v3/kafka: dial tcp 127.0.0.1:8000: connect: connection refused" time="2019-07-24T02:16:39Z" level=error msg="error listing clusters. Continuing." err="Get http://localhost:8000/v3/kafka: dial tcp 127.0.0.1:8000: connect: connection refused"

We do not have liveness and readiness probes defined for burrow exporter, how do we check it?
My concern is any issue with the Kafka cluster should not put the burrow-exporter in CrashLoopBackOff state

Argument parsing issue Kubernetes?

Has anyone tried running the Docker image in Kubernetes and had issues passing either command line arguments or just environment variables to the Go binary? It seems to redact a - in my args, although it works in Docker. I have tried everything I can think of, I have Burrow running in the same pod just fine passing flags. See the shortened config below:

        image: 715666668144.dkr.ecr.us-east-1.amazonaws.com/burrow_exporter:latest
        command:
        - ./burrow-exporter
        env:
          - name: BURROW_ADDR
            value: http://localhost:8000
          - name: METRICS_ADDR
            value: 0.0.0.0:8080
        ports:
        - name: web
          containerPort: 8080

Or as the following, where the -- turns into -, visible in the logs of the container

        image: 715666668144.dkr.ecr.us-east-1.amazonaws.com/burrow_exporter:latest
        command:
        - ./burrow-exporter
        args:
        - --burrow-addr http://localhost:8000
        ports:
        - name: web
          containerPort: 8080

Cannot assign requested address

Hi,

I'm running burrow and burrow_exporter, and seing this a lot in the logs:

time="2019-03-21T17:15:12Z" level=error msg="error retrieving consumer group topic details" cluster=staging err="Get http://localhost:31363/v3/kafka/staging/topic/topic_name dial tcp 127.0.0.1:31363: connect: cannot assign requested address" topic=topic_name

Any idea what could be happening?

Use without docker

Hello,

Is there a way to use this tool without using docker i.e: compile into a standalone executable?
This would be very useful as we do not have docker on our linux machines.

Thanks for your help!

Julien

/metrics only showing go stats

This may be a gap of usage knowledge on my end but I am only getting go stats form /metrics
Is there a sub URL that I am missing?
Burrow is working and, I am not seeing any errors with connecting via docker.

Burrow ouput
{ "error": false, "message": "cluster list returned", "clusters": [ "local" ], "request": { "url": "/v3/kafka", "host": "70fb55e70e02" } }

/metrics output

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.0003077
go_gc_duration_seconds{quantile="0.25"} 0.0003077
go_gc_duration_seconds{quantile="0.5"} 0.0003139
go_gc_duration_seconds{quantile="0.75"} 0.0003139
go_gc_duration_seconds{quantile="1"} 0.0003139
go_gc_duration_seconds_sum 0.0006216
go_gc_duration_seconds_count 2
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 15
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 437784
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 5.018448e+06
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.444015e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 15182
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 9.598504609245178e-07
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.371584e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 437784
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 6.4847872e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.572864e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 2168
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.6420736e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5433399375312576e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 17350
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 6912
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 30248
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 49152
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.294409e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 688128
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 688128
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 7.2284408e+07
# HELP go_threads Number of OS threads created
# TYPE go_threads gauge
go_threads 13
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.31
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.1558912e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.54333909606e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.15638272e+08

Export partition status

I'd like to export the status of each partition too.
We can always write some logic at prometheus end, but Burrow already does this well.
https://github.com/linkedin/Burrow/wiki/http-request-consumer-group-status
These are the valid status strings: NOTFOUND, OK, WARN, ERR, STOP, STALL

Edit:
We shall model them as separate time series

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 1

something like https://www.robustperception.io/exposing-the-software-version-to-prometheus/