grobian / carbon-c-relay Goto Github PK

Enhanced C implementation of Carbon relay, aggregator and rewriter

License: Apache License 2.0

C 50.36% Makefile 5.86% Python 0.41% Roff 4.56% Lex 0.72% Yacc 2.66% Scilab 0.04% Shell 34.59% M4 0.53% Dockerfile 0.05% Vim Script 0.22%

graphite monitoring c relays

carbon-c-relay's Issues

metric path gets garbled

I'm seeing a strange issue where metric path gets garbled. My metric path looks looks something like:

vCenter.Datastore.<datastore name>.x.y.z

which results in a variation of different metric paths:

drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DavCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatastovCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DvCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatavCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatasvCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatvCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatastvCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatastorevCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 DatastorvCenter
drwxr-xr-x   3 _graphite _graphite  4096 Mar 17 08:33 vCenter
drwxr-xr-x  66 _graphite _graphite  4096 Mar 17 08:33 Datastore

When I send the metric directly to carbon, things work ok. I've tested this with version 0.36 and with the latest v0.37 (50b6a1).

Any advice how to troubleshoot this?

more and more metrics being sent out due to aggregation?

Hi,
my setup:

carbon-relay-ng, which sends all metrics it receives to multiple graphite servers (A,B andC), as well as carbon-c-relay
carbon-c-relay is used just for aggregation. it does a few aggregations, and is configured to send the output of its aggregations also to A, B and C. (it shouldn't send anything else to A/B/C beside its aggregation results, i never verified this but it should be obvious from the config)

here's the config:

cluster all
    forward
        graphiteA:2003
        graphiteB:2003
        graphiteC:2003
    ;

aggregate
        ^stats\.dc1pweb[0-9]+\.request_by_country\.([^.]*)
    every 60 seconds
    expire after 60 seconds
    compute sum write to
        stats._sum_dc1pweb.request_by_country.\1
    ;

aggregate
        ^stats\.dc1[^.]+\.warehouselogger\.([^.]*)
    every 60 seconds
    expire after 60 seconds
    compute sum write to
        stats._sum_dc1.warehouselogger.\1
    ;

match _sum_dc1pweb.request_by_country
    send to all
    ;

match _sum_dc1.warehouselogger
    send to all
    ;

now the thing is, i didn't want to maintain the A/B/C list in carbon-c-relay as well, because that's what carbon-relay-ng is for, and this list changes too often.
so i modified the config, to instead of forward to A/B/C, forward to carbon-relay-ng:2003, feeding the output of carbon-c-relay back into carbon-relay-ng, which can then feed the aggregations into everywhere it needs to go.
so that means that stuff also feeds back into carbon-c-relay.
I didn't expect this to be an issue, because the regexes are carefully written to only match the input metrics, and not its own aggregation outputs. However, it was. this is amount of metrics coming into carbon-relay-ng (it flattened back out because carbon-relay-ng hit 100% cpu and couldn't process anym roe)

i did some network sniffing on the relay machine and saw way too many metrics with sum_dc1pweb in them.

did i misconfigure the relay? or is there any reason why this behavior would make sense? might it be a bug in the relay?

Windows?

Does this build/run on windows?

Segfault on CentOS

Under high load, carbon-c-relay sometimes dies and the kernel ring buffer reports:

relay[12386] general protection ip:3ee520798e sp:7fc14c4bde00 error:0 in libpthread-2.12.so[3ee5200000+17000]
relay[29214]: segfault at b983c6a8 ip 0000000000409027 sp 00007fa3b983bcb0 error 4 in relay[400000+e000]

on CentOS release 6.3 (Final) 2.6.32-279.5.2.el6.x86_64. I'd be happy to provide more information as needed.

FR: rewrite rules

Hi,

We make extensive use of rewrite rules in our carbon/graphite setup.

Any plans to implement some way of performing rewriting on the metric names?

Bob

any_of cluster does not work as described.

Hi,
I think that I have run into an issue with the any_of cluster type. I have not tested the other cluster types and I have not tested to see if there are any timeouts longer than 10 seconds involved.

It appears that the any_of cluster option does always elects to send to the same metric to the same cluster node, is sticky and does not fail over.

From the documentation:

cluster send-to-any-one
any_of 10.1.0.1:2010 10.1.0.1:2011;
This would implement a fail-over scenario, where two servers are used, the load between them is spreaded, but should any of them fail, all metrics are send to the remaining one. This typically works well for upstream relays, or for balancing carbon-cache processes running on the same machine. Should any member become unavailable, for instance due to rolling restart, the other members receive the traffic.

To test this I used the following config with release 32:

$ relay -p4000 -w 4 -b 10 -q 25000 -H maxwell -f relay.conf -d
[2014-09-16 17:37:26] starting carbon-c-relay v0.33 (2014-09-16)
configuration:
    relay hostname = twiki501.back.test.bc.local
    listen port = 4000
    workers = 4
    send batch size = 10
    server queue size = 25000
    debug = true
    routes configuration = relay.conf

parsed configuration follows:
cluster one
    any_of
        localhost:5000
        localhost:6000
        localhost:7000
    ;
match ^carbon\.relays\..*$
    send to blackhole
    stop
    ;

match *
    send to one
    ;

I started 3 nc instances listening on ports 5000 6000 7000.
I sent the following metrics:

orang-utan 4 1410887768
spider 4 1410887768
chimp 4 1410887768

I observed the following results:

orang-utans always go to port 5000
spiders always go to port 6000
chimps always go to port 7000

This is the case even if there is nothing listening on the target port.

Please let me know if I can help in any way. I am afraid that my C has about 16 years of rust on it but I am tempted to relearn.

Thanks for what is, by my testing, the fastest way to relay carbon metrics.

High cpu consumption on carbon-c-relay worker threads

Hello,

I've found the high cpu usage on the worker threads of the carbon-c-relay. The symptom and impact are mostly the same as the issue reported in this #8 but only the trigger is different. Somehow there were some data that contains only "." sent from our client to our carbon-c-relay and cause this issue.

Step to reproduce is below. Could you help me fix this please?

echo -n "." | nc localhost 2003

Integrating into an established environment.

I am trying to incorporate carbon-c-relay into an existing environment and am finding that the carbon_ch hashing method does not line up with that of the standard carbon tools.

Should
cluster my_store
carbon_ch
10.0.0.100:2113
10.0.0.200:2113
10.0.0.300:2113
10.0.0.400:2113
;

behave the same as?
DESTINATIONS = 10.0.0.100:2120:a,10.0.0.200:2120:a,10.0.0.300:2120:a,10.0.0.400:2120:a

In my testing it was not the case.

relay not sending any metrics to multipe carbon instances on same host

[[email protected] bin]$ ./carbon-c-relay -f /opt/graphite/conf/carbon-c-relay2.conf -q 100000 -s
[2015-01-26 22:54:12] starting carbon-c-relay v0.37 (2015-01-26)
configuration:
relay hostname = carbon01
listen port = 2003
workers = 2
send batch size = 2500
server queue size = 100000
statistics submission interval = 60s
submission = true
routes configuration = /opt/graphite/conf/carbon-c-relay2.conf

parsed configuration follows:
cluster graphite
fnv1a_ch replication 1
127.0.0.1:2104
127.0.0.1:2106
127.0.0.1:2108
127.0.0.1:2110
;

match *
send to graphite
;

listening on tcp4 0.0.0.0 port 2003
listening on tcp6 :: port 2003
listening on udp4 0.0.0.0 port 2003
listening on udp6 :: port 2003
listening on UNIX socket /tmp/.s.carbon-c-relay.2003
starting 2 workers
starting statistics collector
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2108: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2104: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2106: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2110: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2108: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2104: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:12] server 127.0.0.1:2106: OK
[2015-01-26 22:54:12] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2110: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2108: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2104: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2106: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2110: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2108: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2104: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2106: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2110: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2108: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2104: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2106: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2110: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2108: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2104: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:13] server 127.0.0.1:2106: OK
[2015-01-26 22:54:13] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2110: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2108: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2104: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2106: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2110: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2104: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2108: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2106: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2110: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2110: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2108: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2108: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2104: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2106: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2106: Broken pipe
[2015-01-26 22:54:14] server 127.0.0.1:2104: OK
[2015-01-26 22:54:14] failed to write() to 127.0.0.1:2104: Broken pipe
[2015-01-26 22:54:1

aggregator: implement dynamic aggregations (capture groups)

Hi. Could you provide some configuration examples?
For example, how to integrate such carbon-aggregator configuration:

aggregation-rules.conf:
stats.app...all..count (15) = sum stats.app...(\d+)-(\d+)-(\d+)-(\d+).<>.count

Does carbon-c-relay supports named captures in output patterns like carbon-aggregator?

Why there can be gaps in aggregate values?

Hi,

I'm trying to use carbon-c-relay and perform some aggregations. One of the rules is:

aggregate
        ^([^-]+)-[^.]+\.cacti_nginx\.cacti_nginx_sockets\.nginx_(requests|handled|accepts)
    every 30 seconds
    expire after 35 seconds
    compute sum write to
        aggregates.\1.nginx.sockets.\2
    compute sum write to
        aggregates.all.nginx.sockets.\2
    ;

Though there are some strange gaps (looks like 1 or 2 points are missing):

I can't understand why there can be no data for aggregate but data for the points itself. Especially because data is obtained by collectd and they have actual time.

With format=json it looks like that:
[20622.722064, 1419838410], [null, 1419838440], [21996.078082, 1419838470]

[18649.835853, 1419838410], [18739.188967999995, 1419838440], [19948.72801599999, 1419838470]

metrics appended with _

hi,

I'm pretty sure that this is not a bug, I am just totally failing to grok something...
If you spot what I am doing wrong, please let me know. Otherwise, please feel free to close this issue.

When carbon c relay is fed metrics from collectd all metrics relayed have a '_' appended.
If I capture the output of collectd to a file then play it back to cc relay with netcat everything is fine.

Scenario
collectd -> carbon c relay -> carbon-cache.
I am running carbon c relay release 0.27 (need to update..)

carbon cache is complaining in /var/log/carbon/listener.log

27/08/2014 19:14:07 :: invalid line received from client 127.0.0.1:55633, ignoring
27/08/2014 19:14:07 :: invalid line received from client 127.0.0.1:55633, ignoring
27/08/2014 19:14:07 :: invalid line received from client 127.0.0.1:55633, ignoring
27/08/2014 19:14:07 :: invalid line received from client 127.0.0.1:55633, ignoring
27/08/2014 19:14:07 :: invalid line received from client 127.0.0.1:55633, ignoring
27/08/2014 19:14:07 :: invalid line received from client 127.0.0.1:55633, ignoring

stopping carbon-cache and replacing it with netcat:

$ nc -l 2023
vbox_twiki01.processes-httpd.ps_data 1005916160.000000 1409166917_
vbox_twiki01.processes-httpd.ps_code 190201856.000000 1409166917_
vbox_twiki01.processes-httpd.ps_stacksize 13432.000000 1409166917_
vbox_twiki01.processes-httpd.ps_cputime.user 0.000000 1409166917_
vbox_twiki01.processes-httpd.ps_cputime.syst 0.000000 1409166917_
vbox_twiki01.processes-httpd.ps_count.processes 10.000000 1409166917_
vbox_twiki01.processes-httpd.ps_count.threads 10.000000 1409166917_
vbox_twiki01.processes-httpd.ps_pagefaults.minflt 0.000000 1409166917_
vbox_twiki01.processes-httpd.ps_pagefaults.majflt 0.000000 1409166917_
vbox_twiki01.processes-httpd.ps_disk_octets.read 0.000000 1409166917_
vbox_twiki01.processes-httpd.ps_disk_octets.write 0.000000 1409166917_

checking that collectd is not sending weirdness by replacing cc relay with netcat:

$ nc -l 2103
vbox_twiki01.processes-httpd.ps_count.processes 10.000000 1409167037
vbox_twiki01.processes-httpd.ps_count.threads 10.000000 1409167037
vbox_twiki01.processes-httpd.ps_pagefaults.minflt 0.000000 1409167037
vbox_twiki01.processes-httpd.ps_pagefaults.majflt 0.000000 1409167037
vbox_twiki01.processes-httpd.ps_disk_octets.read 0.000000 1409167037
vbox_twiki01.processes-httpd.ps_disk_octets.write 0.000000 1409167037
vbox_twiki01.processes-httpd.ps_disk_ops.read 0.000000 1409167037
vbox_twiki01.processes-httpd.ps_disk_ops.write 0.000000 1409167037
vbox_twiki01.processes-collectd.ps_vm 674430976.000000 1409167037
vbox_twiki01.processes-collectd.ps_rss 2420736.000000 1409167037
vbox_twiki01.processes-collectd.ps_data 642191360.000000 1409167037
vbox_twiki01.processes-collectd.ps_code 2719744.000000 1409167037
vbox_twiki01.processes-collectd.ps_stacksize 2208.000000 1409167037

replacing collectd with netcat piping output into cc relay and catching cc_relay output with netcat:

terminal1$ cat ./test_metrics.txt | nc 127.0.0.1 2103

terminal2$ nc -l 2023
vbox_twiki01.processes-monit.ps_vm 127668224.000000 1409165887
vbox_twiki01.processes-monit.ps_rss 933888.000000 1409165887
vbox_twiki01.processes-monit.ps_data 78196736.000000 1409165887
vbox_twiki01.processes-monit.ps_code 7028736.000000 1409165887
vbox_twiki01.processes-monit.ps_stacksize 752.000000 1409165887
vbox_twiki01.processes-monit.ps_cputime.user 0.000000 1409165887
vbox_twiki01.processes-monit.ps_cputime.syst 0.000000 1409165887
vbox_twiki01.processes-monit.ps_count.processes 1.000000 1409165887
vbox_twiki01.processes-monit.ps_count.threads 2.000000 1409165887
vbox_twiki01.processes-monit.ps_pagefaults.minflt 0.000000 1409165887
vbox_twiki01.processes-monit.ps_pagefaults.majflt 0.000000 1409165887
vbox_twiki01.processes-monit.ps_disk_octets.read 10299.084952 1409165887

configuration for carbon c relay:

cluster localhost
    forward
        127.0.0.1:2023
    ;

match *
    send to localhost
    stop
    ;

start command for cc relay:

/usr/bin/relay -p 2103 -w 4 -d -s -f /etc/relay.conf

collectd write_graphite configuration:

<Plugin write_graphite>
  <Node "cc_relay">
    Host "localhost"
    Port "2103"
    Protocol "tcp"
    LogSendErrors true
  </Node>
</Plugin>

Carbon-c-relay worker threads stuck at 100% cpu

Hello,

We are running into a degradation issue with carbon-c-relay the symptoms of which are increased CPU usage for carbon-c-relay eventually getting stuck at 100% per thread and performance degradation, less throughput and dropped metrics, resulting from the increased CPU usage.

Have tested with one worker and multiples, same behaviour. With multiple workers threads start consuming 100% cpu until all of them do.

The stuck threads have the following output from strace:

<..>
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x2b78c00008c5, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 213000000}, NULL)         = 0
<repeating..>

Whereas healthy threads have output similar to:

read(6, 0x2aea80004925, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 215000000}, NULL)         = 0
read(6, 0x2aea80004925, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 298000000}, NULL)         = 0
read(6, 0x2aea80004925, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 209000000}, NULL)         = 0
read(6, 0x2aea80004925, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 207000000}, NULL)         = 0
read(6, 0x2aea80004925, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 257000000}, NULL)         = 0
read(6, 0x2aea80004925, 8095)           = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({0, 133000000}, NULL)         = 0
read(6, "<host>.<metric>"..., 8062) = 8062

It appears that in the high CPU usage case, many polls are done on what appears to be a socket with incoming metrics between each sleep where normally there would be one between each sleep.

Have not been able to replicate the issue with fake data in a self-contained test as of yet, will follow up as soon as we do. In the meantime would appreciate any pointers to where the issue lies, if you have any ideas.

Thanks!

Listen on UDP Port

It seems to be missing support for listening on UDP port. I'm looking to do UPD -> TCP or UDP -> UDP relaying. Am I missing an option ?

timestamp of output of aggregator always lag behind

my config:

interval 10
expire 10
method sum

every point represents 10s. Howerver, the output of aggregator is always 10s laging behind that computed via graphite's sumSeries function, which is anoiying.

[Feature request] calculation of rate for counters

It will be helpful if aggregation supports rate calculation.

For example, network inflow or outflow data collected by collectd is monotonic increasing. We need to subtract two points and divide by time to get bps.

How to configure to match over active/passive cluster

After reviewing the documentation I don't know how to send over a cluster in fail-over mode.

consider this cluster configuration

cluster A
    ip_A_datacenter1
    ip_A_datacenter2

cluster B
    ip_B_datacenter1
    ip_B_datacenter2

cluster C
    ip_C_datacenter1
    ip_C_datacenter2

match "^metricstoA\.*"
    send to A

match "^metricstoB\.*"
   send to B

match "^metricstoC\.*"
    send to C

What I need is only send to the first host on each cluster (ip_X_datacenter1) and only if this one fails switch to the other ( ip_X_datacenter2) also should check ip_X_datacenter1 availability to come back to the main active server after the host has been recovered.

How can I do that?

Logrotate-friendly logging to file

It'll be good to have logrotate-friendly log support inside carbon-c-relay with reopening logs on SIGUSR, etc.

performance statistics only in debug mode?

Good evening,

It looks like there may be some issue with the routing of the performance statistics when not running with debug enabled:

Running relay with debug enabled

$ relay -f /etc/cc_relay.conf -p 2103 -w 2 -b 2500 -q 25000 -d -H maxwell
[2014-09-02 20:14:19] starting carbon-c-relay v0.31 (2014-09-02)
configuration:
    relay hostname = maxwell
    listen port = 2103
    workers = 2
    send batch size = 2500
    server queue size = 25000
    debug = true
    routes configuration = /etc/cc_relay.conf

parsed configuration follows:
cluster localhost
    forward
        127.0.0.1:2003
    ;

match *
    send to localhost
    ;

listening on tcp4 0.0.0.0 port 2103
listening on UNIX socket /tmp/.s.carbon-c-relay.2103
starting 2 workers
starting statistics collector
[2014-09-02 20:14:38] failed to send() to 127.0.0.1:2003: Broken pipe
[2014-09-02 20:14:41] server 127.0.0.1:2003: OK
carbon.relays.maxwell.dispatcher1.metricsReceived 920 1409685319
carbon.relays.maxwell.dispatcher1.wallTime_us 15045 1409685319
carbon.relays.maxwell.dispatcher2.metricsReceived 1250 1409685319
carbon.relays.maxwell.dispatcher2.wallTime_us 21912 1409685319
carbon.relays.maxwell.metricsReceived 2170 1409685319
carbon.relays.maxwell.dispatch_wallTime_us 36957 1409685319
carbon.relays.maxwell.dispatch_busy 0 1409685319
carbon.relays.maxwell.dispatch_idle 2 1409685319
carbon.relays.maxwell.destinations.127_0_0_1:2003.sent 2170 1409685319
carbon.relays.maxwell.destinations.127_0_0_1:2003.queued 0 1409685319
carbon.relays.maxwell.destinations.127_0_0_1:2003.dropped 0 1409685319
carbon.relays.maxwell.destinations.127_0_0_1:2003.wallTime_us 19995 1409685319
carbon.relays.maxwell.destinations.internal.sent 0 1409685319
carbon.relays.maxwell.destinations.internal.queued 0 1409685319
carbon.relays.maxwell.destinations.internal.dropped 0 1409685319
carbon.relays.maxwell.destinations.internal.wallTime_us 0 1409685319
carbon.relays.maxwell.metricsSent 2170 1409685319
carbon.relays.maxwell.metricsQueued 0 1409685319
carbon.relays.maxwell.metricsDropped 0 1409685319
carbon.relays.maxwell.server_wallTime_us 19995 1409685319
carbon.relays.maxwell.connections 217 1409685319
carbon.relays.maxwell.disconnects 217 1409685319
^Ccaught SIGINT, terminating...
[2014-09-02 20:15:56] shutting down...
[2014-09-02 20:15:56] listener for port 2103 closed
[2014-09-02 20:15:57] collector stopped
[2014-09-02 20:15:57] stopped worker 1 2 3 (2014-09-02 20:15:58)
[2014-09-02 20:15:58] routing stopped

Running relay without debug enabled:

relay -f /etc/cc_relay.conf -p 2103 -w 2 -b 2500 -q 25000 -H maxwell
[2014-09-02 20:19:12] starting carbon-c-relay v0.31 (2014-09-02)
configuration:
    relay hostname = maxwell
    listen port = 2103
    workers = 2
    send batch size = 2500
    server queue size = 25000
    routes configuration = /etc/cc_relay.conf

parsed configuration follows:
cluster localhost
    forward
        127.0.0.1:2003
    ;

match *
    send to localhost
    ;

listening on tcp4 0.0.0.0 port 2103
listening on UNIX socket /tmp/.s.carbon-c-relay.2103
starting 2 workers
starting statistics collector
[2014-09-02 20:20:12] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:12] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:12] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:13] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:13] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:13] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:13] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:14] failed to send() to internal:2103: Socket operation on non-socket
[2014-09-02 20:20:14] failed to send() to internal:2103: Socket operation on non-socket
...........

I also noticed that a UNIX socket has been opened when not running in debug mode

Cheers,

Matthew.

support for 0 expire in aggregator

Currently we can not use 0 in expire, thus the output from aggregator lags behind that computed by graphite functions (sum, avg..), which is annoying.

Aggregation rules must appear before routing rules?

Hi,

I was recently setting up carbon-c-relay and had trouble getting aggregations to work. It seems that the aggregation configs must appear before the 'match' sections.

Eg. this did not work:

cluster caches
    fnv1a_ch
        127.0.0.1:2401 proto udp
        127.0.0.1:2402 proto udp
        127.0.0.1:2403 proto udp
        127.0.0.1:2404 proto udp
    ;

match *
    send to remotes
    stop
    ;

# Metric format: metrics.<dc>.<type>.host.<host>.<metric>
aggregate
        ^metrics\.([^.]+)\.([^.]+)\.host\.([^.]+)\.(.+)$
    every 10 seconds
    expire after 10 seconds
    compute sum write to
        metrics.\1.\2.all.sum.\4
    compute average write to
        metrics.\1.\2.all.avg.\4
    ;

But this did:

cluster caches
    fnv1a_ch
        127.0.0.1:2401 proto udp
        127.0.0.1:2402 proto udp
        127.0.0.1:2403 proto udp
        127.0.0.1:2404 proto udp
    ;

# Metric format: metrics.<dc>.<type>.host.<host>.<metric>
aggregate
        ^metrics\.([^.]+)\.([^.]+)\.host\.([^.]+)\.(.+)$
    every 10 seconds
    expire after 10 seconds
    compute sum write to
        metrics.\1.\2.all.sum.\4
    compute average write to
        metrics.\1.\2.all.avg.\4
    ;

match *
    send to remotes
    stop
    ;

This was running with version 0.39.

Is this behavior intentional, or a bug? The example configs in the README follow the non-working format, so at least we should update the README to avoid confusion.

Note that the above config is slightly modified from what I actually ran. Apologies, but I haven't had time yet to put together a verified set of configs and script to reproduce the problem, but the above should work. If it doesn't, I can get back to you with those (or better, a Vagrantfile).

Finally, thank you for your work on carbon-c-relay. I find it to be a real improvement over the carbon-relay and aggregator: easier configuration, more flexible routing, and much more efficient.

Thanks again
-Evan

init scripts for carbon-c-relay

Would be helpful if there was an option to run this process as a daemon, allowing easy init control.

Questions on hashes

How are carbon_ch/fnv1a_ch hashes calculated?
On start of carbon-c-relay?
Where is this information stored?

Basically I want to know if i use carbon_ch/fnv1a_ch hashing, is hash recalculated every time carbon-c-relay is restarted? And how do I rebalance if that's the case. I'm thinking a rule base approach may be better.

All questions above are in regards to sharding and not for distributing among local carbon-caches.

-Thanks

Consistent-hashing problem

Hello!

I have a problem witch consistent-hashing.
My cluster configuration:
node1 - haproxy and 3 carbon-relay in consistent-hashing, replication factor 2
node2, node3, node4 - carbon-relay and 4 carbon-cache on each.

When i try to use graphite-c-relay on node1 instead of 3 carbon-relays, i faced with the following problem: the way that graphite-c-relay sorts metrics differs from the way that the original graphite does it.

The order of nodes and replication factor in the config carbon-c-relay is the same as it was in carbon.conf. I`m ready to give some extra information.

carbon-c-relay.conf

cluster oldcluster
  carbon_ch replication 2
    mynode2:2013 proto tcp
    mynode3:2013 proto tcp
    mynode4:2013 proto tcp
  ;

match *
    send to oldcluster
    stop
    ;

carbon.conf

RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 2
DESTINATIONS = mynode2:2014:1, mynode3:2014:2, mynode4:2014:3

[relay:1]
LINE_RECEIVER_PORT = 2103
PICKLE_RECEIVER_PORT = 2104

[relay:2]
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_PORT = 2204

[relay:3]
LINE_RECEIVER_PORT = 2303
PICKLE_RECEIVER_PORT = 2304

Feature request - expose environmental variables to configuration file

Hi,
we use carbon c relay on both graphite/carbon servers and on clients.
One of the things we want to do on clients is to aggregate cpu metrics as provided by collectd every 10 seconds.
To do this I understand that we shall need a configuration something like:

aggregate
  *.system.cpu-[0-9]+.cpu-idle
  every 10 seconds
    expire after 5 seconds
    compute average write to
      maxwell.vbox.cpu-all.average

for each of the processor states idle, interrupt, nice, softirq, steal, system, user, wait assuming that as this is a client server it will see metrics only from its self.

It would be really useful, from a configuration management point of view, if the configuration could be written as:

aggregate
  *.system.cpu-[0-9]+.cpu-idle
  every 10 seconds
    expire after 5 seconds
    compute average write to
      $HOSTNAME.cpu-all.average

I saw the previous feature request for back references and that would be the best of all. However, I understand that the amount of effort to do so would be great.

Another thing. It has been some time since I have done any C but I have looked at your code and I hope to be able to contribute a small patch to vary the time between statistic emission. Our carbon stack emits statistics every 10 seconds, it would be nice for carbon c relay to do likewise.

Make relay metric prefix configurable

Can we have a configuration option to change the default prefix for the relay's own metrics?

Meaning instead of always using carbon.relays, use a configured value instead. This exists in carbon's stuff, cache and relay, and is used by some setups instead of the default carbon.<..> prefixes.

Also helps with situations where carbon-c-relay is talking to multiple carbon-caches across multiple nodes where you'd need multiple carbon-c-relays to replicate the built in carbon-relay setup and avoid infinite loops.
In this case both carbon-c-relays would be trying to write metrics to the same location, overwriting each other.

Thanks for the great work as always.

IPv6 Listener Support?

Greetings,

In issue #14, one of your terminal outputs shows that you have carbon-c-relay listening on ::, thus having the ability to listen on an IPv6 address.

The specific comment is here

Looking through the code it doesn't seem obvious to me how to get carbon-c-relay to listen on both IPv4 and IPv6. I admit that my C knowledge is sub-par (I mostly deal with BASH, python, ruby, and Go), but is there any way to get the daemon to listen with both address families?

Why not C++?

From a quick reading, it seems for me, that migrating to C++ will make code more natural, like "dispatcher.c" or queue looks like OOP in C. If migrate code to C++ (using only small subset of it, e.g. classes, maybe some new atomic primitives), it'll make code more readable without any performance penalties.

carbon_ch always sends metrics to last carbon-cache

When using carbon_ch, the last node in the list always receives the metrics:

cluster default
    carbon_ch
        127.0.0.1:2003
        127.0.0.1:2103
        127.0.0.1:2203
        127.0.0.1:2303
    ;

match *
    send to default
    stop
    ;

Could carbon-c-relay do replication , load balancing and agregation as once?

Hi @grobian as I'm discussing on the carbon issue (graphite-project/carbon#333) I would like to do and scheme that seems imposible with standar carbon-cache.py and carbon-relay.py.

Could carbon-c-relay replace the cargon-relay.py in that situation ? how would be a config file for this scheme?

Thank-you very much !!

Multi-thread support for aggregator

Is it possible to support multi-thread for aggregator?

I have a scenario which the performance is seems blocking at the aggregation, as I have millions of metrics received every min, and all those metrics are needed to be aggregated, the configuration file looks like this:

cluster local_carbon
  forward
    192.168.0.138:2013
  ;
match ^metrics.all.*
  send to local_carbon
  stop
  ;
aggregate
        metrics.*.api.ac\.([^.]+)\.([^.]+)\.([^.]+)\.([^.]+)\.([^.]+)\.count
    every 10 seconds
    expire after 50 seconds
    compute sum write to
        metrics.all.api.ac.\1.\2.\3.\4.\5.count
    compute sum write to
        metrics.all.api.ac.\1.\2.all.\4.\5.count
    compute sum write to
        metrics.all.api.ac.\1.\2.\3.all.\5.count
    compute sum write to
        metrics.all.api.ac.\1.\2.all.all.\5.count
    ;

and I have glanced the code(aggregator.c) a little bit, maybe it can be changed to bind one thread for every aggregator, instead of all aggregators?

Pickled Data

This may not be the proper place for this question but can carbon-c-relay accept/send metrics via pickle? We are moving away from the original carbon-relay and have some applications sending pickled data.

Tab as line metric separator

Carbon-c-relay does not accept tabs as line metric separator characters but carbon-caches and the built in carbon-relay do.

This is causing issues with some clients that send tab separated line metrics, like sensu, which are accepted by the built in tools but not carbon-c-relay.

Could be possible a online reload of the configuration periodically ?

Could be interesting for a critical production environment if carbon-c-relay could reload configuration files itself periodically without restarting the process.

Could be possible in a mid term?

support udp output

Some systems use udp input, it will be better if carbon-c-relay supports this.

Ordering of rewrites and matches in relay.conf

Morning,
It looks like rewrites are performed before matches, such as a match placed before a rewrite will be sent data which has been subject to the rewrite.

From the documentation:

match * send to old;

rewrite ... ;

match * send to new;

I understand that this is not the intended behaviour.

Methodology:

relay invocation in terminal session 1:

$ /usr/bin/relay -p 4000 -w 2 -b 2500 -q 25000 -H maxwell -f ./relay.test.conf
[2014-09-11 11:07:19] starting carbon-c-relay v0.32 (2014-09-10)
configuration:
    relay hostname = maxwell
    listen port = 4000
    workers = 2
    send batch size = 2500
    server queue size = 25000
    routes configuration = ./relay.test.conf

parsed configuration follows:
cluster new
    forward
        127.0.0.1:5000
    ;
cluster old
    forward
        127.0.0.1:6000
    ;

match ^carbon\.relays\..*$
    send to blackhole
    stop
    ;
match *
    send to old
    ;
rewrite ^foo\.(.*)
    into bar.\1
    ;
match *
    send to new
    ;

listening on tcp4 0.0.0.0 port 4000
listening on UNIX socket /tmp/.s.carbon-c-relay.4000
starting 2 workers
starting statistics collector

Listener for cluster old in terminal 2:

nc -kl 5000

Listener for cluster new in terminal 3:

nc -kl 6000

Feed data into graphite from terminal 4:

echo "foo.monkey 4 `date +%s`" | nc 127.0.0.1 4000

Expected result in cluster old:

foo.monkey 4 1410433961

Expected result in cluster new:

bar.monkey 4 1410433961

Actual result in cluster old:

bar.monkey 4 1410433961

Actual result in cluster new:

bar.monkey 4 1410433961

rewrite metric names to lowercase

Is it possible to use the rewrite rule to convert all metric names to lowercase?

RHEL Initscript is not LSB compliant

On a RHEL based system, a service relay start fails with exit code 1, if the service is already running. I would prefer the init script to behave like described in the LSB: http://refspecs.linuxbase.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/iniscrptact.html.

Pull request follows.

relay segfaulting when configured with any_of

I've got a simple setup using any_of in my config:

cluster default
    any_of
        127.0.0.1:2003
        127.0.0.1:2103
        127.0.0.1:2203
        127.0.0.1:2303
    ;

match *
    send to default
    stop
    ;

After a small amount of time, relay segfaults with:

Mar 13 02:52:25 sg1infmnp006 kernel: [46512476.556928] carbon-relay[3562]: segfault at 38 ip 00007fef26404e84 sp 00007fef25c19f40 error 4 in libpthread-2.15.so[7fef263fb000+18000]

Looking at it with gdb, I see:

Program terminated with signal 11, Segmentation fault.
#0  0x00007f8afad37e84 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0

This is on Ubuntu 12.04, which currently uses 2.15-0ubuntu10.5 of libc6. I've tested it out using carbon_ch instead and haven't had it segfault at all.

Relay Statistics

I know the relay can send it's own statistics into the stats stream because I'm sure I saw them during my testing. Now however, they no longer show up. I've tried starting with debug and they do print to stdout.

# Define the local carbon cache cluster
cluster local_cache
    any_of
        127.0.0.1:2003
        127.0.0.1:2103
        127.0.0.1:2203
        127.0.0.1:2303
    ;

# Define the AWS storage cluster
cluster aws_store
    carbon_ch
        <ip>:2113
        <ip>:2113
        <ip>:2113
        <ip>:2113
    ;

# Send everything to local
match *
    send to local_cache
    ;

# Send everything to AWS
match *
    send to aws_store
    ;

Rewrite rules works only for routing?

Hi,

In my environment some systems send metrics with '*' and '/'.
Create some rules to change that chars to dots and remove duouble dots.
I test them with -t flag and works like a charm but when they are routed to next relay they appeared as somewhat another rule.

Example:
input metric

bus.telemetry.feature.http:/some.url.com/api/sky/list*.GET.p999

rewrited metric

bus.telemetry.feature.http:.some.url.com.api.sky.list.GET.p999

output metric

bus.telemetry.feature.http:_some.url.com_api_sky_list_.GET.p999

What I missed?

stops listening on udp

I've noticed that carbon-c-relay stops listening on the udp socket after some time (about a day). This is on rhel 6.6 with carbon-c-relay 0.39.

make number of workers default to number of available cpus

Can you please elaborate on the following options:

 Options:
-w  use <workers> worker threads, defaults to 16
-b  server send batch size, defaults to 2500
-q  server queue size, defaults to 25000
-d  debug mode: currently writes statistics to log
-s  submission mode: write info about errors to log
-t  config test mode: prints rule matches from input on stdin

How do -w -b -q affect performance? -w seems to do nothing for me. -q seems to just lower the metrics sent to carbon-cache.

Dual License?

Is there any chance you could license this under Apache 2 or MIT? GPL2 is completely out of the question for us. 😿

occasional buffer overflows

Here's a trace from upstart logs:

[2014-09-17 10:54:56] failed to write() to 127.0.0.1:2013: Broken pipe
*** buffer overflow detected ***: /usr/local/bin/relay terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7fe2dc89c287]
/lib/x86_64-linux-gnu/libc.so.6(+0x10a180)[0x7fe2dc89b180]
/lib/x86_64-linux-gnu/libc.so.6(+0x10b23e)[0x7fe2dc89c23e]
/usr/local/bin/relay[0x40886c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7fe2dcb58e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe2dc88573d]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 09:02 1185849                            /usr/local/bin/relay
0060d000-0060e000 r--p 0000d000 09:02 1185849                            /usr/local/bin/relay
0060e000-0060f000 rw-p 0000e000 09:02 1185849                            /usr/local/bin/relay
02018000-02039000 rw-p 00000000 00:00 0                                  [heap]
7fe29c000000-7fe29c021000 rw-p 00000000 00:00 0 
7fe29c021000-7fe2a0000000 ---p 00000000 00:00 0 
7fe2a4000000-7fe2a4021000 rw-p 00000000 00:00 0 
7fe2a4021000-7fe2a8000000 ---p 00000000 00:00 0 
7fe2a8000000-7fe2a8021000 rw-p 00000000 00:00 0 
7fe2a8021000-7fe2ac000000 ---p 00000000 00:00 0 
7fe2adeef000-7fe2b0000000 rw-p 00000000 00:00 0 
7fe2b0000000-7fe2b007b000 rw-p 00000000 00:00 0 
7fe2b007b000-7fe2b4000000 ---p 00000000 00:00 0 
7fe2b4000000-7fe2b407c000 rw-p 00000000 00:00 0 
7fe2b407c000-7fe2b8000000 ---p 00000000 00:00 0 
7fe2b8000000-7fe2b807d000 rw-p 00000000 00:00 0 
7fe2b807d000-7fe2bc000000 ---p 00000000 00:00 0 
7fe2bc000000-7fe2bc07a000 rw-p 00000000 00:00 0 
7fe2bc07a000-7fe2c0000000 ---p 00000000 00:00 0 
7fe2c0000000-7fe2c007c000 rw-p 00000000 00:00 0 
7fe2c007c000-7fe2c4000000 ---p 00000000 00:00 0 
7fe2c4000000-7fe2c4087000 rw-p 00000000 00:00 0 
7fe2c4087000-7fe2c8000000 ---p 00000000 00:00 0 
7fe2c8000000-7fe2c807a000 rw-p 00000000 00:00 0 
7fe2c807a000-7fe2cc000000 ---p 00000000 00:00 0 
7fe2cc000000-7fe2cc079000 rw-p 00000000 00:00 0 
7fe2cc079000-7fe2d0000000 ---p 00000000 00:00 0 
7fe2d0000000-7fe2d10a9000 rw-p 00000000 00:00 0 
7fe2d10a9000-7fe2d4000000 ---p 00000000 00:00 0 
7fe2d5953000-7fe2d5968000 r-xp 00000000 09:02 9175092                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe2d5968000-7fe2d5b67000 ---p 00015000 09:02 9175092                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe2d5b67000-7fe2d5b68000 r--p 00014000 09:02 9175092                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe2d5b68000-7fe2d5b69000 rw-p 00015000 09:02 9175092                    /lib/x86_64-linux-gnu/libgcc_s.so.1
7fe2d5b69000-7fe2d5b6a000 ---p 00000000 00:00 0 
7fe2d5b6a000-7fe2d636a000 rw-p 00000000 00:00 0 
7fe2d636a000-7fe2d636b000 ---p 00000000 00:00 0 
7fe2d636b000-7fe2d6b6b000 rw-p 00000000 00:00 0 
7fe2d6b6b000-7fe2d6b6c000 ---p 00000000 00:00 0 
7fe2d6b6c000-7fe2d736c000 rw-p 00000000 00:00 0 
7fe2d736c000-7fe2d736d000 ---p 00000000 00:00 0 
7fe2d736d000-7fe2d7b6d000 rw-p 00000000 00:00 0 
7fe2d7b6d000-7fe2d7b6e000 ---p 00000000 00:00 0 
7fe2d7b6e000-7fe2d836e000 rw-p 00000000 00:00 0 
7fe2d836e000-7fe2d836f000 ---p 00000000 00:00 0 
7fe2d836f000-7fe2d8b6f000 rw-p 00000000 00:00 0 
7fe2d8b6f000-7fe2d8b70000 ---p 00000000 00:00 0 
7fe2d8b70000-7fe2d9370000 rw-p 00000000 00:00 0 
7fe2d9370000-7fe2d9371000 ---p 00000000 00:00 0 
7fe2d9371000-7fe2d9b71000 rw-p 00000000 00:00 0 
7fe2d9b71000-7fe2d9b72000 ---p 00000000 00:00 0 
7fe2d9b72000-7fe2da372000 rw-p 00000000 00:00 0 
7fe2da372000-7fe2da373000 ---p 00000000 00:00 0 
7fe2da373000-7fe2dab73000 rw-p 00000000 00:00 0 
7fe2dab73000-7fe2dab74000 ---p 00000000 00:00 0 
7fe2dab74000-7fe2db374000 rw-p 00000000 00:00 0 
7fe2db374000-7fe2db375000 ---p 00000000 00:00 0 
7fe2db375000-7fe2dbb75000 rw-p 00000000 00:00 0 
7fe2dbb75000-7fe2dbb76000 ---p 00000000 00:00 0 
7fe2dbb76000-7fe2dc376000 rw-p 00000000 00:00 0 
7fe2dc376000-7fe2dc38c000 r-xp 00000000 09:02 9175268                    /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fe2dc38c000-7fe2dc58b000 ---p 00016000 09:02 9175268                    /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fe2dc58b000-7fe2dc58c000 r--p 00015000 09:02 9175268                    /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fe2dc58c000-7fe2dc58d000 rw-p 00016000 09:02 9175268                    /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7fe2dc58d000-7fe2dc58f000 r-xp 00000000 09:02 9178486                    /lib/x86_64-linux-gnu/libdl-2.15.so
7fe2dc58f000-7fe2dc78f000 ---p 00002000 09:02 9178486                    /lib/x86_64-linux-gnu/libdl-2.15.so
7fe2dc78f000-7fe2dc790000 r--p 00002000 09:02 9178486                    /lib/x86_64-linux-gnu/libdl-2.15.so
7fe2dc790000-7fe2dc791000 rw-p 00003000 09:02 9178486                    /lib/x86_64-linux-gnu/libdl-2.15.so
7fe2dc791000-7fe2dc946000 r-xp 00000000 09:02 9178488                    /lib/x86_64-linux-gnu/libc-2.15.so
7fe2dc946000-7fe2dcb46000 ---p 001b5000 09:02 9178488                    /lib/x86_64-linux-gnu/libc-2.15.so
7fe2dcb46000-7fe2dcb4a000 r--p 001b5000 09:02 9178488                    /lib/x86_64-linux-gnu/libc-2.15.so
7fe2dcb4a000-7fe2dcb4c000 rw-p 001b9000 09:02 9178488                    /lib/x86_64-linux-gnu/libc-2.15.so
7fe2dcb4c000-7fe2dcb51000 rw-p 00000000 00:00 0 
7fe2dcb51000-7fe2dcb69000 r-xp 00000000 09:02 9178479                    /lib/x86_64-linux-gnu/libpthread-2.15.so
7fe2dcb69000-7fe2dcd68000 ---p 00018000 09:02 9178479                    /lib/x86_64-linux-gnu/libpthread-2.15.so
7fe2dcd68000-7fe2dcd69000 r--p 00017000 09:02 9178479                    /lib/x86_64-linux-gnu/libpthread-2.15.so
7fe2dcd69000-7fe2dcd6a000 rw-p 00018000 09:02 9178479                    /lib/x86_64-linux-gnu/libpthread-2.15.so
7fe2dcd6a000-7fe2dcd6e000 rw-p 00000000 00:00 0 
7fe2dcd6e000-7fe2dcf1f000 r-xp 00000000 09:02 9178635                    /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fe2dcf1f000-7fe2dd11f000 ---p 001b1000 09:02 9178635                    /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fe2dd11f000-7fe2dd13a000 r--p 001b1000 09:02 9178635                    /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fe2dd13a000-7fe2dd145000 rw-p 001cc000 09:02 9178635                    /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7fe2dd145000-7fe2dd149000 rw-p 00000000 00:00 0 
7fe2dd149000-7fe2dd16b000 r-xp 00000000 09:02 9178480                    /lib/x86_64-linux-gnu/ld-2.15.so
7fe2dd2fc000-7fe2dd362000 rw-p 00000000 00:00 0 
7fe2dd367000-7fe2dd36b000 rw-p 00000000 00:00 0 
7fe2dd36b000-7fe2dd36c000 r--p 00022000 09:02 9178480                    /lib/x86_64-linux-gnu/ld-2.15.so
7fe2dd36c000-7fe2dd36e000 rw-p 00023000 09:02 9178480                    /lib/x86_64-linux-gnu/ld-2.15.so
7fff93756000-7fff93777000 rw-p 00000000 00:00 0                          [stack]
7fff937ff000-7fff93800000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[2014-09-17 10:55:01] starting carbon-c-relay v0.33 (02696d)
configuration:
    relay hostname = IAD01-GRAPHITE02.INTERNAL.NET
    listen port = 2003
    workers = 8
    send batch size = 2500
    server queue size = 25000
    routes configuration = /etc/relay.conf

parsed configuration follows:
cluster all
    fnv1a_ch replication 1
        127.0.0.1:2013
        127.0.0.1:2113
    ;

match *
    send to all
    stop
    ;

listening on tcp4 0.0.0.0 port 2003
listening on UNIX socket /tmp/.s.carbon-c-relay.2003
starting 8 workers
starting statistics collector
[2014-09-17 10:55:01] failed to connect() to 127.0.0.1:2013: Connection refused
[2014-09-17 10:55:02] server 127.0.0.1:2013: OK

What other information can I provide?

Some possible errors reported by static analyzer (clang)

I've ran clang's scan-build (using my branch see the PR for simple-logging from my working account with fixes for clang):
CC="clang" scan-build -o analyze make -j4

Then I've opened report (it's an html with all code flow described there) and saw some errors that I think worth reporting here:


Bug Group   Bug Type ▾    File    Function/Method Line    Path Length 
Dead store  Dead assignment router.c    router_optimise 1231    1   View Report
Dead store  Dead assignment collector.c collector_runner    187 1   View Report
Logic error Dereference of null pointer consistent-hash.c   ch_get_nodes    238 11  View Report
Logic error Dereference of null pointer consistent-hash.c   ch_addnode  179 12  View Report
Logic error Dereference of null pointer router.c    router_route_intern 1538    14  View Report
Memory Error    Memory leak relay.c main    302 36  View Report
Memory Error    Memory leak router.c    router_readconfig   747 54  View Report
Logic error Result of operation is garbage or undefined aggregator.c    aggregator_putmetric    223 15  View Report
Logic error Result of operation is garbage or undefined aggregator.c    aggregator_putmetric    242 19  View Report

Memory leaks are not that bad - it's a corner cases for binding and parsing file - not all malloced variables are freed, but program is terminated (in router_readconfig i- you allocate w, but if encounter 'unexpected end of file' you don't free it, in relay.c -you allocate workers, but you won't free it if you'll fail to bind to socket).

About two null-pointer dereferences in consistent-hash:

ch_get_nodes: If w is NULL and replcnt is 0, then you'll get null pointer dereference on string ret[i].dest = w->server;. I don't know if it really can happen not as a result of poorly crafted config, but still.
ch_addnode: same thing as before, if ring->entries != NULL``, but either w is NULL orring->hash_replicasis less than 0, you'll get null pointer derefrence at line 179 (in my fork), string islast->next = w```, cause last will be NULL and you are checking only if w is not null. Again, I think ring->hash_replicas should never be 0, but there maybe conditions where its possible.

Maybe all are false-positives, but it's better if you'll have a look at the output of clang's analyzer.

At least it's right about "dead assignments" on https://github.com/grobian/carbon-c-relay/blob/master/router.c#L1231 - you set rwalk to 'bwalk->firstroute' in "then" statement, and emediately after rewrites it to 'bwalk->lastroute' on line 1235. And same thing about https://github.com/grobian/carbon-c-relay/blob/master/collector.c#L187 - it's just useless, cause it will never be read.

Reload route config (handle aggregations)

I think it's good idea to have sort of "fast restart" ability - e.x. when you add new aggregates, there should be a way to perform "fast" reload of the rules.

Crashes with Buffer overlfow

I work with @nareshov - We are seeing the same thing as reported i issue #17 - I just re-built from the latest on github:

*** buffer overflow detected ***: /usr/local/bin/relay terminated
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x37)[0x7f8dce1c3387]
/lib/x86_64-linux-gnu/libc.so.6(+0x109280)[0x7f8dce1c2280]
/lib/x86_64-linux-gnu/libc.so.6(+0x10a33e)[0x7f8dce1c333e]
/usr/local/bin/relay[0x40a449]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f8dce47fe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f8dce1ac8bd]
======= Memory map: ========
00400000-00410000 r-xp 00000000 09:02 5375089 /usr/local/bin/relay
00610000-00611000 r--p 00010000 09:02 5375089 /usr/local/bin/relay
00611000-00612000 rw-p 00011000 09:02 5375089 /usr/local/bin/relay
0095f000-009a5000 rw-p 00000000 00:00 0 [heap]
7f8d71eef000-7f8d74000000 rw-p 00000000 00:00 0
7f8d74000000-7f8d74021000 rw-p 00000000 00:00 0
7f8d74021000-7f8d78000000 ---p 00000000 00:00 0
7f8d78000000-7f8d78021000 rw-p 00000000 00:00 0
7f8d78021000-7f8d7c000000 ---p 00000000 00:00 0
7f8d7c000000-7f8d7c300000 rw-p 00000000 00:00 0
7f8d7c300000-7f8d80000000 ---p 00000000 00:00 0
7f8d80000000-7f8d8044a000 rw-p 00000000 00:00 0
7f8d8044a000-7f8d84000000 ---p 00000000 00:00 0
7f8d84000000-7f8d843fd000 rw-p 00000000 00:00 0
7f8d843fd000-7f8d88000000 ---p 00000000 00:00 0
7f8d88000000-7f8d8848f000 rw-p 00000000 00:00 0
7f8d8848f000-7f8d8c000000 ---p 00000000 00:00 0
7f8d8c000000-7f8d8c4bd000 rw-p 00000000 00:00 0
7f8d8c4bd000-7f8d90000000 ---p 00000000 00:00 0
7f8d90000000-7f8d903be000 rw-p 00000000 00:00 0
7f8d903be000-7f8d94000000 ---p 00000000 00:00 0
7f8d94000000-7f8d9444d000 rw-p 00000000 00:00 0
7f8d9444d000-7f8d98000000 ---p 00000000 00:00 0
7f8d98000000-7f8d983af000 rw-p 00000000 00:00 0
7f8d983af000-7f8d9c000000 ---p 00000000 00:00 0
7f8d9c000000-7f8d9c4b4000 rw-p 00000000 00:00 0
7f8d9c4b4000-7f8da0000000 ---p 00000000 00:00 0
7f8da0000000-7f8da0403000 rw-p 00000000 00:00 0
7f8da0403000-7f8da4000000 ---p 00000000 00:00 0
7f8da4000000-7f8da4022000 rw-p 00000000 00:00 0
7f8da4022000-7f8da8000000 ---p 00000000 00:00 0
7f8da8000000-7f8da84e4000 rw-p 00000000 00:00 0
7f8da84e4000-7f8dac000000 ---p 00000000 00:00 0
7f8dac000000-7f8dac3bb000 rw-p 00000000 00:00 0
7f8dac3bb000-7f8db0000000 ---p 00000000 00:00 0
7f8db0000000-7f8db0527000 rw-p 00000000 00:00 0
7f8db0527000-7f8db4000000 ---p 00000000 00:00 0
7f8db4000000-7f8db4452000 rw-p 00000000 00:00 0
7f8db4452000-7f8db8000000 ---p 00000000 00:00 0
7f8db87f9000-7f8db87fa000 ---p 00000000 00:00 0
7f8db87fa000-7f8db8ffa000 rw-p 00000000 00:00 0
7f8db8ffa000-7f8db8ffb000 ---p 00000000 00:00 0
7f8db8ffb000-7f8db97fb000 rw-p 00000000 00:00 0
7f8db97fb000-7f8db97fc000 ---p 00000000 00:00 0
7f8db97fc000-7f8db9ffc000 rw-p 00000000 00:00 0
7f8db9ffc000-7f8db9ffd000 ---p 00000000 00:00 0
7f8db9ffd000-7f8dba7fd000 rw-p 00000000 00:00 0
7f8dba7fd000-7f8dba7fe000 ---p 00000000 00:00 0
7f8dba7fe000-7f8dbaffe000 rw-p 00000000 00:00 0
7f8dbaffe000-7f8dbafff000 ---p 00000000 00:00 0
7f8dbafff000-7f8dbb7ff000 rw-p 00000000 00:00 0
7f8dbb7ff000-7f8dbb800000 ---p 00000000 00:00 0
7f8dbb800000-7f8dbc000000 rw-p 00000000 00:00 0
7f8dbc000000-7f8dbc4f6000 rw-p 00000000 00:00 0
7f8dbc4f6000-7f8dc0000000 ---p 00000000 00:00 0
7f8dc00fe000-7f8dc00ff000 ---p 00000000 00:00 0
7f8dc00ff000-7f8dc08ff000 rw-p 00000000 00:00 0
7f8dc17fb000-7f8dc17fc000 ---p 00000000 00:00 0
7f8dc17fc000-7f8dc1ffc000 rw-p 00000000 00:00 0
7f8dc1ffc000-7f8dc1ffd000 ---p 00000000 00:00 0
7f8dc1ffd000-7f8dc27fd000 rw-p 00000000 00:00 0
7f8dc27fd000-7f8dc27fe000 ---p 00000000 00:00 0
7f8dc27fe000-7f8dc2ffe000 rw-p 00000000 00:00 0
7f8dc2ffe000-7f8dc2fff000 ---p 00000000 00:00 0
7f8dc2fff000-7f8dc37ff000 rw-p 00000000 00:00 0
7f8dc37ff000-7f8dc3800000 ---p 00000000 00:00 0
7f8dc3800000-7f8dc4000000 rw-p 00000000 00:00 0
7f8dc4000000-7f8dc44fe000 rw-p 00000000 00:00 0
7f8dc44fe000-7f8dc8000000 ---p 00000000 00:00 0
7f8dc80fe000-7f8dc80ff000 ---p 00000000 00:00 0
7f8dc80ff000-7f8dc88ff000 rw-p 00000000 00:00 0
7f8dc88ff000-7f8dc8900000 ---p 00000000 00:00 0
7f8dc8900000-7f8dc9100000 rw-p 00000000 00:00 0
7f8dc9100000-7f8dc9101000 ---p 00000000 00:00 0
7f8dc9101000-7f8dc9901000 rw-p 00000000 00:00 0
7f8dc9901000-7f8dc9902000 ---p 00000000 00:00 0
7f8dc9902000-7f8dca102000 rw-p 00000000 00:00 0
7f8dca102000-7f8dca103000 ---p 00000000 00:00 0
7f8dca103000-7f8dca903000 rw-p 00000000 00:00 0
7f8dca903000-7f8dca904000 ---p 00000000 00:00 0
7f8dca904000-7f8dcb104000 rw-p 00000000 00:00 0
7f8dcbf77000-7f8dcbf8c000 r-xp 00000000 09:02 9175092 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f8dcbf8c000-7f8dcc18b000 ---p 00015000 09:02 9175092 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f8dcc18b000-7f8dcc18c000 r--p 00014000 09:02 9175092 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f8dcc18c000-7f8dcc18d000 rw-p 00015000 09:02 9175092 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f8dcc18d000-7f8dcc18e000 ---p 00000000 00:00 0
7f8dcc18e000-7f8dccb15000 rw-p 00000000 00:00 0
7f8dccb15000-7f8dccb16000 ---p 00000000 00:00 0
7f8dccb16000-7f8dcd49d000 rw-p 00000000 00:00 0
7f8dcd49d000-7f8dcd49e000 ---p 00000000 00:00 0
7f8dcd49e000-7f8dcdc9e000 rw-p 00000000 00:00 0
7f8dcdc9e000-7f8dcdcb4000 r-xp 00000000 09:02 9175268 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7f8dcdcb4000-7f8dcdeb3000 ---p 00016000 09:02 9175268 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7f8dcdeb3000-7f8dcdeb4000 r--p 00015000 09:02 9175268 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7f8dcdeb4000-7f8dcdeb5000 rw-p 00016000 09:02 9175268 /lib/x86_64-linux-gnu/libz.so.1.2.3.4
7f8dcdeb5000-7f8dcdeb7000 r-xp 00000000 09:02 9178524 /lib/x86_64-linux-gnu/libdl-2.15.so
7f8dcdeb7000-7f8dce0b7000 ---p 00002000 09:02 9178524 /lib/x86_64-linux-gnu/libdl-2.15.so
7f8dce0b7000-7f8dce0b8000 r--p 00002000 09:02 9178524 /lib/x86_64-linux-gnu/libdl-2.15.so
7f8dce0b8000-7f8dce0b9000 rw-p 00003000 09:02 9178524 /lib/x86_64-linux-gnu/libdl-2.15.so
7f8dce0b9000-7f8dce26d000 r-xp 00000000 09:02 9178502 /lib/x86_64-linux-gnu/libc-2.15.so
7f8dce26d000-7f8dce46d000 ---p 001b4000 09:02 9178502 /lib/x86_64-linux-gnu/libc-2.15.so
7f8dce46d000-7f8dce471000 r--p 001b4000 09:02 9178502 /lib/x86_64-linux-gnu/libc-2.15.so
7f8dce471000-7f8dce473000 rw-p 001b8000 09:02 9178502 /lib/x86_64-linux-gnu/libc-2.15.so
7f8dce473000-7f8dce478000 rw-p 00000000 00:00 0
7f8dce478000-7f8dce490000 r-xp 00000000 09:02 9178504 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f8dce490000-7f8dce68f000 ---p 00018000 09:02 9178504 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f8dce68f000-7f8dce690000 r--p 00017000 09:02 9178504 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f8dce690000-7f8dce691000 rw-p 00018000 09:02 9178504 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f8dce691000-7f8dce695000 rw-p 00000000 00:00 0
7f8dce695000-7f8dce847000 r-xp 00000000 09:02 9179198 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f8dce847000-7f8dcea46000 ---p 001b2000 09:02 9179198 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f8dcea46000-7f8dcea61000 r--p 001b1000 09:02 9179198 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f8dcea61000-7f8dcea6c000 rw-p 001cc000 09:02 9179198 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
7f8dcea6c000-7f8dcea70000 rw-p 00000000 00:00 0
7f8dcea70000-7f8dcea92000 r-xp 00000000 09:02 9178520 /lib/x86_64-linux-gnu/ld-2.15.so
7f8dceac9000-7f8dceafe000 r--s 00000000 09:02 9830433 /var/cache/nscd/hosts
7f8dceafe000-7f8dcec89000 rw-p 00000000 00:00 0
7f8dcec8e000-7f8dcec92000 rw-p 00000000 00:00 0
7f8dcec92000-7f8dcec93000 r--p 00022000 09:02 9178520 /lib/x86_64-linux-gnu/ld-2.15.so
7f8dcec93000-7f8dcec95000 rw-p 00023000 09:02 9178520 /lib/x86_64-linux-gnu/ld-2.15.so
7fff38f96000-7fff38fb7000 rw-p 00000000 00:00 0 [stack]
7fff38fff000-7fff39000000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

Hash replaced with underscore

I have noticed that carbon-c-relay replaces the hash symbol with an underscore. Due to some technical debt, we actually have hash signs in our metric names in two locations. One example is below.

env.app.POS#.yadda.count

Other than an rewrite, is there a method to allow this or certain characters in metric names?

Additionally, a rewrite didn't fix my issue but issued the error below. As I understand it, if a rewrite occurs before a match, the rewritten metric will not be cleansed. It would basically attempting to rewrite a hash symbol with a hash symbol.

Error:
router_route: failed to rewrite metric: newmetric size too small to hold replacement

Feature Request: self staticstis for memory and CPU for process/thread ?

Hi @grobian there is anywhere documented the carbon-c-relay daemon statistics?

I can't see metrics related to resource consumption for each process ( and thread perhaps ..) do you have plans to add them?

Lots of thanks for your great work !!!

grobian / carbon-c-relay Goto Github PK

carbon-c-relay's Issues

Recommend Projects

Recommend Topics

Recommend Org