bookingcom / carbonapi Goto Github PK

View Code? Open in Web Editor NEW

81.0 14.0 22.0 28.98 MB

High-performance Graphite frontend in Go

License: Other

Makefile 0.10% Go 98.81% Shell 0.28% Dockerfile 0.07% Python 0.74%

graphite metrics timeseries monitoring go golang

carbonapi's People

Contributors

Stargazers

Watchers

carbonapi's Issues

Fix config

1. Do .toml configs work?

When trying to load a .toml config, I get an error:

2019-01-13T13:50:55.626+0100    FATAL   main    Failed to parse config file     {"error": "yaml: unmarshal errors:\n  line 5: cannot unmarshal !!str `concure...` into cfg.preAPI"}

Also, I see only yaml annotations everywhere in config code.

If .toml configs don't work, we should consider removing them.

2. Current config does not work out-of-the box

Probably needs to be tweaked.

Unknown content type 'application/octet-stream'

On last commit get error "Unknown content type 'application/octet-stream'", when receive responce from backend.
From commit 25bf30e all work fine.
This is strange, becouse no changes in pkg/backend/net/net.go and only "application/x-protobuf" and "application/protobuf" allowed.
For qiuck fix I add "application/octet-stream" to Render and Find function in pkg/backend/net/net.go

Stacktrace:

{"level":"ERROR","timestamp":"2019-01-16T15:06:30.313+0300","logger":"render","message":"find error","carbonapi_uuid":"714f6d17-1a35-4a14-8aaa-02d0d7468250","username":"devops","metric":"test","error":"All backend requests failed: 1 backends: Unknown content type 'application/octet-stream'",
"errorVerbose":"Unknown content type 'application/octet-stream'
github.com/bookingcom/carbonapi/pkg/backend/net.Backend.Find
/go/src/github.com/bookingcom/carbonapi/pkg/backend/net/net.go:462
github.com/bookingcom/carbonapi/pkg/backend/net.(*Backend).Find
:1
github.com/bookingcom/carbonapi/pkg/backend.Finds.func1
/go/src/github.com/bookingcom/carbonapi/pkg/backend/rpc.go:134
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:23611 backends\nAll backend requests failed"}

Log backend identification when errors are returned by backends

When errors are returned by backends, only the number of errors is written to logs.

It would be handy to have some identification of failed backends to written to logs as well.

find cache is not used when sendGlobAsIs is true

The logic of making find requests before making render requests has evolved over time. However, the caching logic has not kept pace with the same. Even when sendGlobAsIs is true, we make find request, in order to determine whether to send the query one by one, or in one go, making a find request in the process. Caching the result would help avoid this find request.

DiffSeries panics when first series is nil

DiffSeries panics when the the first series is nil, and carbonapi behaviour is different from graphite-web. Graphite web misbehaves with this function as well.

Create a separate prometheus metric to track metrics not found

For legacy reasons(support for graphite-web), carbonapi find requests return http code 200, even in case of 404s. We need to keep track of this.

Separate metric requests counting served from cache

The requests served from cache have drastically different characteristics than ones served with remote requests. It makes sense to keep separate statistics for them.

Hence, we need three histograms:

all
cache
no cache

moving* functions shifting results to the left query boundary

When query executed on a whisper archive crossing it left boundary the results of
movingMedian movingSum movingMin movingMax movingMax movingAverage (might be more)
are shifted to the left query boundary:
Lets say you have a whisper archive with a single archive storing 2 days of secondly data:

{
"10.10.10.10:8080": {
"name": "secondly.views_sum",
"aggregationMethod": "Sum",
"maxRetention": 172800,
"retentions": [
{
"secondsPerPoint": 1,
"numberOfPoints": 172800
}
]
}

you querying it on the range from=now-3d until=now
you getting 2 days of data, padded with nulls from the LEFT
applying moving* function on top of it will shift result to the left, padding it with nulls from the RIGHT

This issue does not exist in ewma function for instance

Add full-stack system tests

Add automatic system test that:

Spins-up the setup from #58
Feeds some mock data on the storage
Retrieves data via the go-carbon→zipper→carbonapi chain
Verifies correctness

Log URI for errNoResponses and errNoMetricsFetched

Log URI for errNoResponses and errNoMetricsFetched (returned from Render, Find and Info in zipper/zipper.go) for easy timeouts and no metrics troubleshooting. Now only stack logged.

Clean up old zipper and some other dead code

After refactoring some code became dead.

Zipper tries to send metrics to graphite even when it's not specified in config

When there is no graphite option specified in zipper config, it tries to push metrics anyway to 127.0.0.1:3002. If there is no graphite running there (and it's not supposed to be when there is nothing in the config), it logs packs of errors.

Zipper must not push to graphite when it is not specified in config.

Move prometheus metrics settings to config files

Prometheus metrics parameters are hard-coded. They can only be changed by changing the code. We need to move them to config.

We need to do small refactoring to make this happen.

Refactor sendGlobAsIs ?

We have 3 parameters:
sendGlobsAsIs: true|false
alwaysSendGlobsAsIs: true|false
maxBatchSize: int

Their logic is convoluted
sendGlobsAsIs in true is working together with maxBatchSize
carbonapi is sending a find request and depending on amount of metrics returned from stores acts differently:

if amount of metrics is lower than maxBatchSize it sends request as is with globs
if amount of metrics is higher it uses list of metrics returned from find and query them one by one

sendGlobsAsIs in false is always sending find and ignoring maxBatchSize

alwaysSendGlobsAsIs I suppose is neglecting sendGlobsAsIs settings and always send a render request without 'find'

Proposing deprecate sendGlobsAsIs and alwaysSendGlobsAsIs in favor of
resolveGlobs: true|false in conjunction with maxBatchSize

resolveGlobs: false -> send render query as it is
resolveGlobs: true -> send find query and if:
count of metrics < maxBatchSize - send as it is
count of metrics > maxBatchSize - group them in batches of maxBatchSize and send them in batches

Check usage of graphiteWeb function, remove if unused

Seems like graphiteWeb function that is here is not used. It is not clear if it was ever used.

Need to figure out if it is still needed and clean-up if not.

Consider to merge aboveSeries PR to our repo

go-graphite/carbonapi#383

Document /lb_check and make it report a real health status

When you query carbonapi /lb_check it should report
ok - if at least one backend (either zipper or storage) present, connected
fail - if none present/configured/failed heartbeat

Also document it here:

carbonapi/app/carbonapi/http_handlers.go

Line 1087 in e56cfc0

supported requests:

Make docker-compose setup with clickhouse

Add docker-compose with clickhouse backend and following components:

clickhouse
graphite-clickhouse
carbon-clickhouse
carbonapi
zipper

Refactor logging to make it more controlled

Currently, there are the following problems with logging that need to be fixed:

loggers are global variables
there are many loggers and it is hard to know how many
new loggers can be spawned randomly
it is hard to control log format
it is hard to flush and sync loggers

We need to make the following changes to resolve this:

Put directory structure in order

Re-structure directories in the project to follow a single convention.

This one seems like a good choice. Also, it looks like this struct already half-way there.

carbonapi doesn't build with cairo 1.16 on mac

Steps to reproduce:-

Brew latest cairo version
Try and build carbonapi, and it fails
If commit ff3e8c2 is reverted, it builds with the latest version.

Memory consumption drops drastically after service restart

When either carbonapi or carbonzipper services are restarted the RAM consumption drops to be approx x5 times less.

We need to investigate why this happens because this may be a sign of memory leak. Another explanation would be caching.

metric render fails if new metric was not indexed by trigram

Create a new metric, make sure that it on the disk, but indexer of go-carbon did not index it yet
Send a render request.
It will fail because we are sending find in advance and it won't find it until it indexed by go-carbon.
wait ~5 minutes and send a render request again - it will return a data.

This is controversial to #73 where metric does not actually exist on the given host out of a many.
In this case metrics in the question was stored on a single host.

Use a path cache for find requests?

staring at logs in go-carbon I see a lot of find ERROR not related to the metrics stored on this host:

{"level":"ERROR","timestamp":"2019-02-27T10:46:33.115+0100","logger":"access","message":"find failed","handler":"find","url":"/metrics/find/?format=protobuf&query=metric.name.","peer":"10.1.8.2:60779","carbonapi_uuid":"01134b54-cc10-4bb5-8451-f297a42a299a","query":["metric.name."],"format":"protobuf","runtime_seconds":0.000046948,"reason":"Not Found","error":"Not Found","http_code":404}

We should be more clever about a scope of servers which receive our find requests and limit the list of target hosts by utilizing the path cache instead of fan-out.

Have clear errors metric for carbonapi

The majority of errors are suppressed in carbonapi, remaining invisible. There is no clear metric of error rate.

We are limited in what HTTP codes we can return by compatibility concerns. So, it's best to expose this as a metric and think about reponse codes later.

Errors characteristics:

Errors include context cancellations and timeouts. Relates to #64
Bad requests are not errors
Not found metrics are not errors

Figure-out the HTTP return codes policy

Now we return 200 + empty body for almost all error scenarios. Some systems may not handle this well.

At the same time, we need to stay consistent with python graphite.

Also, what to return in case of cancellation and timeout?

Officially support clickhouse backend

Support all current functionality while using graphite-clickhouse as a backend instead of go-carbon.

Tags will not be supported for now.

Consider to merge seriesByTag and aliasByTags to our repo

go-graphite/carbonapi#384

Don't send a render to backend which return on find 'no metrics found'

from go-carbon logs:

{
  "level": "ERROR",
  "timestamp": "2019-02-05T14:27:17.576+0100",
  "logger": "access",
  "message": "find failed",
  "handler": "find",
  "url": "/metrics/find/?format=protobuf&query=bla.%2A.clusters.%2A.foo.active",
  "peer": "10.10.10.10:47549",
  "carbonapi_uuid": "7652d624-c031-4384-8bc2-c96039beb7be",
  "query": [
    "bla.*.clusters.*.foo.active"
  ],
  "format": "protobuf",
  "runtime_seconds": 5.7953e-05,
  "reason": "Not Found",
  "error": "Not Found",
  "http_code": 404
}

and immediately after that

{
  "level": "ERROR",
  "timestamp": "2019-02-05T14:27:18.903+0100",
  "logger": "access",
  "message": "fetch failed",
  "handler": "render",
  "url": "/render/?format=protobuf&from=1549344432&target=bla.%2A.clusters.%2A.foo.active&until=1549373232",
  "peer": "10.10.10.10:50115",
  "carbonapi_uuid": "7652d624-c031-4384-8bc2-c96039beb7be",
  "format": "carbonapi_v2_pb",
  "targets": [
    "bla.*.clusters.*.foo.active"
  ],
  "runtime_seconds": 5.4372e-05,
  "reason": "no metrics found",
  "http_code": 404
}

Index out of range during expression eval

The issue is described in go-graphite/carbonapi#315

We need to test and check if it is reproducible in our fork.

@hanzhinstas Could you give some tips on reproducing this?

Add test coverage report to the repo

Decide what coverage provider to use, get permissions, and add it to the repo with a badge.

Refactor code to decrease cyclomatic complexity to be <20

A lot of functions in the code are way to complex and long. Here's the gometalinter log:

expr/functions/cairo/png/cairo.go:1289::warning: cyclomatic complexity 65 of function setupTwoYAxes() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:1025::warning: cyclomatic complexity 65 of function drawGraph() is high (> 10) (gocyclo)
expr/functions/asPercent/function.go:34::warning: cyclomatic complexity 55 of function (*asPercent).Do() is high (> 10) (gocyclo)
app/carbonapi/http_handlers.go:181::warning: cyclomatic complexity 42 of function (*App).renderHandler() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:2227::warning: cyclomatic complexity 40 of function drawLines() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:1616::warning: cyclomatic complexity 38 of function setupYAxis() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:673::warning: cyclomatic complexity 34 of function EvalExprGraph() is high (> 10) (gocyclo)
expr/functions/tukey/function.go:33::warning: cyclomatic complexity 29 of function (*tukey).Do() is high (> 10) (gocyclo)
app/carbonapi/app.go:282::warning: cyclomatic complexity 29 of function setUpConfig() is high (> 10) (gocyclo)
expr/functions/nonNegativeDerivative/function.go:31::warning: cyclomatic complexity 27 of function (*nonNegativeDerivative).Do() is high (> 10) (gocyclo)
expr/functions/perSecond/function.go:32::warning: cyclomatic complexity 27 of function (*perSecond).Do() is high (> 10) (gocyclo)
expr/functions/pearsonClosest/function.go:33::warning: cyclomatic complexity 25 of function (*pearsonClosest).Do() is high (> 10) (gocyclo)
pkg/parser/parser.go:433::warning: cyclomatic complexity 23 of function parseArgList() is high (> 10) (gocyclo)
expr/functions/graphiteWeb/function.go:78::warning: cyclomatic complexity 21 of function New() is high (> 10) (gocyclo)
expr/functions/moving/function.go:32::warning: cyclomatic complexity 21 of function (*moving).Do() is high (> 10) (gocyclo)
expr/functions/summarize/function.go:33::warning: cyclomatic complexity 21 of function (*summarize).Do() is high (> 10) (gocyclo)
pkg/parser/parser.go:119::warning: cyclomatic complexity 20 of function (*expr).Metrics() is high (> 10) (gocyclo)
date/date.go:49::warning: cyclomatic complexity 20 of function DateParamToEpoch() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:2025::warning: cyclomatic complexity 20 of function drawGridLines() is high (> 10) (gocyclo)
pkg/parser/parser.go:415::warning: cyclomatic complexity 20 of function IsNameChar() is high (> 10) (gocyclo)

This makes code convoluted. Makes sense to refactor to reach at least complexity <20.

Expose carbonapi_uuid in http response headers

Presenting carbonapi_uuid to a user with result of render will allow easier troubleshooting by user.
Suggested name is X-Carbonapi-uuid

Add generated documentation for functions

Add pre-generated functions documentation to the repo together with the script that would do the generation.

Setup full-system benchmark that allows to see performance impact of a code change

This should be a follow-up to #70

We can use a docker-compose setup for benchmarking if long-term benchmark comparison is left-out. Docker is not suited to track benchmark performance over the long term with a track record. The environment cannot be always reproduced.

If we will be able to make the environment reproducible, this harness can be used for long-term benchmark book-keeping and analysis.

Mirror prometheus metrics into graphite

We currently mirror expvars into graphite. We need to mirror the Prometheus metrics as well.

After this is done, we could clean up some of the expvars and graphite metrics, since the Prometheus ones will replace them.

First guess where to start is here

Automate documentation of compatibility between graphite-web and carbonapi

The functional compatibility between carbonapi and graphite-web is maintained as a separate document COMPATIBILITY.md

This document can get out-of-sync very easily and is quite likely out of date now.

We need to implement the automatic generation of it by calling and comparing outputs of /functions endpoints.

carbonapi png mismatches graphite-web png ig graphOnly=1

When you render a graphOnly=1 png there is a space on the left in carbonapi, but no space in graphite-web.
Also this make impossible to render smaller graphs i.e. width=50 height=20

Dockerize carbonapi setup

Make a Docker config that would include the following:

carbonapi
carbonzipper
go-carbon
Some data pushed to storage to play with

This will:

allow other people to try carbonapi easily, and contribute as well
will make system testing possible, inside CI as well
will make manual testing easier

Unlucky requests can get needlessly dropped

Our requests limiter works like a semaphore now. The requests are not processed in a FIFO queue but are picked randomly from an unordered pool. The requests have a timeout. This means, that some requests can get unlucky and will not be picked up for longer than needed, and will be timed out.

Example

Say, we have 10 requests to be processed, each takes 1 second, but every second we get a new request in. The number of requests to be processed remains constant i.e. 10. Requests are processed one-at-a-time.

Intuitively, the waiting time for a request should be 9 seconds.

What we have now

Requests are picked randomly. In this case, the probability that a request is picked is 0.9 at each processing cycle. The chance that a request will be in the queue for >30 seconds is 0.9^29 ~ 5%. This is much longer than needed, and chances are that the request will be timed out and dropped.

What would be nice to have

Requests go into a FIFO queue. This way, each request waits for 9 seconds, then it is processed. The waiting time is more predictable and fair.

histograms separately for /render and /find
remove request histogram for all requests
separate counting for cached responses, see #66

do this on

carbonapi
zipper

Add graceful SIGINT handling

Currently, we don't properly gracefully handle the termination signals.

The graceful shutdown should include

the flushes
stop receiving new requests
finish hanging requests

Actualize config examples, remove unused, add missing options

Rename https://github.com/bookingcom/carbonapi/tree/master/config/carbonzipper.conf to yaml

is this being used?

carbonapi/config/carbonapi.yaml

Line 14 in 133e554

concurrencyLimitPerServer: 1025

why this in upstreams section?

carbonapi/config/carbonapi.yaml

Line 77 in 133e554

buckets: 10

What is this? Some comments needed with explanation.

carbonapi/config/carbonapi.yaml

Lines 130 to 132 in 133e554

 start: 0.05 

 bucketsNum: 25 

 bucketSize: 2

keepalive every 30s? really?

carbonapi/config/carbonzipper.conf

Line 34 in 133e554

keepAliveInterval: "30s"

What do we cache exactly?

carbonapi/config/carbonzipper.conf

Line 40 in 133e554

# If not zero, enabled cache for find requests

Dead:

carbonapi/config/carbonzipper.conf

Line 51 in 133e554

carbonsearch:

what is this?
https://github.com/bookingcom/carbonapi/blob/133e554781d1dfd46c1675479eef362dca99ac90/config/graphiteWeb.yaml

Do we use it? Where is the config parameter for this file?
https://github.com/bookingcom/carbonapi/blob/133e554781d1dfd46c1675479eef362dca99ac90/config/graphTemplates.yaml

Graphite tags support

Hello.

Do You have any plans for graphite tags support ?

Improve telemetry and monitoring

This includes several points:

Expose more metrics via Prometheus
Add separate request timing stats for cached and uncached responses
Make graphite push optional
Add saturation and load metrics
General visibility improvement

Add request limiter metrics to Prometheus

These should give an insight on how long does a request wait to be processed, and how much load is on the system.

bookingcom / carbonapi Goto Github PK

carbonapi's People

Contributors

Stargazers

Watchers

Forkers

carbonapi's Issues

1. Do .toml configs work?

2. Current config does not work out-of-the box

Example

What we have now

What would be nice to have

Recommend Projects

Recommend Topics

Recommend Org