bookingcom / carbonapi Goto Github PK
View Code? Open in Web Editor NEWHigh-performance Graphite frontend in Go
License: Other
High-performance Graphite frontend in Go
License: Other
When trying to load a .toml
config, I get an error:
2019-01-13T13:50:55.626+0100 FATAL main Failed to parse config file {"error": "yaml: unmarshal errors:\n line 5: cannot unmarshal !!str `concure...` into cfg.preAPI"}
Also, I see only yaml
annotations everywhere in config code.
If .toml
configs don't work, we should consider removing them.
Probably needs to be tweaked.
On last commit get error "Unknown content type 'application/octet-stream'", when receive responce from backend.
From commit 25bf30e all work fine.
This is strange, becouse no changes in pkg/backend/net/net.go and only "application/x-protobuf" and "application/protobuf" allowed.
For qiuck fix I add "application/octet-stream" to Render and Find function in pkg/backend/net/net.go
Stacktrace:
{"level":"ERROR","timestamp":"2019-01-16T15:06:30.313+0300","logger":"render","message":"find error","carbonapi_uuid":"714f6d17-1a35-4a14-8aaa-02d0d7468250","username":"devops","metric":"test","error":"All backend requests failed: 1 backends: Unknown content type 'application/octet-stream'",
"errorVerbose":"Unknown content type 'application/octet-stream'
github.com/bookingcom/carbonapi/pkg/backend/net.Backend.Find
/go/src/github.com/bookingcom/carbonapi/pkg/backend/net/net.go:462
github.com/bookingcom/carbonapi/pkg/backend/net.(*Backend).Find
:1
github.com/bookingcom/carbonapi/pkg/backend.Finds.func1
/go/src/github.com/bookingcom/carbonapi/pkg/backend/rpc.go:134
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:23611 backends\nAll backend requests failed"}
When errors are returned by backends, only the number of errors is written to logs.
It would be handy to have some identification of failed backends to written to logs as well.
The logic of making find requests before making render requests has evolved over time. However, the caching logic has not kept pace with the same. Even when sendGlobAsIs is true, we make find request, in order to determine whether to send the query one by one, or in one go, making a find request in the process. Caching the result would help avoid this find request.
DiffSeries panics when the the first series is nil, and carbonapi behaviour is different from graphite-web. Graphite web misbehaves with this function as well.
For legacy reasons(support for graphite-web), carbonapi find requests return http code 200, even in case of 404s. We need to keep track of this.
The requests served from cache have drastically different characteristics than ones served with remote requests. It makes sense to keep separate statistics for them.
Hence, we need three histograms:
When query executed on a whisper archive crossing it left boundary the results of
movingMedian movingSum movingMin movingMax movingMax movingAverage (might be more)
are shifted to the left query boundary:
Lets say you have a whisper archive with a single archive storing 2 days of secondly data:
{
"10.10.10.10:8080": {
"name": "secondly.views_sum",
"aggregationMethod": "Sum",
"maxRetention": 172800,
"retentions": [
{
"secondsPerPoint": 1,
"numberOfPoints": 172800
}
]
}
you querying it on the range from=now-3d until=now
you getting 2 days of data, padded with nulls from the LEFT
applying moving* function on top of it will shift result to the left, padding it with nulls from the RIGHT
This issue does not exist in ewma function for instance
Add automatic system test that:
go-carbon
โzipper
โcarbonapi
chainLog URI for errNoResponses and errNoMetricsFetched (returned from Render, Find and Info in zipper/zipper.go) for easy timeouts and no metrics troubleshooting. Now only stack logged.
After refactoring some code became dead.
When there is no graphite
option specified in zipper
config, it tries to push metrics anyway to 127.0.0.1:3002. If there is no graphite running there (and it's not supposed to be when there is nothing in the config), it logs packs of errors.
Zipper
must not push to graphite when it is not specified in config.
Prometheus metrics parameters are hard-coded. They can only be changed by changing the code. We need to move them to config.
We need to do small refactoring to make this happen.
We have 3 parameters:
sendGlobsAsIs: true|false
alwaysSendGlobsAsIs: true|false
maxBatchSize: int
Their logic is convoluted
sendGlobsAsIs in true is working together with maxBatchSize
carbonapi is sending a find request and depending on amount of metrics returned from stores acts differently:
sendGlobsAsIs in false is always sending find and ignoring maxBatchSize
alwaysSendGlobsAsIs I suppose is neglecting sendGlobsAsIs settings and always send a render request without 'find'
Proposing deprecate sendGlobsAsIs and alwaysSendGlobsAsIs in favor of
resolveGlobs: true|false in conjunction with maxBatchSize
resolveGlobs: false -> send render query as it is
resolveGlobs: true -> send find query and if:
count of metrics < maxBatchSize - send as it is
count of metrics > maxBatchSize - group them in batches of maxBatchSize and send them in batches
Seems like graphiteWeb
function that is here is not used. It is not clear if it was ever used.
Need to figure out if it is still needed and clean-up if not.
When you query carbonapi /lb_check it should report
ok - if at least one backend (either zipper or storage) present, connected
fail - if none present/configured/failed heartbeat
Also document it here:
carbonapi/app/carbonapi/http_handlers.go
Line 1087 in e56cfc0
Add docker-compose with clickhouse
backend and following components:
clickhouse
graphite-clickhouse
carbon-clickhouse
carbonapi
zipper
Currently, there are the following problems with logging that need to be fixed:
We need to make the following changes to resolve this:
Re-structure directories in the project to follow a single convention.
This one seems like a good choice. Also, it looks like this struct already half-way there.
Steps to reproduce:-
When either carbonapi
or carbonzipper
services are restarted the RAM consumption drops to be approx x5 times less.
We need to investigate why this happens because this may be a sign of memory leak. Another explanation would be caching.
Create a new metric, make sure that it on the disk, but indexer of go-carbon did not index it yet
Send a render request.
It will fail because we are sending find in advance and it won't find it until it indexed by go-carbon.
wait ~5 minutes and send a render request again - it will return a data.
This is controversial to #73 where metric does not actually exist on the given host out of a many.
In this case metrics in the question was stored on a single host.
staring at logs in go-carbon I see a lot of find ERROR not related to the metrics stored on this host:
{"level":"ERROR","timestamp":"2019-02-27T10:46:33.115+0100","logger":"access","message":"find failed","handler":"find","url":"/metrics/find/?format=protobuf&query=metric.name.","peer":"10.1.8.2:60779","carbonapi_uuid":"01134b54-cc10-4bb5-8451-f297a42a299a","query":["metric.name."],"format":"protobuf","runtime_seconds":0.000046948,"reason":"Not Found","error":"Not Found","http_code":404}
We should be more clever about a scope of servers which receive our find requests and limit the list of target hosts by utilizing the path cache instead of fan-out.
The majority of errors are suppressed in carbonapi
, remaining invisible. There is no clear metric of error rate.
We are limited in what HTTP codes we can return by compatibility concerns. So, it's best to expose this as a metric and think about reponse codes later.
Errors characteristics:
Now we return 200 + empty body for almost all error scenarios. Some systems may not handle this well.
At the same time, we need to stay consistent with python graphite.
Also, what to return in case of cancellation and timeout?
Support all current functionality while using graphite-clickhouse
as a backend instead of go-carbon
.
Tags will not be supported for now.
from go-carbon logs:
{
"level": "ERROR",
"timestamp": "2019-02-05T14:27:17.576+0100",
"logger": "access",
"message": "find failed",
"handler": "find",
"url": "/metrics/find/?format=protobuf&query=bla.%2A.clusters.%2A.foo.active",
"peer": "10.10.10.10:47549",
"carbonapi_uuid": "7652d624-c031-4384-8bc2-c96039beb7be",
"query": [
"bla.*.clusters.*.foo.active"
],
"format": "protobuf",
"runtime_seconds": 5.7953e-05,
"reason": "Not Found",
"error": "Not Found",
"http_code": 404
}
and immediately after that
{
"level": "ERROR",
"timestamp": "2019-02-05T14:27:18.903+0100",
"logger": "access",
"message": "fetch failed",
"handler": "render",
"url": "/render/?format=protobuf&from=1549344432&target=bla.%2A.clusters.%2A.foo.active&until=1549373232",
"peer": "10.10.10.10:50115",
"carbonapi_uuid": "7652d624-c031-4384-8bc2-c96039beb7be",
"format": "carbonapi_v2_pb",
"targets": [
"bla.*.clusters.*.foo.active"
],
"runtime_seconds": 5.4372e-05,
"reason": "no metrics found",
"http_code": 404
}
The issue is described in go-graphite/carbonapi#315
We need to test and check if it is reproducible in our fork.
@hanzhinstas Could you give some tips on reproducing this?
Decide what coverage provider to use, get permissions, and add it to the repo with a badge.
A lot of functions in the code are way to complex and long. Here's the gometalinter
log:
expr/functions/cairo/png/cairo.go:1289::warning: cyclomatic complexity 65 of function setupTwoYAxes() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:1025::warning: cyclomatic complexity 65 of function drawGraph() is high (> 10) (gocyclo)
expr/functions/asPercent/function.go:34::warning: cyclomatic complexity 55 of function (*asPercent).Do() is high (> 10) (gocyclo)
app/carbonapi/http_handlers.go:181::warning: cyclomatic complexity 42 of function (*App).renderHandler() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:2227::warning: cyclomatic complexity 40 of function drawLines() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:1616::warning: cyclomatic complexity 38 of function setupYAxis() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:673::warning: cyclomatic complexity 34 of function EvalExprGraph() is high (> 10) (gocyclo)
expr/functions/tukey/function.go:33::warning: cyclomatic complexity 29 of function (*tukey).Do() is high (> 10) (gocyclo)
app/carbonapi/app.go:282::warning: cyclomatic complexity 29 of function setUpConfig() is high (> 10) (gocyclo)
expr/functions/nonNegativeDerivative/function.go:31::warning: cyclomatic complexity 27 of function (*nonNegativeDerivative).Do() is high (> 10) (gocyclo)
expr/functions/perSecond/function.go:32::warning: cyclomatic complexity 27 of function (*perSecond).Do() is high (> 10) (gocyclo)
expr/functions/pearsonClosest/function.go:33::warning: cyclomatic complexity 25 of function (*pearsonClosest).Do() is high (> 10) (gocyclo)
pkg/parser/parser.go:433::warning: cyclomatic complexity 23 of function parseArgList() is high (> 10) (gocyclo)
expr/functions/graphiteWeb/function.go:78::warning: cyclomatic complexity 21 of function New() is high (> 10) (gocyclo)
expr/functions/moving/function.go:32::warning: cyclomatic complexity 21 of function (*moving).Do() is high (> 10) (gocyclo)
expr/functions/summarize/function.go:33::warning: cyclomatic complexity 21 of function (*summarize).Do() is high (> 10) (gocyclo)
pkg/parser/parser.go:119::warning: cyclomatic complexity 20 of function (*expr).Metrics() is high (> 10) (gocyclo)
date/date.go:49::warning: cyclomatic complexity 20 of function DateParamToEpoch() is high (> 10) (gocyclo)
expr/functions/cairo/png/cairo.go:2025::warning: cyclomatic complexity 20 of function drawGridLines() is high (> 10) (gocyclo)
pkg/parser/parser.go:415::warning: cyclomatic complexity 20 of function IsNameChar() is high (> 10) (gocyclo)
This makes code convoluted. Makes sense to refactor to reach at least complexity <20.
Presenting carbonapi_uuid to a user with result of render will allow easier troubleshooting by user.
Suggested name is X-Carbonapi-uuid
Add pre-generated functions documentation to the repo together with the script that would do the generation.
This should be a follow-up to #70
We can use a docker-compose setup for benchmarking if long-term benchmark comparison is left-out. Docker is not suited to track benchmark performance over the long term with a track record. The environment cannot be always reproduced.
If we will be able to make the environment reproducible, this harness can be used for long-term benchmark book-keeping and analysis.
We currently mirror expvars into graphite. We need to mirror the Prometheus metrics as well.
After this is done, we could clean up some of the expvars and graphite metrics, since the Prometheus ones will replace them.
First guess where to start is here
The functional compatibility between carbonapi
and graphite-web
is maintained as a separate document COMPATIBILITY.md
This document can get out-of-sync very easily and is quite likely out of date now.
We need to implement the automatic generation of it by calling and comparing outputs of /functions
endpoints.
When you render a graphOnly=1 png there is a space on the left in carbonapi, but no space in graphite-web.
Also this make impossible to render smaller graphs i.e. width=50 height=20
Make a Docker config that would include the following:
carbonapi
carbonzipper
go-carbon
This will:
carbonapi
easily, and contribute as wellOur requests limiter works like a semaphore now. The requests are not processed in a FIFO queue but are picked randomly from an unordered pool. The requests have a timeout. This means, that some requests can get unlucky and will not be picked up for longer than needed, and will be timed out.
Say, we have 10 requests
to be processed, each takes 1 second
, but every second we get a new request in. The number of requests to be processed remains constant i.e. 10
. Requests are processed one-at-a-time.
Intuitively, the waiting time for a request should be 9 seconds
.
Requests are picked randomly. In this case, the probability that a request is picked is 0.9
at each processing cycle. The chance that a request will be in the queue for >30 seconds is 0.9^29 ~ 5%
. This is much longer than needed, and chances are that the request will be timed out and dropped.
Requests go into a FIFO queue. This way, each request waits for 9 seconds
, then it is processed. The waiting time is more predictable and fair.
Sometimes logs are partly in classic log format, and partly in json format. This makes them hardly readable.
Global and per-backend timeouts don't work in general case. They are only applied when the request waits in "queue", but not in the general case.
We need to expose a counter of requests that were dropped because of context cancellation. This will include both real cancellations and timeouts.
This should be added to Prometheus first.
see title
Gather request statistics separately for requests on different endpoints:
/render
and /find
do this on
Currently, we don't properly gracefully handle the termination signals.
The graceful shutdown should include
Rename https://github.com/bookingcom/carbonapi/tree/master/config/carbonzipper.conf to yaml
is this being used?
carbonapi/config/carbonapi.yaml
Line 14 in 133e554
why this in upstreams section?
carbonapi/config/carbonapi.yaml
Line 77 in 133e554
What is this? Some comments needed with explanation.
carbonapi/config/carbonapi.yaml
Lines 130 to 132 in 133e554
keepalive every 30s? really?
carbonapi/config/carbonzipper.conf
Line 34 in 133e554
What do we cache exactly?
carbonapi/config/carbonzipper.conf
Line 40 in 133e554
Dead:
carbonapi/config/carbonzipper.conf
Line 51 in 133e554
what is this?
https://github.com/bookingcom/carbonapi/blob/133e554781d1dfd46c1675479eef362dca99ac90/config/graphiteWeb.yaml
Do we use it? Where is the config parameter for this file?
https://github.com/bookingcom/carbonapi/blob/133e554781d1dfd46c1675479eef362dca99ac90/config/graphTemplates.yaml
Hello.
Do You have any plans for graphite tags support ?
This includes several points:
These should give an insight on how long does a request wait to be processed, and how much load is on the system.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.