Git Product home page Git Product logo

geras's Introduction

Geras-logo

CI - Docker CI - Test License

Geras provides a Thanos Store API for the OpenTSDB HTTP API. This makes it possible to query OpenTSDB via PromQL, through Thanos.

Since Thanos's StoreAPI is designed for unified data access and is not too Prometheus specific, Geras is able to provide an implementation which proxies onto the OpenTSDB HTTP API, giving the ability to query OpenTSDB using PromQL, and even enabling unified queries (including joins) over Prometheus and OpenTSDB.

Build

go get github.com/G-Research/geras/cmd/geras

After the build you will have a self-contained binary (geras). It writes logs to stdout.

A Dockerfile is also provided (see docker-compose.yaml for an example of using it).

Deployment

At a high level:

  • Run Geras somewhere and point it to OpenTSDB: -opentsdb-address opentsdb:4242;
  • Configure a Thanos query instance with --store=geras:19000 (i.e. the gRPC listen address).

Geras additionally listens on a HTTP port for Prometheus /metrics queries and some debug details (using x/net/trace, see for example /debug/requests and /debug/events.

Usage

  -grpc-listen string
        Service will expose the Store API on this address (default "localhost:19000")
  -http-listen string
        Where to serve HTTP debugging endpoints (like /metrics) (default "localhost:19001")
  -trace-enabled
        Enable tracing of requests, which is shown at /debug/requests (default true)
  -trace-dumpbody
        Include TSDB request and response bodies in traces (can be expensive) (default false)
  -label value
        Label to expose on the Store API, of the form '<key>=<value>'. May be repeated.
  -log.format string
        Log format. One of [logfmt, json] (default "logfmt")
  -log.level string
        Log filtering level. One of [debug, info, warn, error] (default "error")
  -healthcheck-metric
        A metric to query as a readiness health check (default "tsd.rpc.recieved")
  -metrics-refresh-interval duration
        Time between metric name refreshes. Use negative duration to disable refreshes. (default 15m0s)
  -metrics-refresh-timeout
        Timeout for metric refreshes (default 2m0s)
  -metrics-suggestions
        Enable metric suggestions (can be expensive) (default true)
  -opentsdb-address string
        <host>:<port>
  -metrics-allowed-regexp regexp
        A regular expression specifying the allowed metrics. Default is `.*`,
        i.e. everything. A good value if your metric names all match OpenTSDB
        style of `service.metric.name` could be `^\w+\..*$`. Disallowed metrics
        are simply not queried and non error is returned -- the purpose is to
        not send traffic to OpenTSDB when the metric source is Prometheus.
  -metrics-blocked-regexp regexp
        A regular expression of metrics to block. Default is empty and means to
        not block anything. The expected use of this is to block problematic
        queries as a fast mitigation therefore an error is returned when a
        metric is blocked.
  -metrics-name-response-rewriting
        Rewrite '.' to a defined character and other bad characters to '_' in all responses (Prometheus
        remote_read won't accept these, while Thanos will) (default true)
  -period-character-replace
		Rewrite '.' to a defined charater that Prometheus will handle better. (default ':')

When specifying multiple labels, you will need to repeat the argument name, e.g:

./geras -label label1=value1 -label label2=value2

Limitations

  • PromQL supports queries without __name__. This is not possible in OpenTSDB and no results will be returned if the query doesn't match on a metric name.
  • Geras periodically loads metric names from OpenTSDB and keeps them in memory to support queries like {__name__=~"regexp"}.
  • Thanos' primary timeseries backend is Prometheus, which doesn't support unquoted dots in metric names. However OpenTSDB metrics generally use . as a seperator within names. In order to query names containing a . you will need to either:
    • Replace all . with another character (we like :).
    • Use the __name__ label to specify the metric name, e.g. {__name__="cpu.percent"}
    • Also watch out for - (dashes) in your metric names

geras's People

Contributors

c-rindi avatar debiday avatar dgl avatar eswdd avatar gr-githubrepo avatar gr-nebojsao avatar greed42 avatar jgiannuzzi avatar johnseekins avatar kradalby avatar ljubon avatar robincw avatar robincw-gr avatar stackedsax avatar szalai1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geras's Issues

Filter for metric names

As Geras has knowledge of __name__ and resolves it itself we can fairly easily implement a filter on names. Add allowed and disallowed metric names regexes (probably simple flags with disallowed checked first to allow easy blocking).

Need to consider error handling, an error to the user for blocked metrics makes sense (to be clear that they are actually blocked, I'm thinking this would mostly be used as a rapid response to problematic queries). We don't consider a query that isn't allowed an error (e.g. an example allow would be \w+\..* as OpenTSDB metrics usually have a dot in them -- things like up can go to both thanos-store and Geras, we don't want to return errors for those).

Build and test Dockerfile with Circleci

  • Add dockerfile
  • Add CircleCI build for Docker (#42)
  • Add integration test
    • FakeTSDB (#43)
    • Spin up Geras, Thanos query and FakeTSDB in CircleCI config
    • Run queries against prometheus v1 query API (via Thanos)

Add Dockerfile

Potentially use circleci to check our docker build, maybe use for some kind of integration test? (Probably needs docker swarm, seems like circleci supports that somehow...)

Optimise PromQL regexp matches

Turn regexp matches on lists (e.g. as generated by Grafana list dropdowns), into literal_or.

Potentially also useful to implement NRE.

Investigate forcing data as stale

https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf
https://www.robustperception.io/staleness-and-promql

We could add staleness markers, e.g. if data is regularly arriving, add a marker after expected interval * 2... But this won't work for all metrics. Consider a scheme like geras_allowstale: as a prefix when querying?

Alternatively maybe this is inferring too much about the data we should force users to work this out via range queries and/or timestamp().

Improve handling of long timeseries windows

Currently geras will send a single query to OpenTSDB for each Series request, therefore potentially needing an unbounded amount of memory.

Split up queries (needs thought about how to combine with downsampling if done, #31) and stream back the data as found.

Later:

  • Makes it possible to paralleise queries, limit overall data volume, etc.

Healthchecks dependent on backend availability

Make Info return an error if unable to talk to backend (probably via calling /api/version, until OpenTSDB supports a better status API, see OpenTSDB/opentsdb #1742).

Also consider a HTTP endpoint that does similar for use with Kubernetes as a readiness probe.

Errors are ignored

Geras doesn't actually look at the ErrorMsg response from opentsdb-goclient. However the fix for this is a bit tricky -- we now rely on not sending an error back when the metric name doesn't exist.

The errors we want to ignore look like {"error":{"code":400,"message":"No such name for 'metrics': 'metric.name.here'","details":"[same as message]","trace":"net.opentsdb.BadRequestException: ... Caused by net.opentsdb.uid.NoSuchUniqueName..."}}

The "net.opentsdb.uid.NosuchUniqueName" looks like the most useful way to filter, but probably just code == 400 is best rather than relying on implementation details.

Label RE matching does partial matches

x:y:z{foo=~"y"} returns foo="xyz" series. Same for __name__.

Against real Prometheus this does full matches (e.g. up{job=~"rometheus"} returns nothing).

The `-grpc-listen` argument should have a default

The -grpc-listen argument professes to have a default:

  -grpc-listen string
    	service will expose the store api on this address (default "localhost:19000")

but if you don't pass a string, it complains:

$ go run cmd/geras/main.go -grpc-listen
flag needs an argument: -grpc-listen

Either the help message should change or there should be a default. Either way is fine -- probably having the default is a reasonable choice here.

Downsample in OpenTSDB

It may be possible to do some aggregations in OpenTSDB to avoid sending the raw OpenTSDB data back. Not all the aggregations make sense, in particular the Thanos aggrchunks are about downsampling mostly, the labels don't seem to change.

Maybe useful to do #15 first so we can actually measure the impact.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.