Git Product home page Git Product logo

ts-bridge's Issues

Containerize ts-bridge, remove GAE dependencies, and make it deployable in K8S

ts-bridge is currently implemented as GAE application, which is fine for most of the potential users, but we've already faced with challenges trying to deploy ts-bridge in environments that are included into Google VPC Service Controls service perimeters. An attempt to deploy ts-bridge with end up with an error:
{"errors":[{"code":"DENIED","message":"Request is prohibited by organization's policy. vpcServiceControlsUniqueIdentifier: "}]}.

GAE is currently not supported by VPC service controls.

The proposal is to containerize ts-bridge and remove any GAE-specific dependencies (ideally - Cloud Datastore dependencies as well). As a result it will be possible to deploy ts-bridge inside K8S cluster (e.g. in GKE) or as GAE Flex application or in any other environment that support running containers.

Run tests with different versions of Go

Currently, the GitHub Action (previously Travis CI) only runs tests with the latest version of Go. We should extend this in the future to run for different versions of Go.

Add support for importing counters to Stackdriver

Currently query results get written to Stackdriver as GAUGE metrics.

This does not work well for importing SLIs, since an SLI is typically a ratio of two counters (good_events / valid_events). To allow calculating a ratio over an arbitrary long time window, we'll need ts-bridge to import both counters (good_events and valid_events) to Stackdriver separately as cumulative metrics.

This is not trivial, since CUMULATIVE metrics in Stackdriver require the client to report start_time (metric reset time) with each point. This timestamp is usually set to the start time of the process exporting a counter, and is used by Stackdriver to track counter resets.

I believe we should be able to use Datadog's cumsum function to get a cumulative counter, and then export it to Stackdriver as a CUMULATIVE metric.

This will require keeping some internal state per counter, and I think we can use metric records for that.

Refactor handler and tasks to use the same struct

In all the funcs inside tasks/tasks.go, the variables tsbridge.MetricConfig, tsbridge.Metrics, tsbridge.Config and storage.Manager are required repeatedly. Instead of passing them every time, a struct could be made to store these values. This would make the function calls much cleaner to read.

There is already a struct named web.Handler which stores these values, so the code should be refactored to use the same struct in both handlers.go and tasks.go. Currently, handlers.go is only used for AppEngine, whereas tasks.go functions are used in both AppEngine and non-AppEngine cases. This must be accounted for when refactoring.

See also: post by @nerdinary in #103 (comment)

Skip very fresh points from Datadog

Some folks from Datadog that I talked to mentioned that the most recent point returned by the API might contain incomplete data if a query aggregates values reported from multiple machines and data coming from some of the machines is slightly delayed.

Since it's impossible to edit values that have already been submitted to Stackdriver, I suggest ts-bridge ignore points from Datadog that are less than 1 minute old to allow data to be fully processed and aggregated.

Fix TSBridge UserAgent presentation

Currently TSBridge just sends requests with standard google-cloud-go user agent, leaving the application bit empty.

Let's pick user agent from packer:
google-api-go-client/0.5 Packer/1.6.3 (+https://www.packer.io/; go1.15.2; linux/amd64),gzip(gfe)

google-api-go-client/%VERSION% ts-bridge/%VERSION% (go%VERSION%; %PLATFORM%)

, e.g.:

google-api-go-client/0.5 ts-bridge/0.1 (go1.13.2; linux/amd64)

Automate release with Github Actions

Currently, changelogs and release descriptions are written manually for every release. We would like to use GitHub Actions to do the following:

  • create/update CHANGELOG.md
  • create releases containing changelog which summarises all changes since previous release
  • Build and release binaries for platforms were possible: linux 86_64, arm, macOS
  • Release source archives in tar.gz format

Support Datadog labels

It would be great to support Datadog metrics labels in the query field, or translate Datadog metric labels to Cloud Monitoring metric labels.

ts-bridge needs argument credentials support

Currently we don't support passing credentials to ts-bridge directly. It's all done through google-cloud-go library support, so classic application default credentials.

Design:

  • App would support additional 2 methods of auth:
    • `--google-json-key" argument and associated env variable allowing to pass in a JSON account key to the app
    • `--google-json-key-string" argument and associated env variable allowing to pass the same account key file as a string
  • If set, those flags must override other methods of selecting auth (library already does it so it should be enough to just be careful when piping the options through)

This should require just piping through the options from the lib, described here: https://github.com/googleapis/google-cloud-go#authorization

1.16.x (just released) - test failure

So 1.16 just released and it seems that something has changed to make a couple of tests fail only for 1.16.x build, notably:

  • TestUpdateAllMetrics
  • TestUpdateAllMetricsErrors

See: https://github.com/google/ts-bridge/runs/1933024386

Assigning to Adi to take a look later (this is not very urgent). Sorry for not doing it myself, but 1.16 just released and this didn't fail in RC so I couldn't have known.

Refactor dependency handlers & external connections

Connections to external dependencies, such as Prometheus & StackDriver exporters are established only after the initial sync loop is started.

A few examples:

https://github.com/google/ts-bridge/blob/master/tasks/tasks.go#L61
To prevent multiple dependency connections being opened, this is wrapped in a sync.Once function,

c.PromExporter, err = pmexporter.NewExporter(pmexporter.Options{

The Prometehus exported is again wrapped in the parent sync.Once call, which does not allow passing back the ServeHTTP handler to main. This makes is difficult to serve the /metrics handler from a different port, and reuses the same handler that is established in main (https://github.com/google/ts-bridge/blob/master/app/main.go#L139)

There might be other examples - see the sync loop for all external dependencies established.

TODO:

  1. Dial / establish all external connections inside main at the program startup. This will allow a fail fast should the dependencies not be available or error out.
  2. Pass the pointers to these handlers via a struct. Either re-use the TSBridge.Config struct, or it might make sense to have a separate one.

Avoid writing points to Stackdriver too frequently

Occasionally seeing the following errors from Stackdriver:

total_bytes_rcvd: failed to write to Stackdriver: CreateTimeSeries error: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[0], timeseries: metric:<type:"custom.googleapis.com/datadog/total_bytes_rcvd" > resource:<type:"global" > metric_kind:GAUGE value_type:DOUBLE points:<interval:<end_time:<seconds:1540994261 > > value:<double_value:14838.339990234375 > >

Datadog returns more than 1 point per minute (for this specific metric points are 15sec apart), while Stackdriver generally recommends sending 1 point per minute.

Stackdriver adapter should probably just drop points that are less than 1 minute apart from the one previously written.

Disable regular flushing of internal metrics to Stackdriver

I am occasionally seeing the following error message in logs:

StatsCollector: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[0-2]

What I believe is happening:

  1. when App Engine instance is launched (on first incoming request), a new OpenCensus view worker is started which flushes stats to all exporters every 10 seconds.
  2. ts-bridge runs a sync operation every minute. As part of the sync operation, a new Stackdriver exporter is registered, used to report stats, and then immediately unregistered (which flushes accumulated stats).
  3. since the view worker is long-lived, its attempts to flush stats will happen every 10s until the process dies. Most of the times nothing will happen, since there are no registered exporters. However, if a regular flush attempt happens while the sync operation is running, such flush attempt will succeed and report metrics to SD.
  4. if a regular flush attempt happens to succeed during the sync operation, it will be almost immediately (within several seconds) followed by another flush attempt initiated by ts-bridge unregistering the SD exporter. Since these two attempts follow each other in quick succession, Stackdriver will complain about points being written too frequently.

There is no way to stop the default view worker, however we can call SetReportingPeriod to reset its timer. By setting a very long reporting period in the beginning of every sync operation (every time we register the exporter) we will effectively disable regular flushing of metrics and will only rely on flushing that happens at the end of the sync operation.

Support credential flags

Currently we don't support passing credentials to ts-bridge directly. It's all done through google-cloud-go library support, so classic application default credentials.

Design:

  • App would support additional 2 methods of auth:
    • `--google-json-key" argument and associated env variable allowing to pass in a JSON account key to the app
    • `--google-json-key-string" argument and associated env variable allowing to pass the same account key file as a string
  • If set, those flags must override other methods of selecting auth (library already does it so it should be enough to just be careful when piping the options through)

This should require just piping through the options from the lib, described here: https://github.com/googleapis/google-cloud-go#authorization

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.