Git Product home page Git Product logo

ts-bridge's Introduction

Time Series Bridge is a tool that can be used to import metrics from one monitoring system into another. It regularly runs a specific query against a source monitoring system (currently Datadog & InfluxDB) and writes new time series results into the destination system (currently only Stackdriver).

Table of Contents

  1. Setup Guide
  2. metrics.yaml Configuration
  3. App Configuration
  4. Status Page
  5. Internal Monitoring
  6. Troubleshooting
  7. Development
  8. Support

Setup Guide

In brief, to set up the ts-bridge app:

  1. Create a GCP project that will host the app
  2. Configure metrics for import
  3. Deploy the app and let it auto-import your metrics every minute

The following sections will guide you through this process.

Create and Set Up a Google Cloud Project

We recommend making the project that hosts ts-bridge separate from the rest of your infrastructure so infrastructure failures will not affect monitoring and monitoring failues will not affect infrastructure.

  1. Log in to GCP and create a new Google Cloud project
  2. Ensure the new project is linked to a billing account (Note that the Stackdriver free tier can accommodate up to about 220 metrics. If you have already consumed your free quota with other usage, the incremental cost per metric is between US$2.32 and US$0.55 per month, depending on which pricing tier you are already in.)
  3. Enable stackdriver monitoring for the new project. When prompted:
    • Create a new stackdriver account for the project
    • Monitor only the new project (it should be selected by default)
    • Skip AWS setup and agent installation
    • Choose whether to receive email status reports for the project

Set Up A Dev Environment

We recommend using Cloud Shell to prepare ts-bridge for deployment to ensure a consistent and stable working environment. If you need a dev environment that you can share among multiple users, consider using a git repository and open-in-cloud-shell links.

  1. If you are not using Cloud Shell:
    • Install go
    • Download and install the Cloud SDK
      • Initialize with the following commands to set the linked project and auth cookie:
      • gcloud init
      • gcloud auth application-default login
  2. Clone the ts-bridge source
    • go get github.com/google/ts-bridge/...
    • The ts-bridge source code should appear in $GOPATH/src/github.com/google/ts-bridge/

End To End Test (Dev Server)

  1. Ensure that you either have Owner permissions for the whole Cloud project, or at minimum the Monitoring Editor role

  2. Create a ts-bridge config with no metrics

    • cd $GOPATH/src/github.com/google/ts-bridge; cp metrics.yaml.example metrics.yaml
    • Edit the yaml file, remove the datadog_metrics and influxdb_metrics sample content, and copy in the name of the project you just created into the stackdriver_destinations section.
    • Your metrics.yaml file should look like this:
    datadog_metrics:
    influxdb_metrics:
    stackdriver_destinations:
      - name: stackdriver
        project_id: "your_project_name"
    
  3. Turn on the status page (uncomment #ENABLE_STATUS_PAGE: "yes" in app.yaml)

  4. Update SD_PROJECT_FOR_INTERNAL_METRICS in your app.yaml to match the name of your GCP project.

  5. Launch a dev server

    • dev_appserver.py app.yaml --port 18080
  6. Test via localhost/sync

    • curl http://localhost:18080/sync
  7. Verify that no error messages are shown. Troubleshooting guide:

    Error message Remedy
    ERROR: StatsCollector: rpc error: code = PermissionDenied desc = The caller does not have permission Ensure the authenticating user has at least the "Monitoring Editor" role
  8. Configure metrics by following the instructions below.

  9. Test metric ingestion via localhost/sync

    • curl http://localhost:18080/sync
  10. Verify that metrics are visible on status page

    • In Cloud Shell, click the ‘web preview’ button and change the port to 18080
    • If running on a local workstation, browse to http://localhost:18080/
  11. Verify that metrics are visible in the Stackdriver UI

  12. Kill the local dev server

  13. Revert SD_PROJECT_FOR_INTERNAL_METRICS to "" in app.yaml

Docker

Authorization

ts-bridge relies on Google Cloud Go library to provide authorization and should support all options available for it. Generally, there are 3 ways you can do it:

  • Run gcloud auth application-default login (suitable for local development / dev environments)
  • Use GOOGLE_APPLICATION_CREDENTIALS="[PATH]" variable to point at the credentials
  • Using GCP platform-provided credentials, such as Workload identity for GKE

For more information, see:

Building the image

  1. Build the image from the supplied Dockerfile:
docker build -t ts-bridge:VERSION -t some-other-tag .

Running the image

The image sets ts-bridge binary as the entrypoint, so it can simply be run via cmd arguments with configuration files in working directory (/ts-bridge), e.g.:

docker run -p 8080:8080 \
 -v ${PWD}/metrics.yaml:/ts-bridge/metrics.yaml \
 -v ~/.gcp/my-account-key.json:/ts-bridge/gcp_account_key.json \
 -e "GOOGLE_APPLICATION_CREDENTIALS=/ts-bridge/gcp_account_key.json" \
 ts-bridge:VERSION \
 --debug \
 --storage-engine=boltdb \
 --enable-status-page \
 --stats-sd-project=my-project \
 --update-parallelism=4 \
 --sync-period=10s

Deploy In Production

  1. Ensure that you either have Owner permissions for the whole Cloud project, or at minimum the App Engine Admin and Cloud Scheduler Admin roles
  2. Disable the status page (comment out ENABLE_STATUS_PAGE: "yes" in app.yaml)
    • See below if you'd like to keep the status page enabled in prod.
  3. Create the App Engine application
    • gcloud app create
    • Choose the App Engine region. If you are using ts-bridge to import metrics originating from a system running on GCP, you should run ts-bridge in a different Cloud region from the system itself to ensure independent failure domains.
  4. Deploy app
    • gcloud app deploy --project <your_project_name> --version live
  5. Verify in the Stackdriver metrics explorer that metrics are being imported once a minute

CI

.github/workflows contains a number of GitHub Actions used to automate releases, security scans, tests and dev builds for ts-bridge.

There are two builds for this project's Docker image(s):

metrics.yaml Configuration

Metric sources and targets are configured in the app/metrics.yaml file.

Metric Sources

See the READMEs for how to import metrics from supported metric sources:

Metric Destinations

Stackdriver

Imported metrics can be written to multiple destination Stackdriver projects, even though in practice we expect a single instance of Time Series Bridge to write to a single project (usually matching the GCP project where the ts-bridge is running).

Stackdriver destinations are listed in the stackdriver_destinations section of the app/metrics.yaml file. The following parameters can be specified for each destination:

  • name: name of the Stackdriver destination. It's only used internally by ts-bridge to match imported metrics with destinations.
  • project_id: name of the Stackdriver project that metrics will be written to. This parameter is optional; if not specified, the same project where ts-bridge is running will be used.

If you are using ts-bridge to write metrics to a different Stackdriver project from the one it's running in, you will need to grant roles/monitoring.editor IAM permission to the service account used by the ts-bridge App Engine app to allow it to read and write Stackdriver metrics.

App Configuration

Importing period

Time Series Bridge attempts to import all configured metrics regularly. This is driven by the App Engine Cron Service which is configured in app/cron.yaml. By default metrics are imported every minute.

Global settings

Some other settings can be set globally as environment variables or command-line flags. In case of AppEngine variables are configured in the env_variables section of app/app.yaml.

  • DEBUG (--debug): enable debug logging.
  • PORT (--port): ts-bridge server port.
  • CONFIG_FILE (--metric-config): name of the metric configuration file (metrics.yaml).
  • SD_LOOKBACK_INTERVAL (--sd-lookback-interval): time interval used while searching for recent data in Stackdriver. This is also the default backfill interval for when no recent points are found. This interval should be kept reasonably short to avoid fetching too much data from Stackdriver on each update.
    • You might be tempted to increase this significantly to backfill historic values. Please keep in mind that Stackdriver does not allow writing points that are more than 24 hours old. Also, Datadog downsamples values to keep the number of points in each response below ~300. This means that a single request can only cover a time period of 5 hours if you are aiming to get a point per minute.
  • UPDATE_TIMEOUT (--update-timeout): the total time that updating all metrics is allowed to take. The incoming HTTP request from App Engine Cron will fail if it takes longer than this, and a subsequent update will be triggered again.
  • UPDATE_PARALLELISM (--update-parallelism): number of metric updates that are performed in parallel. Parallel updates are scheduled using goroutines and still happen in the context of a single incoming HTTP request, and setting this value too high might result in the App Engine instance running out of RAM.
  • MIN_POINT_AGE (--min-point-age): minimum age of a data point returned by a metric source that makes it eligible for being written. Points that are very fresh (default is 1.5 minutes) are ignored, since the metric source might return incomplete data for them if some input data is delayed.
  • COUNTER_RESET_INTERVAL (--counter-reset-interval): while importing counters, ts-bridge needs to reset 'start time' regularly to keep the query time window small enough. This parameter defines how often a new start time is chosen, and defaults to 30 minutes. See Cumulative metrics section below for more details.
  • STORAGE_ENGINE (--storage-engine): storage engine to use for storing metric metadata, defaults to datastore.
    • datastore - use AppEngine Datastore
    • boltdb - use BoltDB via BoltHold
      • BOLTDB_PATH (--boltdb-path) - path to BoltDB store, e.g. /data/bolt.db (defaults to $PWD/bolt.db)
  • ENABLE_STATUS_PAGE (--enable-status-page): can be set to 'yes' to enable the status web page (disabled by default).

You can use --env_var flag to override these environment variables while running the app via dev_appserver.py.

Cumulative metrics

Stackdriver supports cumulative metrics, which are monotonically increasing counters. Such metrics allow calculating deltas and rates over different alignment periods.

While neither Datadog nor InfluxDB have first-class support for cumulative metrics, they both have cumulative functions that allow their queries to retrive a cumulative sum. Time Series Bridge can use the results of such queries and import them as cumulative metrics, but such queries need to be explicitly annotated with a cumulative option in metrics.yaml being set to true.

For queries that are marked as cumulative, ts-bridge will regularly choose a 'start time' and then issue queries with from that time. As the result, Datadog and InfluxDB will return a monotonically increasing time series with a sum of all measurements since 'start time'. To avoid processing too many points as the cumulative interval increases, 'start time' regularly gets moved forward, keeping the query time window short (see COUNTER_RESET_INTERVAL). Such resets are handled correctly by Stackdriver, since it requires explicit start time to be provided for cumulative metric points.

Status Page

If the ENABLE_STATUS_PAGE environment variable is set to 'yes', the index page of the App Engine app shows a list of configured metrics along with import status for each metric. This might be useful for debugging, however it is disabled by default to avoid publicly exposing a list of configured metrics (App Engine HTTP endpoints are publicly available by default).

If you choose to leave the status page enabled, we recommend configuring Identity-Aware Proxy (IAP) for the Cloud project in which ts-bridge is running. You can use IAP to restrict access to ts-bridge to a specific Google group or a list of Google accounts.

Internal Monitoring

Time Series Bridge uses OpenCensus to report several metrics to Stackdriver:

  • metric_import_latencies: per-metric import latency (in ms). This metric has a metric_name field.
  • import_latencies: total time it took to import all metrics (in ms). If this becomes larger than UPDATE_TIMEOUT, some metrics might not be imported, and you might need to increase UPDATE_PARALLELISM or UPDATE_TIMEOUT.
  • oldest_metric_age: oldest time since the last written point across all metrics (in ms). This metric can be used to detect queries that no longer return any data.

All metrics are reported as Stackdriver custom metrics and have names prefixed by custom.googleapis.com/opencensus/ts_bridge/

examples/ directory in this repository contains a suggested Stackdriver Alerting Policy you can use to receive alerts when metric importing breaks.

Troubleshooting

This section describes common issues you might experience with ts-bridge.

Writing points to Stackdriver too frequently

If your query returns more than 1 point per minute, you might be seeing the following error from Stackdriver:

One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.

Stackdriver documentation recommends to not add points to the same time series faster than once per minute. If your metric query returns multiple points per minute, it is recommended you use aggregation to reduce the number of points.

Development

  • Set up a dev environment as per the Setup Guide above.
  • Create a metrics.yaml file in app/
  • Run the app locally using dev_appserver: cd app/ && dev_appserver.py app.yaml --port 18080
  • The app should be available at http://localhost:18080/
    • Note, dev_appserver does not support App Engine cron, so you'll need to run curl http://localhost:18080/sync to import metrics
  • Run tests: go test ./...
    • If you've changed interfaces, run go generate ./... to update mocks
  • If you've changed dependencies, run dep ensure to update vendored libraries and Gopkg.lock

If you'd like to contribute a patch, please see contribution guidelines in CONTRIBUTING.md.

Support

This is not an officially supported Google product.

ts-bridge's People

Contributors

dependabot[bot] avatar dnefedkin avatar dparrish avatar knyar avatar kynan avatar myktaylor avatar nerdinary avatar soaphia avatar temikus avatar wizardoffaraz avatar yanske avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ts-bridge's Issues

ts-bridge needs argument credentials support

Currently we don't support passing credentials to ts-bridge directly. It's all done through google-cloud-go library support, so classic application default credentials.

Design:

  • App would support additional 2 methods of auth:
    • `--google-json-key" argument and associated env variable allowing to pass in a JSON account key to the app
    • `--google-json-key-string" argument and associated env variable allowing to pass the same account key file as a string
  • If set, those flags must override other methods of selecting auth (library already does it so it should be enough to just be careful when piping the options through)

This should require just piping through the options from the lib, described here: https://github.com/googleapis/google-cloud-go#authorization

Fix TSBridge UserAgent presentation

Currently TSBridge just sends requests with standard google-cloud-go user agent, leaving the application bit empty.

Let's pick user agent from packer:
google-api-go-client/0.5 Packer/1.6.3 (+https://www.packer.io/; go1.15.2; linux/amd64),gzip(gfe)

google-api-go-client/%VERSION% ts-bridge/%VERSION% (go%VERSION%; %PLATFORM%)

, e.g.:

google-api-go-client/0.5 ts-bridge/0.1 (go1.13.2; linux/amd64)

Refactor dependency handlers & external connections

Connections to external dependencies, such as Prometheus & StackDriver exporters are established only after the initial sync loop is started.

A few examples:

https://github.com/google/ts-bridge/blob/master/tasks/tasks.go#L61
To prevent multiple dependency connections being opened, this is wrapped in a sync.Once function,

c.PromExporter, err = pmexporter.NewExporter(pmexporter.Options{

The Prometehus exported is again wrapped in the parent sync.Once call, which does not allow passing back the ServeHTTP handler to main. This makes is difficult to serve the /metrics handler from a different port, and reuses the same handler that is established in main (https://github.com/google/ts-bridge/blob/master/app/main.go#L139)

There might be other examples - see the sync loop for all external dependencies established.

TODO:

  1. Dial / establish all external connections inside main at the program startup. This will allow a fail fast should the dependencies not be available or error out.
  2. Pass the pointers to these handlers via a struct. Either re-use the TSBridge.Config struct, or it might make sense to have a separate one.

Disable regular flushing of internal metrics to Stackdriver

I am occasionally seeing the following error message in logs:

StatsCollector: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[0-2]

What I believe is happening:

  1. when App Engine instance is launched (on first incoming request), a new OpenCensus view worker is started which flushes stats to all exporters every 10 seconds.
  2. ts-bridge runs a sync operation every minute. As part of the sync operation, a new Stackdriver exporter is registered, used to report stats, and then immediately unregistered (which flushes accumulated stats).
  3. since the view worker is long-lived, its attempts to flush stats will happen every 10s until the process dies. Most of the times nothing will happen, since there are no registered exporters. However, if a regular flush attempt happens while the sync operation is running, such flush attempt will succeed and report metrics to SD.
  4. if a regular flush attempt happens to succeed during the sync operation, it will be almost immediately (within several seconds) followed by another flush attempt initiated by ts-bridge unregistering the SD exporter. Since these two attempts follow each other in quick succession, Stackdriver will complain about points being written too frequently.

There is no way to stop the default view worker, however we can call SetReportingPeriod to reset its timer. By setting a very long reporting period in the beginning of every sync operation (every time we register the exporter) we will effectively disable regular flushing of metrics and will only rely on flushing that happens at the end of the sync operation.

Refactor handler and tasks to use the same struct

In all the funcs inside tasks/tasks.go, the variables tsbridge.MetricConfig, tsbridge.Metrics, tsbridge.Config and storage.Manager are required repeatedly. Instead of passing them every time, a struct could be made to store these values. This would make the function calls much cleaner to read.

There is already a struct named web.Handler which stores these values, so the code should be refactored to use the same struct in both handlers.go and tasks.go. Currently, handlers.go is only used for AppEngine, whereas tasks.go functions are used in both AppEngine and non-AppEngine cases. This must be accounted for when refactoring.

See also: post by @nerdinary in #103 (comment)

Skip very fresh points from Datadog

Some folks from Datadog that I talked to mentioned that the most recent point returned by the API might contain incomplete data if a query aggregates values reported from multiple machines and data coming from some of the machines is slightly delayed.

Since it's impossible to edit values that have already been submitted to Stackdriver, I suggest ts-bridge ignore points from Datadog that are less than 1 minute old to allow data to be fully processed and aggregated.

Support Datadog labels

It would be great to support Datadog metrics labels in the query field, or translate Datadog metric labels to Cloud Monitoring metric labels.

Add support for importing counters to Stackdriver

Currently query results get written to Stackdriver as GAUGE metrics.

This does not work well for importing SLIs, since an SLI is typically a ratio of two counters (good_events / valid_events). To allow calculating a ratio over an arbitrary long time window, we'll need ts-bridge to import both counters (good_events and valid_events) to Stackdriver separately as cumulative metrics.

This is not trivial, since CUMULATIVE metrics in Stackdriver require the client to report start_time (metric reset time) with each point. This timestamp is usually set to the start time of the process exporting a counter, and is used by Stackdriver to track counter resets.

I believe we should be able to use Datadog's cumsum function to get a cumulative counter, and then export it to Stackdriver as a CUMULATIVE metric.

This will require keeping some internal state per counter, and I think we can use metric records for that.

1.16.x (just released) - test failure

So 1.16 just released and it seems that something has changed to make a couple of tests fail only for 1.16.x build, notably:

  • TestUpdateAllMetrics
  • TestUpdateAllMetricsErrors

See: https://github.com/google/ts-bridge/runs/1933024386

Assigning to Adi to take a look later (this is not very urgent). Sorry for not doing it myself, but 1.16 just released and this didn't fail in RC so I couldn't have known.

Containerize ts-bridge, remove GAE dependencies, and make it deployable in K8S

ts-bridge is currently implemented as GAE application, which is fine for most of the potential users, but we've already faced with challenges trying to deploy ts-bridge in environments that are included into Google VPC Service Controls service perimeters. An attempt to deploy ts-bridge with end up with an error:
{"errors":[{"code":"DENIED","message":"Request is prohibited by organization's policy. vpcServiceControlsUniqueIdentifier: "}]}.

GAE is currently not supported by VPC service controls.

The proposal is to containerize ts-bridge and remove any GAE-specific dependencies (ideally - Cloud Datastore dependencies as well). As a result it will be possible to deploy ts-bridge inside K8S cluster (e.g. in GKE) or as GAE Flex application or in any other environment that support running containers.

Run tests with different versions of Go

Currently, the GitHub Action (previously Travis CI) only runs tests with the latest version of Go. We should extend this in the future to run for different versions of Go.

Automate release with Github Actions

Currently, changelogs and release descriptions are written manually for every release. We would like to use GitHub Actions to do the following:

  • create/update CHANGELOG.md
  • create releases containing changelog which summarises all changes since previous release
  • Build and release binaries for platforms were possible: linux 86_64, arm, macOS
  • Release source archives in tar.gz format

Avoid writing points to Stackdriver too frequently

Occasionally seeing the following errors from Stackdriver:

total_bytes_rcvd: failed to write to Stackdriver: CreateTimeSeries error: rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[0], timeseries: metric:<type:"custom.googleapis.com/datadog/total_bytes_rcvd" > resource:<type:"global" > metric_kind:GAUGE value_type:DOUBLE points:<interval:<end_time:<seconds:1540994261 > > value:<double_value:14838.339990234375 > >

Datadog returns more than 1 point per minute (for this specific metric points are 15sec apart), while Stackdriver generally recommends sending 1 point per minute.

Stackdriver adapter should probably just drop points that are less than 1 minute apart from the one previously written.

Support credential flags

Currently we don't support passing credentials to ts-bridge directly. It's all done through google-cloud-go library support, so classic application default credentials.

Design:

  • App would support additional 2 methods of auth:
    • `--google-json-key" argument and associated env variable allowing to pass in a JSON account key to the app
    • `--google-json-key-string" argument and associated env variable allowing to pass the same account key file as a string
  • If set, those flags must override other methods of selecting auth (library already does it so it should be enough to just be careful when piping the options through)

This should require just piping through the options from the lib, described here: https://github.com/googleapis/google-cloud-go#authorization

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.