Git Product home page Git Product logo

load-watcher's Introduction

Load Watcher Go Reference CI Build Status Automated Release Notes by gren

The load watcher is responsible for the cluster-wide aggregation of resource usage metrics like CPU, memory, network, and IO stats over time windows from a metrics provider like SignalFx, Prometheus, Kubernetes Metrics Server etc. developed for Trimaran: Real Load Aware Scheduling in Kubernetes. It stores the metrics in its local cache, which can be queried from scheduler plugins.

The following metrics provider clients are currently supported:

  1. SignalFx
  2. Kubernetes Metrics Server
  3. Prometheus

These clients fetch CPU usage currently, support for other resources will be added later as needed.

Tutorial

This tutorial will guide you to build load watcher Docker image, which can be deployed to work with Trimaran scheduler plugins.

The default main.go is configured to watch Kubernetes Metrics Server. You can change this to any available metrics provider in pkg/metricsprovider. To build a client for new metrics provider, you will need to implement FetcherClient interface.

From the root folder, run the following commands to build docker image of load watcher, tag it and push to your docker repository:

docker build -t load-watcher:<version> .
docker tag load-watcher:<version> <your-docker-repo>:<version>
docker push <your-docker-repo>

Note that load watcher runs on default port 2020. Once deployed, you can use the following API to read watcher metrics:

GET /watcher

This will return metrics for all nodes. A query parameter to filter by host can be added with host.

Metrics Provider Configuration

  • By default Kubernetes Metrics Server client is configured. Set KUBE_CONFIG env var to your kubernetes client configuration file path if running out of cluster.

  • To use the Prometheus client, please configure environment variables METRICS_PROVIDER_NAME, METRICS_PROVIDER_ADDRESS and METRICS_PROVIDER_TOKEN to Prometheus, Prometheus address and auth token. Please do not set METRICS_PROVIDER_TOKEN if no authentication is needed to access the Prometheus APIs. Default value of address set is http://prometheus-k8s:9090 for Prometheus client.

  • To use the SignalFx client, please configure environment variables METRICS_PROVIDER_NAME, METRICS_PROVIDER_ADDRESS and METRICS_PROVIDER_TOKEN to SignalFx, SignalFx address and auth token respectively. Default value of address set is https://api.signalfx.com for SignalFx client.

Deploy load-watcher as a service

To deploy load-watcher as a monitoring service in your Kubernetes cluster, you should replace the values in the [] with your own cluster monitoring stack and then you can run the following.

> kubectl create -f manifests/load-watcher-deployment.yaml

Using load-watcher client

  • load-watcher-client.go shows an example to use load-watcher packages as libraries in a client mode. When load-watcher is running as a service exposing an endpoint in a cluster, a client, such as Trimaran plugins, can use its libraries to create a client getting the latest metrics.

load-watcher's People

Contributors

dependabot[bot] avatar huang-wei avatar jpedro1992 avatar linericyang avatar ridv avatar wangchen615 avatar zorro786 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

load-watcher's Issues

Prometheus metrics may have done aggregation.

The current load-watcher Prometheus pkg was using the metric of instance:node_cpu:ratio to calculate the node utilization
However, when this value is still below 60%, I found another metric instance:node_cpu_utilisation:rate1m was very large and was around 90%. Apparently, the Prometheus metric had some smoothing for the metric, and the one we used may already have a smoothing over a large time window, which might be larger than 1m. Let's guess for 5m.

We are not sure which Prometheus metric is consistent with the metric obtained directly from the metric server, so there needs more testing.

screencapture-prometheus-k8s-openshift-monitoring-diana-roks-d5524207867702d0568abed2cd076001-0000-us-south-containers-appdomain-cloud-graph-2021-04-23-00_33_53

Is there some errors?

  1. I found baseUrl = "/Watcher" in pkg/watcher/watcher.go, rather than "/watcher" ?
    It can't match the trimaran plugin, which defined as "/watcher"

  2. the compile env option should add CGO_ENABLED=0, used in alpine

Prometheus Client is Missing

The load watcher should work with the Prometheus monitoring stack to provide statistics of resource usage data.
The plan is to support CPU and Memory resource usages and support statistical aggregations, such as the average and the standard deviations for rolling windows include 5m, 10m, and 15m.

Cannot use the LibraryClient with the watcher

The current design of libraryClient in the release version includes watcher as follows:

// Client for Watcher APIs as a library
type libraryClient struct {
	fetcherClient watcher.MetricsProviderClient
	watcher       *watcher.Watcher
}

Watcher libraries should only be used for load-watcher. libraryClient should not include starting the watcher when the libraryClient is also used by scheduler plugins.

// Creates a new watcher client when using watcher as a library
func NewLibraryClient(opts watcher.MetricsProviderOpts) (Client, error) {
	var err error
	client := libraryClient{}
	switch opts.Name {
	case watcher.PromClientName:
		client.fetcherClient, err = metricsprovider.NewPromClient(opts)
	case watcher.SignalFxClientName:
		client.fetcherClient, err = metricsprovider.NewSignalFxClient(opts)
	default:
		client.fetcherClient, err = metricsprovider.NewMetricsServerClient()
	}
	if err != nil {
		return client, err
	}

	client.watcher = watcher.NewWatcher(client.fetcherClient)
	client.watcher.StartWatching()
	return client, nil
}

The LibraryClient failed to getLatestMetrics when using it in plugins without select {}.

Add a CI for code build/tests

Currently, there is no job configured on PRs, which can lead to inadvertently broken builds. The build/test checks must be passed before any PR is merged.

Add code style checks and contribution guidelines

It will be nice to have contribution guidelines defined. Also, a script to check basic code formatting issues will save time in PR reviews and avoid unintentional overlooks. This can be added as an additional check to pass PR.

Add a health check API

Currently, there exists no health check API for load-watcher. This can be used for example to have liveness/readiness checks in Kubernetes. This can be an additional API in the existing web server.

Add Changelog & Release Notes

Changelog file is missing from repo. Additionally, release notes under releases are not descriptive and lack information. It might be a good idea to automate this with Github tools.

Create version compatibility table in README

Currently, it is confusing to know which load-watcher version is compatible with which kube-scheduler/scheduler-plugins version. We should have a table to declare the release compatibility.

Provide more details about cpu and memory prometheus metric.

It took me some time to find out what exactly instance:node_cpu:ratio metirc is. It seems cpu and memory metric is come from helm-charts/charts/kube-prometheus-stack/templates/prometheus/rules/kube-prometheus-node-recording.rules.yaml rule which is is removed and seems be repalced by instance:node_load1_per_cpu:ratio rule in later verison. I think it is better to have detail description about cpu and memory metric and provide a way to configure name of cpu and memory metric.

support other metrics in load-watcher

I wonder if it is in the roadmap to support other metrics in load-watcher. I can help with the development if it would make sense :)

For instance, I would like to include metrics regarding bandwidth retrieved from prometheus:

node_network_transmit_bytes and node_network_receive_bytes

Maintaining consistent release version

The load-watcher can either be used as a microservice and as a library.

  • We need to maintain release versions so Trimaran scheduler plugins can always work with a stable release version where others can continue to contribute to other branches.
  • Besides, it will make sure a specific version of Trimaran plugins work with the same release version of docker image and library regardless of whether running in client mode or in package mode.
  • Need an official docker hub account for the image.

how to deploy the load-watcher as a service?

Hi all,
I found this line in README:
kubectl create -f manifests/load-watcher-deployment.yaml
but I did not find the manifests/load-watcher-deployment.yaml in repo.
Maybe a sample deployment file is needed?
thanks.

Enable logs with line numbers

Currently, logs do not print line numbers which makes it a bit difficult to debug. The logging library we use sirupsen/logrus has the option for enabling it.

SignalFx FetchAllHostMetrics returns FQDN node names

In FetchAllHostMetrics() method implementation by SignalFx client, the query is done to fetch all the metrics from SignalFx server with FQDNs, as added by SignalFx to Kubernetes node names. For example,

Node name like "test" is stored with its FQDN as "test.gcp.us-central.com". We would like to strip the node name from this.
We need a filter per cluster to avoid conflicts across clusters when we strip, and also for better query performance.

Need lib to run `load-watcher` client

When trimaran plugins run clients to obtain metrics from a load-watcher service, it can write both HTTP queries to get data from load-watcher service or can use load-watcher client library to get metrics.

  1. We do not have load-watcher client library now.
  2. We should allow users, such as Trimaran plugins, to use all load-watcher as libraries for all clients load-watcher support.
  3. The existing load-watcher clients, such as prometheus, k8s and signalfx, should have the same interface as the new load-watcher client to add.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.