intel / platform-aware-scheduling Goto Github PK

Enabling Kubernetes to make pod placement decisions with platform intelligence.

License: Apache License 2.0

Makefile 0.87% Go 95.18% Shell 2.50% Mustache 0.94% Dockerfile 0.51%

kubernetes-scheduler scheduler-extender telemetry-policies metrics descheduler workload telemetry-aware-scheduling

platform-aware-scheduling's Introduction

Platform Aware Scheduling

Platform Aware Scheduling (PAS) contains a group of related projects designed to expose platform specific attributes to the Kubernetes scheduler using a modular policy driven approach. The project contains a core library and information for building custom scheduler extensions as well as specific implementations that can be used in a working cluster or leveraged as a reference for creating new Kubernetes scheduler extensions.

Telemetry Aware Scheduling is the initial reference implementation of Platform Aware Scheduling. It can expose any platform-level metric to the Kubernetes Scheduler for policy driven filtering and prioritization of workloads. You can read more about TAS here.

GPU Aware Scheduling is the implementation of the GPU resource aware Kubernetes scheduler extension.

Kubernetes Scheduler Extenders
Extenders
- Telemetry Aware Scheduling
- GPU Aware Scheduling
Communication and contribution

Kubernetes Scheduler Extenders

Platform Aware Scheduling leverages the power of Kubernetes Scheduling Extenders. These extenders allow the core Kubernetes scheduler to make HTTP calls to an external service which can then modify scheduling decisions. This can be used to provide workload specific scheduling direction based on attributes not normally exposed to the Kubernetes scheduler.

The extender package at the top-level of this repo can be used to quickly create a working scheduler extender.

Enabling a scheduler extender

Scheduler extenders are enabled by providing a scheduling configuration file to the default Kubernetes scheduler. An example of a configuration file:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
  - urlPrefix: "https://tas-service.telemetry-aware-scheduling.svc.cluster.local:9001"
    prioritizeVerb: "scheduler/prioritize"
    filterVerb: "scheduler/filter"
    weight: 1
    enableHTTPS: true
    managedResources:
      - name: "telemetry/scheduling"
        ignoredByScheduler: true
    ignorable: true
    tlsConfig:
      insecure: false
      certFile: "/host/certs/client.crt"
      keyFile: "/host/certs/client.key"

There are a number of options available to us under the "extenders" configuration object. Some of these fields - such as setting the urlPrefix, filterVerb and prioritizeVerb are necessary to point the Kubernetes scheduler to our scheduling service, while other sections deal the TLS configuration of mutual TLS. The remaining fields tune the behavior of the scheduler: managedResource is used to specify which pods should be scheduled using this service, in this case pods which request the dummy resource telemetry/scheduling, ignorable tells the scheduler what to do if it can't reach our extender and weight sets the relative influence our extender has on prioritization decisions.

With a configuration like the above as part of the Kubernetes scheduler configuration the identified webhook becomes part of the scheduling process.

To read more about scheduler extenders see the official docs.

Adding a new extender to Platform Aware Scheduling

Platform Aware Scheduling is a single repo designed to host multiple hardware enabling Kubernetes Scheduler Extenders. A new scheduler can be added with an issue and pull request.

Each project under the top-level repo has its own go module, dependency model and lifecycle.There is no single top level go.mod for the project. Some development tools and testing workflows may need to be done in the context of the go module they're targeting i.e. by changing into one of the directories that contains a go module.

Communication and contribution

Report a bug by filing a new issue.

Contribute by opening a pull request.

Learn about pull requests.

Reporting a Potential Security Vulnerability: If you have discovered potential security vulnerability in TAS, please send an e-mail to [email protected]. For issues related to Intel Products, please visit Intel Security Center.

It is important to include the following details:

The projects and versions affected
Detailed description of the vulnerability
Information on known exploits

Vulnerability information is extremely sensitive. Please encrypt all security vulnerability reports using our PGP key.

A member of the Intel Product Security Team will review your e-mail and contact you to collaborate on resolving the issue. For more information on how Intel works to resolve security issues, see: vulnerability handling guidelines.

platform-aware-scheduling's People

Stargazers

Watchers

platform-aware-scheduling's Issues

Intermittent issue in TAS adding and deleting strategy

TAS seems to have intermittent issues when adding and deleting strategies. These issues are not clear in the logs created by TAS, and I haven't found away to repeat them deterministically. This issue possibly stems from the way TAS is adding and deleting policies from the cache. This issue will be used to track investigation and resolution for this issue.

#55 Has a test which shows this issue by repeating the TASFilter end to end test.
Test run https://github.com/intel/platform-aware-scheduling/actions/runs/913029293 shows the issue at work.

The tasext container report http error

2020/03/12 05:08:16 Get http://localhost:8111/cache/policies/default/demo-policy: dial tcp: i/o timeout

But actually I can use a service to expose 8111 to custer and use curl to access it.

Any possible reason for this ?

Major repo structure change

This week will see the current Telemetry Aware Scheduling repo transformed into the Platform Aware Scheduling repo. The initial changes will involve a major restructuring which will move the top level files down one level so that files currently at github.com/intel/telemetry-aware-scheduling will be moved to github.com/intel/platform-aware-scheduling/telemetry-aware-scheduling

The change in structure will not result in a change in functionality. Commands that previously worked will still work as before in their new directory location. For any automated build or deployment process telemetry aware scheduling should be built as before in it's new lower level directory. i.e. adding cd telemetry-aware-scheduling to scripts etc. should bring any build / test or deploy process up to date.

A new point release will be done just before migrating to the new repo structure for those that want to continue to use the older repo structure.

Cant' schedule power hungry application

I have tried to follow the documentation in the telemetry-aware-scheduler with power. Everything works until the power hungry application is deployed. It remains in pending state and complains about

Warning FailedScheduling 6s (x4243 over 3d1h) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 Insufficient telemetry/scheduling.

I will share more information about my deployment below:

kubectl get nodes
NAME                 STATUS   ROLES                  AGE     VERSION
kind-control-plane   Ready    control-plane,master   7d21h   v1.21.1
kind-worker          Ready    <none>                 7d21h   v1.21.1
kind-worker2         Ready    <none>                 7d21h   v1.21.1

kubectl get pods -A
NAMESPACE            NAME                                          READY   STATUS    RESTARTS   AGE
custom-metrics       custom-metrics-apiserver-56f94d4666-77s48     1/1     Running   1          7d20h
default              master-kube-state-metrics-789b8cb4c-7c84s     1/1     Running   1          7d20h
default              power-hungry-application-6f97755559-mlk92     0/1     Pending   0          3d1h
default              telemetry-aware-scheduling-5f4c847598-7mh82   1/1     Running   0          3d18h
kube-system          coredns-558bd4d5db-8xbf5                      1/1     Running   1          7d21h
kube-system          coredns-558bd4d5db-dfqtf                      1/1     Running   1          7d21h
kube-system          etcd-kind-control-plane                       1/1     Running   1          7d21h
kube-system          kindnet-2pksl                                 1/1     Running   1          7d21h
kube-system          kindnet-8rdlj                                 1/1     Running   1          7d21h
kube-system          kindnet-pzfgx                                 1/1     Running   1          7d21h
kube-system          kube-apiserver-kind-control-plane             1/1     Running   1          7d21h
kube-system          kube-controller-manager-kind-control-plane    1/1     Running   1          7d21h
kube-system          kube-proxy-cgt7k                              1/1     Running   1          7d21h
kube-system          kube-proxy-nljxf                              1/1     Running   1          7d21h
kube-system          kube-proxy-q8zkv                              1/1     Running   1          7d21h
kube-system          kube-scheduler-kind-control-plane             1/1     Running   1          7d21h
kube-system          node-exporter-cljrh                           1/1     Running   1          7d21h
kube-system          node-exporter-pg2dl                           1/1     Running   1          7d21h
local-path-storage   local-path-provisioner-547f784dff-4n4px       1/1     Running   1          7d21h
monitoring           collectd-lvnqs                                1/1     Running   1          7d21h
monitoring           collectd-wc2sp                                1/1     Running   1          7d21h
monitoring           prometheus-deployment-58db6c496b-sjvws        1/1     Running   1          7d19h

kubectl describe taspolicy power-sensitive-scheduling-policy
Name:         power-sensitive-scheduling-policy
Namespace:    default
Labels:       <none>
Annotations:  API Version:  telemetry.intel.com/v1alpha1
Kind:         TASPolicy
Metadata:
  Creation Timestamp:  2021-12-13T13:15:53Z
  Generation:          6
  Managed Fields:
    API Version:  telemetry.intel.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:strategies:
          .:
          f:dontschedule:
            .:
            f:rules:
          f:scheduleonmetric:
            .:
            f:rules:
    Manager:         kubectl
    Operation:       Update
    Time:            2021-12-14T07:13:11Z
  Resource Version:  795315
  UID:               2c8107d9-6416-4003-a3be-c9e5cd385fa7
Spec:
  Strategies:
    Dontschedule:
      Rules:
        Metricname:  package_power_avg
        Operator:    GreaterThan
        Target:      60
    Scheduleonmetric:
      Rules:
        Metricname:  package_power_avg
        Operator:    LessThan
Events:              <none>

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/nodes/*/package_power_avg | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/nodes/%2A/package_power_avg"
  },
  "items": [
    {
      "describedObject": {
        "kind": "Node",
        "name": "kind-worker",
        "apiVersion": "/v1"
      },
      "metricName": "package_power_avg",
      "timestamp": "2021-12-17T08:31:52Z",
      "value": "43065m",
      "selector": null
    },
    {
      "describedObject": {
        "kind": "Node",
        "name": "kind-worker2",
        "apiVersion": "/v1"
      },
      "metricName": "package_power_avg",
      "timestamp": "2021-12-17T08:31:52Z",
      "value": "41128m",
      "selector": null
    }
  ]
}

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/node_package_power_per_pod" | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/node_package_power_per_pod"
  },
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "master-kube-state-metrics-789b8cb4c-7c84s",
        "apiVersion": "/v1"
      },
      "metricName": "node_package_power_per_pod",
      "timestamp": "2021-12-17T08:44:09Z",
      "value": "38760m",
      "selector": null
    }
  ]
}

Warning about critical-pod non-functional

When running helm install node-exporter telemetry-aware-scheduling/deploy/charts/prometheus_node_exporter_helm_chart. a warning about critical pod popping out.

W1212 13:54:31.447663 280921 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead

This file makes use of the critical pod.

While this does not stop the helm chart from being deployed, uncertain of the behavior that will result from the use of the non-functional property.

Metrics present on custom metrics endpoint but controller says metric not found

Hello, I tried to add a custom metric myself but I'm having my pods in Pending. Then I simply tried to follow the health demo tutorial but I have the same problem. Upon inspection the controller's logs say the metric is not present no metric health_metric found " component="controller" but if I run kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/nodes/*/health_metric" | jq . I can see them... so scraping from prometheus to Kubernetes works. Do you have any idea what could be the problem?

configure-scheduler.sh script breaks Kubernetes config

Updating Kubernetes scheduler config using the script & (GAS) config in this repo, breaks Kubernetes 1.26 config:
./telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh -f ./gpu-aware-scheduling/deploy/extender-configuration/scheduler-config-tas+gas.yaml

It messed up the last volume specification in the config so that there were couple of extra lines for the last volume specification.

Besides fixing that in the script, I think script should take a backup of the config file, and show user a diff of the changes it did.

Maybe also ask user whether changes should be accepted (unless something like "--yes" is specified), like e.g. Debian configuration packages script updates do.

You might also consider using kustomize to do the file updates, as that is actually designed for semantic updating of k8s YAML files.

(Which still leaves the issue of upgrade overwriting the changes, as mentioned in #86.)

KubeSchedulerConfiguration v1beta2 is deprecated in v1.25 and will be removed in v1.26

According to this doc page, KubeSchedulerConfiguration v1beta2 is deprecated in v1.25 and will be removed in v1.26, therefore the apiVersion field of this should be updated.

'make mock' fails for gpu aware scheduling

The GPU Aware Scheduling code generation step (make mock) will fail with older mockery versions such as 1.0.0 which has been used for the code generation. This started happening around introduction of go 1.18. Mockery 2.10.6 seems to work. Generated code is the same as before apart from generated comments, which show the new version in them.

should modify kubeadm config rather than edit the kube-scheduler static manifest

The existing setup instructions modify the already-existing kube-scheduler static manifest file but don't edit the kubeadm configmap. If someone updates the kubeadm configmap and regenerates the static manifest files (for kube-apiserver, kube-scheduler, etc.) then I think it'll overwrite the changes to the kube-scheduler static manifest file.

Wouldn't it make more sense to edit the kube-scheduler section of the kubeadm configmap and regenerate the static manifest files with "kubeadm init phase control-plane"? That way anyone else making changes won't end up accidentally overwriting the changes.

Kube-scheduler is not choosing prioritized node by Telemetry Aware Scheduling

In the TAS logs we can see that it received prioritize request

1027 19:48:52.748142 1 telemetryscheduler.go:55] "Received prioritize request" component="extender" I1027 19:48:52.754649 1 telemetryscheduler.go:190] "health_metric for nodes: [ worker1 :0] [ worker2 :1]" component="extender" I1027 19:48:52.756819 1 telemetryscheduler.go:131] "node priorities returned: [{worker1 10} {worker2 9}]" component="extender"

health_metric for worker1 is 0 and for worker2 is 1. TAS prioritizing worker1 as given in the logs.
But the pod is scheduled to worker2.

demo-app-55b7cffd5d-hn476 0/1 Pending 0 0s <none> worker2 <none> <none> demo-app-55b7cffd5d-hn476 0/1 ContainerCreating 0 0s <none> worker2 <none> <none> demo-app-55b7cffd5d-hn476 0/1 ContainerCreating 0 1s <none> worker2 <none> <none> demo-app-55b7cffd5d-hn476 1/1 Running 0 2s 10.10.189.96 worker2 <none> <none>

here is the taspolicy

apiVersion: telemetry.intel.com/v1alpha1 kind: TASPolicy metadata: name: demo-policy namespace: default spec: strategies: scheduleonmetric: rules: - metricname: health_metric operator: LessThan

Why pod is scheduled to worker2? Is there some misconfiguration on my deployment?

All pods in same node giving same power values.

In our cluster we have 2 worker nodes, the node_package_power_per_pod is giving same value for all pods that are in the same node. Below are the metrics output that we are getting:

node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="calico-node",host_ip="192.168.111.101",host_network="true",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-4sw9c",pod="calico-node-955qp",pod_ip="192.168.111.101",priority_class="system-node-critical"} 16.22732102030295
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="calico-node",host_ip="192.168.111.102",host_network="true",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-744hm",pod="calico-node-q7gkd",pod_ip="192.168.111.102",priority_class="system-node-critical"} 11.588657717541842
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="collectd",host_ip="192.168.111.101",host_network="false",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="monitoring",node="test1-775547f794-4sw9c",pod="collectd-bzv9z",pod_ip="192.168.47.65",uid="f079c1d0-57c8-4d67-99d3-8333e7f43185"} 16.22732102030295
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="collectd",host_ip="192.168.111.102",host_network="false",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="monitoring",node="test1-775547f794-744hm",pod="collectd-bcsvl",pod_ip="192.168.50.3",uid="f2118a48-da62-4136-a83c-1c68239bd8d7"} 11.588657717541842
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="kube-proxy",host_ip="192.168.111.101",host_network="true",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-4sw9c",pod="kube-proxy-l7qkl",pod_ip="192.168.111.101",priority_class="system-node-critical"} 16.22732102030295
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="kube-proxy",host_ip="192.168.111.102",host_network="true",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-744hm",pod="kube-proxy-s7frc",pod_ip="192.168.111.102",priority_class="system-node-critical"} 11.588657717541842
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="node-exporter",host_ip="192.168.111.101",host_network="true",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-4sw9c",pod="node-exporter-dbnmr",pod_ip="192.168.111.101",priority_class="system-node-critical"} 16.22732102030295
node_package_power_per_pod{created_by_kind="DaemonSet",created_by_name="node-exporter",host_ip="192.168.111.102",host_network="true",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-744hm",pod="node-exporter-mhdt2",pod_ip="192.168.111.102",priority_class="system-node-critical"} 11.588657717541842
node_package_power_per_pod{created_by_kind="ReplicaSet",created_by_name="custom-metrics-apiserver-58dd7fb956",host_ip="192.168.111.101",host_network="false",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="custom-metrics",node="test1-775547f794-4sw9c",pod="custom-metrics-apiserver-58dd7fb956-db7tc",pod_ip="192.168.47.67"} 16.22732102030295
node_package_power_per_pod{created_by_kind="ReplicaSet",created_by_name="descheduler-75c7857565",host_ip="192.168.111.102",host_network="false",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="kube-system",node="test1-775547f794-744hm",pod="descheduler-75c7857565-nbtst",pod_ip="192.168.50.9",priority_class="system-cluster-critical"} 11.588657717541842
node_package_power_per_pod{created_by_kind="ReplicaSet",created_by_name="prometheus-deployment-bcddbd4cd",host_ip="192.168.111.101",host_network="false",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="monitoring",node="test1-775547f794-4sw9c",pod="prometheus-deployment-bcddbd4cd-q84nx",pod_ip="192.168.47.80"} 16.22732102030295
node_package_power_per_pod{created_by_kind="ReplicaSet",created_by_name="stable-kube-state-metrics-77d8bd798",host_ip="192.168.111.102",host_network="false",instance="stable-kube-state-metrics.default.svc.cluster.local:8080",job="kube-state-metrics",namespace="default",node="test1-775547f794-744hm",pod="stable-kube-state-metrics-77d8bd798-n2n97",pod_ip="192.168.50.4"} 11.588657717541842

correct way to post resource telemetry/scheduling to nodes

Hi there,

I'm recently playing around with this project, but I run into a problem when going through the health-metric demo. I found that the resource "telemetry/scheduling" is not add to the nodes automatically. I'm now using the following command to add it manually:

curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/telemetry~1scheduling", "value": "1"}]' \
https://localhost:8080/api/v1/nodes/worknode-0/status

With this, the scheduler works now, but I guess there should be a better way to do this?

Thanks a lot!

GAS Schedules to cards on same node, ignores podantiaffinity

Describe the bug
When multiple cards are present on the system, and pods request full cards (ie millicores=1000), GAS will place pods on the same node regardless of whether podAntiAffinity is set to prefer nodes that pods are not placed on first.

To Reproduce

Configure GAS and GPU Plugin with sharedDevNum > 1 and resource managment true
set millicores: 1000, i915: 1, enable secondary scheduler
set podAntiAffinity with similar:

- weight: 99
  podAffinityTerm:
    labelSelector:
      matchLabels:
        app: applabel
    topologyKey: "kubernetes.io/hostname"
    namespaceSelector: {}

GAS schedules pods to cards on same node, ignoring podAntiAffinity rule

Expected behavior
GAS should use the podAntifAffinity rule to schedule pods to other nodes before choosing cards on the same node. With default scheduler, this behavior is observed.

Logs
If applicable, add the relevant logs to help explain your problem.
Please consider adding at least the logs from the kube-scheduler and telemetry-aware-scheduling pods (if installed).

kube-scheduler logs

telemetry-aware-scheduling logs

Environment (please complete the following information):
Let us know what K8s version, distribution, and if you are deploying in BM, VM, or within a Cloud provider.
Baremetal OpenShift 4.13.11
DevicePlugins v0.28
GAS v0.5.5-0-g50d1879

Additional context
If relevant, add any other context about the problem here.

Irrelevant TAS topic

This repository has the TAS topic, which links to the topic page for tool-assisted speedrunning, not Telemetry Aware Scheduling. A more accurate topic to use would be telemetry-aware-scheduling, or simply removing the topic from the repository entirely.

Working with descheduler

Hi there,

I'm using the descheduler to rescedule a pod to another node. However, the descheduler complaints that every node has insufficient resource "telemetry/scheduling", which prevent the pod being rescheduled.
(I've checked the source code of the descheduler, and it only evicts pods that don't fit the current node and fit some other nodes, see the code below from node_affinity.go)

pods, err := podutil.ListPodsOnANode(
	node.Name,
	getPodsAssignedToNode,
	podutil.WrapFilterFuncs(podFilter, func(pod *v1.Pod) bool {
		return evictorFilter.Filter(pod) &&
			!nodeutil.PodFitsCurrentNode(getPodsAssignedToNode, pod, node) &&
			nodeutil.PodFitsAnyNode(getPodsAssignedToNode, pod, nodes)
	}),
)
if err != nil {
	klog.ErrorS(err, "Failed to get pods", "node", klog.KObj(node))
}

for _, pod := range pods {
	if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
		klog.V(1).InfoS("Evicting pod", "pod", klog.KObj(pod))
		if _, err := podEvictor.EvictPod(ctx, pod, node, "NodeAffinity"); err != nil {
			klog.ErrorS(err, "Error evicting pod")
			break
		}
	}
}

I'm using the setting up of health-metric-demo. The logs of the descheduler are like:

I0628 14:32:11.838531   72888 node_affinity.go:78] "Processing node" node="minikube-m02"
I0628 14:32:11.838554   72888 node.go:183] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m02"
I0628 14:32:11.838557   72888 node.go:185] "insufficient telemetry/scheduling"
I0628 14:32:11.838568   72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube"
I0628 14:32:11.838571   72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.838579   72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m02"
I0628 14:32:11.838582   72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.838591   72888 node.go:166] "Pod does not fit on node" pod="default/demo-app-77fbd8745b-hhsxl" node="minikube-m03"
I0628 14:32:11.838619   72888 node.go:168] "pod node selector does not match the node label"
I0628 14:32:11.838624   72888 node.go:168] "insufficient telemetry/scheduling"
I0628 14:32:11.839395   72888 descheduler.go:312] "Number of evicted pods" totalEvicted=0

In your instructions for the health-demo, the pod simply re-scheduled to another node, so I'm wondering how do you work around this problem?

Many thanks!

Pod stay pending even though node has become schedulable

Describe the problem

There are no schedulable nodes
A node becomes schedulable (metric becomes 1, similar to health_metric demo)
I would like the Pending workloads to be scheduled on the schedulable node.

To Reproduce
Here's my policy:

apiVersion: telemetry.intel.com/v1alpha1
kind: TASPolicy
metadata:
  name: schedule-until-at-capacity
  namespace: default
spec:
  strategies:
    dontschedule:
      rules:
        - metricname: node_schedulable
          operator: Equals
          target: 0
    scheduleonmetric:
      rules:
        - metricname: node_schedulable
          operator: GreaterThan

node_schedulable is scraped from my own endpoint where I set 0 and 1 at will.

Expected behavior
I expect that the Pod gets scheduled after at least 1 node becomes schedulable (node_schedulable becomes 1 for that node).

Logs
When there are no schedulable nodes, ex from TAS:

I0619 12:37:38.462395       1 telemetryscheduler.go:211] "Filter request received" component="extender"
I0619 12:37:38.462799       1 strategy.go:43] "ecoqube-wkld-dev-default-worker-topo-ptlck-656fc68575x6l666dn4t node_schedulable = 0" component="controller"
I0619 12:37:38.462807       1 strategy.go:57] "ecoqube-wkld-dev-default-worker-topo-ptlck-656fc68575x6l666dn4t violating : node_schedulable Equals 0" component="controller"
I0619 12:37:38.462810       1 strategy.go:43] "ecoqube-wkld-dev-default-worker-topo-ptlck-656fc68575x6l66jl4qm node_schedulable = 0" component="controller"
I0619 12:37:38.462816       1 strategy.go:57] "ecoqube-wkld-dev-default-worker-topo-ptlck-656fc68575x6l66jl4qm violating : node_schedulable Equals 0" component="controller"
I0619 12:37:38.462818       1 strategy.go:43] "ecoqube-wkld-dev-default-worker-topo-ptlck-656fc68575x6l66wbw54 node_schedulable = 0" component="controller"
I0619 12:37:38.462821       1 strategy.go:57] "ecoqube-wkld-dev-default-worker-topo-ptlck-656fc68575x6l66wbw54 violating : node_schedulable Equals 0" component="controller"

Pod is pending:

$ kubectl get pods
default                   500m-cpu-stresstest-252504ce-trbm4                     0/1     Pending     0             55s

I wait until a node is schedulable, then I can spawn new Pods, but the Pending ones stay pending.

Environment (please complete the following information):
K8s version: v1.27.3
Deployed using Cluster API.

Additional context
I had another policy where pods would be scheduled once possible, so not sure why this different behavior now.

Symlink checking bug

Merged change #19 which fixed an error in checking symlinks for TAS.

Renaming repo to Platform Aware Scheduling

In the coming week this repo will be renamed to Platform Aware Scheduling. The structure of the repo will also change.

The new top level repo will contain:
1) libraries to write a scheduler extender
2) Telemetry Aware Scheduling implementation.

This change is intended to allow other scheduler extenders to easily reuse the scheduling server interface from Telemetry Aware Scheduling. It will also allow new implementations to be added to this repo.

TAS Policy remove is not working as there are error logs in scheduler

Skipping extender &{https://tas-service.default.svc.cluster.local:9001 scheduler/filter scheduler/prioritize 1 0xc000866660 false map[telemetry/scheduling:{}] true} as it returned error Failed scheduler/filter with extender at URL https://tas-service.default.svc.cluster.local:9001/scheduler/filter, code 404 and has ignorable flag set

Custom metric not registered

Describe the problem

All the pods are running as expected however, when I checked the custom-metrics-apiserver pod, it says the node_health_metric not registered.

k logs -n custom-metrics custom-metrics-apiserver-7877996fb6-bw4j4 --tail=10

I0503 14:41:21.242625 1 series_registry.go:156] metric nodes/node_health_metric not registered
I0503 14:41:21.242742 1 httplog.go:89] "HTTP" verb="GET" URI="/apis/custom.metrics.k8s.io/v1beta1/nodes/%2A/node_health_metric" latency="6.315329ms" userAgent="extender/v0.0.0 (linux/amd64) kubernetes/$Format" srcIP="10.32.0.1:63075" resp=404

However, I am able to scrape the metric using the below command

 kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/nodes/*/health_metric" | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/nodes/%2A/health_metric"
  },
  "items": [
    {
      "describedObject": {
        "kind": "Node",
        "name": "master-node",
        "apiVersion": "/v1"
      },
      "metricName": "health_metric",
      "timestamp": "2023-05-03T14:44:27Z",
      "value": "2",
      "selector": null
    },
    {
      "describedObject": {
        "kind": "Node",
        "name": "worker-node-1",
        "apiVersion": "/v1"
      },
      "metricName": "health_metric",
      "timestamp": "2023-05-03T14:44:27Z",
      "value": "2",
      "selector": null
    }
  ]
}

Because of this problem, I believe the demo pod is scheduled and running although the master and worker node node_health_metric is set as 2

To Reproduce
Consolidated all the commands in a script and ran it. Here is the script.

cat k8s_tas.sh 
#!/bin/bash
kubectl create secret tls extender-secret --cert /etc/kubernetes/pki/ca.crt --key /etc/kubernetes/pki/ca.key
kubectl apply -f deploy/
sh /home/gowtham_kanakasabapathy/telemetry-aware-scheduling/deploy/extender-configuration/configure-scheduler.sh
kubectl create namespace monitoring
/usr/local/bin/helm install node-exporter deploy/charts/prometheus_node_exporter_helm_chart/
/usr/local/bin/helm install prometheus deploy/charts/prometheus_helm_chart/
cd deploy
export PURPOSE=serving
openssl req -x509 -sha256 -new -nodes -days 365 -newkey rsa:2048 -keyout ${PURPOSE}-ca.key -out ${PURPOSE}-ca.crt -subj "/CN=ca"
echo '{"signing":{"default":{"expiry":"43800h","usages":["signing","key encipherment","'${PURPOSE}'"]}}}' > "${PURPOSE}-ca-config.json"


export SERVICE_NAME=custom_metrics_service
export ALT_NAMES='"custom_metrics_service.custom_metrics","custom_metrics_service.custom_metrics.svc"'
echo '{"CN":"'${SERVICE_NAME}'","hosts":['${ALT_NAMES}'],"key":{"algo":"rsa","size":2048}}' | cfssl gencert -ca=serving-ca.crt -ca-key=serving-ca.key -config=serving-ca-config.json - | cfssljson -bare apiserver

kubectl create namespace custom-metrics

kubectl -n custom-metrics create secret tls cm-adapter-serving-certs --cert=serving-ca.crt --key=serving-ca.key

/usr/local/bin/helm install prometheus-adapter /home/gowtham_kanakasabapathy/telemetry-aware-scheduling/deploy/charts/prometheus_custom_metrics_helm_chart/

kubectl apply -f /home/gowtham_kanakasabapathy/telemetry-aware-scheduling/deploy/health-metric-demo/health-policy.yaml

Expected behavior

Pods to be scheduled per TAS policy

Logs
Added above

Environment (please complete the following information):

Kube version 1.22.0

Centos stream 8

GCP - VM

Additional context
I am able to see the metric both on prometheus dashboard and node_exporter.

TAS policy update

An issue has appeared when trying to update a running TAS policy. If the target values in the policy specs under some strategy rule are changed, the new value should replace the old value. However, the new value is added as a new rule to that strategy. This issue is being investigated with the already raised issue #56 for which a resolution is to be released in the coming weeks.

Community work related questions

Hi!

We have been going through this project recently and found it very interesting and would like to involve in the community work of this repo. Some questions that would be nice to get answers related to the project mostly from a community work perspective:

If you guys organize any community meetings on a weekly/biweekly/monthly basis?
Is there any channel or means of communication (slack/IRC or mailing list) for users or contributors who might have any questions about the project or discuss some feature requests beforehand with maintainers of the project?
What is the agreed procedure, if any, for proposing new design features to the project, etc?

Thanks.