overmindtech / k8s-source Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 1.94 MB

Kubernetes source for sunning in srcman

License: Other

Go 93.78% Dockerfile 0.37% Shell 4.71% Smarty 1.14%

k8s-source's Introduction

Kubernetes Source

Installation

Create an API Key in Overmind under Account settings > API Keys

Install the source into your Kubernetes cluster using Helm:

helm repo add overmind https://overmindtech.github.io/k8s-source
helm install overmind-kube-source overmind/overmind-kube-source --set source.apiKey=ovm_api_YOURKEY_HERE

To upgrade:

helm upgrade overmind-kube-source overmind/overmind-kube-source

To uninstall:

helm uninstall overmind-kube-source

NOTE: Currently the source won't appear in your "Sources" list in Overmind since it's running on your infrastructure, not ours. We'll improve this soon.

Support

This source will support all Kubernetes versions that are currently maintained in the kubernetes project. The list can be found here: https://kubernetes.io/releases/

Search

The backends in this package implement the Search() method. The query format that they are expecting is a JSON object with one or more of the following keys, with strings in the corresponding string format:

labelSelector: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
fieldSelector: https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/

An example would be:

{
    "labelSelector": "app=wordpress"
}

{
    "labelSelector": "environment=production,tier!=frontend",
    "fieldSelector": "metadata.namespace!=default"
}

Other fields can also be set of advanced querying is required, these fields must match the JSON schema for ListOptions: https://pkg.go.dev/k8s.io/[email protected]/pkg/apis/meta/v1#ListOptions

Development

Testing

The tests for this package rely on having a Kubernetes cluster to interact with. This is handled using kind when the tests are started. Please make sure that you have the required software installed:

IMPORTANT: If you already have kubectl configured and are connected to a cluster, that cluster is what will be used for testing. Resources will be cleaned up with the exception of the testing namespace. If a cluster is not configured, or not available, one will be created (and destroyed) using kind. This behavior may change in the future as I see it being a bit risky as it could accidentally run the tests against a production cluster, though that would be a good way to validate real-world use-cases.

k8s-source's People

Contributors

Watchers

k8s-source's Issues

Investigate performance

Performance doesn't seem to change much as max-parallel is increased, I suspect this is a bug and need to look into why.

After some testing it doesn't seem that the issue is related to the discovery library. New tests show that this is performing as expected. Seems that the issue is something to do with this source. Interestingly it seems that increasing the --max-parallel actually slows down the source rather than speeding it up, however not in a linear fashion. e.g. a --max-parallel of 999 takes about the same amount of time as 99,999,999

Integrate API key auth

Once overmindtech/connect#80 is complete, integrate this into the source and update the helm chart

Create terraform & docs mappings

We need to map the existing k8s sources to terraform, and create the mapping so that things can be documented with DocGPT

Pods not linking to config maps

I'm finding that pods aren't linking to the congif maps they mount. For example a pod has the following attributes:

spec:
  dnsPolicy: ClusterFirst
  schedulerName: default-scheduler
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data-neo4j-0
  - name: neo4j-conf
    projected:
      sources:
      - configMap:
          name: neo4j-default-config
      - configMap:
          name: neo4j-user-config
      - configMap:
          name: neo4j-k8s-config

However the ConfigMap isn't connected to anything

Configure the source to request a token on startup if it gets an API token

Implement otel

All the other sources have otel and this one needs it too. It needs to be added, along with the following to ensure that traces are sent on shutdown: https://github.com/overmindtech/ovm-cli/blob/b43a86dfd144759417df5a1c0300b22bc932ec79/cmd/request.go#L45-L47

Work out how this source will be run/deployed

In Cluster

The simplest way for us to allow customers to run this would be to have them run it in their own cluster. We could create a helm chart that included:

A role that confers the required Get and Watch permissions
A serviceaccount that the pod/s would use
A rolebinding probably to bind the role to the serviceaccount
A deployment to run the source within the cluster. We would need resource limits and to ensure that we always have one running

The most difficult part about this is the fact that the source will need to communicate directly with NATS, meaning that it will need to get a token from the API. In order to do that it'll need an OAuth token. This might bring back into contention the complexity of having sources run as specific OAuth apps. Not the end of the world but it it'll be a bit of work. We'll need some way of giving the source a token that it can refresh, without exposing the client secret too. Which is doable but once again a bit harder

Config

Required config:

Create/delete (does this count?)
Auth
- The original plan was to use the Authorization Code Flow with Proof Key for Code Exchange (PKCE) flow. We didn't actually do this though, instead when we create a source, we create a NATS token directly (no OAuth), then we pass the NKey seed and the token to srcman, which creates the source.

TODO:

Look to see if there is anything good that we could use like auto-updating helm charts or something

Since the source is running in the customer's cluster, it would be hard to manage in the GUI. Currently we rely on srcman to manage the entire lifecycle of our sources, and I would really like to keep the experience as close to SAAS as possible so forcing the user to make manual changes to their cluster is strongly discouraged.

Implementation

One way we could implement this is to have the source run exactly the same as a regular source, but we launch it in a "watcher" mode. And its entire job is to watch the source that the customer starts and report its health through srcman like everything else. This will still require some changes to srcman as there will need to be a way to show the user the auth details that they would need to provide to helm

Hosted

If we could host these ourselves we'd be in a much better position since we could manage the lifecycle in the same way that we do for everything else. The problem with this is that there is a good chance that the API for the kube clusters is going to be pretty locked down because of how important it is.

There are a few options for EKS cluster endpoints. As expected many of them are private so they really won't be much help for us. I think that running in the cluster is probably the best way for the time being

Next Steps

Create helm charts since we'll need that no matter what
Once we know exactly what config the helm charts will need and how that config gets provided, think about the changes required for srcman to be able to tell users which options they need to pass to helm

Support Helm

Helm is built on top of kubernetes and doesn't actually create kubernetes objects. In order to read helm info we will need to use the helm Go libraries: https://pkg.go.dev/helm.sh/helm

GPT-4 suggests that we can query helm releases as follows:

package main

import (
    "log"
	"os"
	"helm.sh/helm/v3/pkg/action"
	"helm.sh/helm/v3/pkg/cli"
	"k8s.io/client-go/tools/clientcmd"
	"fmt"
)

func main() {
	helmDriver := os.Getenv("HELM_DRIVER")
	settings := cli.New()

	actionConfig := new(action.Configuration)
	kubeconfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(
		clientcmd.NewDefaultClientConfigLoadingRules(),
		&clientcmd.ConfigOverrides{},
	)
	namespace, _, _ := kubeconfig.Namespace()
	err := actionConfig.Init(settings.RESTClientGetter(), namespace, helmDriver, func(format string, v ...interface{}) {
		fmt.Sprintf(format, v)
	})

	if err != nil {
		log.Fatalf("Failed to connect to Kubernetes: %v", err)
	}

	// create a new List action
	listAction := action.NewList(actionConfig)
	
    // set the namespace if needed - default it's "", which lists releases across all namespaces
    listAction.AllNamespaces = true
	
    // retrieve releases
	releases, err := listAction.Run()
	if err != nil {
		log.Fatalf("Failed to retrieve releases: %v", err)
	}

	// print release name
	for _, rls := range releases {
		fmt.Println(rls.Name)
	}
}

Decide what to do about different versions of the same resource

It's possible in kubernetes to have different APIs that serve the same resource, like when some thing move from a beta to a regular API. Our sources are only able to find data for one type, and one API version each. So a source that works for v1 Pods wouldn't work for v2 Pods (just an example). In order to support people that are using older resources, we really should be looking at the possible versions for things and creating additional sources. This would then raise the question of which should we should use. Kube does have APIs that will report which versions of things are available, but more research would be needed to determine how this works exactly and how we should be using it

End-to-end test

I need to do a full end to end test once https://github.com/overmindtech/deploy/issues/435 is complete. This will involve:

Deleting all sources and API keys
Adding a source for prod & dogfood to the production Overmind
Making sure we can get data from both of them
Ensure that the defaults are correct i.e. NATS url etc.
Configure mappings as per documentation
Make sure blast radius is accurate
Move management to terraform

Create tests for all sources

Currently there are only some tests, but a good test framework. It shouldn't be too hard to create tests for everything

Expose rate limiting config

These events are common:

I0720 11:21:47.684093       1 request.go:696] Waited for 1.033262583s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/ku
I0720 11:21:57.684261       1 request.go:696] Waited for 3.990879119s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/ku
I0720 11:23:43.378810       1 request.go:696] Waited for 1.147997843s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/de
I0720 11:23:53.578883       1 request.go:696] Waited for 3.995475924s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/de
I0720 11:24:03.578965       1 request.go:696] Waited for 3.988223247s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/de
I0720 11:24:13.778828       1 request.go:696] Waited for 3.99516626s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/def
I0720 11:24:57.409218       1 request.go:696] Waited for 1.143622363s due to client-side throttling, not priority and fairness, request: GET:https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces/de

We should make sure rate limits are exposed

Can't grant request:recieve scope

When creating an API token. I can't grant the request:recieve scope. I think this is because interactive users don't have this scope themselves. Need to test in dogfood...

Labels not working as tags

Clearly we are able to find the labels but they aren't being returned as tags

Deployment not linking to replicasets in Dogfood

In on of our environment we're not seeing deployments linked to replicasets, but in another we are:

leaking nats connections

This source is seemingly leaking nats connections leading to unnecessary resource usage in the source process as well as on NATS:

time="2023-07-05T09:21:22Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                      
time="2023-07-05T09:21:32Z" level=info msg="NATS connected" ServerID=NA3WG7EBM4TOVT67EWZ4Z6HUMSOAQE7REZKOSWZR5JZ2HNIMKY4BHWHU URL:="nats://nats:4222"                         
time="2023-07-05T09:21:32Z" level=info msg="Listing namespaces"                                                                                                               
time="2023-07-05T09:21:32Z" level=info msg="got 6 namespaces"                                                                                                                 
time="2023-07-05T09:21:32Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                      
time="2023-07-05T09:21:32Z" level=info msg="NATS connected" ServerID=NA3WG7EBM4TOVT67EWZ4Z6HUMSOAQE7REZKOSWZR5JZ2HNIMKY4BHWHU URL:="nats://nats:4222"                         
time="2023-07-05T09:21:32Z" level=info msg="Listing namespaces"                                                                                                               
time="2023-07-05T09:21:32Z" level=info msg="got 6 namespaces"

Create fallback source

It would be possible to create a source that is completely generic. This would mean that we could scan the server for the available resource types, and for all the ones we don't have real sources for, create a generic one on the fly. This wouldn't have as many links but would give us a lot of coverage for relatively little work.

I think we should try this before we make a start on #18 and #17

Low priority k8s source backlog

componentstatuses
apiservices
csistoragecapacities
customresourcedefinitions
eniconfigs
fieldexports: low
flowschemas
globalclusters
leases
limitranges
priorityclasses
prioritylevelconfigurations
validatingwebhookconfigurations
mutatingwebhookconfigurations
runtimeclasses

Locking up after long usage

kube source hovering around 1.5% cpu usage through the night, oddly though the source does not respond to queries while it is still happily connected to both kube api and nats:

~ $ netstat -atp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 :::http-alt             :::*                    LISTEN      1/source
tcp        0      0 demo-overmind-kube-source-5b868897d-vqxdn:54732 nats.default.svc.cluster.local:4222 ESTABLISHED 1/source
tcp        0      0 demo-overmind-kube-source-5b868897d-vqxdn:50986 kubernetes.default.svc.cluster.local:https ESTABLISHED 1/source
~ $

Nats-top:

NATS server version 2.9.20 (uptime: 20h13m50s)
Server:
  ID:   NB2LR3QJDH6IYAJ5D4OXBEPVYGDQBT6YDJAYTUVXZRVBKIXQPZPP57J5
  Load: CPU:  0.0%  Memory: 19.8M  Slow Consumers: 0
  In:   Msgs: 17.0M  Bytes: 460.8M  Msgs/Sec: 327.2  Bytes/Sec: 7.7K 
  Out:  Msgs: 86.2K  Bytes: 38.5M  Msgs/Sec: 0.0  Bytes/Sec: 0

Connections Polled: 5
  HOST                                CID    NAME                                                                   SUBS    PENDING     MSGS_TO     MSGS_FROM   BYTES_TO    BYTES_FROM  LANG     VERSION  UPTIME   LAST_ACTIVITY
  2a05:d01c:997:3200:9a2f::8:56112    16     source.default.f63e5fff-fd97-4723-984a-eb65e8a35b6f-58d4f7cdf5-2d8jw   10      0           7.3K        37.6K       987.2K      34.3M       go       1.27.1   20h13m17s  2023-07-20 06:35:22.741581908 +0000 UTC
  2a05:d01c:997:3200:e9b4::c:49632    17     source.default.2919c5eb-ba0d-4495-beed-fbd80404e924-57489b896b-4vdbn   4       0           7.4K        1.6K        1015.1K     672.2K      go       1.27.1   20h13m7s  2023-07-20 06:35:22.735401345 +0000 UTC 
  2a05:d01c:997:3200:e9b4::d:33344    18     source.default.313a0824-416b-4a9a-b455-369a38e62245-6587ff8c8b-rglj5   14      0           6.8K        32.2K       894.7K      9.7M        go       1.27.1   20h13m6s  2023-07-20 06:35:22.750481085 +0000 UTC 
  2a05:d01c:997:3200:e9b4::7:44966    19     revlink-6ffbd7567f-qnt79                                               2       0           6.3K        842         4.4M        152.1K      go       1.27.1   20h12m54s  2023-07-20 06:35:22.631274491 +0000 UTC
  2a05:d01c:997:3200:9a2f::4:54732    34     .demo-overmind-kube-source-5b868897d-vqxdn                             20      0           6.8K        16.9M       889.9K      410.6M      go       1.27.1   15h40m11s  2023-07-20 07:39:48.900815265 +0000 UTC

It's constantly sending a large number of messages.

Initial steps

Add tracing: basically copy over the various tracing.go tracing/ files and hook the startup/shutdown methods into the root cmd*
Fix that godforsaken tracing log message in everything
Restart the pod with the required tracing so that we can catch this tomorrow at least
Audit the response sender logic in discovery, see if there is any way for it to not realise a query is done
Audit what we're doing when we get new namespaces

Engine restarting due to namespace event every 10 seconds

This only seems to happen after the source has been running for a while. Example of the output:

time="2023-05-31T12:00:29Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:00:29Z" level=info msg="Listing namespaces"
time="2023-05-31T12:00:29Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:00:29Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:00:29Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:00:29Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"
time="2023-05-31T12:00:39Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LB
time="2023-05-31T12:00:39Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:00:39Z" level=info msg="Listing namespaces"
time="2023-05-31T12:00:39Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:00:39Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:00:39Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:00:40Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"
time="2023-05-31T12:00:40Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LB
time="2023-05-31T12:00:40Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:00:40Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:00:40Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:00:40Z" level=info msg="Listing namespaces"
time="2023-05-31T12:00:40Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:00:40Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"
time="2023-05-31T12:00:40Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LB
time="2023-05-31T12:00:40Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:00:40Z" level=info msg="Listing namespaces"
time="2023-05-31T12:00:40Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:00:40Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:00:40Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:00:40Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"
time="2023-05-31T12:00:50Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LB
time="2023-05-31T12:00:50Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:00:50Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:00:50Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:00:50Z" level=info msg="Listing namespaces"
time="2023-05-31T12:00:51Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:00:51Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"
time="2023-05-31T12:00:51Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LB
time="2023-05-31T12:00:51Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:00:51Z" level=info msg="Listing namespaces"
time="2023-05-31T12:00:51Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:00:51Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:00:51Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:00:51Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"
time="2023-05-31T12:01:01Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LB
time="2023-05-31T12:01:01Z" level=info msg="Restarting engine due to namespace event: "
time="2023-05-31T12:01:01Z" level=info msg="Listing namespaces"
time="2023-05-31T12:01:01Z" level=info msg="NATS disconnected" address= error="<nil>"
time="2023-05-31T12:01:01Z" level=info msg="NATS connection closed" error="<nil>"
time="2023-05-31T12:01:02Z" level=info msg="got 7 namespaces"
time="2023-05-31T12:01:02Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"

Create helm chart to deploy

If users are to run this on their clusters they will need a helm chart to deploy it. This chart will need to contain:

A deployment to run the pod
A configmap for the NATS and other config
A secret for the NKey seed & the token

Source is looping

For some reason the source is still looping, like what happened previously in #53

time="2023-06-09T14:28:41Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:28:41Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:28:42Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:28:42Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:28:42Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:28:42Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:28:43Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:28:43Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:28:53Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:28:53Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ I0609 14:28:55.086434       1 request.go:696] Waited for 1.151082353s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/api/v1/namespaces                                                             │
│ time="2023-06-09T14:28:55Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:28:55Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:28:55Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:28:55Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:28:56Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:28:56Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:28:56Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:28:56Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:28:57Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:28:57Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ I0609 14:29:07.489119       1 request.go:696] Waited for 1.002376188s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/api/v1/namespaces/kube-system/replicationcontrollers                          │
│ time="2023-06-09T14:29:07Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:29:07Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:29:08Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:29:08Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:29:08Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:29:08Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:29:09Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:29:09Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:29:09Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:29:09Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:29:11Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:29:11Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ time="2023-06-09T14:29:11Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:29:11Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:29:12Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:29:12Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ I0609 14:29:18.083232       1 request.go:696] Waited for 1.045901276s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings                         │
│ time="2023-06-09T14:29:22Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:29:22Z" level=info msg="Listing namespaces"                                                                                                                                                                                   │
│ time="2023-06-09T14:29:24Z" level=info msg="got 7 namespaces"                                                                                                                                                                                     │
│ time="2023-06-09T14:29:24Z" level=info msg="NATS connecting" servers="nats://nats:4222,nats://nats:4223"                                                                                                                                          │
│ I0609 14:29:28.084857       1 request.go:696] Waited for 1.198547805s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/networking.k8s.io/v1/namespaces/buildkit/networkpolicies                 │
│ time="2023-06-09T14:29:34Z" level=info msg="NATS connected" ServerID=NDKSO4EZTKGJ5MCON5FZ4UPUGQHBNIM3F6V37OJTGPBSL4IOLZ2LBQM7 URL:="nats://nats:4222"                                                                                             │
│ time="2023-06-09T14:29:34Z" level=info msg="Listing namespaces"

Create tests that cover many kube versions

Current we only test against latest. It would be better to have a good understanding of which versions we expect it to work for

Helm chart does not configure source pod to restart on secret changes

When changing the configured API Key, deploying the helm chart again does correctly update the API Key in the secret, but does not restart the k8s-source pod leading to outdated processes running and the new API Key not being picked up.

Add blast radius details to all links

Now that this information is embedded in the link, we need to go though and add it all manually

Create in-cluster config option

We should default to in-cluster config for the kube source

Create manual provisioning instructions

For now we are going to use manual instructions for creating the K8s source, moving to a more automated way later.

Implement medium priority k8s sources

securitygrouppolicies
resourcequotas
networkpolicies
ingressclasses
ingressclassparams
certificatesigningrequests
csidrivers
csinodes
dbclusterparametergroups
dbclusters
dbinstances
dbparametergroups
dbproxies
dbsubnetgroups
podtemplates
targetgroupbindings

updating values does not respin the pod

When changing config values, like source.natsServers the source pod does not get reloaded. This prevents the config from getting activated.

Decide how auth/lifecycle should work for self-hosted sources

Auth

At the moment the sources is using the same auth method which assumes that it's able to get NKeys and Token to the source securely. This is okay in srcman as it's all hosted by us. But for self-hosted sources we need to be bit smarter. The things I don't like about the current approach are:

There is no need for the NKey seed to be sent over the wire, it could be generated locally and then just the public key sent back to get a token
The token has a really long lifetime and can't be revoked

It would be cool if we could have the user install the helm chart without having to pass any parameters. I'm thinking maybe it generates an NKey seed, and then shows the user something that they can then provide to us to prove the source is theirs. Like maybe a a URL that they can click which adds the source to their account. The thing is once they have clicked this, we need to get the JWT to the source somehow. Maybe it would need to make an unauthenticated request to the API which is "approved" by the user clicking the link.

Lifecycle

The next problem is: how do we tell what sources you have? Currently we use kubernetes as a database for sources and all data is stored there. We could customise this so that you can store source data but it doesn't actually start a source maybe? It does however raise some questions:

How would you delete it?
How do we track the state?

k8s doc-gpt

#95

Create docGPT comments

Handle temporary connection issues better

The alert in

k8s-source/cmd/root.go

Line 279 in aff1425

sentry.CaptureException(err)

is a bit overzealous and should wait for a bit before raising an alert.

Sentry Issue: BACKEND-15

syscall.Errno: connection refused
*os.SyscallError: connect: connection refused
*net.OpError: dial tcp [fd16:6e37:bf5d::1]:443: connect: connection refused
*url.Error: Get "https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces": dial tcp [fd16:6e37:bf5d::1]:443: connect: connection refused
*rest.wrapPreviousError: Get "https://[fd16:6e37:bf5d::1]:443/api/v1/namespaces": dial tcp [fd16:6e37:bf5d::1]:443: connect: connection refused - error from a previous attempt: read tcp [2a05:d01c:997:3200:9a2f::9]:43548->[fd16:6e37:bf5d::1]:443: read: connection reset by peer
  File "/workspace/cmd/root.go", line 275, in run.func3

Review current state of k8s integration

David has already done a bunch of work to get the old source compiling. I need to review it and create tickets for what else needs to be done

We are going to need this no matter what so I need to get cracking on creating/resurrecting the k8s integration. Since we already have an example source the work should start by reviewing that: https://github.com/overmindtech/k8s-source

We should then go and create the source using the newest template and the knowledge that we've gained from the AWS source. I'm thinking that we'll almost certainly be able to use generics for the sources themselves

Memory leak

In kube we're seeing this pod being eventually evicted with the following errors:

The node was low on resource: memory. Container overmind-kube-source was using 1822764Ki, which exceeds its request of 0
The node was low on resource: memory. Container overmind-kube-source was using 2515040Ki, which exceeds its request of 0
The node was low on resource: memory. Container overmind-kube-source was using 3094324Ki, which exceeds its request of 0

This is a super massive amount of memory so it must be leaking somewhere.

Document supported resources

We should document which resources from which APIs are supported

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

This repository currently has no open or pending branches.

Detected dependencies

dockerfile

build/package/Dockerfile

golang 1.22-alpine

alpine 3.20

github-actions

.github/workflows/docker-cleanup.yml

actions/delete-package-versions v5

.github/workflows/release-charts.yml

actions/checkout v4

azure/setup-helm v4

helm/chart-releaser-action v1.6.0

.github/workflows/test-build.yml

actions/checkout v4

actions/setup-go v5

actions/checkout v4

docker/metadata-action v5

docker/login-action v3

docker/login-action v3

depot/setup-action v1

depot/build-push-action v1

actions/upload-artifact v4

actions/checkout v4

actions/upload-artifact v4

gomod

go.mod

go 1.22.4

github.com/MrAlias/otel-schema-utils v0.2.1-alpha

github.com/getsentry/sentry-go v0.28.1

github.com/google/uuid v1.6.0

github.com/overmindtech/discovery v0.27.6

github.com/overmindtech/sdp-go v0.79.0

github.com/overmindtech/sdpcache v1.6.4

github.com/sirupsen/logrus v1.9.3

github.com/spf13/cobra v1.8.1

github.com/spf13/pflag v1.0.5

github.com/spf13/viper v1.19.0

github.com/uptrace/opentelemetry-go-extra/otellogrus v0.3.1

go.opentelemetry.io/contrib/detectors/aws/ec2 v1.28.0

go.opentelemetry.io/otel v1.28.0

go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.28.0

go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.28.0

go.opentelemetry.io/otel/sdk v1.28.0

go.uber.org/automaxprocs v1.5.3

k8s.io/api v0.30.2

k8s.io/apimachinery v0.30.2

k8s.io/client-go v0.30.2

sigs.k8s.io/kind v0.23.0

go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.53.0

k8s.io/client-go v1.4.0

k8s.io/client-go v1.5.0

k8s.io/client-go v1.5.1

k8s.io/client-go v2.0.0+incompatible

k8s.io/client-go v3.0.0+incompatible

k8s.io/client-go v4.0.0+incompatible

k8s.io/client-go v5.0.0+incompatible

k8s.io/client-go v5.0.1+incompatible

k8s.io/client-go v6.0.0+incompatible

k8s.io/client-go v7.0.0+incompatible

k8s.io/client-go v8.0.0+incompatible

k8s.io/client-go v9.0.0-invalid+incompatible

k8s.io/client-go v9.0.0+incompatible

k8s.io/client-go v10.0.0+incompatible

k8s.io/client-go v11.0.0+incompatible

helm-values

deployments/overmind-kube-source/values.yaml

Check this box to trigger a request for Renovate to run again on this repository

See if we can automate owner reference relationships

Read these docs and see if this can be automated: https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/

Create generic kube source

Using the existing reflection implementation as an example, I need to create a k8s source using generics.

Implement a source using generics that satisfies the discovery.Source interface
Create a pod & node source as an example of namespaced and non-namespaced
Review developer experience and move duplication into the source
Create comprehensive tests

Auto-determine cluster name in EKS

This is possible and will us relying on the user configuring the source correctly which is always a good thing for everyone involved.

We can use the IPDSv2 API to determine the details of the instance such as the region, ID etc
We can then use the EC2 API to get the tags from the instance
From the tags we can determine the name of the cluster, this can be validated by getting the cluster details

TODO: Work out how this would work in Fargate

Implement high priority k8s sources

Implement the following sources:

Once this is complete go through all the shared stuff and delete anything that isn't required

Then go through the start command and make sure all of these get loaded properly, including reload on changed namespaces

Final k8s integration work

Work out the final bits we need in order for people to use the Kuebernetes source. This means:

Test and document helm charts
#85
Create a modal or whatever for adding a kube source. This should allow the user to add an API key for that source that already has the correct scopes, and once that is done it give you output to copy that creates the source from helm
Allow srcman to support external sources (ones that it doesn't need to run)