We just successfully killed our k8s control plane nodes by deploying the <code class="

Thx for the info <a class="user-mention notranslate" data-hovercard-type="user" data-h

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Recreated cluster with 3 nodes and audit policy provided by Falco Insta

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled about kubeshark HOT 10 CLOSED

MMquant commented on May 27, 2024 1

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled

from kubeshark.

Comments (10)

MMquant commented on May 27, 2024 1

Is it cloud, on-prem, some test env with kind/minikube/etc

The k8s cluster is deployed on-prem on proxmox. 3 master nodes, 3 worker nodes.

Number of nodes and approximate number of pods

±150

Nodes resources (CPU, RAM, storage)

maple@ubuntu:tk8s-mon$ k top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
tk8sm01   423m         10%    4086Mi          53%       
tk8sm02   316m         7%     3477Mi          45%       
tk8sm03   252m         6%     3216Mi          42%       
tk8sw01   381m         4%     11859Mi         37%       
tk8sw02   671m         8%     8623Mi          27%       
tk8sw03   1051m        13%    18054Mi         56%

Exact audit policy used

https://falco.org/docs/install-operate/third-party/learning/#falco-with-multiple-sources

ELK event count screenshot from opensearch

You can see that normal event rate[5m] is around 6k-7k. When the kubeshark was deployed the rate jumped to 13k.
The empty bars are kube-apiserver crashes. We needed to temporarily disable auditing in the kube-apiserver.yaml so that we would be able to uninstall the kubeshark helm release.

Additionally see the screenshots from Grafana

kubeshark daemonset memory load

kube-apiserver memory load

At the moment we are gonna to test following

upgrade nodes to use cgroup v2
limit the kube-apiserver memory resources so that next time at least the node itself survive
limit the kubeshark memory resources according to the guide https://docs.kubeshark.co/en/performance
scope the kubeshark to specific namespace

You are right that the audit policy definition has huge impact on the amount of the events generated. The one we are using is pretty verbose as it's needed for analysis by Falco.

from kubeshark.

corest commented on May 27, 2024 1

Thx for the info @MMquant

To confirm if Kubeshark itself generates those events, can you please exclude Kubeshark service account from auditing?
If you installed Kubeshark in the default namespace, this can be done with the rule:

  - level: None
    userGroups: ["system:serviceaccounts"]
    users: ["system:serviceaccount:default:kubeshark-service-account"]

So, when you have time, remove Kubeshark, add rule, restart API servers, install Kubeshark and check if number of events is that high again

from kubeshark.

MMquant commented on May 27, 2024 1

Hi @corest ,
we have just tested the rule and it seems that the rule indeed filtered-out the "DOS events".

from kubeshark.

alongir commented on May 27, 2024

@MMquant thanks for reporting this. We are actively looking into this and will report back our findings.

from kubeshark.

corest commented on May 27, 2024

Hi @MMquant
Thx for reporting this issue.

I've tried to verify this on our test environment in EKS cluster. (5 t3.large nodes, ~100 pods)

This graph shows a number of audit log events in the cluster.
I installed Kubeshark at 7:30.
There was a little spike in a number of events at this point which stabilized after Kubeshark made its initial discovery. After that, no anomalies in the number of events were detected.
Also, there is no visible additional load on the Kubernetes API server.
We did similar tests before on a cluster with 100 nodes and ~1000 pods and didn't find any issues.
This doesn't prove that there are no such issues though.
Maybe EKS does not have that verbose audit policy, dunno.

Please provide more details on your setup.

Is it cloud, on-prem, some test env with kind/minikube/etc
Number of nodes and approximate number of pods
Nodes resources (CPU, RAM, storage)
Exact audit policy used

Also maybe ELK can provide some details on anomalies? You wrote that no common "DDOS" events were found, but maybe you can provide at least the difference in the count of events before Kubeshark and after.
E.g. average count before Kubeshark was 1k events/s and after Kubeshark was installed - 10k events/s.
That would help us to identify the magnitude of the issue at least.

from kubeshark.

alongir commented on May 27, 2024

@MMquant <https://github.com/MMquant> The logs don't show a problem. Do you still experience the containers failing? Also, one log line implies you're using an older version. It will be good to test out one of the recent versions (e.g. the latest one).

…

On Tue, Feb 27, 2024 at 1:36 AM Petr Javorik ***@***.***> wrote: Hi @corest <https://github.com/corest> , we have just tested the rule and it seems that the rule indeed filtered-out the "DOS events". However we are concurrently facing another issue with kubeshark which we don't know if it's related to this events issue. Kubeshark containers are in crashLoopBackoff state Defaulted container "sniffer" out of: sniffer, tracer {"level":"debug","time":"2024-02-27T09:24:01Z","message":"packet-capture flag is deprecated!"} 2024-02-27T09:24:01Z INF main.go:75 > Starting worker... 2024-02-27T09:24:01Z INF misc/data.go:25 > Set the data directory to: data-dir=/app/data 2024-02-27T09:24:01Z INF kubernetes/memory/limit.go:47 > Memory limit is set to limit=8301034833169294539 2024-02-27T09:24:01Z INF main.go:106 > Starting worker... 2024-02-27T09:24:01Z WRN kubernetes/resolver/resolver.go:126 > Failed reading the name resolution history dump: error="open /app/data/name_resolution_history.json: no such file or directory" path=/app/data/name_resolution_history.json 2024-02-27T09:24:01Z INF main.go:126 > Linux kernel: version=4.18.0-513.11.1.el8_9.x86_64 2024-02-27T09:24:01Z INF utils/kernel/loader.go:80 > Downloading kernel module: dst=/app/kernel_modules/pf_ring.ko url=https://api.kubeshark.co/kernel-modules/4.18.0-513.11.1.el8_9.x86_64/pf_ring.ko 2024-02-27T09:24:01Z <https://api.kubeshark.co/kernel-modules/4.18.0-513.11.1.el8_9.x86_64/pf_ring.ko2024-02-27T09:24:01Z> INF kubernetes/resolver/target.go:115 > Targeted pod: ......-01-0 2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: .......-02-0 2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: ....dsr7k 2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: linux-tools 2024-02-27T09:24:02Z WRN main.go:131 > error="bad response code: 404" 2024-02-27T09:24:02Z INF assemblers/tcp_streams_map.go:75 > Using 1000 ms as the close timed out TCP stream channels interval 2024-02-27T09:24:02Z WRN source/tcp_packet_source.go:88 > Can't use PF_RING socket error="pfring NewRing error: address family not supported by protocol" 2024-02-27T09:24:02Z INF source/tcp_packet_source.go:103 > Using AF_PACKET socket as the capture source 2024-02-27T09:24:02Z INF server/server.go:62 > Starting the server... port=30001 — Reply to this email directly, view it on GitHub <#1500 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPGA2JN2QUG4IJ2GIMI7ULYVWSK7AVCNFSM6AAAAABDSVNNBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRWGE2DIMZYGE> . You are receiving this because you commented.Message ID: ***@***.***>

from kubeshark.

MMquant commented on May 27, 2024

@corest I think we can close this issue can't we?
@alongir The logs I posted have been sorted out by our devops team. Moreover I wouldn't discuss that log error in this issue as I think these things are not related.

from kubeshark.

corest commented on May 27, 2024

@MMquant we will keep this opened for now as I have few things to work on:

Create environment with extensive audit policy + falco to replicate your issue.
Find why Kubeshark generates so many events
Update docs on our side regarding excluding Kubeshar from audit events.

from kubeshark.

corest commented on May 27, 2024

Recreated cluster with 3 nodes and audit policy provided by Falco
Installed Falco, some workloads. Left cluster for 1h. Average rate of audit events - 345 events/minute
Installed Kubeshark and enabled scripts to have some activity. Left for 1h. Average rage of audit events increased to 352 events/minute

Overall in 1h Kubeshark service account generated ~300 events and that is expected and normal.
Also there was no additional visible load on Kubernetes API generated.

So for the case of this issue I think the reason behind high volume of events is very specific to cluster setup and can't be fixed on Kubeshark side as for now.

Last thing for this issue - I'll add section here https://docs.kubeshark.co/en/troubleshooting on how to exclude kubeshark audit events from monitoring

FYI @alongir

from kubeshark.

corest commented on May 27, 2024

Done

from kubeshark.

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled about kubeshark HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent