Git Product home page Git Product logo

Comments (10)

MMquant avatar MMquant commented on May 27, 2024 1
  1. Is it cloud, on-prem, some test env with kind/minikube/etc

The k8s cluster is deployed on-prem on proxmox. 3 master nodes, 3 worker nodes.

  1. Number of nodes and approximate number of pods

±150

  1. Nodes resources (CPU, RAM, storage)
maple@ubuntu:tk8s-mon$ k top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
tk8sm01   423m         10%    4086Mi          53%       
tk8sm02   316m         7%     3477Mi          45%       
tk8sm03   252m         6%     3216Mi          42%       
tk8sw01   381m         4%     11859Mi         37%       
tk8sw02   671m         8%     8623Mi          27%       
tk8sw03   1051m        13%    18054Mi         56%
  1. Exact audit policy used

https://falco.org/docs/install-operate/third-party/learning/#falco-with-multiple-sources

ELK event count screenshot from opensearch

image

You can see that normal event rate[5m] is around 6k-7k. When the kubeshark was deployed the rate jumped to 13k.
The empty bars are kube-apiserver crashes. We needed to temporarily disable auditing in the kube-apiserver.yaml so that we would be able to uninstall the kubeshark helm release.

Additionally see the screenshots from Grafana

kubeshark daemonset memory load

Screenshot 2024-02-20 at 13 42 08

kube-apiserver memory load

Screenshot 2024-02-20 at 11 14 33

At the moment we are gonna to test following

  • upgrade nodes to use cgroup v2
  • limit the kube-apiserver memory resources so that next time at least the node itself survive
  • limit the kubeshark memory resources according to the guide https://docs.kubeshark.co/en/performance
  • scope the kubeshark to specific namespace

You are right that the audit policy definition has huge impact on the amount of the events generated. The one we are using is pretty verbose as it's needed for analysis by Falco.

from kubeshark.

corest avatar corest commented on May 27, 2024 1

Thx for the info @MMquant

To confirm if Kubeshark itself generates those events, can you please exclude Kubeshark service account from auditing?
If you installed Kubeshark in the default namespace, this can be done with the rule:

  - level: None
    userGroups: ["system:serviceaccounts"]
    users: ["system:serviceaccount:default:kubeshark-service-account"]

So, when you have time, remove Kubeshark, add rule, restart API servers, install Kubeshark and check if number of events is that high again

from kubeshark.

MMquant avatar MMquant commented on May 27, 2024 1

Hi @corest ,
we have just tested the rule and it seems that the rule indeed filtered-out the "DOS events".

from kubeshark.

alongir avatar alongir commented on May 27, 2024

@MMquant thanks for reporting this. We are actively looking into this and will report back our findings.

from kubeshark.

corest avatar corest commented on May 27, 2024

Hi @MMquant
Thx for reporting this issue.

I've tried to verify this on our test environment in EKS cluster. (5 t3.large nodes, ~100 pods)
image

This graph shows a number of audit log events in the cluster.
I installed Kubeshark at 7:30.
There was a little spike in a number of events at this point which stabilized after Kubeshark made its initial discovery. After that, no anomalies in the number of events were detected.
Also, there is no visible additional load on the Kubernetes API server.
We did similar tests before on a cluster with 100 nodes and ~1000 pods and didn't find any issues.
This doesn't prove that there are no such issues though.
Maybe EKS does not have that verbose audit policy, dunno.

Please provide more details on your setup.

  1. Is it cloud, on-prem, some test env with kind/minikube/etc
  2. Number of nodes and approximate number of pods
  3. Nodes resources (CPU, RAM, storage)
  4. Exact audit policy used

Also maybe ELK can provide some details on anomalies? You wrote that no common "DDOS" events were found, but maybe you can provide at least the difference in the count of events before Kubeshark and after.
E.g. average count before Kubeshark was 1k events/s and after Kubeshark was installed - 10k events/s.
That would help us to identify the magnitude of the issue at least.

from kubeshark.

alongir avatar alongir commented on May 27, 2024

from kubeshark.

MMquant avatar MMquant commented on May 27, 2024

@corest I think we can close this issue can't we?
@alongir The logs I posted have been sorted out by our devops team. Moreover I wouldn't discuss that log error in this issue as I think these things are not related.

from kubeshark.

corest avatar corest commented on May 27, 2024

@MMquant we will keep this opened for now as I have few things to work on:

  1. Create environment with extensive audit policy + falco to replicate your issue.
  2. Find why Kubeshark generates so many events
  3. Update docs on our side regarding excluding Kubeshar from audit events.

from kubeshark.

corest avatar corest commented on May 27, 2024
  1. Recreated cluster with 3 nodes and audit policy provided by Falco
  2. Installed Falco, some workloads. Left cluster for 1h. Average rate of audit events - 345 events/minute
  3. Installed Kubeshark and enabled scripts to have some activity. Left for 1h. Average rage of audit events increased to 352 events/minute

Overall in 1h Kubeshark service account generated ~300 events and that is expected and normal.
Also there was no additional visible load on Kubernetes API generated.

So for the case of this issue I think the reason behind high volume of events is very specific to cluster setup and can't be fixed on Kubeshark side as for now.

Last thing for this issue - I'll add section here https://docs.kubeshark.co/en/troubleshooting on how to exclude kubeshark audit events from monitoring

FYI @alongir

from kubeshark.

corest avatar corest commented on May 27, 2024

Done

from kubeshark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.