Git Product home page Git Product logo

Comments (17)

Nickmman avatar Nickmman commented on September 21, 2024 1

@rkachach I have fixed the osd/pool usage last night by recreating each of the osds, since it was not freeing up space from deleted block images. I hope that I don't have to do this frequently but I haven't been able to come across any documentation about cleaning up the osds, aside from recreating them.

Regarding the original issue, I have also increased the CPU limit to double (or more than double) what was assigned to the mgr pods. This seems to have resolved the issue, though it is still unclear why it needs so much processing power to present a dashboard. It is already reaching the CPU limit I have assigned.

Should this happen again, I will try to obtain the mgr logs (nothing jumps out aside from constant DBG messages about the PGs). I did run top however, and the process that is consuming near 100% (and sometimes 100%) is dashboard
image

from rook.

rkachach avatar rkachach commented on September 21, 2024 1

@Nickmman Thanks for the feedback, I'm glad that your setup is working after the upgrade. The patch has not been released yet and it would take some time as it depends on the Ceph next reef release (v18.2.2).

from rook.

Nickmman avatar Nickmman commented on September 21, 2024 1

Thanks @rkachach, closing this as the original issue was resolved.

from rook.

rkachach avatar rkachach commented on September 21, 2024

@Nickmman please, can you have a look to this issue #12877. The symptoms seem really similar and it may be related with some issues that arise when SSL is enabled but from your spec I can see that it's not the case!

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

@Nickmman please, can you have a look to this issue #12877. The symptoms seem really similar and it may be related with some issues that arise when SSL is enabled but from your spec I can see that it's not the case!

Hi @rkachach, I can confirm that I am not using rook with SSL enabled, rather that is taken care of by nginx-ingress. This would also not correlate with trying to hit the pod directly on what should be a regular http reply. I can confirm, however, that I did notice a spike in CPU usage by the active mgr pod, see below:
image
Though this may have to be due to cleanup by deleting a block image (since I'm using RBD).
I look forward to see what else could be causing this issue.

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

I have confirmed that the increased CPU usage only corresponds when accessing the dashboards, as it is high again and I have opened up the dashboards. It initially worked for a couple of pages but has started timing out again and is unusable.

from rook.

rkachach avatar rkachach commented on September 21, 2024

@Nickmman I'm curious about the ingress configuration:

nginx.ingress.kubernetes.io/ssl-redirect: "true"

Would this convert http traffic to https when directing it to the dashboard?
Can you set this option to false and see if that fixes your issue?

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

@rkachach This is just to redirect frontend end users when using HTTP to the HTTPS version of the endpoint, via a 308. It is not related to the backend endpoint, which is the mgr service. Backend traffic to the mgr service is maintained as HTTP, as per nginx.ingress.kubernetes.io/backend-protocol: HTTP

from rook.

rkachach avatar rkachach commented on September 21, 2024

I see, thanks for the clarification @Nickmman. BTW: your cluster has two important warnings:

            2 nearfull osd(s)
            2 pool(s) nearfull

It would be good to see if after fixing them you continue experimenting the slowness issue or not. Meanwhile, can you please attach the mgr logs (specially when the dashboard starts going slow)?

It would also be useful to see which thread inside the mgr is consuming the CPU. For this, you have to:

  1. run a shell inside the mgr container
  2. run top
  3. switch to "thread" view by using the key "H"

This view should show which mgr module is causing the high CPU consumption.

from rook.

rkachach avatar rkachach commented on September 21, 2024

great, thanks. Can you plz run the following from inside the mgr pod and report the output?

ss -tulpn
ss -anp

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

@rkachach Output is attached. The dashboard is timing out again, I've attached the logs from the active mgr as well. I've created a zipfile containing both files since one of them is pretty large in size.
rookmgr-output-and-logs.zip

from rook.

rkachach avatar rkachach commented on September 21, 2024

hi @Nickmman thank you for providing the logs. As I can see there're a lot of connections ( > 1000) open but left in TIME_WAIT probably causing high CPU consumption at server side level. Some exceptions are being raised such as:

Internal Server Error
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 258, in inner
    ret = func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 293, in minimal
    return self.health_minimal.all_health()
  File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 150, in all_health
    result['hosts'] = self.host_count()
  File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 213, in host_count
    return len(get_hosts())
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 204, in get_hosts
    return merge_hosts_by_hostname(ceph_hosts, orch.hosts.list())
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 161, in merge_hosts_by_hostname
    host['hostname'], host['services'])
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 170, in populate_service_instances
    for d in orch.services.list_daemons(hostname=hostname)))
  File "/usr/share/ceph/mgr/dashboard/services/orchestrator.py", line 39, in inner
    raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in raise_if_exception
    raise e

I also see a couple of exceptions as the following, but TBH I don't really know what's wrong your pods that causing the crash:

Internal Server Error
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
    return handler(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 258, in inner
    ret = func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 293, in minimal
    return self.health_minimal.all_health()
  File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 150, in all_health
    result['hosts'] = self.host_count()
  File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 213, in host_count
    return len(get_hosts())
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 204, in get_hosts
    return merge_hosts_by_hostname(ceph_hosts, orch.hosts.list())
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 161, in merge_hosts_by_hostname
    host['hostname'], host['services'])
  File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 170, in populate_service_instances
    for d in orch.services.list_daemons(hostname=hostname)))
  File "/usr/share/ceph/mgr/dashboard/services/orchestrator.py", line 39, in inner
    raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in raise_if_exception
    raise e
kubernetes.client.rest.ApiException: ({'type': 'ERROR', 'object': {'api_version': 'v1',
 'kind': 'Status',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'initializers': None,
              'labels': None,
              'managed_fields': None,
              'name': None,
              'namespace': None,
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': None,
 'status': {'conditions': None,
            'container_statuses': None,
            'host_ip': None,
            'init_container_statuses': None,
            'message': None,
            'nominated_node_name': None,
            'phase': None,
            'pod_ip': None,
            'qos_class': None,
            'reason': None,
            'start_time': None}}, 'raw_object': {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 422996437 (423036909)', 'reason': 'Expired', 'code': 410}})
Reason: None

[10.152.8.1:48900] [GET] [500] [9.816s] [admin] [513.0B] /api/health/minimal
[b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request_id": "ce8d0572-d6e8-4013-abc1-0e7bc5049f9c"}                                                                                                                                                                                                                                                                                  

From the BUG description I see that you are running ceph v17.2.6 which is the previous version. Latest version v18.2.1 has several fixes so it could potentially fix the issue you are experimenting but it has this BUG with metrics and at this moment users are relying on the image docker.io/rkachach/ceph:v18.2.1_patched_v1 until a stable ceph release with the fix is made available.

I'd recommend upgrading to the latest Rook release and use the above patched ceph image version to see if that fixes your issue.

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

@rkachach Thank you for your findings and comment. I have gone ahead and updated the deployment, I will be watching and testing to see if this resolves the issue. As far as the connections go, the only thing I can think of is the prometheus scraper, which shows up in the mgr logs frequently. Other than that, the dashboard isn't frequently accessed.

from rook.

rkachach avatar rkachach commented on September 21, 2024

@Nickmman please let me now if the upgrade has solved your issues or not.

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

@rkachach The issue seems to be resolved. I am now questioning the resources used up by the dashboard component, mainly CPU. Currently, default values are:

  resources:
    mgr:
      limits:
        cpu: "1000m"
        memory: "1Gi"
      requests:
        cpu: "500m"
        memory: "512Mi"

But watching over this issue and subsequent increase of resources, it seems that it is constantly maxing out during usage. Is this normal or is dashboard/mgr just using up more resources than it should be?

from rook.

rkachach avatar rkachach commented on September 21, 2024

Good news you don't see the issue anymore.

What do you mean with "maxing out"? is it reaching more than 100% of CPU usage?

from rook.

Nickmman avatar Nickmman commented on September 21, 2024

Hi @rkachach, I think that with the upgrade, the issue is resolved, I am not seeing the increased CPU usage anymore. Going back to the version upgrade, has the patch been released upstream to a new helm chart version? Or should I keep using the alternative image?

from rook.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.