Comments (17)
@rkachach I have fixed the osd/pool usage last night by recreating each of the osds, since it was not freeing up space from deleted block images. I hope that I don't have to do this frequently but I haven't been able to come across any documentation about cleaning up the osds, aside from recreating them.
Regarding the original issue, I have also increased the CPU limit to double (or more than double) what was assigned to the mgr pods. This seems to have resolved the issue, though it is still unclear why it needs so much processing power to present a dashboard. It is already reaching the CPU limit I have assigned.
Should this happen again, I will try to obtain the mgr logs (nothing jumps out aside from constant DBG messages about the PGs). I did run top however, and the process that is consuming near 100% (and sometimes 100%) is dashboard
from rook.
@Nickmman Thanks for the feedback, I'm glad that your setup is working after the upgrade. The patch has not been released yet and it would take some time as it depends on the Ceph next reef release (v18.2.2).
from rook.
Thanks @rkachach, closing this as the original issue was resolved.
from rook.
@Nickmman please, can you have a look to this issue #12877. The symptoms seem really similar and it may be related with some issues that arise when SSL is enabled but from your spec I can see that it's not the case!
from rook.
@Nickmman please, can you have a look to this issue #12877. The symptoms seem really similar and it may be related with some issues that arise when SSL is enabled but from your spec I can see that it's not the case!
Hi @rkachach, I can confirm that I am not using rook with SSL enabled, rather that is taken care of by nginx-ingress. This would also not correlate with trying to hit the pod directly on what should be a regular http reply. I can confirm, however, that I did notice a spike in CPU usage by the active mgr pod, see below:
Though this may have to be due to cleanup by deleting a block image (since I'm using RBD).
I look forward to see what else could be causing this issue.
from rook.
I have confirmed that the increased CPU usage only corresponds when accessing the dashboards, as it is high again and I have opened up the dashboards. It initially worked for a couple of pages but has started timing out again and is unusable.
from rook.
@Nickmman I'm curious about the ingress configuration:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
Would this convert http traffic to https when directing it to the dashboard?
Can you set this option to false and see if that fixes your issue?
from rook.
@rkachach This is just to redirect frontend end users when using HTTP to the HTTPS version of the endpoint, via a 308. It is not related to the backend endpoint, which is the mgr service. Backend traffic to the mgr service is maintained as HTTP, as per nginx.ingress.kubernetes.io/backend-protocol: HTTP
from rook.
I see, thanks for the clarification @Nickmman. BTW: your cluster has two important warnings:
2 nearfull osd(s)
2 pool(s) nearfull
It would be good to see if after fixing them you continue experimenting the slowness issue or not. Meanwhile, can you please attach the mgr logs (specially when the dashboard starts going slow)?
It would also be useful to see which thread inside the mgr is consuming the CPU. For this, you have to:
- run a shell inside the mgr container
- run
top
- switch to "thread" view by using the key
"H"
This view should show which mgr module is causing the high CPU consumption.
from rook.
great, thanks. Can you plz run the following from inside the mgr pod and report the output?
ss -tulpn
ss -anp
from rook.
@rkachach Output is attached. The dashboard is timing out again, I've attached the logs from the active mgr as well. I've created a zipfile containing both files since one of them is pretty large in size.
rookmgr-output-and-logs.zip
from rook.
hi @Nickmman thank you for providing the logs. As I can see there're a lot of connections ( > 1000) open but left in TIME_WAIT
probably causing high CPU consumption at server side level. Some exceptions are being raised such as:
Internal Server Error
Traceback (most recent call last):
File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
return handler(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 258, in inner
ret = func(*args, **kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 293, in minimal
return self.health_minimal.all_health()
File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 150, in all_health
result['hosts'] = self.host_count()
File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 213, in host_count
return len(get_hosts())
File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 204, in get_hosts
return merge_hosts_by_hostname(ceph_hosts, orch.hosts.list())
File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 161, in merge_hosts_by_hostname
host['hostname'], host['services'])
File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 170, in populate_service_instances
for d in orch.services.list_daemons(hostname=hostname)))
File "/usr/share/ceph/mgr/dashboard/services/orchestrator.py", line 39, in inner
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in raise_if_exception
raise e
I also see a couple of exceptions as the following, but TBH I don't really know what's wrong your pods that causing the crash:
Internal Server Error
Traceback (most recent call last):
File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 47, in dashboard_exception_handler
return handler(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 258, in inner
ret = func(*args, **kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 293, in minimal
return self.health_minimal.all_health()
File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 150, in all_health
result['hosts'] = self.host_count()
File "/usr/share/ceph/mgr/dashboard/controllers/health.py", line 213, in host_count
return len(get_hosts())
File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 204, in get_hosts
return merge_hosts_by_hostname(ceph_hosts, orch.hosts.list())
File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 161, in merge_hosts_by_hostname
host['hostname'], host['services'])
File "/usr/share/ceph/mgr/dashboard/controllers/host.py", line 170, in populate_service_instances
for d in orch.services.list_daemons(hostname=hostname)))
File "/usr/share/ceph/mgr/dashboard/services/orchestrator.py", line 39, in inner
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in raise_if_exception
raise e
kubernetes.client.rest.ApiException: ({'type': 'ERROR', 'object': {'api_version': 'v1',
'kind': 'Status',
'metadata': {'annotations': None,
'cluster_name': None,
'creation_timestamp': None,
'deletion_grace_period_seconds': None,
'deletion_timestamp': None,
'finalizers': None,
'generate_name': None,
'generation': None,
'initializers': None,
'labels': None,
'managed_fields': None,
'name': None,
'namespace': None,
'owner_references': None,
'resource_version': None,
'self_link': None,
'uid': None},
'spec': None,
'status': {'conditions': None,
'container_statuses': None,
'host_ip': None,
'init_container_statuses': None,
'message': None,
'nominated_node_name': None,
'phase': None,
'pod_ip': None,
'qos_class': None,
'reason': None,
'start_time': None}}, 'raw_object': {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'too old resource version: 422996437 (423036909)', 'reason': 'Expired', 'code': 410}})
Reason: None
[10.152.8.1:48900] [GET] [500] [9.816s] [admin] [513.0B] /api/health/minimal
[b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request_id": "ce8d0572-d6e8-4013-abc1-0e7bc5049f9c"}
From the BUG description I see that you are running ceph v17.2.6
which is the previous version. Latest version v18.2.1
has several fixes so it could potentially fix the issue you are experimenting but it has this BUG with metrics and at this moment users are relying on the image docker.io/rkachach/ceph:v18.2.1_patched_v1
until a stable ceph release with the fix is made available.
I'd recommend upgrading to the latest Rook release and use the above patched ceph image version to see if that fixes your issue.
from rook.
@rkachach Thank you for your findings and comment. I have gone ahead and updated the deployment, I will be watching and testing to see if this resolves the issue. As far as the connections go, the only thing I can think of is the prometheus scraper, which shows up in the mgr logs frequently. Other than that, the dashboard isn't frequently accessed.
from rook.
@Nickmman please let me now if the upgrade has solved your issues or not.
from rook.
@rkachach The issue seems to be resolved. I am now questioning the resources used up by the dashboard component, mainly CPU. Currently, default values are:
resources:
mgr:
limits:
cpu: "1000m"
memory: "1Gi"
requests:
cpu: "500m"
memory: "512Mi"
But watching over this issue and subsequent increase of resources, it seems that it is constantly maxing out during usage. Is this normal or is dashboard/mgr just using up more resources than it should be?
from rook.
Good news you don't see the issue anymore.
What do you mean with "maxing out"? is it reaching more than 100% of CPU usage?
from rook.
Hi @rkachach, I think that with the upgrade, the issue is resolved, I am not seeing the increased CPU usage anymore. Going back to the version upgrade, has the patch been released upstream to a new helm chart version? Or should I keep using the alternative image?
from rook.
Related Issues (20)
- Use fully-qualified rook image names by default HOT 2
- CephObjectStore COSI user has excess permissions
- Rook external s3 radosgw user quotas HOT 2
- Integrate Ceph-CSI v3.12 HOT 1
- Integrate release version of CSI operator for experimental mode
- xfs still dangerous due to deadlock? HOT 3
- Single OSDs fail to start after upgrading from Ceph 18.2.2 to 18.2.4 when encrypted. HOT 3
- Allow using stable device names for an OSD's ROOK_BLOCK_PATH
- Improvement of OpenSSF Scorecard Score HOT 4
- Add support for annotations on version detection pods
- Will modifying the replication count result in data loss in a high-load cluster? HOT 1
- globalmount: permission denied HOT 9
- Ceph version detect crash loop HOT 2
- OSD pod never come back up after upgrade HOT 2
- Issue with ceph rook installation | OSD pods not running | One of the mon pod failing HOT 4
- build: Makefile problem on Mac HOT 3
- mon Liveness probe failed,always restart HOT 5
- After the specified OSD is removed, the OSD deployment is still started(but fails). HOT 1
- dataDirHostPath is not cleaned up automatically HOT 3
- osd pod failed to start after upgrade FAILED ceph_assert(r >= 0) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rook.