thejj / ceph-balancer Goto Github PK
View Code? Open in Web Editor NEWAn alternative Ceph placement optimizer, aiming for maximum storage capacity through equal OSD utilization.
License: GNU General Public License v3.0
An alternative Ceph placement optimizer, aiming for maximum storage capacity through equal OSD utilization.
License: GNU General Public License v3.0
I have access to a cluster created long ago and then expanded by adding new OSDs. I found that, in order to balance it properly, I had to add --osdsize device balance --osdused delta
. Otherwise, its idea of how full an OSD is disagrees with what ceph osd df
says, and disagrees differently for different OSDs.
Today, with the help of my colleagues, we root-caused this: old OSDs have bluestore_min_alloc_size=65536
, while new ones have bluestore_min_alloc_size=4096
. It means that the average per-object overhead is different. This overhead is what makes the sum of PG sizes (i.e., the sum of all stored object sizes) different from the used space on the OSD.
Please assume by default that each stored object comes with an overhead of bluestore_min_alloc_size / 2
, and take this into account when figuring out how much space would be used or freed by PG moves. On Ceph 17.2.7 and later, you can get this from ceph osd metadata
.
For example, an OSD that has a total of 56613739 objects in all PGs would have 1.7 TB of overhead with bluestore_min_alloc_size=65536
, but only 100 GB of overhead with bluestore_min_alloc_size=4096
.
Here is ceph osd df
(please ignore the first bunch of OSDs with only 0.75% utilization - they are outside of the CRUSH root, waiting for an "ok" to be placed in the proper hierarchy):
ceph-osd-df.txt
Here is ceph pg ls-by-osd 221
(this one was redeployed recently, so it has bluestore_min_alloc_size=4096
):
ceph-pg-ls-by-osd-221.txt
Here is ceph pg ls-by-osd 223
:
ceph-pg-ls-by-osd-223.txt
As you can see, these two OSDs have almost the same size, almost the same (differing only by 1) number of PGs, but their utilization differs by 1.9 TB, which matches (although not perfectly) the overhead calculation presented above.
Sorry, I am not allowed to post the full osdmap.
P.S. I am also going to file the same bug against the built-in Ceph balancer.
Added two new storage nodes to a cluster that was completely balanced by placementoptimizer.py
Performed upmap_ramapped.py to set all pgs to active+clean
https://github.com/HeinleinSupport/cern-ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
Expected subsequent runs of placementoptimizer.py to shift data to the new nodes.
[2022-10-05 14:40:45,104] BAD => osd.350 already has too many of pool=1 (84 >= 79.38328132674108)
[2022-10-05 14:40:45,104] TRY-1 move 1.33d3 osd.154 => osd.154 (1256/1256)
[2022-10-05 14:40:45,104] BAD move to source OSD makes no sense
[2022-10-05 14:40:45,104] SKIP pg 1.1a18 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.3a7e since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.7d8 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.2d59 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.152d since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.1b63 since pool (1) can't be balanced more
...
Instead: Entire pool is blacklisted and no further optimisations can be found.
Disabling this code fixes the issue:
ceph-balancer/placementoptimizer.py
Line 2041 in d0cd6a8
I'm wondering what is the purpose of unsuccessful_pools?
On a fairly large cluster running Ceph 15.2.16 placementoptimizer.py balance
generates pg-upmap-items
that violate the crush failure domain constraint by trying to map to multiple OSD in an EC PG to the same rack. Any ideas on what I am doing wrong would be greatly appreciated.
[root@ceph-admin ~]# ./placementoptimizer.py show
poolid name type size min pg_num stored used avail shrdsize crush
10 fs.meta.mirror repl 3 2 1024 2.1G 2.1G 1.2T 2.1M 6:replicated_chassis_mddm default~mddm*1.000
11 fs.meta.user repl 3 2 1024 2.7G 2.7G 1.2T 2.7M 7:replicated_chassis_mddu default~mddu*1.000
12 fs.data.mirror.ssd.rep repl 3 2 8192 0.0B 0.0B 267.4T 0.0B 1:replicated_chassis_ssd default~ssd*1.000
13 fs.data.mirror.nvme.rep repl 3 2 2048 0.0B 0.0B 53.4T 0.0B 3:replicated_chassis_nvme default~nvme*1.000
14 fs.data.user.ssd.rep repl 3 2 8192 94.4T 94.4T 267.4T 11.8G 1:replicated_chassis_ssd default~ssd*1.000
15 fs.data.user.nvme.rep repl 3 2 2048 0.0B 0.0B 53.4T 0.0B 3:replicated_chassis_nvme default~nvme*1.000
16 fs.data.mirror.ssd.ec ec6+2 8 7 4096 356.8T 356.8T 601.6T 14.9G 2:fs.data.mirror.ssd.ec default~ssd*1.000
17 fs.data.mirror.hdd.ec ec6+2 8 7 4096 1.2P 1.2P 315.0T 53.1G 4:fs.data.mirror.hdd.ec default~hdd*1.000
18 fs.data.user.hdd.ec ec6+2 8 7 4096 3.5G 3.5G 315.0T 148.3K 5:fs.data.user.hdd.ec default~hdd*1.000
20 device_health_metrics repl 3 2 32 3.1G 3.1G 53.4T 99.2M 3:replicated_chassis_nvme default~nvme*1.000
21 fs.data.user.ssd.ec ec6+2 8 7 4096 82.5T 82.5T 601.6T 3.4G 0:fs.data.user.ssd.ec default~ssd*1.000
default~mddm 2.07G 2.07G 1.625%
default~mddu 2.73G 2.73G 1.708%
default~ssd 533.62T 533.62T 40.830%
default~nvme 3.10G 3.10G 0.157%
default~hdd 1.24P 1.24P 51.871%
sum 1.76P 1.76P
while 128 PG are active+remapped+backfilling
from the mgr balancer
I tried running the JJ balancer
to see if can do a better job since I currently have OSD ranging from 22% to 82% %USE. However, the pg-upmap
fail the crush failure domain constraint of trying to use two OSD in the same rack
,
[root@ceph1 ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps.2
[2022-06-25 15:28:31,297] gathering cluster state via ceph api...
[2022-06-25 15:29:10,227] running pg balancer
[2022-06-25 15:29:10,506] current OSD fill rate per crushclasses:
[2022-06-25 15:29:10,506] mddu: average=0.20%, median=0.20%, without_placement_constraints=1.67%
[2022-06-25 15:29:10,506] mddm: average=0.15%, median=0.15%, without_placement_constraints=1.58%
[2022-06-25 15:29:10,506] nvme: average=0.01%, median=0.00%, without_placement_constraints=0.15%
[2022-06-25 15:29:10,509] ssd: average=44.80%, median=44.48%, without_placement_constraints=40.82%
[2022-06-25 15:29:10,510] hdd: average=93.10%, median=58.60%, without_placement_constraints=51.84%
[2022-06-25 15:29:10,510] smr: average=0.00%, median=0.00%, without_placement_constraints=0.05%
[2022-06-25 15:29:10,517] cluster variance for crushclasses:
[2022-06-25 15:29:10,517] mddu: 0.000
[2022-06-25 15:29:10,517] mddm: 0.000
[2022-06-25 15:29:10,517] nvme: 0.000
[2022-06-25 15:29:10,517] ssd: 6.502
[2022-06-25 15:29:10,517] hdd: 3915.508
[2022-06-25 15:29:10,517] smr: 0.000
[2022-06-25 15:29:10,517] min osd.253 0.000%
[2022-06-25 15:29:10,517] max osd.2323 385.361%
[2022-06-25 15:29:10,517] osd.2353 has calculated usage >= 100%: 100.63859057273221%
[2022-06-25 15:29:10,517] osd.447 has calculated usage >= 100%: 100.95507949049384%
...
[2022-06-25 15:29:12,803] osd.1960 has calculated usage >= 100%: 288.98203225266775%
[2022-06-25 15:29:12,803] osd.2323 has calculated usage >= 100%: 289.00484092949876%
[2022-06-25 15:29:12,803] enough remaps found
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] generated 10 remaps.
[2022-06-25 15:29:12,803] total movement size: 532.0G.
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] old cluster variance per crushclass:
[2022-06-25 15:29:12,803] mddu: 0.000
[2022-06-25 15:29:12,803] mddm: 0.000
[2022-06-25 15:29:12,803] nvme: 0.000
[2022-06-25 15:29:12,803] ssd: 6.502
[2022-06-25 15:29:12,803] hdd: 3915.508
[2022-06-25 15:29:12,803] smr: 0.000
[2022-06-25 15:29:12,803] old min osd.253 0.000%
[2022-06-25 15:29:12,803] old max osd.2323 385.361%
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] new min osd.253 0.000%
[2022-06-25 15:29:12,803] new max osd.2323 289.005%
[2022-06-25 15:29:12,803] new cluster variance:
[2022-06-25 15:29:12,803] mddu: 0.000
[2022-06-25 15:29:12,804] mddm: 0.000
[2022-06-25 15:29:12,804] nvme: 0.000
[2022-06-25 15:29:12,804] ssd: 6.502
[2022-06-25 15:29:12,804] hdd: 3655.503
[2022-06-25 15:29:12,804] smr: 0.000
[2022-06-25 15:29:12,804] --------------------------------------------------------------------------------
ceph osd pg-upmap-items 17.966 2323 2128
ceph osd pg-upmap-items 17.ce8 1986 2128
ceph osd pg-upmap-items 17.be4 2323 2128
ceph osd pg-upmap-items 17.55e 1986 2128
ceph osd pg-upmap-items 17.afb 2397 2099
ceph osd pg-upmap-items 17.67a 1982 2128
ceph osd pg-upmap-items 17.8c4 2177 2099
ceph osd pg-upmap-items 17.1c3 2215 2127
ceph osd pg-upmap-items 17.450 2313 2128
ceph osd pg-upmap-items 17.b03 2397 2099
[root@ceph-admin ~]# ceph osd pg-upmap-items 17.b03 2397 2099
set 17.b03 pg_upmap_items mapping to [2397->2099]
[root@ceph1 ceph]# tail -f /var/log/ceph/ceph-mon.ceph1.log
...
2022-06-25T15:30:31.076-0700 7fe8a0f67700 -1 verify_upmap multiple osds 2099,2141 come from same failure domain -4487
2022-06-25T15:30:31.076-0700 7fe8a0f67700 0 check_pg_upmaps verify_upmap of pg 17.b03 returning -22
which is true since both osd.2099 and osd.2141 are in the same rack,
[root@ceph-admin ~]# ceph osd find 2099 | grep rack
"rack": "s14",
[root@ceph-admin ~]# ceph osd find 2141 | grep rack
"rack": "s14",
And for reference, here is the PG attempting to be modified,
[root@ceph-admin ~]# ceph pg dump | awk '$1 == "17.b03"'
dumped all
17.b03 82138 0 0 0 0 343693292761 0 0 2150 2150 active+clean 2022-06-22T21:31:45.461326-0700 706729'221745 706729:3645914 [2343,1947,2496,2397,2210,2141,465,1839] 2343 [2343,1947,2496,2397,2210,2141,465,1839] 2343 671601'208723 2022-06-22T21:31:45.461159-0700 671601'208723 2022-06-22T21:31:45.461159-0700 0
Hi
on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:
{
"rule_id": 10,
"rule_name": "ec33hdd_rule",
"type": 3,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "choose_indep",
"num": 3,
"type": "datacenter"
},
{
"op": "choose_indep",
"num": 2,
"type": "osd"
},
{
"op": "emit"
}
]
}
EC profile:
crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=3
plugin=jerasure
technique=reed_sol_van
w=8
I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.
Looks to me like it's ignoring custom CRUSH rule for EC pools.
Also strange that pg-upmap-items
is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.
Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.
Thank you!
# ./placementoptimizer.py showremapped
Traceback (most recent call last):
File "/root/./placementoptimizer.py", line 5403, in <module>
exit(main())
File "/root/./placementoptimizer.py", line 5358, in main
state = ClusterState(args.state, osdsize_method=osdsize_method)
File "/root/./placementoptimizer.py", line 720, in __init__
self.load(statefile)
File "/root/./placementoptimizer.py", line 768, in load
raise RuntimeError("Cluster topology changed during information gathering (e.g. a pg changed state). "
RuntimeError: Cluster topology changed during information gathering (e.g. a pg changed state). Wait for things to calm down and try again
Surely this precaution against cluster topology changes does not make sense for read-only and purely informational subcommands such as showremapped
?
I would like to run the very useful script on a cluster where currently the Ceph client can only be invoked via Cephadm i.e.
sudo cephadm shell ceph
Would you consider adding an option to handle that situation cleanly?
Hello!
I would like to use json output of the show command to parse shard sizes.
python3 placementoptimizer.py show --format json
then it outputs it but at the end there is an error:
[...]
"pools_acting": [1, 2, 3, 6, 9, 18], "pg_count_up": {"2": 41, "3": 23, "9": 2, "1": 5, "6": 2, "18": 1}, "pg_count_acting": {"2": 41, "3": 23, "9": 2, "1": 5, "6": 2, "18": 1}, "pg_num_up": 74, "pgs_up": Traceback (most recent call last):
File "placementoptimizer.py", line 5086, in <module>
exit(main())
File "placementoptimizer.py", line 5063, in main
show(args, state)
File "placementoptimizer.py", line 4850, in show
json.dump(ret, sys.stdout)
File "/usr/lib/python3.8/json/__init__.py", line 179, in dump
for chunk in iterable:
File "/usr/lib/python3.8/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/usr/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/usr/lib/python3.8/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/usr/lib/python3.8/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type set is not JSON serializable
At the same time standard plain text output works ok.
Python version: Python 3.8.10 (default, Nov 14 2022, 12:59:47)
Ceph version: Quincy 17.2.5
Hi,
I have one Ceph cluster with high STDEV number (took from ceph osd df tree class hdd
command (I didn't put whole output here as there are hundreds of OSDs):
-34 566.02417 - 515 TiB 423 TiB 422 TiB 50 MiB 1.2 TiB 91 TiB 82.24 1.00 - host stg4-osd8
11 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 2.9 MiB 19 GiB 1019 GiB 86.32 1.05 87 up osd.11
12 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.5 MiB 18 GiB 1019 GiB 86.33 1.05 87 up osd.12
56 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 1017 KiB 19 GiB 1019 GiB 86.33 1.05 87 up osd.56
58 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.8 MiB 19 GiB 1020 GiB 86.31 1.05 87 up osd.58
66 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 330 KiB 18 GiB 1023 GiB 86.27 1.05 87 up osd.66
81 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 492 KiB 18 GiB 1018 GiB 86.33 1.05 87 up osd.81
94 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.6 MiB 18 GiB 1021 GiB 86.30 1.05 87 up osd.94
123 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 246 KiB 17 GiB 1022 GiB 86.28 1.05 87 up osd.123
143 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.8 MiB 18 GiB 1022 GiB 86.28 1.05 87 up osd.143
151 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 847 KiB 19 GiB 1021 GiB 86.30 1.05 87 up osd.151
187 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.4 MiB 18 GiB 1020 GiB 86.31 1.05 87 up osd.187
250 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.76 0.96 195 up osd.250
264 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.76 0.96 195 up osd.264
282 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.76 0.96 195 up osd.282
297 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.77 0.96 195 up osd.297
318 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.75 0.96 195 up osd.318
333 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.77 0.96 195 up osd.333
349 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.4 TiB 79.16 0.96 196 up osd.349
362 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.4 TiB 79.14 0.96 196 up osd.362
381 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 39 GiB 3.4 TiB 79.17 0.96 196 up osd.381
399 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 38 GiB 3.5 TiB 78.78 0.96 195 up osd.399
415 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.76 0.96 195 up osd.415
435 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.75 0.96 195 up osd.435
463 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.0 MiB 41 GiB 2.3 TiB 85.94 1.05 195 up osd.463
467 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 2.6 MiB 41 GiB 2.2 TiB 86.35 1.05 196 up osd.467
480 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.4 TiB 79.16 0.96 196 up osd.480
503 hdd 14.00052 1.00000 13 TiB 11 TiB 11 TiB 3.1 MiB 33 GiB 1.8 TiB 86.14 1.05 152 up osd.503
519 hdd 16.00090 1.00000 15 TiB 13 TiB 13 TiB 1.2 MiB 35 GiB 2.0 TiB 86.24 1.05 174 up osd.519
539 hdd 16.00090 1.00000 15 TiB 13 TiB 13 TiB 3.0 MiB 36 GiB 2.0 TiB 86.29 1.05 174 up osd.539
562 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 5.2 MiB 42 GiB 2.2 TiB 86.38 1.05 196 up osd.562
573 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 4.4 MiB 40 GiB 2.2 TiB 86.37 1.05 196 up osd.573
589 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 1.1 MiB 42 GiB 2.3 TiB 85.95 1.05 195 up osd.589
606 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 2.2 MiB 42 GiB 2.3 TiB 85.94 1.05 195 up osd.606
610 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 615 KiB 40 GiB 2.3 TiB 85.95 1.05 195 up osd.610
645 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.78 0.96 195 up osd.645
658 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.76 0.96 195 up osd.658
673 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.4 TiB 79.15 0.96 196 up osd.673
681 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.77 0.96 195 up osd.681
-37 566.02417 - 515 TiB 423 TiB 422 TiB 61 MiB 1.2 TiB 92 TiB 82.21 1.00 - host stg4-osd9
13 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 364 KiB 18 GiB 1021 GiB 86.30 1.05 87 up osd.13
14 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 1013 KiB 19 GiB 1020 GiB 86.31 1.05 87 up osd.14
54 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 1.7 MiB 18 GiB 1022 GiB 86.28 1.05 87 up osd.54
57 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.3 MiB 18 GiB 1020 GiB 86.31 1.05 87 up osd.57
68 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.6 MiB 18 GiB 1024 GiB 86.26 1.05 87 up osd.68
82 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 1.6 MiB 18 GiB 1022 GiB 86.28 1.05 87 up osd.82
93 hdd 16.00090 1.00000 15 TiB 13 TiB 13 TiB 5.0 MiB 36 GiB 2.0 TiB 86.28 1.05 174 up osd.93
95 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.0 MiB 18 GiB 1022 GiB 86.28 1.05 87 up osd.95
109 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 1.2 MiB 19 GiB 1020 GiB 86.32 1.05 87 up osd.109
122 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.2 MiB 18 GiB 1023 GiB 86.28 1.05 87 up osd.122
136 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.2 MiB 18 GiB 1019 GiB 86.33 1.05 87 up osd.136
147 hdd 8.00156 1.00000 7.3 TiB 6.3 TiB 6.3 TiB 3.8 MiB 18 GiB 1020 GiB 86.32 1.05 87 up osd.147
246 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.4 TiB 79.15 0.96 196 up osd.246
276 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.77 0.96 195 up osd.276
286 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.4 TiB 79.15 0.96 196 up osd.286
310 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.78 0.96 195 up osd.310
319 hdd 14.00052 1.00000 13 TiB 11 TiB 11 TiB 5.3 MiB 33 GiB 1.8 TiB 86.15 1.05 152 up osd.319
327 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.77 0.96 195 up osd.327
342 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.78 0.96 195 up osd.342
375 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.76 0.96 195 up osd.375
391 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 39 GiB 3.5 TiB 78.78 0.96 195 up osd.391
406 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 37 GiB 3.5 TiB 78.78 0.96 195 up osd.406
423 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 38 GiB 3.4 TiB 79.16 0.96 196 up osd.423
425 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.79 0.96 195 up osd.425
443 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.4 TiB 79.15 0.96 196 up osd.443
469 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.4 MiB 42 GiB 2.3 TiB 85.93 1.05 195 up osd.469
470 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.4 MiB 41 GiB 2.3 TiB 85.93 1.05 195 up osd.470
486 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.77 0.96 195 up osd.486
533 hdd 16.00090 1.00000 15 TiB 13 TiB 13 TiB 989 KiB 35 GiB 2.0 TiB 86.27 1.05 174 up osd.533
552 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.8 MiB 41 GiB 2.3 TiB 85.95 1.05 195 up osd.552
578 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.2 MiB 43 GiB 2.3 TiB 85.95 1.05 195 up osd.578
590 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.7 MiB 40 GiB 2.3 TiB 85.94 1.05 195 up osd.590
617 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 2.2 MiB 40 GiB 2.2 TiB 86.37 1.05 196 up osd.617
621 hdd 18.00020 1.00000 16 TiB 14 TiB 14 TiB 3.6 MiB 41 GiB 2.3 TiB 85.94 1.05 195 up osd.621
631 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.76 0.96 195 up osd.631
646 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.75 0.96 195 up osd.646
663 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 35 GiB 3.5 TiB 78.75 0.96 195 up osd.663
678 hdd 18.00020 1.00000 16 TiB 13 TiB 13 TiB 0 B 36 GiB 3.5 TiB 78.80 0.96 195 up osd.678
TOTAL 8.0 PiB 6.6 PiB 6.6 PiB 846 MiB 19 TiB 1.4 PiB 82.20
MIN/MAX VAR: 0.96/1.05 STDDEV: 3.72
HDD sizes used there are 8TB,16TB and 18TB. Failure domain is host. And all OSDs are members of only one EC pool.
As you can see there is a high diff between min and max VAR number even if PG num looks pretty good distributed.
I started to use your balancer to try to solve the issue (to get lower STDEV - same OSD utilization) but with no luck. It doesn't make any proposal for moving any PG to another OSD.
I've tried it with following options:
/usr/bin/placementoptimizer.py -v balance --max-pg-moves 1000 --ensure-variance-decrease --only-pool cephfs_data --max-full-move-attempts 100 --allow-move-below-target-pgcount
any advice how to run it to get OSDs utilized evenly?
Thank you
While trying to rebalance an especially broken cluster, my colleague found this exception:
# ./placementoptimizer.py --osdsize device balance --osdused delta --max-pg-moves 50 --osdfrom fullest
Traceback (most recent call last):
File "./placementoptimizer.py", line 5475, in <module>
exit(main())
File "./placementoptimizer.py", line 5470, in main
run()
File "./placementoptimizer.py", line 5434, in <lambda>
run = lambda: balance(args, state)
File "./placementoptimizer.py", line 4600, in balance
need_simulation=True)
File "./placementoptimizer.py", line 3260, in __init__
self.init_analyzer.analyze(self)
File "./placementoptimizer.py", line 4264, in analyze
self._update_stats()
File "./placementoptimizer.py", line 4350, in _update_stats
self.cluster_variance = self.pg_mappings.get_cluster_variance()
File "./placementoptimizer.py", line 3771, in get_cluster_variance
for crushclass, usages in self.get_class_osd_usages().items():
File "./placementoptimizer.py", line 3509, in get_class_osd_usages
ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
File "./placementoptimizer.py", line 3509, in <dictcomp>
ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
File "./placementoptimizer.py", line 3757, in get_osd_usage
used = self.get_osd_usage_size(osdid, add_size)
File "./placementoptimizer.py", line 3714, in get_osd_usage_size
used += self.cluster.osd_transfer_remainings[osdid]
KeyError: 66
Note that osd.66 is the only OSD which has the hdd_test class:
$ ceph osd tree | grep test
66 hdd_test 14.55269 osd.66 up 1.00000 1.00000
As we are not permitted to publicly post anything containing UUIDs that can be used to identify the customer's cluster, I am going to submit the debug info via private email.
./placementoptimizer.py -v balance
[2023-01-26 08:42:54,343] gathering cluster state via ceph api...
[2023-01-26 08:43:06,262] running pg balancer
[2023-01-26 08:43:06,267] current OSD fill rate per crushclasses:
[2023-01-26 08:43:06,268] hdd: average=66.15%, median=60.57%, crushclass_usage=64.46%
[2023-01-26 08:43:06,268] cluster variance for crushclasses:
[2023-01-26 08:43:06,269] hdd: 284.034
[2023-01-26 08:43:06,269] min osd.2 56.273%
[2023-01-26 08:43:06,269] max osd.12 121.969%
[2023-01-26 08:43:06,269] osd.12 has calculated usage >= 100%: 121.9685%
[2023-01-26 08:43:06,270] osd.3 is source osd in pg <__main__.PGMoveChecker object at 0x7f2f9d773e20>
[2023-01-26 08:43:06,270] self.pg_osds is [7, 13, 10, 12, 1, 6]
Traceback (most recent call last):
File "/root/install/ceph-balancer/./placementoptimizer.py", line 2138, in <module>
pool_pg_count_ideal = pg_mappings.pool_pg_count_ideal(pg_pool, try_pg_move.get_osd_candidates(osd_from))
File "/root/install/ceph-balancer/./placementoptimizer.py", line 890, in get_osd_candidates
root_name = self.root_names[pg_osd_idx]
IndexError: list index out of range
I added a little bit of debugging to determine that osd_from is not in pg_osd_idx
```
186 active+clean
30 active+remapped+backfilling
4 active+clean+scrubbing+deep
4 active+remapped+backfill_toofull
1 active+clean+scrubbing
Is the remapped+backfill_toofull state breaking this? What additional debug output should I try?
We have some crush rules that originate on datacenter types and not root types and these are filtered at https://github.com/TheJJ/ceph-balancer/blob/master/placementoptimizer.py#L1932
Hi! I have used this balancer to great success on my cluster. Lately I reformatted some osds and moved their db to an nvme device. Since then I get the following error:
./placementoptimizer.py -v balance --max-pg-moves 20 --max-move-attempts=20 | tee balance-upmaps
Traceback (most recent call last):
File "/home/lolhens/./placementoptimizer.py", line 5060, in <module>
exit(main())
File "/home/lolhens/./placementoptimizer.py", line 5028, in main
balance(args, state)
File "/home/lolhens/./placementoptimizer.py", line 4393, in balance
try_pg_move.prepare_crush_check()
File "/home/lolhens/./placementoptimizer.py", line 2496, in prepare_crush_check
raise RuntimeError(f"no trace found for {pg_osd} in {current_root_name}")
RuntimeError: no trace found for 2147483647 in default~hdd
I modified the script a bit to print out the root_osds and I get this:
[20, 8, 2147483647, 11]
Sadly I don't know the ceph internals enough to figure out what is going on. I saw that you did quite a few changes today and tried it with the new version but I get the same error.
We have 5-node cluster in 5 racks with only replicated rules. Sometimes script dies with following AssertationError:
-> ./placementoptimizer.py -v balance --max-pg-moves 20 | tee 2021-12-02_balance-upmaps_6
[2021-12-02 16:24:07,886] running pg balancer
[2021-12-02 16:24:07,900] current OSD fill rate per crushclasses:
[2021-12-02 16:24:07,901] ssd: average=53.83%, median=52.49%, without_placement_constraints=51.36%
[2021-12-02 16:24:07,902] cluster variance for crushclasses:
[2021-12-02 16:24:07,902] ssd: 12.010
[2021-12-02 16:24:07,902] min osd.22 48.726%
[2021-12-02 16:24:07,902] max osd.11 59.926%
[2021-12-02 16:24:07,910] SAVE move 6.1f osd.11 => osd.8 (size=8.5G)
[2021-12-02 16:24:07,910] => variance new=11.354265826467204 < 12.009990670791638=old
[2021-12-02 16:24:07,910] new min osd.22 48.726%
[2021-12-02 16:24:07,910] max osd.25 59.629%
[2021-12-02 16:24:07,910] new cluster variance:
[2021-12-02 16:24:07,910] ssd: 11.354
[2021-12-02 16:24:07,917] SAVE move 8.1e osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:24:07,917] => variance new=10.442261655175155 < 11.354265826467204=old
[2021-12-02 16:24:07,917] new min osd.8 49.928%
[2021-12-02 16:24:07,917] max osd.4 59.142%
[2021-12-02 16:24:07,917] new cluster variance:
[2021-12-02 16:24:07,917] ssd: 10.442
[2021-12-02 16:24:07,924] SAVE move 8.64 osd.4 => osd.8 (size=12.6G)
[2021-12-02 16:24:07,924] => variance new=9.682937891685194 < 10.442261655175155=old
[2021-12-02 16:24:07,924] new min osd.18 49.975%
[2021-12-02 16:24:07,924] max osd.12 59.027%
[2021-12-02 16:24:07,924] new cluster variance:
[2021-12-02 16:24:07,925] ssd: 9.683
[2021-12-02 16:24:07,931] SAVE move 8.ac osd.12 => osd.21 (size=12.4G)
[2021-12-02 16:24:07,931] => variance new=8.953738197324748 < 9.682937891685194=old
[2021-12-02 16:24:07,931] new min osd.18 49.975%
[2021-12-02 16:24:07,932] max osd.11 58.975%
[2021-12-02 16:24:07,932] new cluster variance:
[2021-12-02 16:24:07,932] ssd: 8.954
[2021-12-02 16:24:07,939] SAVE move 6.11 osd.11 => osd.18 (size=8.5G)
[2021-12-02 16:24:07,939] => variance new=8.427594885907181 < 8.953738197324748=old
[2021-12-02 16:24:07,939] new min osd.22 50.116%
[2021-12-02 16:24:07,939] max osd.25 58.239%
[2021-12-02 16:24:07,939] new cluster variance:
[2021-12-02 16:24:07,939] ssd: 8.428
[2021-12-02 16:24:07,946] SAVE move 8.d9 osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:24:07,946] => variance new=7.783740064149625 < 8.427594885907181=old
[2021-12-02 16:24:07,946] new min osd.1 50.187%
[2021-12-02 16:24:07,946] max osd.11 58.027%
[2021-12-02 16:24:07,946] new cluster variance:
[2021-12-02 16:24:07,946] ssd: 7.784
[2021-12-02 16:24:07,953] SAVE move 6.e osd.11 => osd.1 (size=8.4G)
[2021-12-02 16:24:07,953] => variance new=7.334701606818667 < 7.783740064149625=old
[2021-12-02 16:24:07,954] new min osd.9 50.477%
[2021-12-02 16:24:07,954] max osd.6 57.919%
[2021-12-02 16:24:07,954] new cluster variance:
[2021-12-02 16:24:07,954] ssd: 7.335
[2021-12-02 16:24:07,961] SAVE move 8.f1 osd.6 => osd.9 (size=13.0G)
[2021-12-02 16:24:07,961] => variance new=6.732430912736088 < 7.334701606818667=old
[2021-12-02 16:24:07,961] new min osd.18 50.923%
[2021-12-02 16:24:07,961] max osd.4 57.731%
[2021-12-02 16:24:07,961] new cluster variance:
[2021-12-02 16:24:07,961] ssd: 6.732
[2021-12-02 16:24:07,968] SAVE move 8.1a osd.4 => osd.29 (size=12.5G)
[2021-12-02 16:24:07,969] => variance new=6.21303665183377 < 6.732430912736088=old
[2021-12-02 16:24:07,969] new min osd.18 50.923%
[2021-12-02 16:24:07,969] max osd.12 57.635%
[2021-12-02 16:24:07,969] new cluster variance:
[2021-12-02 16:24:07,969] ssd: 6.213
[2021-12-02 16:24:07,975] SAVE move 8.2c osd.12 => osd.14 (size=12.4G)
[2021-12-02 16:24:07,975] => variance new=5.7103134469222105 < 6.21303665183377=old
[2021-12-02 16:24:07,976] new min osd.18 50.923%
[2021-12-02 16:24:07,976] max osd.16 57.633%
[2021-12-02 16:24:07,976] new cluster variance:
[2021-12-02 16:24:07,976] ssd: 5.710
[2021-12-02 16:24:07,982] SAVE move 6.1f osd.16 => osd.18 (size=8.5G)
[2021-12-02 16:24:07,982] => variance new=5.332574718339378 < 5.7103134469222105=old
[2021-12-02 16:24:07,982] new min osd.26 51.099%
[2021-12-02 16:24:07,982] max osd.11 57.083%
[2021-12-02 16:24:07,982] new cluster variance:
[2021-12-02 16:24:07,982] ssd: 5.333
[2021-12-02 16:24:07,988] SAVE move 6.8 osd.11 => osd.26 (size=8.4G)
[2021-12-02 16:24:07,988] => variance new=5.004755568105044 < 5.332574718339378=old
[2021-12-02 16:24:07,989] new min osd.1 51.131%
[2021-12-02 16:24:07,989] max osd.23 57.009%
[2021-12-02 16:24:07,989] new cluster variance:
[2021-12-02 16:24:07,989] ssd: 5.005
Traceback (most recent call last):
File "./placementoptimizer.py", line 1917, in <module>
try_pg_move.prepare_crush_check()
File "./placementoptimizer.py", line 984, in prepare_crush_check
assert reuses == uses
AssertionError
I will be honest I didn't really tried to dive into code and algorithms, but based on error message I have no idea what did I do wrong. Is it even supposed to work on such small cluster as I have?
Btw I can fix it by limiting number of moves with --max-pg-moves
so I can avoid it.
I will be happy to give you any debug info, just tell me what I can do.
All daemons: ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)
# ./placementoptimizer.py -v show
Traceback (most recent call last):
File "./placementoptimizer.py", line 343, in <module>
raise Exception(f"on osd.{id} calculated pg num acting: "
Exception: on osd.3 calculated pg num acting: 180 != 179
ceph dumps: ceph-balancer.zip
Having an issue where the script dies during a balance run.
$ python3 ./jj-balancer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps
[2023-01-10 10:20:34,509] gathering cluster state via ceph api...
[2023-01-10 10:20:40,984] running pg balancer
[2023-01-10 10:20:41,011] current OSD fill rate per crushclasses:
[2023-01-10 10:20:41,012] hdd: average=60.71%, median=60.40%, crushclass_usage=75.52%
[2023-01-10 10:20:41,012] ssd: average=58.43%, median=58.23%, crushclass_usage=62.43%
[2023-01-10 10:20:41,013] cluster variance for crushclasses:
[2023-01-10 10:20:41,013] hdd: 3.238
[2023-01-10 10:20:41,013] ssd: 9.636
[2023-01-10 10:20:41,013] min osd.53 52.013%
[2023-01-10 10:20:41,013] max osd.85 66.192%
[2023-01-10 10:20:41,022] SAVE move 40.a9 osd.85 => osd.298
[2023-01-10 10:20:41,023] props: size=35.5G remapped=False upmaps=0
[2023-01-10 10:20:41,023] => variance new=3.1273334935204544 < 3.237997993456637=old
[2023-01-10 10:20:41,023] new min osd.53 52.013%
[2023-01-10 10:20:41,023] max osd.232 65.897%
[2023-01-10 10:20:41,023] new cluster variance:
[2023-01-10 10:20:41,023] hdd: 3.127
[2023-01-10 10:20:41,023] ssd: 9.636
Traceback (most recent call last):
File "./jj-balancer.py", line 2166, in <module>
try_pg_move.prepare_crush_check()
File "./jj-balancer.py", line 1098, in prepare_crush_check
raise Exception(f"could not find item type {choose_type} "
Exception: could not find item type chassis requested by rule step {'op': 'chooseleaf_firstn', 'num': -1, 'type': 'chassis'}
I am assuming that it is due to a slightly non-standard crush topology/ruleset.
I have an hdd-root where the crush topology is root -> rack -> chassis -> host -> osd; then I have an ssd-root where the topology is root -> rack -> host -> osd (no chassis).
This is due to having some 8T hosts with 3x8T disks (2 per chassis, so 1 chassis = ~48T), some 8T hosts with 6x8T disks (1 per chassis), and some 24x2T hosts (1 per chassis), so that all of the chassis are ~48T and my crush rulesets take from chassis rather than host.
The SSD rulesets use host instead of chassis.
But I also have some "hybrid" rulesets where I take 1 from ssd-host, and take -1 from hdd-chassis.
So I'm guessing that this is why it is breaking on the {'op': 'chooseleaf_firstn', 'num': -1, 'type': 'chassis'}
.
Let me know if there is anything I can provide to help.
Attaching a tree view of the host topology to hopefully visualize more easily.
This was pulled from 30f09f0
and the file is just renamed to jj-balancer.py
for reasons.
Python is 3.8.10.
โโโ ROOT-hdd
โย ย โโโ RACK-rack-hdd
โย ย โโโ CHASSIS-ceph-hdd-2t-01
โย ย โย ย โโโ HOST-ceph-hdd-2t-01
โย ย โโโ CHASSIS-ceph-hdd-2t-02
โย ย โย ย โโโ HOST-ceph-hdd-2t-02
โย ย โโโ CHASSIS-ceph-hdd-2t-03
โย ย โย ย โโโ HOST-ceph-hdd-2t-03
โย ย โโโ CHASSIS-ceph-hdd-2t-04
โย ย โย ย โโโ HOST-ceph-hdd-2t-04
โย ย โโโ CHASSIS-ceph-hdd-2t-05
โย ย โย ย โโโ HOST-ceph-hdd-2t-05
โย ย โโโ CHASSIS-ceph-hdd-2t-06
โย ย โย ย โโโ HOST-ceph-hdd-2t-06
โย ย โโโ CHASSIS-ceph-hdd-2t-07
โย ย โย ย โโโ HOST-ceph-hdd-2t-07
โย ย โโโ CHASSIS-ceph-hdd-2t-08
โย ย โย ย โโโ HOST-ceph-hdd-2t-08
โย ย โโโ CHASSIS-ceph-hdd-8t-0102
โย ย โย ย โโโ HOST-ceph-hdd-8t-01
โย ย โย ย โโโ HOST-ceph-hdd-8t-02
โย ย โโโ CHASSIS-ceph-hdd-8t-0304
โย ย โย ย โโโ HOST-ceph-hdd-8t-03
โย ย โย ย โโโ HOST-ceph-hdd-8t-04
โย ย โโโ CHASSIS-ceph-hdd-8t-0506
โย ย โย ย โโโ HOST-ceph-hdd-8t-05
โย ย โย ย โโโ HOST-ceph-hdd-8t-06
โย ย โโโ CHASSIS-ceph-hdd-8t-0708
โย ย โย ย โโโ HOST-ceph-hdd-8t-07
โย ย โย ย โโโ HOST-ceph-hdd-8t-08
โย ย โโโ CHASSIS-ceph-hdd-8t-09
โย ย โโโ HOST-ceph-hdd-8t-09
โโโ ROOT-ssd
โโโ RACK-rack-ssd
โโโ HOST-ceph-ssd-01
โโโ HOST-ceph-ssd-02
โโโ HOST-ceph-ssd-03
โโโ HOST-ceph-ssd-04
โโโ HOST-ceph-ssd-05
โโโ HOST-ceph-ssd-06
When trying to run the script against my CEPH Cluster either across the whole cluster or a particular pool get the following error:
./placementoptimizer.py -v balance --only-pool cephec_ecdata --max-pg-moves 10 | tee /tmp/balance-upmaps
[2023-03-20 18:47:32,626] gathering cluster state via ceph api...
[2023-03-20 18:47:48,224] running pg balancer
[2023-03-20 18:47:48,224] only considering pools {12}
[2023-03-20 18:47:48,235] current OSD fill rate per crushclasses:
[2023-03-20 18:47:48,236] hdd: average=75.24%, median=76.04%, crushclass_usage=75.55%
[2023-03-20 18:47:48,237] cluster variance for crushclasses:
[2023-03-20 18:47:48,238] hdd: 56.797
[2023-03-20 18:47:48,238] min osd.26 53.096%
[2023-03-20 18:47:48,238] max osd.75 90.596%
Traceback (most recent call last):
File "/root/./placementoptimizer.py", line 2133, in <module>
pool_pg_count_ideal = pg_mappings.pool_pg_count_ideal(pg_pool, try_pg_move.get_osd_candidates(osd_from))
File "/root/./placementoptimizer.py", line 885, in get_osd_candidates
root_name = self.root_names[pg_osd_idx]
IndexError: list index out of range
I have a single root in my crush map with all 6 hosts within
Hi JJ!
FIrst, thanks for an AWESOME balancer! I'm in shock-and-awe how good, efficient and simple this is - it has achieved a virtually perfect balance on our system with a lots of mixed HDD sizes and nodes :-)
However, in addition to our large-storage volumes we also have a partition where we use 3-fold replication on 1SSD combined with 2HDDs. At least for our (relatively read-intensive) setup this works great in combination with NVMe DB/WAL devices for the HDDs. We get close to pure SSD performance on writes, and exactly the same read performance as a pure SSD array - but at 1/3 of the cost.
But... the JJbalancer fails for this pool. I have started to debug, and it seems to be caused by traces in prepare_crush_check where the code likely assumes all OSDs in the crush rule are the same class?
I will keep working on it, but I suspect there might be a close-to-trivial work-around to continue when OSDs have the wrong class, so I figured I should submit an issue if it's a 5-minute fix for somebody who knows the code better.
Here's the error with a bit of debug context; osd.277 is class ssd, osd.218 and osd.37 class hdd.
[2022-11-10 18:35:29,334] TRY-0 moving pg 5.3c1 (36/58) with 78.3G from osd.37
[2022-11-10 18:35:29,335] OK => taking pg 5.3c1 from source osd.37 since it has too many of pool=5 (13 > 11.98997752947604)
[2022-11-10 18:35:29,335] prepare crush check for pg 5.3c1 currently up=[277, 218, 37]
[2022-11-10 18:35:29,335] rule:
{'name': '1ssd_2hdd',
'steps': [{'item': -52, 'item_name': 'default~ssd', 'op': 'take'},
{'num': 1, 'op': 'chooseleaf_firstn', 'type': 'host'},
{'op': 'emit'},
{'item': -24, 'item_name': 'default~hdd', 'op': 'take'},
{'num': -1, 'op': 'chooseleaf_firstn', 'type': 'host'},
{'op': 'emit'}]}
[2022-11-10 18:35:29,335] allowed reuses per rule step, starting at root: [2, 2, 2, 2, 1, 1]
[2022-11-10 18:35:29,336] processing crush step {'op': 'take', 'item': -52, 'item_name': 'default~ssd'} with tree_depth=0, rule_depth=0, item_uses=defaultdict(<class 'dict'>, {})
[2022-11-10 18:35:29,336] trace for 277: [{'id': -52, 'type_name': 'root'}, {'id': -77, 'type_name': 'host'}, {'id': 277, 'type_name': 'osd'}]
[2022-11-10 18:35:29,336] trace for 218: None
Traceback (most recent call last):
File "./placementoptimizer.py", line 2081, in <module>
try_pg_move.prepare_crush_check()
File "./placementoptimizer.py", line 1009, in prepare_crush_check
raise Exception(f"no trace found for {pg_osd} in {rule_root_name}")
Exception: no trace found for 218 in default~ssd
Another attempt of balancing the same half-broken cluster as in #35 after some CRUSH map cleanup yields this:
# ./placementoptimizer.py -v --osdsize device balance --osdused delta --max-pg-moves 100 --osdfrom fullest --only-crushclass hdd
[2024-03-15 16:03:45,878] gathering cluster state via ceph api...
Traceback (most recent call last):
File "./placementoptimizer.py", line 5475, in <module>
exit(main())
File "./placementoptimizer.py", line 5431, in main
state.preprocess()
File "./placementoptimizer.py", line 2061, in preprocess
metadata_estimate = int(meta_amount * pg_objects / osd_objs_acting)
ZeroDivisionError: division by zero
The debug archive will be sent via email, but note that this is a large cluster, so it exceeds your message size limit, so I will have to split the archive.
Sometimes balancer gives output containing upmap commands which are not valid according to active CRUSH rule.
I have 5 node cluster with replicated rules only with rack as failure domain. Many of generated commands are completely OK and leads to better balancing, but sometimes command (or its part) is not valid and Ceph doesn't insert it into configuration (silently, which is a bit confusing).
Cluster has following topology:
-> ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 26.17169 root default
-13 26.17169 datacenter dc1
-14 26.17169 room room1
-15 5.23602 rack rack2
-3 5.23602 host ceph1
0 ssd 0.87209 osd.0 up 1.00000 1.00000
2 ssd 0.87279 osd.2 up 1.00000 1.00000
3 ssd 0.87279 osd.3 up 1.00000 1.00000
4 ssd 0.87279 osd.4 up 1.00000 1.00000
28 ssd 0.87279 osd.28 up 1.00000 1.00000
29 ssd 0.87279 osd.29 up 1.00000 1.00000
-16 5.23392 rack rack3
-5 5.23392 host ceph2
5 ssd 0.87209 osd.5 up 1.00000 1.00000
6 ssd 0.87209 osd.6 up 1.00000 1.00000
7 ssd 0.87209 osd.7 up 1.00000 1.00000
8 ssd 0.87209 osd.8 up 1.00000 1.00000
26 ssd 0.87279 osd.26 up 1.00000 1.00000
27 ssd 0.87279 osd.27 up 1.00000 1.00000
-17 5.23392 rack rack5
-7 5.23392 host ceph3
1 ssd 0.87209 osd.1 up 1.00000 1.00000
11 ssd 0.87209 osd.11 up 1.00000 1.00000
14 ssd 0.87209 osd.14 up 1.00000 1.00000
17 ssd 0.87209 osd.17 up 1.00000 1.00000
24 ssd 0.87279 osd.24 up 1.00000 1.00000
25 ssd 0.87279 osd.25 up 1.00000 1.00000
-18 5.23392 rack rack6
-9 5.23392 host ceph4
9 ssd 0.87209 osd.9 up 1.00000 1.00000
12 ssd 0.87209 osd.12 up 1.00000 1.00000
15 ssd 0.87209 osd.15 up 1.00000 1.00000
18 ssd 0.87209 osd.18 up 1.00000 1.00000
22 ssd 0.87279 osd.22 up 1.00000 1.00000
23 ssd 0.87279 osd.23 up 1.00000 1.00000
-25 5.23392 rack rack7
-11 5.23392 host ceph5
10 ssd 0.87209 osd.10 up 1.00000 1.00000
13 ssd 0.87209 osd.13 up 1.00000 1.00000
16 ssd 0.87209 osd.16 up 1.00000 1.00000
19 ssd 0.87209 osd.19 up 1.00000 1.00000
20 ssd 0.87279 osd.20 up 1.00000 1.00000
21 ssd 0.87279 osd.21 up 1.00000 1.00000
-> ./placementoptimizer.py -v balance --max-pg-moves 12 | tee 2021-12-02_balance-upmaps_2
[2021-12-02 16:51:16,001] running pg balancer
[2021-12-02 16:51:16,008] current OSD fill rate per crushclasses:
[2021-12-02 16:51:16,008] ssd: average=53.85%, median=52.51%, without_placement_constraints=51.39%
[2021-12-02 16:51:16,009] cluster variance for crushclasses:
[2021-12-02 16:51:16,009] ssd: 12.017
[2021-12-02 16:51:16,009] min osd.22 48.749%
[2021-12-02 16:51:16,009] max osd.11 59.951%
[2021-12-02 16:51:16,013] SAVE move 6.1f osd.11 => osd.8 (size=8.5G)
[2021-12-02 16:51:16,013] => variance new=11.36060690389384 < 12.016851844400833=old
[2021-12-02 16:51:16,013] new min osd.22 48.749%
[2021-12-02 16:51:16,013] max osd.25 59.655%
[2021-12-02 16:51:16,013] new cluster variance:
[2021-12-02 16:51:16,013] ssd: 11.361
[2021-12-02 16:51:16,016] SAVE move 8.1e osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:51:16,016] => variance new=10.447990002211988 < 11.36060690389384=old
[2021-12-02 16:51:16,016] new min osd.8 49.952%
[2021-12-02 16:51:16,016] max osd.4 59.171%
[2021-12-02 16:51:16,016] new cluster variance:
[2021-12-02 16:51:16,017] ssd: 10.448
[2021-12-02 16:51:16,020] SAVE move 8.64 osd.4 => osd.8 (size=12.6G)
[2021-12-02 16:51:16,020] => variance new=9.687735955065486 < 10.447990002211988=old
[2021-12-02 16:51:16,020] new min osd.18 50.000%
[2021-12-02 16:51:16,020] max osd.12 59.057%
[2021-12-02 16:51:16,020] new cluster variance:
[2021-12-02 16:51:16,020] ssd: 9.688
[2021-12-02 16:51:16,024] SAVE move 8.ac osd.12 => osd.21 (size=12.4G)
[2021-12-02 16:51:16,024] => variance new=8.957778334870069 < 9.687735955065486=old
[2021-12-02 16:51:16,024] new min osd.18 50.000%
[2021-12-02 16:51:16,024] max osd.11 58.999%
[2021-12-02 16:51:16,024] new cluster variance:
[2021-12-02 16:51:16,024] ssd: 8.958
[2021-12-02 16:51:16,027] SAVE move 6.11 osd.11 => osd.18 (size=8.5G)
[2021-12-02 16:51:16,027] => variance new=8.431644281707726 < 8.957778334870069=old
[2021-12-02 16:51:16,028] new min osd.22 50.140%
[2021-12-02 16:51:16,028] max osd.25 58.265%
[2021-12-02 16:51:16,028] new cluster variance:
[2021-12-02 16:51:16,028] ssd: 8.432
[2021-12-02 16:51:16,031] SAVE move 8.d9 osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:51:16,031] => variance new=7.787358409360783 < 8.431644281707726=old
[2021-12-02 16:51:16,031] new min osd.1 50.210%
[2021-12-02 16:51:16,031] max osd.11 58.052%
[2021-12-02 16:51:16,031] new cluster variance:
[2021-12-02 16:51:16,031] ssd: 7.787
[2021-12-02 16:51:16,035] SAVE move 6.e osd.11 => osd.1 (size=8.4G)
[2021-12-02 16:51:16,035] => variance new=7.337994734535253 < 7.787358409360783=old
[2021-12-02 16:51:16,035] new min osd.9 50.500%
[2021-12-02 16:51:16,035] max osd.6 57.947%
[2021-12-02 16:51:16,035] new cluster variance:
[2021-12-02 16:51:16,036] ssd: 7.338
[2021-12-02 16:51:16,039] SAVE move 8.f1 osd.6 => osd.9 (size=13.0G)
[2021-12-02 16:51:16,039] => variance new=6.735266750261725 < 7.337994734535253=old
[2021-12-02 16:51:16,039] new min osd.18 50.948%
[2021-12-02 16:51:16,039] max osd.4 57.760%
[2021-12-02 16:51:16,039] new cluster variance:
[2021-12-02 16:51:16,039] ssd: 6.735
[2021-12-02 16:51:16,043] SAVE move 8.1a osd.4 => osd.29 (size=12.5G)
[2021-12-02 16:51:16,043] => variance new=6.215466556962361 < 6.735266750261725=old
[2021-12-02 16:51:16,043] new min osd.18 50.948%
[2021-12-02 16:51:16,043] max osd.12 57.665%
[2021-12-02 16:51:16,043] new cluster variance:
[2021-12-02 16:51:16,043] ssd: 6.215
[2021-12-02 16:51:16,046] SAVE move 8.2c osd.12 => osd.14 (size=12.4G)
[2021-12-02 16:51:16,046] => variance new=5.712574199748361 < 6.215466556962361=old
[2021-12-02 16:51:16,046] new min osd.18 50.948%
[2021-12-02 16:51:16,047] max osd.16 57.660%
[2021-12-02 16:51:16,047] new cluster variance:
[2021-12-02 16:51:16,047] ssd: 5.713
[2021-12-02 16:51:16,050] SAVE move 6.1f osd.16 => osd.18 (size=8.5G)
[2021-12-02 16:51:16,050] => variance new=5.334505158854212 < 5.712574199748361=old
[2021-12-02 16:51:16,050] new min osd.26 51.127%
[2021-12-02 16:51:16,050] max osd.11 57.107%
[2021-12-02 16:51:16,050] new cluster variance:
[2021-12-02 16:51:16,050] ssd: 5.335
[2021-12-02 16:51:16,054] SAVE move 6.8 osd.11 => osd.26 (size=8.4G)
[2021-12-02 16:51:16,054] => variance new=5.006761816792269 < 5.334505158854212=old
[2021-12-02 16:51:16,054] new min osd.1 51.155%
[2021-12-02 16:51:16,054] max osd.23 57.032%
[2021-12-02 16:51:16,054] new cluster variance:
[2021-12-02 16:51:16,054] ssd: 5.007
[2021-12-02 16:51:16,054] enough remaps found
[2021-12-02 16:51:16,054] --------------------------------------------------------------------------------
[2021-12-02 16:51:16,054] generated 12 remaps.
[2021-12-02 16:51:16,054] total movement size: 130.1G.
[2021-12-02 16:51:16,054] --------------------------------------------------------------------------------
[2021-12-02 16:51:16,054] old cluster variance per crushclass:
[2021-12-02 16:51:16,055] ssd: 12.017
[2021-12-02 16:51:16,055] old min osd.22 48.749%
[2021-12-02 16:51:16,055] old max osd.11 59.951%
[2021-12-02 16:51:16,055] --------------------------------------------------------------------------------
[2021-12-02 16:51:16,055] new min osd.1 51.155%
[2021-12-02 16:51:16,055] new max osd.23 57.032%
[2021-12-02 16:51:16,055] new cluster variance:
[2021-12-02 16:51:16,055] ssd: 5.007
[2021-12-02 16:51:16,055] --------------------------------------------------------------------------------
ceph osd pg-upmap-items 6.1f 11 8 16 18
ceph osd pg-upmap-items 8.1e 25 22
ceph osd pg-upmap-items 8.64 4 8
ceph osd pg-upmap-items 8.ac 4 19 12 21
ceph osd pg-upmap-items 6.11 11 18
ceph osd pg-upmap-items 8.d9 25 22
ceph osd pg-upmap-items 6.e 11 1
ceph osd pg-upmap-items 8.f1 6 9
ceph osd pg-upmap-items 8.1a 11 9 4 29
ceph osd pg-upmap-items 8.2c 25 26 12 14
ceph osd pg-upmap-items 6.8 11 26
But e.g. second move of PG 6.1F
would violate CRUSH rule because it would colocate 2nd and 3rd replica to same host ceph4
:
-> ceph pg dump | grep -F 6.1f
dumped all
6.1f 2262 0 0 0 0 9118765056 0 0 2918 2918 active+clean 2021-12-02T01:29:46.271762+0000 1043596'3570596 1043603:54558341 [11,12,16] 11 [11,12,16] 11 1034824'3565764 2021-12-01T16:09:13.967493+0000 1023825'3552322 2021-11-30T10:06:06.894016+0000 0
And also Ceph will (silently) refuse to proceed command when I try to run them:
-> ceph osd dump | grep -F 6.1f
-> ceph osd pg-upmap-items 6.1f 11 8 16 18
set 6.1f pg_upmap_items mapping to [11->8,16->18]
-> ceph osd dump | grep -F 6.1f
But I can do first relocation just fine:
-> ceph osd pg-upmap-items 6.1f 11 8
set 6.1f pg_upmap_items mapping to [11->8]
-> ceph osd dump | grep -F 6.1f
pg_upmap_items 6.1f [11,8]
pg_temp 6.1f [11,12,16]
Please let me know if I can be somewhat helpful with solving this issue.
Hi.
We are having trouble using the balancer on the Reef version; it throws an error in decoding JSON from 'ceph osd dump --format json,' but the output of this command is valid JSON. Do you know where the issue might be?
root@app001 ~/ceph-balancer # git pull
Updating 1c90248..b48ffbf
Fast-forward
placementoptimizer.py | 2453 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
1 file changed, 2089 insertions(+), 364 deletions(-)
root@app001 ~/ceph-balancer # ./placementoptimizer.py balance
Traceback (most recent call last):
File "/root/ceph-balancer/./placementoptimizer.py", line 5060, in <module>
exit(main())
File "/root/ceph-balancer/./placementoptimizer.py", line 5024, in main
state = ClusterState(args.state, osdsize_method=osdsize_method)
File "/root/ceph-balancer/./placementoptimizer.py", line 592, in __init__
self.load(statefile)
File "/root/ceph-balancer/./placementoptimizer.py", line 619, in load
osd_dump=jsoncall("ceph osd dump --format json".split()),
File "/root/ceph-balancer/./placementoptimizer.py", line 274, in jsoncall
return json.loads(rawdata.decode())
File "/usr/lib64/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 2365 (char 2365)
root@app001 ~/ceph-balancer # ceph version
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
root@app001 ~/ceph-balancer # python --version
Python 3.9.17
Thanks
Michal
The script just throws an error
Output:
./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps
[2022-03-17 15:29:12,622] gathering cluster state via ceph api...
[2022-03-17 15:29:15,735] running pg balancer
[2022-03-17 15:29:15,737] current OSD fill rate per crushclasses:
[2022-03-17 15:29:15,738] hdd: average=121.93%, median=107.85%, without_placement_constraints=64.31%
[2022-03-17 15:29:15,738] ssd: average=47.30%, median=47.30%, without_placement_constraints=44.71%
Traceback (most recent call last):
File "/root/ceph-balancer/./placementoptimizer.py", line 1945, in <module>
init_cluster_variance = get_cluster_variance(enabled_crushclasses, pg_mappings)
File "/root/ceph-balancer/./placementoptimizer.py", line 1870, in get_cluster_variance
class_variance = statistics.variance(osd_usages)
File "/usr/lib/python3.9/statistics.py", line 739, in variance
raise StatisticsError('variance requires at least two data points')
statistics.StatisticsError: variance requires at least two data points
$ ./placementoptimizer.py showremapped --state broken-cluster-1.xz --by-osd
Traceback (most recent call last):
File "/tmp/bug-report/./placementoptimizer.py", line 5496, in <module>
exit(main())
^^^^^^
File "/tmp/bug-report/./placementoptimizer.py", line 5490, in main
run()
File "/tmp/bug-report/./placementoptimizer.py", line 5460, in <lambda>
run = lambda: showremapped(args, state)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/bug-report/./placementoptimizer.py", line 5346, in showremapped
print(f"{osdname}: {cluster.osds[osdid]['host_name']} =>{sum_to} {sum_data_to_pp} <={sum_from} {sum_data_from_pp}"
~~~~~~~~~~~~^^^^^^^
KeyError: -1
Without the --by-osd
flag, it works:
$ ./placementoptimizer.py showremapped --state broken-cluster-1.xz
pg 28.28c degraded+waiting 55.7G: 0 of 1008986, 0.0%, -1->120
pg 28.272 degraded+waiting 55.6G: 570 of 1003522, 0.1%, -1->120
pg 28.277 degraded+waiting 55.5G: 0 of 1004690, 0.0%, -1->84
...
broken-cluster-1.xz will be sent by email.
Hi,
script is failing on OS with older python < 3.9 .
File "<fstring>", line 1
(pool_type=)
^
SyntaxError: invalid syntax
Running placementoptimizer.py
on my Pacific (15.2.16) cluster increases the epoch
count, which triggers the following exception even when the cluster is idle,
[root@ceph-admin ~]# ./placementoptimizer.py -v balance --max-pg-moves 10
fsid 227d9741-3db8-4984-a522-6442c1739578
[2022-03-25 09:18:38,347] gathering cluster state via ceph api...
Traceback (most recent call last):
File "./placementoptimizer.py", line 233, in <module>
raise Exception("Cluster topology changed during information gathering (e.g. a pg changed state). "
Exception: Cluster topology changed during information gathering (e.g. a pg changed state). Wait for things to calm down and try again
What am I doing wrong?
Hi ,
you have a typo
for new_from, new_to in resulting_upmaps:
if new_from == new_to:
raise Exception(f"somewhere something went wrong, we map {idpg} from osd.{new_from} to osd.{new_to}")
idpg
should be pgid
Hello!
I've been trying to use the balancer on my EC pool, and I am receiving the following error:
Traceback (most recent call last):
File "./placementoptimizer.py", line 2080, in <module>
try_pg_move.prepare_crush_check()
File "./placementoptimizer.py", line 1060, in prepare_crush_check
raise Exception(f"during emit, rule step {idx} item {item} was used {uses} != {reuses} expected")
Exception: during emit, rule step 0 item -15 was used 11 != 12 expected
I am assuming that during the sanity check, because I have 11 chunks of data (k8m3) but the crush rules specify 4hosts/3osds (12 chunks), this is why I am getting this error?
I am using Proxmox 7.2, CEPH Pacific 16.2.7. Here is my CRUSH map.
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class nvme
device 33 osd.33 class nvme
device 34 osd.34 class nvme
device 35 osd.35 class nvme
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class nvme
device 41 osd.41 class nvme
device 42 osd.42 class hdd
device 43 osd.43 class hdd
device 44 osd.44 class hdd
device 45 osd.45 class hdd
device 46 osd.46 class hdd
device 47 osd.47 class hdd
device 48 osd.48 class hdd
device 49 osd.49 class hdd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# buckets
host server1 {
id -3 # do not change unnecessarily
id -16 class nvme # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 31.765
alg straw2
hash 0 # rjenkins1
item osd.18 weight 2.729
item osd.19 weight 2.729
item osd.22 weight 2.729
item osd.23 weight 2.729
item osd.24 weight 2.729
item osd.34 weight 0.873
item osd.16 weight 5.458
item osd.20 weight 5.458
item osd.38 weight 5.458
item osd.1 weight 0.873
}
host server2 {
id -5 # do not change unnecessarily
id -17 class nvme # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 31.765
alg straw2
hash 0 # rjenkins1
item osd.4 weight 2.729
item osd.5 weight 2.729
item osd.6 weight 2.729
item osd.8 weight 2.729
item osd.9 weight 2.729
item osd.32 weight 0.873
item osd.7 weight 5.458
item osd.10 weight 5.458
item osd.31 weight 5.458
item osd.3 weight 0.873
}
host server3 {
id -7 # do not change unnecessarily
id -18 class nvme # do not change unnecessarily
id -13 class hdd # do not change unnecessarily
# weight 31.765
alg straw2
hash 0 # rjenkins1
item osd.11 weight 2.729
item osd.12 weight 2.729
item osd.14 weight 2.729
item osd.17 weight 2.729
item osd.2 weight 0.873
item osd.15 weight 5.458
item osd.36 weight 5.458
item osd.13 weight 2.729
item osd.37 weight 5.458
item osd.33 weight 0.873
}
host server4 {
id -9 # do not change unnecessarily
id -19 class nvme # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 31.765
alg straw2
hash 0 # rjenkins1
item osd.25 weight 2.729
item osd.27 weight 2.729
item osd.26 weight 2.729
item osd.28 weight 2.729
item osd.30 weight 2.729
item osd.35 weight 0.873
item osd.0 weight 0.873
item osd.21 weight 5.458
item osd.29 weight 5.458
item osd.39 weight 5.458
}
host server5 {
id -2 # do not change unnecessarily
id -4 class nvme # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 31.766
alg straw2
hash 0 # rjenkins1
item osd.40 weight 0.873
item osd.41 weight 0.873
item osd.42 weight 5.458
item osd.43 weight 5.458
item osd.44 weight 2.729
item osd.45 weight 5.458
item osd.46 weight 2.729
item osd.47 weight 2.729
item osd.48 weight 2.729
item osd.49 weight 2.729
}
root default {
id -1 # do not change unnecessarily
id -20 class nvme # do not change unnecessarily
id -15 class hdd # do not change unnecessarily
# weight 158.826
alg straw2
hash 0 # rjenkins1
item server1 weight 31.765
item server2 weight 31.765
item server3 weight 31.765
item server4 weight 31.765
item server5 weight 31.766
}
rule storage_metadata {
id 2
type replicated
min_size 2
max_size 3
step take default class nvme
step chooseleaf firstn 0 type host
step emit
}
rule storage_data {
id 3
type erasure
min_size 10
max_size 11
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 4 type host
step chooseleaf indep 3 type osd
step emit
}
# end crush map
EC Profile
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=8
m=3
plugin=jerasure
technique=reed_sol_van
w=8
Getting a ZeroDivisionError when we run the balance command
# ./placementoptimizer.py -v balance --max-pg-moves 10
[2021-11-13 16:42:46,885] running pg balancer
Traceback (most recent call last):
File "./placementoptimizer.py", line 1745, in <module>
pg_mappings = PGMappings(pgs, osds)
File "./placementoptimizer.py", line 1225, in __init__
pg_obj_size = shardsize / pg_objs
ZeroDivisionError: division by zero
I'm guessing it's because we have a couple osds down with zero PGs?
# ceph osd df | sort -k17 | head
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
MIN/MAX VAR: 0.07/1.28 STDDEV: 17.89
TOTAL 8.7 PiB 6.0 PiB 5.9 PiB 986 GiB 21 TiB 2.8 PiB 68.37
1073 hdd 12.87889 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
71 hdd 12.87889 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down
1050 hdd 12.87889 1.00000 13 TiB 1.4 TiB 1.1 TiB 52 KiB 3.8 GiB 12 TiB 10.77 0.16 9 up
1054 hdd 12.87889 1.00000 13 TiB 1.4 TiB 1.1 TiB 132 KiB 3.9 GiB 12 TiB 10.98 0.16 7 up
1063 hdd 12.87889 1.00000 13 TiB 1.6 TiB 1.4 TiB 52 KiB 4.6 GiB 11 TiB 12.64 0.18 10 up
To simplify debugging, the balancer should be able to work on an imported cluster state. To generate that state, it also needs to be able to produce a state bundle.
The easiest approach would be generating a huge json output, where we just put in all the collected data from various ceph commands. This file can then be shared, for direct debugging and testing, without needing access to the live cluster.
It might be worth better handling of osd = 2^31-1 (aka NONE) when there is a missing osd in a pg,
[root@ceph-admin ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 --max-full-move-attempts=100 | tee /tmp/balance-upmaps
[2022-06-27 15:09:30,803] gathering cluster state via ceph api...
[2022-06-27 15:10:05,578] running pg balancer
[2022-06-27 15:10:05,865] current OSD fill rate per crushclasses:
[2022-06-27 15:10:05,865] mddu: average=0.21%, median=0.21%, without_placement_constraints=1.16%
[2022-06-27 15:10:05,866] mddm: average=0.16%, median=0.16%, without_placement_constraints=1.09%
[2022-06-27 15:10:05,866] nvme: average=0.01%, median=0.01%, without_placement_constraints=0.11%
[2022-06-27 15:10:05,868] ssd: average=44.91%, median=44.56%, without_placement_constraints=40.89%
[2022-06-27 15:10:05,870] hdd: average=98.72%, median=61.03%, without_placement_constraints=54.26%
[2022-06-27 15:10:05,870] smr: average=0.00%, median=0.00%, without_placement_constraints=0.03%
[2022-06-27 15:10:05,877] cluster variance for crushclasses:
[2022-06-27 15:10:05,877] mddu: 0.000
[2022-06-27 15:10:05,877] mddm: 0.000
[2022-06-27 15:10:05,877] nvme: 0.000
[2022-06-27 15:10:05,877] ssd: 7.102
[2022-06-27 15:10:05,877] hdd: 4505.848
[2022-06-27 15:10:05,877] smr: 0.000
[2022-06-27 15:10:05,877] min osd.253 0.000%
[2022-06-27 15:10:05,877] max osd.1986 405.902%
[2022-06-27 15:10:05,877] osd.2231 has calculated usage >= 100%: 100.29076207793798%
[2022-06-27 15:10:05,878] osd.365 has calculated usage >= 100%: 100.55524411861991%
...
[2022-06-27 15:10:06,411] osd.2215 has calculated usage >= 100%: 341.60815893352543%
[2022-06-27 15:10:06,411] osd.2221 has calculated usage >= 100%: 342.0460392118455%
Traceback (most recent call last):
File "./placementoptimizer.py", line 2079, in <module>
try_pg_move.prepare_crush_check()
File "./placementoptimizer.py", line 1007, in prepare_crush_check
raise Exception(f"no trace found for {pg_osd} in {rule_root_name}")
Exception: no trace found for 2147483647 in default~hdd
I am using the script to watch the progress of backfills on a broken cluster. Yet, it shows an exception:
Traceback (most recent call last):
File "/root/./placementoptimizer.py", line 5496, in <module>
exit(main())
File "/root/./placementoptimizer.py", line 5451, in main
state.preprocess()
File "/root/./placementoptimizer.py", line 2183, in preprocess
raise RuntimeError(f"pg {pg_incoming} to be moved to osd.{osdid} is misplaced "
RuntimeError: pg 18.6a to be moved to osd.117 is misplaced with -198781.0<0 objects already transferred
I will send you the dump via email. Yes, I know that one PG is not recoverable without the manual export/import.
Hello,
In one of my small clusters, the balancer attempted to move the first PG of a pool, but it did not find any suitable target and skipped this pool completely. So the balancer generated no move suggestion of this pool. I think the balancer should have checked if the rest of the PGs in the same pool could be moved to other OSDs.
ps: Related logic is at line from 2126 to 2128
Hi,
trying to run with options
-v balance --osdsize device --osdused delta --max-pg-moves 2 --osdfrom fullest
getting error:
[2024-04-15 12:31:58,685] gathering cluster state via ceph api...
[2024-04-15 12:32:33,852] running pg balancer
Traceback (most recent call last):
File "jj.py", line 5496, in <module>
exit(main())
File "jj.py", line 5490, in main
run()
File "jj.py", line 5454, in <lambda>
run = lambda: balance(args, state)
File "jj.py", line 4607, in balance
pg_mappings = PGMappings(cluster,
File "jj.py", line 3265, in __init__
self.init_analyzer.analyze(self)
File "jj.py", line 4288, in analyze
self._update_stats()
File "jj.py", line 4374, in _update_stats
self.cluster_variance = self.pg_mappings.get_cluster_variance()
File "jj.py", line 3788, in get_cluster_variance
for crushclass, usages in self.get_class_osd_usages().items():
File "jj.py", line 3526, in get_class_osd_usages
ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
File "jj.py", line 3526, in <dictcomp>
ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
File "jj.py", line 3774, in get_osd_usage
used = self.get_osd_usage_size(osdid, add_size)
File "jj.py", line 3731, in get_osd_usage_size
used += self.cluster.osd_transfer_remainings[osdid]
KeyError: 3
Cluster is Ceph Quincy, with 1500 OSDs, SSDs, HDDs, with more pools, but main and most utilized is EC.
Any other details you need?
Thank you
Following happened to me when I migrated all data from ssd
class OSDs to nvme
ones, but I didn't remove class ssd
from CRUSH. I didn't do that because I might re-add them later and I don't want to get rid of existing CRUSH rules.
./placementoptimizer.py -v balance --max-pg-moves 20 | tee -a "$(date +%Y-%m-%d)_balance-upmaps_1"
[2022-04-06 20:10:41,020] gathering cluster state via ceph api...
Traceback (most recent call last):
File "./placementoptimizer.py", line 269, in <module>
class_df_stats = CLUSTER_STATE["df_dump"]["stats_by_class"][crush_class]
KeyError: 'ssd'
I completely understand what's happenning here, I just expect the script to simply ignore classes without OSDs.
./placementoptimizer.py -v balance --max-pg-moves 1000 --max-full-move-attempts=1000
generates pg-upmap-items that,
Error EINVAL: num of osd pairs (4) > pool size (3)
For example, here are some large OSD lists for a triply replicated pool,
ceph osd pg-upmap-items 14.36e7 2020 1591 2019 1612 1591 1438 1612 597 ceph osd pg-upmap-items 14.12c6 1093 1767 2018 638 ceph osd pg-upmap-items 14.2f63 817 1357 710 1613 1357 1366 1613 630 ceph osd pg-upmap-items 14.2790 1312 674 822 1763 674 1740 1763 1470 1740 1445 1470 601
Taking a closer look at the last entry there are indeed OSD listed that are not used by that PG,
[root@ceph-admin ~]# ceph pg dump | awk '$1 == "14.2790"' dumped all 14.2790 2214 0 0 0 0 8615245675 0 0 887 887 active+clean 2022-08-01T07:56:09.531220-0700 858000'9325 871388:788167 [180,1312,605] 180 [180,1312,605] 180 858000'9325 2022-08-01T07:56:09.530894-0700 858000'9325 2022-08-01T07:56:09.530894-0700 0
What is the reason for placementoptimizer.py showremapped
to raise an exception if the cluster topology has changed?
raise Exception("Cluster topology changed during information gathering (e.g. a pg changed state). "
Since this is a presumably a read-only operation how about changing that exception to a warning and outputting the currently remapped PG nonetheless?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.