thejj / ceph-balancer Goto Github PK

View Code? Open in Web Editor NEW

94.0 14.0 26.0 697 KB

An alternative Ceph placement optimizer, aiming for maximum storage capacity through equal OSD utilization.

License: GNU General Public License v3.0

Python 100.00%

ceph ceph-balancer python optimization-problem cluster-analysis

ceph-balancer's People

Contributors

Stargazers

Watchers

ceph-balancer's Issues

Feature request: take bluestore_min_alloc_size into account

I have access to a cluster created long ago and then expanded by adding new OSDs. I found that, in order to balance it properly, I had to add --osdsize device balance --osdused delta. Otherwise, its idea of how full an OSD is disagrees with what ceph osd df says, and disagrees differently for different OSDs.

Today, with the help of my colleagues, we root-caused this: old OSDs have bluestore_min_alloc_size=65536, while new ones have bluestore_min_alloc_size=4096. It means that the average per-object overhead is different. This overhead is what makes the sum of PG sizes (i.e., the sum of all stored object sizes) different from the used space on the OSD.

Please assume by default that each stored object comes with an overhead of bluestore_min_alloc_size / 2, and take this into account when figuring out how much space would be used or freed by PG moves. On Ceph 17.2.7 and later, you can get this from ceph osd metadata.

For example, an OSD that has a total of 56613739 objects in all PGs would have 1.7 TB of overhead with bluestore_min_alloc_size=65536, but only 100 GB of overhead with bluestore_min_alloc_size=4096.

Here is ceph osd df (please ignore the first bunch of OSDs with only 0.75% utilization - they are outside of the CRUSH root, waiting for an "ok" to be placed in the proper hierarchy):
ceph-osd-df.txt

Here is ceph pg ls-by-osd 221 (this one was redeployed recently, so it has bluestore_min_alloc_size=4096):
ceph-pg-ls-by-osd-221.txt

Here is ceph pg ls-by-osd 223:
ceph-pg-ls-by-osd-223.txt

As you can see, these two OSDs have almost the same size, almost the same (differing only by 1) number of PGs, but their utilization differs by 1.9 TB, which matches (although not perfectly) the overhead calculation presented above.

Sorry, I am not allowed to post the full osdmap.

P.S. I am also going to file the same bug against the built-in Ceph balancer.

unsuccessful_pools prevents rebalancing when new nodes added.

Added two new storage nodes to a cluster that was completely balanced by placementoptimizer.py

Performed upmap_ramapped.py to set all pgs to active+clean
https://github.com/HeinleinSupport/cern-ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Expected subsequent runs of placementoptimizer.py to shift data to the new nodes.

[2022-10-05 14:40:45,104]  BAD => osd.350 already has too many of pool=1 (84 >= 79.38328132674108)
[2022-10-05 14:40:45,104] TRY-1 move 1.33d3 osd.154 => osd.154 (1256/1256)
[2022-10-05 14:40:45,104]  BAD move to source OSD makes no sense
[2022-10-05 14:40:45,104] SKIP pg 1.1a18 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.3a7e since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.7d8 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.2d59 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.152d since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.1b63 since pool (1) can't be balanced more
...

Instead: Entire pool is blacklisted and no further optimisations can be found.

Disabling this code fixes the issue:

ceph-balancer/placementoptimizer.py

Line 2041 in d0cd6a8

if pg_pool in unsuccessful_pools:

I'm wondering what is the purpose of unsuccessful_pools?

no suitable improvement found for rack-distributed cluster

On a fairly large cluster running Ceph 15.2.16 placementoptimizer.py balance generates pg-upmap-items that violate the crush failure domain constraint by trying to map to multiple OSD in an EC PG to the same rack. Any ideas on what I am doing wrong would be greatly appreciated.

[root@ceph-admin ~]# ./placementoptimizer.py show

poolid name                    type     size min pg_num  stored    used   avail shrdsize crush
    10 fs.meta.mirror          repl        3   2   1024    2.1G    2.1G    1.2T     2.1M 6:replicated_chassis_mddm default~mddm*1.000
    11 fs.meta.user            repl        3   2   1024    2.7G    2.7G    1.2T     2.7M 7:replicated_chassis_mddu default~mddu*1.000
    12 fs.data.mirror.ssd.rep  repl        3   2   8192    0.0B    0.0B  267.4T     0.0B 1:replicated_chassis_ssd default~ssd*1.000
    13 fs.data.mirror.nvme.rep repl        3   2   2048    0.0B    0.0B   53.4T     0.0B 3:replicated_chassis_nvme default~nvme*1.000
    14 fs.data.user.ssd.rep    repl        3   2   8192   94.4T   94.4T  267.4T    11.8G 1:replicated_chassis_ssd default~ssd*1.000
    15 fs.data.user.nvme.rep   repl        3   2   2048    0.0B    0.0B   53.4T     0.0B 3:replicated_chassis_nvme default~nvme*1.000
    16 fs.data.mirror.ssd.ec   ec6+2       8   7   4096  356.8T  356.8T  601.6T    14.9G 2:fs.data.mirror.ssd.ec default~ssd*1.000
    17 fs.data.mirror.hdd.ec   ec6+2       8   7   4096    1.2P    1.2P  315.0T    53.1G 4:fs.data.mirror.hdd.ec default~hdd*1.000
    18 fs.data.user.hdd.ec     ec6+2       8   7   4096    3.5G    3.5G  315.0T   148.3K 5:fs.data.user.hdd.ec default~hdd*1.000
    20 device_health_metrics   repl        3   2     32    3.1G    3.1G   53.4T    99.2M 3:replicated_chassis_nvme default~nvme*1.000
    21 fs.data.user.ssd.ec     ec6+2       8   7   4096   82.5T   82.5T  601.6T     3.4G 0:fs.data.user.ssd.ec default~ssd*1.000
default~mddm                                              2.07G   2.07G    1.625%
default~mddu                                              2.73G   2.73G    1.708%
default~ssd                                             533.62T 533.62T   40.830%
default~nvme                                              3.10G   3.10G    0.157%
default~hdd                                               1.24P   1.24P   51.871%
sum                                                       1.76P   1.76P

while 128 PG are active+remapped+backfilling from the mgr balancer I tried running the JJ balancer to see if can do a better job since I currently have OSD ranging from 22% to 82% %USE. However, the pg-upmap fail the crush failure domain constraint of trying to use two OSD in the same rack,

[root@ceph1 ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps.2
[2022-06-25 15:28:31,297] gathering cluster state via ceph api...
[2022-06-25 15:29:10,227] running pg balancer
[2022-06-25 15:29:10,506] current OSD fill rate per crushclasses:
[2022-06-25 15:29:10,506]   mddu: average=0.20%, median=0.20%, without_placement_constraints=1.67%
[2022-06-25 15:29:10,506]   mddm: average=0.15%, median=0.15%, without_placement_constraints=1.58%
[2022-06-25 15:29:10,506]   nvme: average=0.01%, median=0.00%, without_placement_constraints=0.15%
[2022-06-25 15:29:10,509]   ssd: average=44.80%, median=44.48%, without_placement_constraints=40.82%
[2022-06-25 15:29:10,510]   hdd: average=93.10%, median=58.60%, without_placement_constraints=51.84%
[2022-06-25 15:29:10,510]   smr: average=0.00%, median=0.00%, without_placement_constraints=0.05%
[2022-06-25 15:29:10,517] cluster variance for crushclasses:
[2022-06-25 15:29:10,517]   mddu: 0.000
[2022-06-25 15:29:10,517]   mddm: 0.000
[2022-06-25 15:29:10,517]   nvme: 0.000
[2022-06-25 15:29:10,517]   ssd: 6.502
[2022-06-25 15:29:10,517]   hdd: 3915.508
[2022-06-25 15:29:10,517]   smr: 0.000
[2022-06-25 15:29:10,517] min osd.253 0.000%
[2022-06-25 15:29:10,517] max osd.2323 385.361%
[2022-06-25 15:29:10,517] osd.2353 has calculated usage >= 100%: 100.63859057273221%
[2022-06-25 15:29:10,517] osd.447 has calculated usage >= 100%: 100.95507949049384%
...
[2022-06-25 15:29:12,803] osd.1960 has calculated usage >= 100%: 288.98203225266775%
[2022-06-25 15:29:12,803] osd.2323 has calculated usage >= 100%: 289.00484092949876%
[2022-06-25 15:29:12,803] enough remaps found
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] generated 10 remaps.
[2022-06-25 15:29:12,803] total movement size: 532.0G.
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] old cluster variance per crushclass:
[2022-06-25 15:29:12,803]   mddu: 0.000
[2022-06-25 15:29:12,803]   mddm: 0.000
[2022-06-25 15:29:12,803]   nvme: 0.000
[2022-06-25 15:29:12,803]   ssd: 6.502
[2022-06-25 15:29:12,803]   hdd: 3915.508
[2022-06-25 15:29:12,803]   smr: 0.000
[2022-06-25 15:29:12,803] old min osd.253 0.000%
[2022-06-25 15:29:12,803] old max osd.2323 385.361%
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] new min osd.253 0.000%
[2022-06-25 15:29:12,803] new max osd.2323 289.005%
[2022-06-25 15:29:12,803] new cluster variance:
[2022-06-25 15:29:12,803]   mddu: 0.000
[2022-06-25 15:29:12,804]   mddm: 0.000
[2022-06-25 15:29:12,804]   nvme: 0.000
[2022-06-25 15:29:12,804]   ssd: 6.502
[2022-06-25 15:29:12,804]   hdd: 3655.503
[2022-06-25 15:29:12,804]   smr: 0.000
[2022-06-25 15:29:12,804] --------------------------------------------------------------------------------
ceph osd pg-upmap-items 17.966 2323 2128
ceph osd pg-upmap-items 17.ce8 1986 2128
ceph osd pg-upmap-items 17.be4 2323 2128
ceph osd pg-upmap-items 17.55e 1986 2128
ceph osd pg-upmap-items 17.afb 2397 2099
ceph osd pg-upmap-items 17.67a 1982 2128
ceph osd pg-upmap-items 17.8c4 2177 2099
ceph osd pg-upmap-items 17.1c3 2215 2127
ceph osd pg-upmap-items 17.450 2313 2128
ceph osd pg-upmap-items 17.b03 2397 2099

[root@ceph-admin ~]# ceph osd pg-upmap-items 17.b03 2397 2099
set 17.b03 pg_upmap_items mapping to [2397->2099]

[root@ceph1 ceph]# tail -f /var/log/ceph/ceph-mon.ceph1.log
...
2022-06-25T15:30:31.076-0700 7fe8a0f67700 -1 verify_upmap multiple osds 2099,2141 come from same failure domain -4487
2022-06-25T15:30:31.076-0700 7fe8a0f67700  0 check_pg_upmaps verify_upmap of pg 17.b03 returning -22

which is true since both osd.2099 and osd.2141 are in the same rack,

[root@ceph-admin ~]# ceph osd find 2099 | grep rack
        "rack": "s14",
[root@ceph-admin ~]# ceph osd find 2141 | grep rack
        "rack": "s14",

And for reference, here is the PG attempting to be modified,

[root@ceph-admin ~]# ceph pg dump | awk '$1 == "17.b03"'
dumped all
17.b03     82138                   0         0          0        0  343693292761            0           0   2150      2150                 active+clean  2022-06-22T21:31:45.461326-0700  706729'221745  706729:3645914   [2343,1947,2496,2397,2210,2141,465,1839]        2343   [2343,1947,2496,2397,2210,2141,465,1839]            2343  671601'208723  2022-06-22T21:31:45.461159-0700    671601'208723  2022-06-22T21:31:45.461159-0700              0

it's breaking CRUSH rule

on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -2,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "datacenter"
        },
        {
            "op": "choose_indep",
            "num": 2,
            "type": "osd"
        },
        {
            "op": "emit"
        }
    ]
}

EC profile:

crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=3
plugin=jerasure
technique=reed_sol_van
w=8

I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.

Looks to me like it's ignoring custom CRUSH rule for EC pools.

Also strange that pg-upmap-items is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.

Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.

Thank you!

./placementoptimizer.py showremapped balks at cluster topology changes

# ./placementoptimizer.py showremapped
Traceback (most recent call last):
  File "/root/./placementoptimizer.py", line 5403, in <module>
    exit(main())
  File "/root/./placementoptimizer.py", line 5358, in main
    state = ClusterState(args.state, osdsize_method=osdsize_method)
  File "/root/./placementoptimizer.py", line 720, in __init__
    self.load(statefile)
  File "/root/./placementoptimizer.py", line 768, in load
    raise RuntimeError("Cluster topology changed during information gathering (e.g. a pg changed state). "
RuntimeError: Cluster topology changed during information gathering (e.g. a pg changed state). Wait for things to calm down and try again

Surely this precaution against cluster topology changes does not make sense for read-only and purely informational subcommands such as showremapped?

Add ability to handle invoking the Ceph client via Cephadm?

I would like to run the very useful script on a cluster where currently the Ceph client can only be invoked via Cephadm i.e.

sudo cephadm shell ceph

Would you consider adding an option to handle that situation cleanly?

JSON TypeError on show command when json output is used.

Hello!

I would like to use json output of the show command to parse shard sizes.

python3 placementoptimizer.py show --format json

then it outputs it but at the end there is an error:

[...]
"pools_acting": [1, 2, 3, 6, 9, 18], "pg_count_up": {"2": 41, "3": 23, "9": 2, "1": 5, "6": 2, "18": 1}, "pg_count_acting": {"2": 41, "3": 23, "9": 2, "1": 5, "6": 2, "18": 1}, "pg_num_up": 74, "pgs_up": Traceback (most recent call last):
  File "placementoptimizer.py", line 5086, in <module>
    exit(main())
  File "placementoptimizer.py", line 5063, in main
    show(args, state)
  File "placementoptimizer.py", line 4850, in show
    json.dump(ret, sys.stdout)
  File "/usr/lib/python3.8/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/usr/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.8/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/usr/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type set is not JSON serializable

At the same time standard plain text output works ok.

Python version: Python 3.8.10 (default, Nov 14 2022, 12:59:47)
Ceph version: Quincy 17.2.5

Can't get OSDs utilized evenly

Hi,
I have one Ceph cluster with high STDEV number (took from ceph osd df tree class hdd command (I didn't put whole output here as there are hundreds of OSDs):

-34          566.02417         -  515 TiB  423 TiB  422 TiB    50 MiB  1.2 TiB    91 TiB  82.24  1.00    -              host stg4-osd8
 11    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   2.9 MiB   19 GiB  1019 GiB  86.32  1.05   87      up          osd.11
 12    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.5 MiB   18 GiB  1019 GiB  86.33  1.05   87      up          osd.12
 56    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB  1017 KiB   19 GiB  1019 GiB  86.33  1.05   87      up          osd.56
 58    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.8 MiB   19 GiB  1020 GiB  86.31  1.05   87      up          osd.58
 66    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   330 KiB   18 GiB  1023 GiB  86.27  1.05   87      up          osd.66
 81    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   492 KiB   18 GiB  1018 GiB  86.33  1.05   87      up          osd.81
 94    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.6 MiB   18 GiB  1021 GiB  86.30  1.05   87      up          osd.94
123    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   246 KiB   17 GiB  1022 GiB  86.28  1.05   87      up          osd.123
143    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.8 MiB   18 GiB  1022 GiB  86.28  1.05   87      up          osd.143
151    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   847 KiB   19 GiB  1021 GiB  86.30  1.05   87      up          osd.151
187    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.4 MiB   18 GiB  1020 GiB  86.31  1.05   87      up          osd.187
250    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.76  0.96  195      up          osd.250
264    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.76  0.96  195      up          osd.264
282    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.76  0.96  195      up          osd.282
297    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.77  0.96  195      up          osd.297
318    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.75  0.96  195      up          osd.318
333    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.77  0.96  195      up          osd.333
349    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.4 TiB  79.16  0.96  196      up          osd.349
362    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.4 TiB  79.14  0.96  196      up          osd.362
381    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   39 GiB   3.4 TiB  79.17  0.96  196      up          osd.381
399    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   38 GiB   3.5 TiB  78.78  0.96  195      up          osd.399
415    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.76  0.96  195      up          osd.415
435    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.75  0.96  195      up          osd.435
463    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.0 MiB   41 GiB   2.3 TiB  85.94  1.05  195      up          osd.463
467    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   2.6 MiB   41 GiB   2.2 TiB  86.35  1.05  196      up          osd.467
480    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.4 TiB  79.16  0.96  196      up          osd.480
503    hdd    14.00052   1.00000   13 TiB   11 TiB   11 TiB   3.1 MiB   33 GiB   1.8 TiB  86.14  1.05  152      up          osd.503
519    hdd    16.00090   1.00000   15 TiB   13 TiB   13 TiB   1.2 MiB   35 GiB   2.0 TiB  86.24  1.05  174      up          osd.519
539    hdd    16.00090   1.00000   15 TiB   13 TiB   13 TiB   3.0 MiB   36 GiB   2.0 TiB  86.29  1.05  174      up          osd.539
562    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   5.2 MiB   42 GiB   2.2 TiB  86.38  1.05  196      up          osd.562
573    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   4.4 MiB   40 GiB   2.2 TiB  86.37  1.05  196      up          osd.573
589    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   1.1 MiB   42 GiB   2.3 TiB  85.95  1.05  195      up          osd.589
606    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   2.2 MiB   42 GiB   2.3 TiB  85.94  1.05  195      up          osd.606
610    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   615 KiB   40 GiB   2.3 TiB  85.95  1.05  195      up          osd.610
645    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.78  0.96  195      up          osd.645
658    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.76  0.96  195      up          osd.658
673    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.4 TiB  79.15  0.96  196      up          osd.673
681    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.77  0.96  195      up          osd.681
-37          566.02417         -  515 TiB  423 TiB  422 TiB    61 MiB  1.2 TiB    92 TiB  82.21  1.00    -              host stg4-osd9
 13    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   364 KiB   18 GiB  1021 GiB  86.30  1.05   87      up          osd.13
 14    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB  1013 KiB   19 GiB  1020 GiB  86.31  1.05   87      up          osd.14
 54    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   1.7 MiB   18 GiB  1022 GiB  86.28  1.05   87      up          osd.54
 57    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.3 MiB   18 GiB  1020 GiB  86.31  1.05   87      up          osd.57
 68    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.6 MiB   18 GiB  1024 GiB  86.26  1.05   87      up          osd.68
 82    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   1.6 MiB   18 GiB  1022 GiB  86.28  1.05   87      up          osd.82
 93    hdd    16.00090   1.00000   15 TiB   13 TiB   13 TiB   5.0 MiB   36 GiB   2.0 TiB  86.28  1.05  174      up          osd.93
 95    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.0 MiB   18 GiB  1022 GiB  86.28  1.05   87      up          osd.95
109    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   1.2 MiB   19 GiB  1020 GiB  86.32  1.05   87      up          osd.109
122    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.2 MiB   18 GiB  1023 GiB  86.28  1.05   87      up          osd.122
136    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.2 MiB   18 GiB  1019 GiB  86.33  1.05   87      up          osd.136
147    hdd     8.00156   1.00000  7.3 TiB  6.3 TiB  6.3 TiB   3.8 MiB   18 GiB  1020 GiB  86.32  1.05   87      up          osd.147
246    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.4 TiB  79.15  0.96  196      up          osd.246
276    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.77  0.96  195      up          osd.276
286    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.4 TiB  79.15  0.96  196      up          osd.286
310    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.78  0.96  195      up          osd.310
319    hdd    14.00052   1.00000   13 TiB   11 TiB   11 TiB   5.3 MiB   33 GiB   1.8 TiB  86.15  1.05  152      up          osd.319
327    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.77  0.96  195      up          osd.327
342    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.78  0.96  195      up          osd.342
375    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.76  0.96  195      up          osd.375
391    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   39 GiB   3.5 TiB  78.78  0.96  195      up          osd.391
406    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   37 GiB   3.5 TiB  78.78  0.96  195      up          osd.406
423    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   38 GiB   3.4 TiB  79.16  0.96  196      up          osd.423
425    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.79  0.96  195      up          osd.425
443    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.4 TiB  79.15  0.96  196      up          osd.443
469    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.4 MiB   42 GiB   2.3 TiB  85.93  1.05  195      up          osd.469
470    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.4 MiB   41 GiB   2.3 TiB  85.93  1.05  195      up          osd.470
486    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.77  0.96  195      up          osd.486
533    hdd    16.00090   1.00000   15 TiB   13 TiB   13 TiB   989 KiB   35 GiB   2.0 TiB  86.27  1.05  174      up          osd.533
552    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.8 MiB   41 GiB   2.3 TiB  85.95  1.05  195      up          osd.552
578    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.2 MiB   43 GiB   2.3 TiB  85.95  1.05  195      up          osd.578
590    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.7 MiB   40 GiB   2.3 TiB  85.94  1.05  195      up          osd.590
617    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   2.2 MiB   40 GiB   2.2 TiB  86.37  1.05  196      up          osd.617
621    hdd    18.00020   1.00000   16 TiB   14 TiB   14 TiB   3.6 MiB   41 GiB   2.3 TiB  85.94  1.05  195      up          osd.621
631    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.76  0.96  195      up          osd.631
646    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.75  0.96  195      up          osd.646
663    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   35 GiB   3.5 TiB  78.75  0.96  195      up          osd.663
678    hdd    18.00020   1.00000   16 TiB   13 TiB   13 TiB       0 B   36 GiB   3.5 TiB  78.80  0.96  195      up          osd.678
                           TOTAL  8.0 PiB  6.6 PiB  6.6 PiB   846 MiB   19 TiB   1.4 PiB  82.20
MIN/MAX VAR: 0.96/1.05  STDDEV: 3.72

HDD sizes used there are 8TB,16TB and 18TB. Failure domain is host. And all OSDs are members of only one EC pool.
As you can see there is a high diff between min and max VAR number even if PG num looks pretty good distributed.

I started to use your balancer to try to solve the issue (to get lower STDEV - same OSD utilization) but with no luck. It doesn't make any proposal for moving any PG to another OSD.

I've tried it with following options:

/usr/bin/placementoptimizer.py -v balance --max-pg-moves 1000 --ensure-variance-decrease --only-pool cephfs_data --max-full-move-attempts 100 --allow-move-below-target-pgcount

any advice how to run it to get OSDs utilized evenly?

Thank you

KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid])

While trying to rebalance an especially broken cluster, my colleague found this exception:

# ./placementoptimizer.py --osdsize device balance --osdused delta --max-pg-moves 50 --osdfrom fullest
Traceback (most recent call last):
  File "./placementoptimizer.py", line 5475, in <module>
    exit(main())
  File "./placementoptimizer.py", line 5470, in main
    run()
  File "./placementoptimizer.py", line 5434, in <lambda>
    run = lambda: balance(args, state)
  File "./placementoptimizer.py", line 4600, in balance
    need_simulation=True)
  File "./placementoptimizer.py", line 3260, in __init__
    self.init_analyzer.analyze(self)
  File "./placementoptimizer.py", line 4264, in analyze
    self._update_stats()
  File "./placementoptimizer.py", line 4350, in _update_stats
    self.cluster_variance = self.pg_mappings.get_cluster_variance()
  File "./placementoptimizer.py", line 3771, in get_cluster_variance
    for crushclass, usages in self.get_class_osd_usages().items():
  File "./placementoptimizer.py", line 3509, in get_class_osd_usages
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "./placementoptimizer.py", line 3509, in <dictcomp>
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "./placementoptimizer.py", line 3757, in get_osd_usage
    used = self.get_osd_usage_size(osdid, add_size)
  File "./placementoptimizer.py", line 3714, in get_osd_usage_size
    used += self.cluster.osd_transfer_remainings[osdid]
KeyError: 66

Note that osd.66 is the only OSD which has the hdd_test class:

$ ceph osd tree | grep test
 66  hdd_test     14.55269          osd.66              up   1.00000  1.00000

As we are not permitted to publicly post anything containing UUIDs that can be used to identify the customer's cluster, I am going to submit the debug info via private email.

Fatal error in get_osd_candidates

./placementoptimizer.py -v balance
[2023-01-26 08:42:54,343] gathering cluster state via ceph api...
[2023-01-26 08:43:06,262] running pg balancer
[2023-01-26 08:43:06,267] current OSD fill rate per crushclasses:
[2023-01-26 08:43:06,268]   hdd: average=66.15%, median=60.57%, crushclass_usage=64.46%
[2023-01-26 08:43:06,268] cluster variance for crushclasses:
[2023-01-26 08:43:06,269]   hdd: 284.034
[2023-01-26 08:43:06,269] min osd.2 56.273%
[2023-01-26 08:43:06,269] max osd.12 121.969%
[2023-01-26 08:43:06,269] osd.12 has calculated usage >= 100%: 121.9685%
[2023-01-26 08:43:06,270] osd.3 is source osd in pg <__main__.PGMoveChecker object at 0x7f2f9d773e20>
[2023-01-26 08:43:06,270] self.pg_osds is [7, 13, 10, 12, 1, 6]
Traceback (most recent call last):
  File "/root/install/ceph-balancer/./placementoptimizer.py", line 2138, in <module>
    pool_pg_count_ideal = pg_mappings.pool_pg_count_ideal(pg_pool, try_pg_move.get_osd_candidates(osd_from))
  File "/root/install/ceph-balancer/./placementoptimizer.py", line 890, in get_osd_candidates
    root_name = self.root_names[pg_osd_idx]
IndexError: list index out of range

I added a little bit of debugging to determine that osd_from is not in pg_osd_idx

     ```
186 active+clean
         30  active+remapped+backfilling
         4   active+clean+scrubbing+deep
         4   active+remapped+backfill_toofull
         1   active+clean+scrubbing


Is the remapped+backfill_toofull state breaking this? What additional debug output should I try?

Please consider type datacenter as a root bucket

We have some crush rules that originate on datacenter types and not root types and these are filtered at https://github.com/TheJJ/ceph-balancer/blob/master/placementoptimizer.py#L1932

no trace found for 2147483647 in default~hdd

Hi! I have used this balancer to great success on my cluster. Lately I reformatted some osds and moved their db to an nvme device. Since then I get the following error:
./placementoptimizer.py -v balance --max-pg-moves 20 --max-move-attempts=20 | tee balance-upmaps

Traceback (most recent call last):
  File "/home/lolhens/./placementoptimizer.py", line 5060, in <module>
    exit(main())
  File "/home/lolhens/./placementoptimizer.py", line 5028, in main
    balance(args, state)
  File "/home/lolhens/./placementoptimizer.py", line 4393, in balance
    try_pg_move.prepare_crush_check()
  File "/home/lolhens/./placementoptimizer.py", line 2496, in prepare_crush_check
    raise RuntimeError(f"no trace found for {pg_osd} in {current_root_name}")
RuntimeError: no trace found for 2147483647 in default~hdd

I modified the script a bit to print out the root_osds and I get this:
[20, 8, 2147483647, 11]

Sadly I don't know the ceph internals enough to figure out what is going on. I saw that you did quite a few changes today and tried it with the new version but I get the same error.

Sometimes script dies on AssertationError

We have 5-node cluster in 5 racks with only replicated rules. Sometimes script dies with following AssertationError:

-> ./placementoptimizer.py -v balance --max-pg-moves 20 | tee 2021-12-02_balance-upmaps_6
[2021-12-02 16:24:07,886] running pg balancer
[2021-12-02 16:24:07,900] current OSD fill rate per crushclasses:
[2021-12-02 16:24:07,901]   ssd: average=53.83%, median=52.49%, without_placement_constraints=51.36%
[2021-12-02 16:24:07,902] cluster variance for crushclasses:
[2021-12-02 16:24:07,902]   ssd: 12.010
[2021-12-02 16:24:07,902] min osd.22 48.726%
[2021-12-02 16:24:07,902] max osd.11 59.926%
[2021-12-02 16:24:07,910]   SAVE move 6.1f osd.11 => osd.8 (size=8.5G)
[2021-12-02 16:24:07,910]     => variance new=11.354265826467204 < 12.009990670791638=old
[2021-12-02 16:24:07,910]     new min osd.22 48.726%
[2021-12-02 16:24:07,910]         max osd.25 59.629%
[2021-12-02 16:24:07,910]     new cluster variance:
[2021-12-02 16:24:07,910]       ssd: 11.354
[2021-12-02 16:24:07,917]   SAVE move 8.1e osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:24:07,917]     => variance new=10.442261655175155 < 11.354265826467204=old
[2021-12-02 16:24:07,917]     new min osd.8 49.928%
[2021-12-02 16:24:07,917]         max osd.4 59.142%
[2021-12-02 16:24:07,917]     new cluster variance:
[2021-12-02 16:24:07,917]       ssd: 10.442
[2021-12-02 16:24:07,924]   SAVE move 8.64 osd.4 => osd.8 (size=12.6G)
[2021-12-02 16:24:07,924]     => variance new=9.682937891685194 < 10.442261655175155=old
[2021-12-02 16:24:07,924]     new min osd.18 49.975%
[2021-12-02 16:24:07,924]         max osd.12 59.027%
[2021-12-02 16:24:07,924]     new cluster variance:
[2021-12-02 16:24:07,925]       ssd: 9.683
[2021-12-02 16:24:07,931]   SAVE move 8.ac osd.12 => osd.21 (size=12.4G)
[2021-12-02 16:24:07,931]     => variance new=8.953738197324748 < 9.682937891685194=old
[2021-12-02 16:24:07,931]     new min osd.18 49.975%
[2021-12-02 16:24:07,932]         max osd.11 58.975%
[2021-12-02 16:24:07,932]     new cluster variance:
[2021-12-02 16:24:07,932]       ssd: 8.954
[2021-12-02 16:24:07,939]   SAVE move 6.11 osd.11 => osd.18 (size=8.5G)
[2021-12-02 16:24:07,939]     => variance new=8.427594885907181 < 8.953738197324748=old
[2021-12-02 16:24:07,939]     new min osd.22 50.116%
[2021-12-02 16:24:07,939]         max osd.25 58.239%
[2021-12-02 16:24:07,939]     new cluster variance:
[2021-12-02 16:24:07,939]       ssd: 8.428
[2021-12-02 16:24:07,946]   SAVE move 8.d9 osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:24:07,946]     => variance new=7.783740064149625 < 8.427594885907181=old
[2021-12-02 16:24:07,946]     new min osd.1 50.187%
[2021-12-02 16:24:07,946]         max osd.11 58.027%
[2021-12-02 16:24:07,946]     new cluster variance:
[2021-12-02 16:24:07,946]       ssd: 7.784
[2021-12-02 16:24:07,953]   SAVE move 6.e osd.11 => osd.1 (size=8.4G)
[2021-12-02 16:24:07,953]     => variance new=7.334701606818667 < 7.783740064149625=old
[2021-12-02 16:24:07,954]     new min osd.9 50.477%
[2021-12-02 16:24:07,954]         max osd.6 57.919%
[2021-12-02 16:24:07,954]     new cluster variance:
[2021-12-02 16:24:07,954]       ssd: 7.335
[2021-12-02 16:24:07,961]   SAVE move 8.f1 osd.6 => osd.9 (size=13.0G)
[2021-12-02 16:24:07,961]     => variance new=6.732430912736088 < 7.334701606818667=old
[2021-12-02 16:24:07,961]     new min osd.18 50.923%
[2021-12-02 16:24:07,961]         max osd.4 57.731%
[2021-12-02 16:24:07,961]     new cluster variance:
[2021-12-02 16:24:07,961]       ssd: 6.732
[2021-12-02 16:24:07,968]   SAVE move 8.1a osd.4 => osd.29 (size=12.5G)
[2021-12-02 16:24:07,969]     => variance new=6.21303665183377 < 6.732430912736088=old
[2021-12-02 16:24:07,969]     new min osd.18 50.923%
[2021-12-02 16:24:07,969]         max osd.12 57.635%
[2021-12-02 16:24:07,969]     new cluster variance:
[2021-12-02 16:24:07,969]       ssd: 6.213
[2021-12-02 16:24:07,975]   SAVE move 8.2c osd.12 => osd.14 (size=12.4G)
[2021-12-02 16:24:07,975]     => variance new=5.7103134469222105 < 6.21303665183377=old
[2021-12-02 16:24:07,976]     new min osd.18 50.923%
[2021-12-02 16:24:07,976]         max osd.16 57.633%
[2021-12-02 16:24:07,976]     new cluster variance:
[2021-12-02 16:24:07,976]       ssd: 5.710
[2021-12-02 16:24:07,982]   SAVE move 6.1f osd.16 => osd.18 (size=8.5G)
[2021-12-02 16:24:07,982]     => variance new=5.332574718339378 < 5.7103134469222105=old
[2021-12-02 16:24:07,982]     new min osd.26 51.099%
[2021-12-02 16:24:07,982]         max osd.11 57.083%
[2021-12-02 16:24:07,982]     new cluster variance:
[2021-12-02 16:24:07,982]       ssd: 5.333
[2021-12-02 16:24:07,988]   SAVE move 6.8 osd.11 => osd.26 (size=8.4G)
[2021-12-02 16:24:07,988]     => variance new=5.004755568105044 < 5.332574718339378=old
[2021-12-02 16:24:07,989]     new min osd.1 51.131%
[2021-12-02 16:24:07,989]         max osd.23 57.009%
[2021-12-02 16:24:07,989]     new cluster variance:
[2021-12-02 16:24:07,989]       ssd: 5.005
Traceback (most recent call last):
  File "./placementoptimizer.py", line 1917, in <module>
    try_pg_move.prepare_crush_check()
  File "./placementoptimizer.py", line 984, in prepare_crush_check
    assert reuses == uses
AssertionError

I will be honest I didn't really tried to dive into code and algorithms, but based on error message I have no idea what did I do wrong. Is it even supposed to work on such small cluster as I have?

Btw I can fix it by limiting number of moves with --max-pg-moves so I can avoid it.

I will be happy to give you any debug info, just tell me what I can do.

pg num acting exception (was: KeyError)

All daemons: ceph version 14.2.16 (5d5ae817209e503a412040d46b3374855b7efe04) nautilus (stable)

# ./placementoptimizer.py -v show 
Traceback (most recent call last):
  File "./placementoptimizer.py", line 343, in <module>
    raise Exception(f"on osd.{id} calculated pg num acting: "
Exception: on osd.3 calculated pg num acting: 180 != 179

ceph dumps: ceph-balancer.zip

Crush topoloy exception

Having an issue where the script dies during a balance run.

$ python3 ./jj-balancer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps
[2023-01-10 10:20:34,509] gathering cluster state via ceph api...
[2023-01-10 10:20:40,984] running pg balancer
[2023-01-10 10:20:41,011] current OSD fill rate per crushclasses:
[2023-01-10 10:20:41,012]   hdd: average=60.71%, median=60.40%, crushclass_usage=75.52%
[2023-01-10 10:20:41,012]   ssd: average=58.43%, median=58.23%, crushclass_usage=62.43%
[2023-01-10 10:20:41,013] cluster variance for crushclasses:
[2023-01-10 10:20:41,013]   hdd: 3.238
[2023-01-10 10:20:41,013]   ssd: 9.636
[2023-01-10 10:20:41,013] min osd.53 52.013%
[2023-01-10 10:20:41,013] max osd.85 66.192%
[2023-01-10 10:20:41,022]   SAVE move 40.a9 osd.85 => osd.298
[2023-01-10 10:20:41,023]     props: size=35.5G remapped=False upmaps=0
[2023-01-10 10:20:41,023]     => variance new=3.1273334935204544 < 3.237997993456637=old
[2023-01-10 10:20:41,023]     new min osd.53 52.013%
[2023-01-10 10:20:41,023]         max osd.232 65.897%
[2023-01-10 10:20:41,023]     new cluster variance:
[2023-01-10 10:20:41,023]       hdd: 3.127
[2023-01-10 10:20:41,023]       ssd: 9.636
Traceback (most recent call last):
  File "./jj-balancer.py", line 2166, in <module>
    try_pg_move.prepare_crush_check()
  File "./jj-balancer.py", line 1098, in prepare_crush_check
    raise Exception(f"could not find item type {choose_type} "
Exception: could not find item type chassis requested by rule step {'op': 'chooseleaf_firstn', 'num': -1, 'type': 'chassis'}

I am assuming that it is due to a slightly non-standard crush topology/ruleset.
I have an hdd-root where the crush topology is root -> rack -> chassis -> host -> osd; then I have an ssd-root where the topology is root -> rack -> host -> osd (no chassis).

This is due to having some 8T hosts with 3x8T disks (2 per chassis, so 1 chassis = ~48T), some 8T hosts with 6x8T disks (1 per chassis), and some 24x2T hosts (1 per chassis), so that all of the chassis are ~48T and my crush rulesets take from chassis rather than host.

The SSD rulesets use host instead of chassis.
But I also have some "hybrid" rulesets where I take 1 from ssd-host, and take -1 from hdd-chassis.

So I'm guessing that this is why it is breaking on the {'op': 'chooseleaf_firstn', 'num': -1, 'type': 'chassis'}.

Let me know if there is anything I can provide to help.
Attaching a tree view of the host topology to hopefully visualize more easily.
This was pulled from 30f09f0 and the file is just renamed to jj-balancer.py for reasons.
Python is 3.8.10.

├── ROOT-hdd
│   └── RACK-rack-hdd
│       ├── CHASSIS-ceph-hdd-2t-01
│       │   └── HOST-ceph-hdd-2t-01
│       ├── CHASSIS-ceph-hdd-2t-02
│       │   └── HOST-ceph-hdd-2t-02
│       ├── CHASSIS-ceph-hdd-2t-03
│       │   └── HOST-ceph-hdd-2t-03
│       ├── CHASSIS-ceph-hdd-2t-04
│       │   └── HOST-ceph-hdd-2t-04
│       ├── CHASSIS-ceph-hdd-2t-05
│       │   └── HOST-ceph-hdd-2t-05
│       ├── CHASSIS-ceph-hdd-2t-06
│       │   └── HOST-ceph-hdd-2t-06
│       ├── CHASSIS-ceph-hdd-2t-07
│       │   └── HOST-ceph-hdd-2t-07
│       ├── CHASSIS-ceph-hdd-2t-08
│       │   └── HOST-ceph-hdd-2t-08
│       ├── CHASSIS-ceph-hdd-8t-0102
│       │   ├── HOST-ceph-hdd-8t-01
│       │   └── HOST-ceph-hdd-8t-02
│       ├── CHASSIS-ceph-hdd-8t-0304
│       │   ├── HOST-ceph-hdd-8t-03
│       │   └── HOST-ceph-hdd-8t-04
│       ├── CHASSIS-ceph-hdd-8t-0506
│       │   ├── HOST-ceph-hdd-8t-05
│       │   └── HOST-ceph-hdd-8t-06
│       ├── CHASSIS-ceph-hdd-8t-0708
│       │   ├── HOST-ceph-hdd-8t-07
│       │   └── HOST-ceph-hdd-8t-08
│       └── CHASSIS-ceph-hdd-8t-09
│           └── HOST-ceph-hdd-8t-09
└── ROOT-ssd
    └── RACK-rack-ssd
        ├── HOST-ceph-ssd-01
        ├── HOST-ceph-ssd-02
        ├── HOST-ceph-ssd-03
        ├── HOST-ceph-ssd-04
        ├── HOST-ceph-ssd-05
        └── HOST-ceph-ssd-06

IndexError: list index out of range

When trying to run the script against my CEPH Cluster either across the whole cluster or a particular pool get the following error:

./placementoptimizer.py -v balance --only-pool cephec_ecdata --max-pg-moves 10 | tee /tmp/balance-upmaps 
[2023-03-20 18:47:32,626] gathering cluster state via ceph api...
[2023-03-20 18:47:48,224] running pg balancer
[2023-03-20 18:47:48,224] only considering pools {12}
[2023-03-20 18:47:48,235] current OSD fill rate per crushclasses:
[2023-03-20 18:47:48,236]   hdd: average=75.24%, median=76.04%, crushclass_usage=75.55%
[2023-03-20 18:47:48,237] cluster variance for crushclasses:
[2023-03-20 18:47:48,238]   hdd: 56.797
[2023-03-20 18:47:48,238] min osd.26 53.096%
[2023-03-20 18:47:48,238] max osd.75 90.596%
Traceback (most recent call last):
  File "/root/./placementoptimizer.py", line 2133, in <module>
    pool_pg_count_ideal = pg_mappings.pool_pg_count_ideal(pg_pool, try_pg_move.get_osd_candidates(osd_from))
  File "/root/./placementoptimizer.py", line 885, in get_osd_candidates
    root_name = self.root_names[pg_osd_idx]
IndexError: list index out of range

I have a single root in my crush map with all 6 hosts within

Balancer fails with crush rule combining 1SSD+2HDD

Hi JJ!

FIrst, thanks for an AWESOME balancer! I'm in shock-and-awe how good, efficient and simple this is - it has achieved a virtually perfect balance on our system with a lots of mixed HDD sizes and nodes :-)

However, in addition to our large-storage volumes we also have a partition where we use 3-fold replication on 1SSD combined with 2HDDs. At least for our (relatively read-intensive) setup this works great in combination with NVMe DB/WAL devices for the HDDs. We get close to pure SSD performance on writes, and exactly the same read performance as a pure SSD array - but at 1/3 of the cost.

But... the JJbalancer fails for this pool. I have started to debug, and it seems to be caused by traces in prepare_crush_check where the code likely assumes all OSDs in the crush rule are the same class?

I will keep working on it, but I suspect there might be a close-to-trivial work-around to continue when OSDs have the wrong class, so I figured I should submit an issue if it's a 5-minute fix for somebody who knows the code better.

Here's the error with a bit of debug context; osd.277 is class ssd, osd.218 and osd.37 class hdd.

[2022-11-10 18:35:29,334] TRY-0 moving pg 5.3c1 (36/58) with 78.3G from osd.37
[2022-11-10 18:35:29,335]   OK => taking pg 5.3c1 from source osd.37 since it has too many of pool=5 (13 > 11.98997752947604)
[2022-11-10 18:35:29,335] prepare crush check for pg 5.3c1 currently up=[277, 218, 37]
[2022-11-10 18:35:29,335] rule:
{'name': '1ssd_2hdd',
 'steps': [{'item': -52, 'item_name': 'default~ssd', 'op': 'take'},
           {'num': 1, 'op': 'chooseleaf_firstn', 'type': 'host'},
           {'op': 'emit'},
           {'item': -24, 'item_name': 'default~hdd', 'op': 'take'},
           {'num': -1, 'op': 'chooseleaf_firstn', 'type': 'host'},
           {'op': 'emit'}]}
[2022-11-10 18:35:29,335] allowed reuses per rule step, starting at root: [2, 2, 2, 2, 1, 1]
[2022-11-10 18:35:29,336] processing crush step {'op': 'take', 'item': -52, 'item_name': 'default~ssd'} with tree_depth=0, rule_depth=0, item_uses=defaultdict(<class 'dict'>, {})
[2022-11-10 18:35:29,336]    trace for  277: [{'id': -52, 'type_name': 'root'}, {'id': -77, 'type_name': 'host'}, {'id': 277, 'type_name': 'osd'}]
[2022-11-10 18:35:29,336]    trace for  218: None
Traceback (most recent call last):
  File "./placementoptimizer.py", line 2081, in <module>
    try_pg_move.prepare_crush_check()
  File "./placementoptimizer.py", line 1009, in prepare_crush_check
    raise Exception(f"no trace found for {pg_osd} in {rule_root_name}")
Exception: no trace found for 218 in default~ssd

ZeroDivisionError: division by zero (osd_objs_acting)

Another attempt of balancing the same half-broken cluster as in #35 after some CRUSH map cleanup yields this:

# ./placementoptimizer.py -v --osdsize device balance --osdused delta --max-pg-moves 100 --osdfrom fullest --only-crushclass hdd
[2024-03-15 16:03:45,878] gathering cluster state via ceph api...
Traceback (most recent call last):
  File "./placementoptimizer.py", line 5475, in <module>
    exit(main())
  File "./placementoptimizer.py", line 5431, in main
    state.preprocess()
  File "./placementoptimizer.py", line 2061, in preprocess
    metadata_estimate = int(meta_amount * pg_objects / osd_objs_acting)
ZeroDivisionError: division by zero

The debug archive will be sent via email, but note that this is a large cluster, so it exceeds your message size limit, so I will have to split the archive.

Balancer advices invalid PG moves

Sometimes balancer gives output containing upmap commands which are not valid according to active CRUSH rule.

I have 5 node cluster with replicated rules only with rack as failure domain. Many of generated commands are completely OK and leads to better balancing, but sometimes command (or its part) is not valid and Ceph doesn't insert it into configuration (silently, which is a bit confusing).

Cluster has following topology:

-> ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                      STATUS  REWEIGHT  PRI-AFF
 -1         26.17169  root default                                            
-13         26.17169      datacenter dc1                                
-14         26.17169          room room1                                       
-15          5.23602              rack rack2                                  
 -3          5.23602                  host ceph1                           
  0    ssd   0.87209                      osd.0          up   1.00000  1.00000
  2    ssd   0.87279                      osd.2          up   1.00000  1.00000
  3    ssd   0.87279                      osd.3          up   1.00000  1.00000
  4    ssd   0.87279                      osd.4          up   1.00000  1.00000
 28    ssd   0.87279                      osd.28         up   1.00000  1.00000
 29    ssd   0.87279                      osd.29         up   1.00000  1.00000
-16          5.23392              rack rack3                                  
 -5          5.23392                  host ceph2                           
  5    ssd   0.87209                      osd.5          up   1.00000  1.00000
  6    ssd   0.87209                      osd.6          up   1.00000  1.00000
  7    ssd   0.87209                      osd.7          up   1.00000  1.00000
  8    ssd   0.87209                      osd.8          up   1.00000  1.00000
 26    ssd   0.87279                      osd.26         up   1.00000  1.00000
 27    ssd   0.87279                      osd.27         up   1.00000  1.00000
-17          5.23392              rack rack5                                  
 -7          5.23392                  host ceph3                           
  1    ssd   0.87209                      osd.1          up   1.00000  1.00000
 11    ssd   0.87209                      osd.11         up   1.00000  1.00000
 14    ssd   0.87209                      osd.14         up   1.00000  1.00000
 17    ssd   0.87209                      osd.17         up   1.00000  1.00000
 24    ssd   0.87279                      osd.24         up   1.00000  1.00000
 25    ssd   0.87279                      osd.25         up   1.00000  1.00000
-18          5.23392              rack rack6                                  
 -9          5.23392                  host ceph4                           
  9    ssd   0.87209                      osd.9          up   1.00000  1.00000
 12    ssd   0.87209                      osd.12         up   1.00000  1.00000
 15    ssd   0.87209                      osd.15         up   1.00000  1.00000
 18    ssd   0.87209                      osd.18         up   1.00000  1.00000
 22    ssd   0.87279                      osd.22         up   1.00000  1.00000
 23    ssd   0.87279                      osd.23         up   1.00000  1.00000
-25          5.23392              rack rack7                                  
-11          5.23392                  host ceph5                           
 10    ssd   0.87209                      osd.10         up   1.00000  1.00000
 13    ssd   0.87209                      osd.13         up   1.00000  1.00000
 16    ssd   0.87209                      osd.16         up   1.00000  1.00000
 19    ssd   0.87209                      osd.19         up   1.00000  1.00000
 20    ssd   0.87279                      osd.20         up   1.00000  1.00000
 21    ssd   0.87279                      osd.21         up   1.00000  1.00000

-> ./placementoptimizer.py -v balance --max-pg-moves 12 | tee 2021-12-02_balance-upmaps_2
[2021-12-02 16:51:16,001] running pg balancer
[2021-12-02 16:51:16,008] current OSD fill rate per crushclasses:
[2021-12-02 16:51:16,008]   ssd: average=53.85%, median=52.51%, without_placement_constraints=51.39%
[2021-12-02 16:51:16,009] cluster variance for crushclasses:
[2021-12-02 16:51:16,009]   ssd: 12.017
[2021-12-02 16:51:16,009] min osd.22 48.749%
[2021-12-02 16:51:16,009] max osd.11 59.951%
[2021-12-02 16:51:16,013]   SAVE move 6.1f osd.11 => osd.8 (size=8.5G)
[2021-12-02 16:51:16,013]     => variance new=11.36060690389384 < 12.016851844400833=old
[2021-12-02 16:51:16,013]     new min osd.22 48.749%
[2021-12-02 16:51:16,013]         max osd.25 59.655%
[2021-12-02 16:51:16,013]     new cluster variance:
[2021-12-02 16:51:16,013]       ssd: 11.361
[2021-12-02 16:51:16,016]   SAVE move 8.1e osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:51:16,016]     => variance new=10.447990002211988 < 11.36060690389384=old
[2021-12-02 16:51:16,016]     new min osd.8 49.952%
[2021-12-02 16:51:16,016]         max osd.4 59.171%
[2021-12-02 16:51:16,016]     new cluster variance:
[2021-12-02 16:51:16,017]       ssd: 10.448
[2021-12-02 16:51:16,020]   SAVE move 8.64 osd.4 => osd.8 (size=12.6G)
[2021-12-02 16:51:16,020]     => variance new=9.687735955065486 < 10.447990002211988=old
[2021-12-02 16:51:16,020]     new min osd.18 50.000%
[2021-12-02 16:51:16,020]         max osd.12 59.057%
[2021-12-02 16:51:16,020]     new cluster variance:
[2021-12-02 16:51:16,020]       ssd: 9.688
[2021-12-02 16:51:16,024]   SAVE move 8.ac osd.12 => osd.21 (size=12.4G)
[2021-12-02 16:51:16,024]     => variance new=8.957778334870069 < 9.687735955065486=old
[2021-12-02 16:51:16,024]     new min osd.18 50.000%
[2021-12-02 16:51:16,024]         max osd.11 58.999%
[2021-12-02 16:51:16,024]     new cluster variance:
[2021-12-02 16:51:16,024]       ssd: 8.958
[2021-12-02 16:51:16,027]   SAVE move 6.11 osd.11 => osd.18 (size=8.5G)
[2021-12-02 16:51:16,027]     => variance new=8.431644281707726 < 8.957778334870069=old
[2021-12-02 16:51:16,028]     new min osd.22 50.140%
[2021-12-02 16:51:16,028]         max osd.25 58.265%
[2021-12-02 16:51:16,028]     new cluster variance:
[2021-12-02 16:51:16,028]       ssd: 8.432
[2021-12-02 16:51:16,031]   SAVE move 8.d9 osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:51:16,031]     => variance new=7.787358409360783 < 8.431644281707726=old
[2021-12-02 16:51:16,031]     new min osd.1 50.210%
[2021-12-02 16:51:16,031]         max osd.11 58.052%
[2021-12-02 16:51:16,031]     new cluster variance:
[2021-12-02 16:51:16,031]       ssd: 7.787
[2021-12-02 16:51:16,035]   SAVE move 6.e osd.11 => osd.1 (size=8.4G)
[2021-12-02 16:51:16,035]     => variance new=7.337994734535253 < 7.787358409360783=old
[2021-12-02 16:51:16,035]     new min osd.9 50.500%
[2021-12-02 16:51:16,035]         max osd.6 57.947%
[2021-12-02 16:51:16,035]     new cluster variance:
[2021-12-02 16:51:16,036]       ssd: 7.338
[2021-12-02 16:51:16,039]   SAVE move 8.f1 osd.6 => osd.9 (size=13.0G)
[2021-12-02 16:51:16,039]     => variance new=6.735266750261725 < 7.337994734535253=old
[2021-12-02 16:51:16,039]     new min osd.18 50.948%
[2021-12-02 16:51:16,039]         max osd.4 57.760%
[2021-12-02 16:51:16,039]     new cluster variance:
[2021-12-02 16:51:16,039]       ssd: 6.735
[2021-12-02 16:51:16,043]   SAVE move 8.1a osd.4 => osd.29 (size=12.5G)
[2021-12-02 16:51:16,043]     => variance new=6.215466556962361 < 6.735266750261725=old
[2021-12-02 16:51:16,043]     new min osd.18 50.948%
[2021-12-02 16:51:16,043]         max osd.12 57.665%
[2021-12-02 16:51:16,043]     new cluster variance:
[2021-12-02 16:51:16,043]       ssd: 6.215
[2021-12-02 16:51:16,046]   SAVE move 8.2c osd.12 => osd.14 (size=12.4G)
[2021-12-02 16:51:16,046]     => variance new=5.712574199748361 < 6.215466556962361=old
[2021-12-02 16:51:16,046]     new min osd.18 50.948%
[2021-12-02 16:51:16,047]         max osd.16 57.660%
[2021-12-02 16:51:16,047]     new cluster variance:
[2021-12-02 16:51:16,047]       ssd: 5.713
[2021-12-02 16:51:16,050]   SAVE move 6.1f osd.16 => osd.18 (size=8.5G)
[2021-12-02 16:51:16,050]     => variance new=5.334505158854212 < 5.712574199748361=old
[2021-12-02 16:51:16,050]     new min osd.26 51.127%
[2021-12-02 16:51:16,050]         max osd.11 57.107%
[2021-12-02 16:51:16,050]     new cluster variance:
[2021-12-02 16:51:16,050]       ssd: 5.335
[2021-12-02 16:51:16,054]   SAVE move 6.8 osd.11 => osd.26 (size=8.4G)
[2021-12-02 16:51:16,054]     => variance new=5.006761816792269 < 5.334505158854212=old
[2021-12-02 16:51:16,054]     new min osd.1 51.155%
[2021-12-02 16:51:16,054]         max osd.23 57.032%
[2021-12-02 16:51:16,054]     new cluster variance:
[2021-12-02 16:51:16,054]       ssd: 5.007
[2021-12-02 16:51:16,054] enough remaps found
[2021-12-02 16:51:16,054] --------------------------------------------------------------------------------
[2021-12-02 16:51:16,054] generated 12 remaps.
[2021-12-02 16:51:16,054] total movement size: 130.1G.
[2021-12-02 16:51:16,054] --------------------------------------------------------------------------------
[2021-12-02 16:51:16,054] old cluster variance per crushclass:
[2021-12-02 16:51:16,055]   ssd: 12.017
[2021-12-02 16:51:16,055] old min osd.22 48.749%
[2021-12-02 16:51:16,055] old max osd.11 59.951%
[2021-12-02 16:51:16,055] --------------------------------------------------------------------------------
[2021-12-02 16:51:16,055] new min osd.1 51.155%
[2021-12-02 16:51:16,055] new max osd.23 57.032%
[2021-12-02 16:51:16,055] new cluster variance:
[2021-12-02 16:51:16,055]   ssd: 5.007
[2021-12-02 16:51:16,055] --------------------------------------------------------------------------------
ceph osd pg-upmap-items 6.1f 11 8 16 18
ceph osd pg-upmap-items 8.1e 25 22
ceph osd pg-upmap-items 8.64 4 8
ceph osd pg-upmap-items 8.ac 4 19 12 21
ceph osd pg-upmap-items 6.11 11 18
ceph osd pg-upmap-items 8.d9 25 22
ceph osd pg-upmap-items 6.e 11 1
ceph osd pg-upmap-items 8.f1 6 9
ceph osd pg-upmap-items 8.1a 11 9 4 29
ceph osd pg-upmap-items 8.2c 25 26 12 14
ceph osd pg-upmap-items 6.8 11 26

But e.g. second move of PG 6.1F would violate CRUSH rule because it would colocate 2nd and 3rd replica to same host ceph4:

-> ceph pg dump | grep -F 6.1f
dumped all
6.1f        2262                   0         0          0        0   9118765056            0           0  2918      2918  active+clean  2021-12-02T01:29:46.271762+0000    1043596'3570596   1043603:54558341  [11,12,16]          11  [11,12,16]              11    1034824'3565764  2021-12-01T16:09:13.967493+0000    1023825'3552322  2021-11-30T10:06:06.894016+0000              0

And also Ceph will (silently) refuse to proceed command when I try to run them:

-> ceph osd dump | grep -F 6.1f

-> ceph osd pg-upmap-items 6.1f 11 8 16 18
set 6.1f pg_upmap_items mapping to [11->8,16->18]

-> ceph osd dump | grep -F 6.1f

But I can do first relocation just fine:

-> ceph osd pg-upmap-items 6.1f 11 8
set 6.1f pg_upmap_items mapping to [11->8]

-> ceph osd dump | grep -F 6.1f
pg_upmap_items 6.1f [11,8]
pg_temp 6.1f [11,12,16]

Please let me know if I can be somewhat helpful with solving this issue.

JSONDecodeError in Reef

Hi.

We are having trouble using the balancer on the Reef version; it throws an error in decoding JSON from 'ceph osd dump --format json,' but the output of this command is valid JSON. Do you know where the issue might be?

root@app001 ~/ceph-balancer # git pull
Updating 1c90248..b48ffbf
Fast-forward
 placementoptimizer.py | 2453 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
 1 file changed, 2089 insertions(+), 364 deletions(-)
root@app001 ~/ceph-balancer # ./placementoptimizer.py balance
Traceback (most recent call last):
  File "/root/ceph-balancer/./placementoptimizer.py", line 5060, in <module>
    exit(main())
  File "/root/ceph-balancer/./placementoptimizer.py", line 5024, in main
    state = ClusterState(args.state, osdsize_method=osdsize_method)
  File "/root/ceph-balancer/./placementoptimizer.py", line 592, in __init__
    self.load(statefile)
  File "/root/ceph-balancer/./placementoptimizer.py", line 619, in load
    osd_dump=jsoncall("ceph osd dump --format json".split()),
  File "/root/ceph-balancer/./placementoptimizer.py", line 274, in jsoncall
    return json.loads(rawdata.decode())
  File "/usr/lib64/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 2365 (char 2365)
root@app001 ~/ceph-balancer # ceph version
ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
root@app001 ~/ceph-balancer # python --version
Python 3.9.17

Thanks
Michal

variance requires at least two data points

The script just throws an error

ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable)
Python 3.9.2
Script version: latest

Output:

./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps
[2022-03-17 15:29:12,622] gathering cluster state via ceph api...
[2022-03-17 15:29:15,735] running pg balancer
[2022-03-17 15:29:15,737] current OSD fill rate per crushclasses:
[2022-03-17 15:29:15,738]   hdd: average=121.93%, median=107.85%, without_placement_constraints=64.31%
[2022-03-17 15:29:15,738]   ssd: average=47.30%, median=47.30%, without_placement_constraints=44.71%
Traceback (most recent call last):
  File "/root/ceph-balancer/./placementoptimizer.py", line 1945, in <module>
    init_cluster_variance = get_cluster_variance(enabled_crushclasses, pg_mappings)
  File "/root/ceph-balancer/./placementoptimizer.py", line 1870, in get_cluster_variance
    class_variance = statistics.variance(osd_usages)
  File "/usr/lib/python3.9/statistics.py", line 739, in variance
    raise StatisticsError('variance requires at least two data points')
statistics.StatisticsError: variance requires at least two data points

./placementoptimizer.py showremapped --by-osd throws KeyError if there is recovery

$ ./placementoptimizer.py showremapped --state broken-cluster-1.xz --by-osd
Traceback (most recent call last):
  File "/tmp/bug-report/./placementoptimizer.py", line 5496, in <module>
    exit(main())
         ^^^^^^
  File "/tmp/bug-report/./placementoptimizer.py", line 5490, in main
    run()
  File "/tmp/bug-report/./placementoptimizer.py", line 5460, in <lambda>
    run = lambda: showremapped(args, state)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/bug-report/./placementoptimizer.py", line 5346, in showremapped
    print(f"{osdname}: {cluster.osds[osdid]['host_name']}  =>{sum_to} {sum_data_to_pp} <={sum_from} {sum_data_from_pp}"
                        ~~~~~~~~~~~~^^^^^^^
KeyError: -1

Without the --by-osd flag, it works:

$ ./placementoptimizer.py showremapped --state broken-cluster-1.xz
pg 28.28c degraded+waiting   55.7G: 0 of 1008986, 0.0%, -1->120
pg 28.272 degraded+waiting   55.6G: 570 of 1003522, 0.1%, -1->120
pg 28.277 degraded+waiting   55.5G: 0 of 1004690, 0.0%, -1->84
...

broken-cluster-1.xz will be sent by email.

can't run it with python < 3.9

Hi,
script is failing on OS with older python < 3.9 .

  File "<fstring>", line 1
    (pool_type=)
              ^
SyntaxError: invalid syntax

Cluster topology change recursion

Running placementoptimizer.py on my Pacific (15.2.16) cluster increases the epoch count, which triggers the following exception even when the cluster is idle,

[root@ceph-admin ~]# ./placementoptimizer.py -v balance --max-pg-moves 10
fsid 227d9741-3db8-4984-a522-6442c1739578
[2022-03-25 09:18:38,347] gathering cluster state via ceph api...
Traceback (most recent call last):
  File "./placementoptimizer.py", line 233, in <module>
    raise Exception("Cluster topology changed during information gathering (e.g. a pg changed state). "
Exception: Cluster topology changed during information gathering (e.g. a pg changed state). Wait for things to calm down and try again

What am I doing wrong?

typo on line 1643

Hi ,

you have a typo

        for new_from, new_to in resulting_upmaps:
            if new_from == new_to:
                raise Exception(f"somewhere something went wrong, we map {idpg} from osd.{new_from} to osd.{new_to}")

idpg should be pgid

uses != reuses | k=8,m=3 EC pool

Hello!

I've been trying to use the balancer on my EC pool, and I am receiving the following error:

Traceback (most recent call last):
  File "./placementoptimizer.py", line 2080, in <module>
    try_pg_move.prepare_crush_check()
  File "./placementoptimizer.py", line 1060, in prepare_crush_check
    raise Exception(f"during emit, rule step {idx} item {item} was used {uses} != {reuses} expected")
Exception: during emit, rule step 0 item -15 was used 11 != 12 expected

I am assuming that during the sanity check, because I have 11 chunks of data (k8m3) but the crush rules specify 4hosts/3osds (12 chunks), this is why I am getting this error?

I am using Proxmox 7.2, CEPH Pacific 16.2.7. Here is my CRUSH map.

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd
device 24 osd.24 class hdd
device 25 osd.25 class hdd
device 26 osd.26 class hdd
device 27 osd.27 class hdd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class nvme
device 33 osd.33 class nvme
device 34 osd.34 class nvme
device 35 osd.35 class nvme
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class nvme
device 41 osd.41 class nvme
device 42 osd.42 class hdd
device 43 osd.43 class hdd
device 44 osd.44 class hdd
device 45 osd.45 class hdd
device 46 osd.46 class hdd
device 47 osd.47 class hdd
device 48 osd.48 class hdd
device 49 osd.49 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host server1 {
	id -3		# do not change unnecessarily
	id -16 class nvme		# do not change unnecessarily
	id -11 class hdd		# do not change unnecessarily
	# weight 31.765
	alg straw2
	hash 0	# rjenkins1
	item osd.18 weight 2.729
	item osd.19 weight 2.729
	item osd.22 weight 2.729
	item osd.23 weight 2.729
	item osd.24 weight 2.729
	item osd.34 weight 0.873
	item osd.16 weight 5.458
	item osd.20 weight 5.458
	item osd.38 weight 5.458
	item osd.1 weight 0.873
}
host server2 {
	id -5		# do not change unnecessarily
	id -17 class nvme		# do not change unnecessarily
	id -12 class hdd		# do not change unnecessarily
	# weight 31.765
	alg straw2
	hash 0	# rjenkins1
	item osd.4 weight 2.729
	item osd.5 weight 2.729
	item osd.6 weight 2.729
	item osd.8 weight 2.729
	item osd.9 weight 2.729
	item osd.32 weight 0.873
	item osd.7 weight 5.458
	item osd.10 weight 5.458
	item osd.31 weight 5.458
	item osd.3 weight 0.873
}
host server3 {
	id -7		# do not change unnecessarily
	id -18 class nvme		# do not change unnecessarily
	id -13 class hdd		# do not change unnecessarily
	# weight 31.765
	alg straw2
	hash 0	# rjenkins1
	item osd.11 weight 2.729
	item osd.12 weight 2.729
	item osd.14 weight 2.729
	item osd.17 weight 2.729
	item osd.2 weight 0.873
	item osd.15 weight 5.458
	item osd.36 weight 5.458
	item osd.13 weight 2.729
	item osd.37 weight 5.458
	item osd.33 weight 0.873
}
host server4 {
	id -9		# do not change unnecessarily
	id -19 class nvme		# do not change unnecessarily
	id -14 class hdd		# do not change unnecessarily
	# weight 31.765
	alg straw2
	hash 0	# rjenkins1
	item osd.25 weight 2.729
	item osd.27 weight 2.729
	item osd.26 weight 2.729
	item osd.28 weight 2.729
	item osd.30 weight 2.729
	item osd.35 weight 0.873
	item osd.0 weight 0.873
	item osd.21 weight 5.458
	item osd.29 weight 5.458
	item osd.39 weight 5.458
}
host server5 {
	id -2		# do not change unnecessarily
	id -4 class nvme		# do not change unnecessarily
	id -6 class hdd		# do not change unnecessarily
	# weight 31.766
	alg straw2
	hash 0	# rjenkins1
	item osd.40 weight 0.873
	item osd.41 weight 0.873
	item osd.42 weight 5.458
	item osd.43 weight 5.458
	item osd.44 weight 2.729
	item osd.45 weight 5.458
	item osd.46 weight 2.729
	item osd.47 weight 2.729
	item osd.48 weight 2.729
	item osd.49 weight 2.729
}
root default {
	id -1		# do not change unnecessarily
	id -20 class nvme		# do not change unnecessarily
	id -15 class hdd		# do not change unnecessarily
	# weight 158.826
	alg straw2
	hash 0	# rjenkins1
	item server1 weight 31.765
	item server2 weight 31.765
	item server3 weight 31.765
	item server4 weight 31.765
	item server5 weight 31.766
}

rule storage_metadata {
	id 2
	type replicated
	min_size 2
	max_size 3
	step take default class nvme
	step chooseleaf firstn 0 type host
	step emit
}
rule storage_data {
	id 3
	type erasure
	min_size 10
	max_size 11
	step set_chooseleaf_tries 5
	step set_choose_tries 100
	step take default class hdd
	step choose indep 4 type host
	step chooseleaf indep 3 type osd
	step emit
}
# end crush map

EC Profile

crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=8
m=3
plugin=jerasure
technique=reed_sol_van
w=8

ZeroDivisionError

Getting a ZeroDivisionError when we run the balance command

# ./placementoptimizer.py -v balance --max-pg-moves 10
[2021-11-13 16:42:46,885] running pg balancer
Traceback (most recent call last):
  File "./placementoptimizer.py", line 1745, in <module>
    pg_mappings = PGMappings(pgs, osds)
  File "./placementoptimizer.py", line 1225, in __init__
    pg_obj_size = shardsize / pg_objs
ZeroDivisionError: division by zero

I'm guessing it's because we have a couple osds down with zero PGs?

# ceph osd df | sort -k17 | head
ID   CLASS WEIGHT   REWEIGHT SIZE    RAW USE DATA     OMAP    META    AVAIL    %USE  VAR  PGS STATUS 
MIN/MAX VAR: 0.07/1.28  STDDEV: 17.89
                       TOTAL 8.7 PiB 6.0 PiB  5.9 PiB 986 GiB  21 TiB  2.8 PiB 68.37                 
1073   hdd 12.87889        0     0 B     0 B      0 B     0 B     0 B      0 B     0    0   0   down 
  71   hdd 12.87889        0     0 B     0 B      0 B     0 B     0 B      0 B     0    0   0   down 
1050   hdd 12.87889  1.00000  13 TiB 1.4 TiB  1.1 TiB  52 KiB 3.8 GiB   12 TiB 10.77 0.16   9     up 
1054   hdd 12.87889  1.00000  13 TiB 1.4 TiB  1.1 TiB 132 KiB 3.9 GiB   12 TiB 10.98 0.16   7     up 
1063   hdd 12.87889  1.00000  13 TiB 1.6 TiB  1.4 TiB  52 KiB 4.6 GiB   11 TiB 12.64 0.18  10     up

Export and import of all needed cluster state as files

To simplify debugging, the balancer should be able to work on an imported cluster state. To generate that state, it also needs to be able to produce a state bundle.

The easiest approach would be generating a huge json output, where we just put in all the collected data from various ceph commands. This file can then be shared, for direct debugging and testing, without needing access to the live cluster.

prepare_crush_check exception for osd.2^31-1

It might be worth better handling of osd = 2^31-1 (aka NONE) when there is a missing osd in a pg,

[root@ceph-admin ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 --max-full-move-attempts=100 | tee /tmp/balance-upmaps
[2022-06-27 15:09:30,803] gathering cluster state via ceph api...
[2022-06-27 15:10:05,578] running pg balancer
[2022-06-27 15:10:05,865] current OSD fill rate per crushclasses:
[2022-06-27 15:10:05,865]   mddu: average=0.21%, median=0.21%, without_placement_constraints=1.16%
[2022-06-27 15:10:05,866]   mddm: average=0.16%, median=0.16%, without_placement_constraints=1.09%
[2022-06-27 15:10:05,866]   nvme: average=0.01%, median=0.01%, without_placement_constraints=0.11%
[2022-06-27 15:10:05,868]   ssd: average=44.91%, median=44.56%, without_placement_constraints=40.89%
[2022-06-27 15:10:05,870]   hdd: average=98.72%, median=61.03%, without_placement_constraints=54.26%
[2022-06-27 15:10:05,870]   smr: average=0.00%, median=0.00%, without_placement_constraints=0.03%
[2022-06-27 15:10:05,877] cluster variance for crushclasses:
[2022-06-27 15:10:05,877]   mddu: 0.000
[2022-06-27 15:10:05,877]   mddm: 0.000
[2022-06-27 15:10:05,877]   nvme: 0.000
[2022-06-27 15:10:05,877]   ssd: 7.102
[2022-06-27 15:10:05,877]   hdd: 4505.848
[2022-06-27 15:10:05,877]   smr: 0.000
[2022-06-27 15:10:05,877] min osd.253 0.000%
[2022-06-27 15:10:05,877] max osd.1986 405.902%
[2022-06-27 15:10:05,877] osd.2231 has calculated usage >= 100%: 100.29076207793798%
[2022-06-27 15:10:05,878] osd.365 has calculated usage >= 100%: 100.55524411861991%
...
[2022-06-27 15:10:06,411] osd.2215 has calculated usage >= 100%: 341.60815893352543%
[2022-06-27 15:10:06,411] osd.2221 has calculated usage >= 100%: 342.0460392118455%
Traceback (most recent call last):
  File "./placementoptimizer.py", line 2079, in <module>
    try_pg_move.prepare_crush_check()
  File "./placementoptimizer.py", line 1007, in prepare_crush_check
    raise Exception(f"no trace found for {pg_osd} in {rule_root_name}")
Exception: no trace found for 2147483647 in default~hdd

RuntimeError: pg 18.6a to be moved to osd.117 is misplaced with -198781.0<0 objects already transferred

I am using the script to watch the progress of backfills on a broken cluster. Yet, it shows an exception:

Traceback (most recent call last):
  File "/root/./placementoptimizer.py", line 5496, in <module>
    exit(main())
  File "/root/./placementoptimizer.py", line 5451, in main
    state.preprocess()
  File "/root/./placementoptimizer.py", line 2183, in preprocess
    raise RuntimeError(f"pg {pg_incoming} to be moved to osd.{osdid} is misplaced "
RuntimeError: pg 18.6a to be moved to osd.117 is misplaced with -198781.0<0 objects already transferred

I will send you the dump via email. Yes, I know that one PG is not recoverable without the manual export/import.

Do Not Skip the Entire Pool When Balancing PGs

Hello,

In one of my small clusters, the balancer attempted to move the first PG of a pool, but it did not find any suitable target and skipped this pool completely. So the balancer generated no move suggestion of this pool. I think the balancer should have checked if the rest of the PGs in the same pool could be moved to other OSDs.

ps: Related logic is at line from 2126 to 2128

key error - on healthy cluster

Hi,
trying to run with options

 -v balance --osdsize device --osdused delta --max-pg-moves 2 --osdfrom fullest

getting error:

[2024-04-15 12:31:58,685] gathering cluster state via ceph api...
[2024-04-15 12:32:33,852] running pg balancer
Traceback (most recent call last):
  File "jj.py", line 5496, in <module>
    exit(main())
  File "jj.py", line 5490, in main
    run()
  File "jj.py", line 5454, in <lambda>
    run = lambda: balance(args, state)
  File "jj.py", line 4607, in balance
    pg_mappings = PGMappings(cluster,
  File "jj.py", line 3265, in __init__
    self.init_analyzer.analyze(self)
  File "jj.py", line 4288, in analyze
    self._update_stats()
  File "jj.py", line 4374, in _update_stats
    self.cluster_variance = self.pg_mappings.get_cluster_variance()
  File "jj.py", line 3788, in get_cluster_variance
    for crushclass, usages in self.get_class_osd_usages().items():
  File "jj.py", line 3526, in get_class_osd_usages
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "jj.py", line 3526, in <dictcomp>
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "jj.py", line 3774, in get_osd_usage
    used = self.get_osd_usage_size(osdid, add_size)
  File "jj.py", line 3731, in get_osd_usage_size
    used += self.cluster.osd_transfer_remainings[osdid]
KeyError: 3

Cluster is Ceph Quincy, with 1500 OSDs, SSDs, HDDs, with more pools, but main and most utilized is EC.
Any other details you need?

Thank you

Balancer script expects that every OSD class is present in cluster

Following happened to me when I migrated all data from ssd class OSDs to nvme ones, but I didn't remove class ssd from CRUSH. I didn't do that because I might re-add them later and I don't want to get rid of existing CRUSH rules.

./placementoptimizer.py -v balance --max-pg-moves 20 | tee -a "$(date +%Y-%m-%d)_balance-upmaps_1"
[2022-04-06 20:10:41,020] gathering cluster state via ceph api...
Traceback (most recent call last):
  File "./placementoptimizer.py", line 269, in <module>
    class_df_stats = CLUSTER_STATE["df_dump"]["stats_by_class"][crush_class]
KeyError: 'ssd'

I completely understand what's happenning here, I just expect the script to simply ignore classes without OSDs.

balancer generates EINVAL num of osd pairs

./placementoptimizer.py -v balance --max-pg-moves 1000 --max-full-move-attempts=1000

generates pg-upmap-items that,

Error EINVAL: num of osd pairs (4) > pool size (3)

For example, here are some large OSD lists for a triply replicated pool,

ceph osd pg-upmap-items 14.36e7 2020 1591 2019 1612 1591 1438 1612 597
ceph osd pg-upmap-items 14.12c6 1093 1767 2018 638
ceph osd pg-upmap-items 14.2f63 817 1357 710 1613 1357 1366 1613 630
ceph osd pg-upmap-items 14.2790 1312 674 822 1763 674 1740 1763 1470 1740 1445 1470 601

Taking a closer look at the last entry there are indeed OSD listed that are not used by that PG,

[root@ceph-admin ~]# ceph pg dump | awk '$1 == "14.2790"'
dumped all
14.2790     2214                   0         0          0        0    8615245675            0           0    887       887                 active+clean  2022-08-01T07:56:09.531220-0700    858000'9325   871388:788167                             [180,1312,605]         180                             [180,1312,605]             180    858000'9325  2022-08-01T07:56:09.530894-0700      858000'9325  2022-08-01T07:56:09.530894-0700              0

showremapped output even if topology change

What is the reason for placementoptimizer.py showremapped to raise an exception if the cluster topology has changed?

raise Exception("Cluster topology changed during information gathering (e.g. a pg changed state). "

Since this is a presumably a read-only operation how about changing that exception to a warning and outputting the currently remapped PG nonetheless?

thejj / ceph-balancer Goto Github PK

ceph-balancer's People

Contributors

Stargazers

Watchers

Forkers

ceph-balancer's Issues

Recommend Projects

Recommend Topics

Recommend Org