ceph / ceph-salt Goto Github PK

View Code? Open in Web Editor NEW

29.0 22.0 20.0 1.05 MB

Deploy Ceph clusters using cephadm

License: MIT License

Python 89.33% SaltStack 7.82% Groovy 0.30% Shell 0.08% Roff 1.68% Jinja 0.79%

cephadm saltstack-formula

ceph-salt's People

Stargazers

Watchers

Forkers

rjfd lenzgr ricardoasmarques gekios smithfarm sebastian-philipp bk201 jschmid1 toabctl votdev swiftgist jan--f mgfritch tserong isabella232 johnz521 yemo-memeda p-se kshtsk giubacc

ceph-salt's Issues

No default value for `/Containers/Images/ceph`

/Containers/Images/ceph should not have default value.

MGRs are using the localhost network as in `127.0.0.1:6800`

$ ceph mgr dump
...
    "active_addrs": {
        "addrvec": [
            {
                "type": "v2",
                "addr": "127.0.0.1:6800",
                "nonce": 1
            },
            {
                "type": "v1",
                "addr": "127.0.0.1:6801",
                "nonce": 1
            }
        ]
    },

We should make sure, MGRs use proper addresses.

Admin role

ATM, ceph-common and /etc/ceph/ceph.client.admin.keyring are installed on all "mons".

Instead, we should have a dedicated "Admin" role, and only install ceph tools on minions that have that role.

Links
ceph/ceph#33793

OSD deployment fails due to invalid drive groups syntax

Problem

Due to a recently introduced regression, ceph-salt stopped deploying OSDs.

Symptom

The "Deploying OSD groups 1/1" step finishes immediately, and then sesdev create octopus ... --qa-test ... fails because the cluster has zero OSDs.

Analysis

ceph-salt uses a command like the following to deploy OSDs:

echo '{\"testing_dg_admin\": {\"host_pattern\": \"admin*\", \"data_devices\": {\"all\": true}}}' | ceph orch osd create -i -

When I try this command manually, it fails:

admin:~ # echo '{\"testing_dg_admin\": {\"host_pattern\": \"admin*\", \"data_devices\": {\"all\": true}}}' | ceph orch osd create -i -
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1070, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 191, in handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 309, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 153, in <lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 144, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 437, in _create_osd
    dgs = DriveGroupSpecs(yaml.load(inbuf))
  File "/usr/lib/python3.6/site-packages/ceph/deployment/drive_group.py", line 117, in __init__
    self.build_drive_groups()
  File "/usr/lib/python3.6/site-packages/ceph/deployment/drive_group.py", line 122, in build_drive_groups
    (drive_group_spec, name=drive_group_name))
  File "/usr/lib/python3.6/site-packages/ceph/deployment/drive_group.py", line 228, in from_json
    "Feature <{}> is not supported".format(applied_filter))
ceph.deployment.drive_group.DriveGroupValidationError: Failed to validate Drive Group: Feature <\"host_pattern\"> is not supported

admin:~ # echo $?
22

When I change the single quotes that are around the JSON in the echo statement to double quotes, it succeeds. It also succeeds with the single-quotes, provided I remove the backslashes from the JSON.

Check if reboot is needed on non-SUSE distros

After updating packages, ceph-bootstrap checks if a reboot is needed.

ATM this check is only implemented for SUSE (zypper ps), but other distributions should also be supported.

Remove minions only if no error is thrown

When I try to remove one minion that has roles I receive an error message, but the internal state was partially changed.

/Cluster/Minions> ls /Cluster
o- Cluster ..................................................................................................... [...]
  o- Minions ............................................................................................ [Minions: 2]
  | o- node1.ceph.com ..................................................................................... [no roles]
  | o- node2.ceph.com .......................................................................................... [mgr]
  o- Roles ..................................................................................... [Minions w/ roles: 1]
    o- Mgr .............................................................................................. [Minions: 1]
    | o- node2.ceph.com ............................................................................. [no other roles]
    o- Mon .............................................................................................. [no minions]



/Cluster/Minions> rm node2*
Cannot remove host 'node2.ceph.com' because it has roles defined: {'mgr'}


/Cluster/Minions> rm node2*
No minions matched "node2*".

The "bootstrap_minion" requires both "mon" and "mgr" roles

We should only set the "bootstrap_minion" value if minion has both "mon" and "mgr" roles.

Implement `status` command

The ceph-bootstrap status command should preform some validations to guarantee that ceph-bootstrap can work correctly in the installed system, and show the status of those checks, and if something fails, suggest possible actions to users.

List of validations:

Salt
- check if salt-master package is installed, or just that salt-master is available (it might be installed from source)
- check if we can communicate with the salt-master, execute operations, etc...
ceph-salt-formula
- check if package is installed (it should be if ceph-bootstrap was installed through the RPM)
- check if state files are present
- check if salt pillar is correctly configured by looking at top.sls file
  - if not, it should suggest to run ceph-bootstrap init (Issue #8) that should take care of making sure
    the pillar gets correctly configured.
Config
- check that at least one minion is both 'mgr' and 'mon'

Error applying "ceph-salt" state without roles

I have the following configuration, without roles:

admin:~ # ceph-bootstrap config ls
o- / ........................................................................................................... [...]
  o- Cluster ................................................................................................... [...]
  | o- Minions .......................................................................................... [Minions: 1]
  | | o- node1.octopusipv6.com ............................................................................ [no roles]
  | o- Roles ................................................................................... [Minions w/ roles: 0]
  |   o- Mgr ............................................................................................ [no minions]
  |   o- Mon ............................................................................................ [no minions]
  o- Containers ................................................................................................ [...]
  | o- Images .................................................................................................. [...]
  |   o- ceph .................................................................... [docker.io/ceph/daemon-base:latest]
  o- Deployment ................................................................................................ [...]
  | o- Bootstrap .......................................................................................... [disabled]
  | o- Dashboard ............................................................................................... [...]
  | | o- password ............................................................................... [randomly generated]
  | | o- username ............................................................................................ [admin]
  | o- Mgr ................................................................................................ [disabled]
  | o- Mon ................................................................................................ [disabled]
  | o- OSD ................................................................................................ [disabled]
  o- Network ................................................................................................... [...]
  | o- Address_Family .......................................................................................... [ip4]
  o- SSH ........................................................................................... [no key pair set]
  | o- Private_Key .............................................................................. [no private key set]
  | o- Public_Key ................................................................................ [no public key set]
  o- Storage ................................................................................................... [...]
  | o- Drive_Groups .......................................................................................... [empty]
  o- Time_Server ........................................................................................... [enabled]
    o- External_Servers ...................................................................................... [empty]
    o- Server_Hostname ....................................................................... [node1.octopusipv6.com]

When applying the ceph-salt state, I'm expecting ceph-bootstrap to configure timeserver:

salt -G 'ceph-salt:member' state.apply ceph-salt

But I get the following error:

admin:~ # salt -G 'ceph-salt:member' state.apply ceph-salt
node1.octopusipv6.com:
    Data failed to compile:
----------
    Rendering SLS 'base:ceph-salt' failed: Jinja variable 'dict object' has no attribute 'bootstrap_mon'
ERROR: Minions returned with non-zero exit code

MDS deployment

Support IPv6

It should be possible to set if we want to use IPv4 or IPv6.

A new config entry should be created:

/Network/Address_Family

whith the following commands available:

set ip4
set ip6

Improve states and steps descriptions

For instance, some stages start with Install ... others with Setting up ....

We should find a more consistent description for "states" and "steps".

Use salt-event bus to notify about execution progress

Salt is very silent when running a salt formula or any salt state. The objective of this feature is to use the salt-event bus to notify about the execution progress of the several steps preformed by ceph-salt formula.

The event tags should have the following structure ceph-salt/<step_name>/[started, running, finished]

The payload of each event should depend on the <step_name> and the type of event [started, running, finished]. Each step should always trigger a started event when it starts, and a finished event when it finishes. The running event is for communicating with progress information, for instance, when the step takes a long time to execute and is possible to send some kind of progress percentage.

The finished event should include the information of whether the operation was successful or not, and if not, it should describe the failure.

Rename "bootstrap_mon" to "bootstrap_minion"

"bootstrap_mon" is the cluster’s first manager and monitor, so it should be renamed to "bootstrap_minion".

Salt Formula: Install podman

ceph/ceph@b16f19e

As podman is no longer a dependency of cephadm, we have to make sure it is installed on the minions.

Add "Apparmor" option group to config shell

ceph-bootstrap

Add a Apparmor option group to the config shell. This group should be able to configure all options required by ceph-salt-formula:apparmor state.

ceph-salt-formula

Currently the apparmor state is not doing much. We should check what was being done in DeepSea (https://github.com/SUSE/DeepSea/tree/master/srv/salt/ceph/apparmor) and add those things to ceph-salt formula.

Also, we need to check what cephadm is doing in this regard, and make sure that what ceph-salt formula preforms is compatible with cephadm.

Removing the last MON does not remove the "bootstrap_mon" from Pillar

Removing the last MON should remove the "bootstrap_mon" from Pillar.

How to reproduce:

admin:~ # ceph-bootstrap config /Cluster ls
o- Cluster ............................................................................. [...]
  o- Minions .................................................................... [no minions]
  o- Roles ............................................................. [Minions w/ roles: 0]
    o- Mgr ...................................................................... [no minions]
    o- Mon ...................................................................... [no minions]

admin:~ # salt 'node1.ceph.com' pillar.get ceph-salt
node1.ceph.com:

admin:~ # ceph-bootstrap config /Cluster/Minions add node1.ceph.com
1 minion added.

admin:~ # ceph-bootstrap config /Cluster/Roles/Mon add node1.ceph.com
1 minion added.

admin:~ # salt 'node1.ceph.com' pillar.get ceph-salt
node1.ceph.com:
    ----------
    bootstrap_mon:
        node1.ceph.com
    minions:
        ----------
        all:
            - node1
        mgr:
        mon:
            ----------
            node1:
                10.20.39.201

admin:~ # ceph-bootstrap config /Cluster/Roles/Mon rm node1.ceph.com
1 minion removed.

admin:~ # salt 'node1.ceph.com' pillar.get ceph-salt
node1.ceph.com:
    ----------
    bootstrap_mon:
        node1.ceph.com
    minions:
        ----------
        all:
        mgr:
        mon:
            ----------

Mention ceph-salt on openSUSE:Ceph wiki page

See

Deployment must be idempotent

ATM, OSD deployment fails when ceph-bootstrap deploy is executed a second time:

Use jinja macros to access the pillar data

There is a lot of duplicate code to fetch data from the pillar, especially due to all the handling of non-existent keys in the pillar.

The objective is to create some jinja macros that can be used to reduce the amount of duplicate code.

See https://jinja.palletsprojects.com/en/2.10.x/templates/#macros

config shell: `ValueError: No such path unknown-key` not catched

admin:~ # ceph-bootstrap config
/> /Cluster/Roles/Mmg add node1*
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/configshell_fb/shell.py", line 811, in _execute_command
    target = self._current_node.get_node(path)
  File "/usr/lib/python3.6/site-packages/configshell_fb/node.py", line 1854, in get_node
    return next_node.get_node(next_path)
  File "/usr/lib/python3.6/site-packages/configshell_fb/node.py", line 1862, in get_node
    return next_node.get_node(next_path)
  File "/usr/lib/python3.6/site-packages/configshell_fb/node.py", line 1862, in get_node
    return next_node.get_node(next_path)
  File "/usr/lib/python3.6/site-packages/configshell_fb/node.py", line 1865, in get_node
    return adjacent_node(path)
  File "/usr/lib/python3.6/site-packages/configshell_fb/node.py", line 1827, in adjacent_node
    return self.get_child(name)
  File "/usr/lib/python3.6/site-packages/configshell_fb/node.py", line 1796, in get_child
    % (self.path.rstrip('/'), name))
ValueError: No such path /Cluster/Roles/Mmg

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ceph-bootstrap", line 11, in <module>
    load_entry_point('ceph-bootstrap==15.0.2+1580743520.g1c1e49b', 'console_scripts', 'ceph-bootstrap')()
  File "/usr/lib/python3.6/site-packages/ceph_bootstrap/__init__.py", line 55, in ceph_bootstrap_main
    cli(prog_name='ceph-bootstrap')
  File "/usr/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ceph_bootstrap/__init__.py", line 85, in config_shell
    if not run_config_shell():
  File "/usr/lib/python3.6/site-packages/ceph_bootstrap/config_shell.py", line 803, in run_config_shell
    shell.run_interactive()
  File "/usr/lib/python3.6/site-packages/configshell_fb/shell.py", line 905, in run_interactive
    self._cli_loop()
  File "/usr/lib/python3.6/site-packages/configshell_fb/shell.py", line 734, in _cli_loop
    self.run_cmdline(cmdline)
  File "/usr/lib/python3.6/site-packages/configshell_fb/shell.py", line 848, in run_cmdline
    self._execute_command(path, command, pparams, kparams)
  File "/usr/lib/python3.6/site-packages/configshell_fb/shell.py", line 813, in _execute_command
    raise ExecutionError(str(msg))
configshell_fb.node.ExecutionError: No such path /Cluster/Roles/Mmg
admin:~ #

Looks like we should catch this exception and just print out the error.

Root privileges required

If the user does not has root privileges, then ceph-bootstrap should return immediately.

See: https://github.com/SUSE/DeepSea/blob/master/cli/common.py#L48-L56

Allow explicit set Mon IP

ATM, ceph-bootstrap always uses the IP address from the fqdn_ip4 grain to configure Monitors.

This address can be used by default, but it should be possible to explicitly change it to a different value.

This can be done in the /Deployment group, e.g.:

 ...
 o- Deployment ............................................................... [...]
  | ...
  | o- Mon ............................................................... [enabled]
  | | o- node1.ceph.com .......................................... [192.169.100.201]
 ...

When implementing this we should remove the CephNodeFqdnResolvesToLoopback validation when adding a minion, and only validate IPs before the deployment.

Display "bootstrap_minion" in the UI

ATM "bootstrap_minion" value is automatically set by "ceph-bootstrap config", but this value is not available in the UI.

Need a way to get the configuration database in JSON

Please implement a way to dump the configuration database in JSON. For example, a --format json switch for ceph-bootstrap config ls. Thanks!

cephadm's default image

In case someone manually starts cephadm on a minion for debugging purposes, there is a chance that a user will not specify an image.

If we could specify the default image for cephadm in a config file, we no longer need to worry about users accidentally pulling the default image.

Changing the default image doesn't have an impact on the image used by calls from the ceph-mgr, as the mgr always specifies the image when calling cephadm.

Thoughts?

click: `de_DE.UTF-8`

scalability-master:~ # ceph-bootstrap deploy
Traceback (most recent call last):
  File "/usr/bin/ceph-bootstrap", line 11, in <module>
    load_entry_point('ceph-bootstrap==15.1.0+1581935293.g7a3134c', 'console_scripts', 'ceph-bootstrap')()
  File "/usr/lib/python3.6/site-packages/ceph_bootstrap/__init__.py", line 55, in ceph_bootstrap_main
    cli(prog_name='ceph-bootstrap')
  File "/usr/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 676, in main
    _verify_python3_env()
  File "/usr/lib/python3.6/site-packages/click/_unicodefun.py", line 118, in _verify_python3_env
    'for mitigation steps.' + extra)
RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment.  Consult http://click.pocoo.org/python3/for mitigation steps.

This system supports the C.UTF-8 locale which is recommended.
You might be able to resolve your issue by exporting the
following environment variables:

    export LC_ALL=C.UTF-8
    export LANG=C.UTF-8

Click discovered that you exported a UTF-8 locale
but the locale system could not pick up from it because
it does not exist.  The exported locale is "de_DE.UTF-8" but it
is not supported

deploy: Cluster is already deployed

scalability-master:~ # ceph-bootstrap deploy --non-interactive
Checking if ceph-salt formula is available...
Syncing minions with the master...
Checking existing deployment...
Cluster is already deployed, please apply the deployment to a single minion at a time: "ceph-bootstrap deploy <minion_id>"

But the cluster actually doesn't exist:

scalability-master:~ # salt '*' cmd.run 'cephadm'
scalability-monitor-2.openstack.local:
    /bin/sh: cephadm: command not found
scalability-osd-5.openstack.local:
    /bin/sh: cephadm: command not found
scalability-osd-1.openstack.local:
    /bin/sh: cephadm: command not found
scalability-osd-4.openstack.local:
    /bin/sh: cephadm: command not found
scalability-monitor-1.openstack.local:
    /bin/sh: cephadm: command not found
scalability-monitor-3.openstack.local:
    /bin/sh: cephadm: command not found
scalability-osd-3.openstack.local:
    /bin/sh: cephadm: command not found
scalability-osd-2.openstack.local:
    /bin/sh: cephadm: command not found
scalability-master.openstack.local:
    /bin/sh: cephadm: command not found
ERROR: Minions returned with non-zero exit code
scalability-master:~ # salt '*' cmd.run 'ceph'
scalability-monitor-2.openstack.local:
    /bin/sh: ceph: command not found
scalability-osd-5.openstack.local:
    /bin/sh: ceph: command not found
scalability-osd-1.openstack.local:
    /bin/sh: ceph: command not found
scalability-osd-4.openstack.local:
    /bin/sh: ceph: command not found
scalability-monitor-3.openstack.local:
    /bin/sh: ceph: command not found
scalability-osd-2.openstack.local:
    /bin/sh: ceph: command not found
scalability-monitor-1.openstack.local:
    /bin/sh: ceph: command not found
scalability-osd-3.openstack.local:
    /bin/sh: ceph: command not found
scalability-master.openstack.local:
    /bin/sh: ceph: command not found
ERROR: Minions returned with non-zero exit code

config:

scalability-master:~ # ceph-bootstrap config
/> ls
o- / ......................................................................................................................... [...]
  o- Cluster ................................................................................................................. [...]
  | o- Minions ........................................................................................................ [Minions: 9]
  | | o- scalability-master.openstack.local ............................................................................. [no roles]
  | | o- scalability-monitor-1.openstack.local .......................................................................... [mgr, mon]
  | | o- scalability-monitor-2.openstack.local .......................................................................... [no roles]
  | | o- scalability-monitor-3.openstack.local .......................................................................... [no roles]
  | | o- scalability-osd-1.openstack.local .............................................................................. [no roles]
  | | o- scalability-osd-2.openstack.local .............................................................................. [no roles]
  | | o- scalability-osd-3.openstack.local .............................................................................. [no roles]
  | | o- scalability-osd-4.openstack.local .............................................................................. [no roles]
  | | o- scalability-osd-5.openstack.local .............................................................................. [no roles]
  | o- Roles ........................................ [Bootstrap minion: scalability-monitor-1.openstack.local, Minions w/ roles: 1]
  |   o- Mgr .......................................................................................................... [Minions: 1]
  |   | o- scalability-monitor-1.openstack.local ................................................................ [other roles: mon]
  |   o- Mon .......................................................................................................... [Minions: 1]
  |     o- scalability-monitor-1.openstack.local ................................................................ [other roles: mgr]
  o- Containers .............................................................................................................. [...]
  | o- Images ................................................................................................................ [...]
  |   o- ceph ........................ [registry.suse.de/suse/sle-15-sp2/update/products/ses7/milestones/containers/ses/7/ceph/ceph]
  o- Deployment .............................................................................................................. [...]
  | o- Bootstrap ......................................................................................................... [enabled]
  | o- Dashboard ............................................................................................................. [...]
  | | o- password ............................................................................................. [randomly generated]
  | | o- username .......................................................................................................... [admin]
  | o- Mgr .............................................................................................................. [disabled]
  | o- Mon .............................................................................................................. [disabled]
  | o- OSD .............................................................................................................. [disabled]
  o- SSH ............................................................................................................ [Key Pair set]
  | o- Private_Key ............................................................... [00:34:a1:d6:44:a6:4a:35:f1:38:88:47:cc:bc:04:42]
  | o- Public_Key ................................................................ [00:34:a1:d6:44:a6:4a:35:f1:38:88:47:cc:bc:04:42]
  o- Storage ................................................................................................................. [...]
  | o- Drive_Groups ........................................................................................................ [empty]
  o- System_Update ........................................................................................................... [...]
  | o- Packages .......................................................................................................... [enabled]
  | o- Reboot ............................................................................................................ [enabled]
  o- Time_Server ......................................................................................................... [enabled]
    o- External_Servers ........................................................................................................ [1]
    | o- ntp.suse.cz ......................................................................................................... [...]
    o- Server_Hostname ........................................................................ [scalability-master.openstack.local]
/> exit

cephadm bootstrap prints valuable information

After cephadm boostrap succeeds, it prints some valuable infos:

INFO:cephadm:Ceph Dashboard is now available at:

             URL: https://ubuntu1804:8443/
            User: admin
        Password: wgcrj2t3ka

INFO:cephadm:You can access the Ceph CLI with:

        sudo ./cephadm shell --fsid 146d0150-4e66-11ea-a110-5254005c2d4e -c ceph.conf -k ceph.client.admin.keyring

INFO:cephadm:Bootstrap complete.

We should forward users the information about the dashboard password and provide a convenience wrapper for cephadm shell

ceph-salt-formula succeeds even when "ceph orch osd create" command fails

Problem

Due to a recently introduced regression, ceph-salt stopped deploying OSDs, but ceph-salt-formula ignores this and reports that OSD groups were created successfully.

Symptom

The "Deploying OSD groups 1/1" step finishes immediately (with success), yet sesdev create octopus ... --qa-test ... fails because the cluster has zero OSDs.

Analysis

ceph-salt uses a command like the following to deploy OSDs:

echo '{\"testing_dg_admin\": {\"host_pattern\": \"admin*\", \"data_devices\": {\"all\": true}}}' | ceph orch osd create -i -

When I try this command manually, it fails:

admin:~ # echo '{\"testing_dg_admin\": {\"host_pattern\": \"admin*\", \"data_devices\": {\"all\": true}}}' | ceph orch osd create -i -
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1070, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 191, in handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 309, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 153, in <lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 144, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 437, in _create_osd
    dgs = DriveGroupSpecs(yaml.load(inbuf))
  File "/usr/lib/python3.6/site-packages/ceph/deployment/drive_group.py", line 117, in __init__
    self.build_drive_groups()
  File "/usr/lib/python3.6/site-packages/ceph/deployment/drive_group.py", line 122, in build_drive_groups
    (drive_group_spec, name=drive_group_name))
  File "/usr/lib/python3.6/site-packages/ceph/deployment/drive_group.py", line 228, in from_json
    "Feature <{}> is not supported".format(applied_filter))
ceph.deployment.drive_group.DriveGroupValidationError: Failed to validate Drive Group: Feature <\"host_pattern\"> is not supported

admin:~ # echo $?
22

However, ceph-salt-formula ignores this error.

Implement `init` command

Before running ceph-bootstrap we need to make sure that the salt pillar is correctly configured to find the ceph-salt.sls file created by ceph-bootstrap. This should be done automatically by the ceph-bootstrap init command.

Currently these are the steps we run manually before running ceph-bootstrap config:

cat <<EOF > /srv/pillar/top.sls
base:
  '*':
    - ceph-salt
EOF
touch /srv/pillar/ceph-salt.sls
chown -R salt:salt /srv/pillar
salt \* saltutil.pillar_refresh

Command "ceph orchestrator" was renamed to "ceph orch"

Since ceph/ceph#33131 ceph orchestrator was renamed to ceph orch, so ceph-bootstrap must be adapted accordingly.

Provide a mechanism for setting "ceph.conf" configuration values so they will be in effect during "cephadm bootstrap" phase

Nowadays, Ceph has a MON store for cluster configuration, but certain options still have to be set in ceph.conf before the cluster is bootstrapped.

One example is osd crush chooseleaf type = 0. If this is not provided on the cephadm bootstrap command line via the -c option, the initial CRUSH map created by cephadm bootstrap will have the failure domain set to "host" and there is no easy way to change that.

It's possible that there are other options like this one, which must be set via cephadm bootstrap -c in order to properly take effect.

Therefore, I am proposing that ceph-salt provide a mechanism for setting these options.

UPDATE: I found another situation where this is needed (and I think it's quite likely that there are more):

If someone needs to run cephadm bootstrap with MGR debugging turned up, the only way to do this is via cephadm bootstrap -c.

Error deploying config without external time server

When deploying the following configuration, without external time server:

  o- Time_Server ................................................................. [enabled]
    o- External_Servers ............................................................ [empty]
    o- Server_Hostname ................................................. [node1.octopus.com]

I'm getting the following error:

----------
          ID: /etc/chrony.conf
    Function: file.managed
      Result: False
     Comment: Unable to manage file: Jinja variable 'dict object' has no attribute 'external_time_servers'
     Started: 10:37:14.223639
    Duration: 47.736 ms
     Changes:   
----------

To fix this, we should check if any external time server was configured before setting it up.

Ensure ceph-salt-formula is loaded by the salt-master before deploy

When running ceph-bootstrap deploy we need to make sure that the ceph-salt-formula state files are already loaded by the salt-master, otherwise the deployment will fail.

We can check if ceph-salt formula is loaded by running the following command: salt \* state.sls_exists ceph-salt.
We should also always sync any state, modules or pillar changes before starting the deployment using the following command: salt \* saltutil.sync_all

Always pull image before service deployment

Make use of cephadm pull command to pull images before executing cephadm bootstrap.

Minions without role are not added to the orchestrator

imaster:~ # ceph-bootstrap config "/Cluster/Minions ls"
o- Minions ............................................................... [Minions: 9]
  o- imaster.ceph ....................................................... [no roles]
  o- imonitor1.ceph ..................................................... [mon, mgr]
  o- imonitor2.ceph ..................................................... [mon, mgr]
  o- imonitor3.ceph ..................................................... [mon, mgr]
  o- iosd-node1.ceph .................................................... [no roles]
  o- iosd-node2.ceph .................................................... [no roles]
  o- iosd-node3.ceph .................................................... [no roles]
  o- iosd-node4.ceph .................................................... [no roles]
  o- iosd-node5.ceph .................................................... [no roles]
OK



imonitor1:~ # ceph orchestrator host ls
HOST      LABELS
imonitor3
imonitor2
imonitor1

RGW deployment

Roles only available in advanced mode

Simple Mode

By default, roles should be disabled, so the following entries should not be visible:

/Cluster/Roles
/Deployment
/Storage

The following new entry should be available:

/Cluster/Bootstrap_Minion

Setting the /Cluster/Bootstrap_Minion will:

set the bootstrap_mon value
clear all the existing roles of bootstrap_mon
add both mon and mgr roles to the bootstrap_mon

Advanced Mode

When running in "advanced mode", all entries should be visible except /Cluster/Bootstrap_Minion.

Adding a role to a minion should set the bootstrap_mon value with a minion that has both mgr and mon roles.

After "ceph-bootstrap deploy", cluster is in HEALTH_WARN ("CEPHADM_STRAY_HOST: 3 stray host(s) with 10 service(s) not managed by cephadm")

$ sesdev ssh octopus_test1
Warning: Permanently added '192.168.121.4' (ECDSA) to the list of known hosts.
Have a lot of fun...
admin:~ # ceph -s
  cluster:
    id:     be09d766-42d5-11ea-bbb8-52540088717e
    health: HEALTH_WARN
            3 stray host(s) with 10 service(s) not managed by cephadm
 
  services:
    mon: 3 daemons, quorum node1.octopus_test1.com,node2,node3 (age 23m)
    mgr: hwboak(active, since 23m), standbys: icdsnv, ytvgxh
    osd: 6 osds: 6 up (since 22m), 6 in (since 22m)
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   6.0 GiB used, 42 GiB / 48 GiB avail
    pgs:     
 
admin:~ # ceph health detail
HEALTH_WARN 3 stray host(s) with 10 service(s) not managed by cephadm
[WRN] CEPHADM_STRAY_HOST: 3 stray host(s) with 10 service(s) not managed by cephadm
    stray host node1.octopus_test1.com has 4 stray daemons: ['mgr.hwboak', 'mon.node1.octopus_test1.com', 'osd.0', 'osd.1']
    stray host node2.octopus_test1.com has 3 stray daemons: ['mgr.icdsnv', 'osd.2', 'osd.3']
    stray host node3.octopus_test1.com has 3 stray daemons: ['mgr.ytvgxh', 'osd.4', 'osd.5']
admin:~ # ceph versions
{
    "mon": {
        "ceph version 15.0.0-9865-gd2c6620fea (d2c6620fea8e44e5b0bc24a0effaa6347315be7e) octopus (dev)": 3
    },
    "mgr": {
        "ceph version 15.0.0-9865-gd2c6620fea (d2c6620fea8e44e5b0bc24a0effaa6347315be7e) octopus (dev)": 3
    },
    "osd": {
        "ceph version 15.0.0-9865-gd2c6620fea (d2c6620fea8e44e5b0bc24a0effaa6347315be7e) octopus (dev)": 6
    },
    "mds": {},
    "overall": {
        "ceph version 15.0.0-9865-gd2c6620fea (d2c6620fea8e44e5b0bc24a0effaa6347315be7e) octopus (dev)": 12
    }
}
admin:~ # ceph --version
ceph version 15.0.0-9865-gd2c6620fea (d2c6620fea8e44e5b0bc24a0effaa6347315be7e) octopus (dev)

UPDATE: The workaround is to explicitly add the hosts - e.g.:

# ceph orchestrator host add admin.octopus_test1.com
Added host 'admin.octopus_test1.com'

ceph-bootstrap config displaying Dashboard password

ceph-bootstrap config displaying Dashboard password:

# ceph-bootstrap config /Deployment/Dashboard/password set admin
# ceph-bootstrap config /Deployment/Dashboard ls
o- Dashboard ........................................ [...]
  o- password ..................................... [admin]
  o- username ..................................... [admin]

Add "System_Update" option group to config shell

ceph-bootstrap

Add a System_Update option group to the config shell. This group should be able to configure all options required by ceph-salt-formula:software state.

ceph-salt-formula

Currently the only thing the software state file does is to run pkg.upgrade module function, which will upgrade all packages of the system to the latest available version.

We should allow a more finer grain configuration, to allow kernel upgrades to be disabled/enabled, or any other packages.

Also, we should find a solution to the problem of package upgrades that require a reboot afterwards.

Solution proposal:
Have a flag that enables/disables the reboot of the machine in case of any package requires a reboot. If the flag is enabled, it should send a message in the salt-event bus stating that the minion will reboot, and then finish the state execution, and issue a reboot.

Since ceph-salt formula should be idempotent, the user of the formula after being notified that the node is rebooting, it can run again the formula in all minions.

Prevent some actions after the initial deployment

Solution 1

After the initial deployment, we should:

Mark minions as "Managed by orchestrator" (based on ceph orchestrator inventory?), e.g.:

o- / ....................................................................................... [...]
  o- Cluster ............................................................................... [...]
  | o- Minions ...................................................................... [Minions: 4]
  | | o- node1.octopus.com ............................................. [Managed by orchestrator]
  | | ...

Minions that are managed by orchestrator should not appear on /Cluster/Roles, and "Bootstrap minion" should not be displayed:

o- / ....................................................................................... [...]
  o- Cluster ............................................................................... [...]
  | o- Minions ...................................................................... [Minions: 5]
  | | o- node1.octopus.com ............................................. [Managed by orchestrator]
  | | ...
  | | o- node5.octopus.com ............................................................ [mgr, mon]
  | o- Roles ............................................................... [Minions w/ roles: 1]
  |   o- Mgr ........................................................................ [Minions: 1]
  |   | o- node5.octopus.com .................................................. [other roles: mon]
  |   o- Mon ........................................................................ [Minions: 2]
  |     o- node5.octopus.com .................................................. [other roles: mgr]

And the following options should be disabled:

/Cluster/Deployment/Bootstrap
/Cluster/Deployment/Dashboard
/SSH

Solution 2

After the initial deployment, the following operations should be "disabled":

/Cluster/Roles
/Deployment
/Storage
/SSH

Default values shared between "configshell code" and "ceph-salt-formulas"

ATM, default values are duplicated in "configshell code" and "ceph-salt-formulas".

It would be great if we have a way to share constants/default values between "configshell code", and "ceph-salt-formulas" (e.g., default dashboard username).

ceph-salt issues wrong "ceph orch" command and gets "Error: 2 hosts provided, expected 3"

Very odd. . .

sesdev command line:

sesdev create octopus --ceph-salt-repo https://github.com/smithfarm/ceph-salt.git --ceph-salt-branch wip-fix-broken-osd-deploy --ceph-container-image="registry.opensuse.org/filesystems/ceph/master/upstream/images/ceph/ceph" --no-deploy-mons --no-deploy-mgrs --no-deploy-osds octopus_test1

Results in:

    admin:   |   o- Mgr .......................................................................................................... [Minions: 3]
    admin:   |   | o- node1.octopus_test1.com .............................................................................. [other roles: mon]
    admin:   |   | o- node2.octopus_test1.com .............................................................................. [other roles: mon]
    admin:   |   | o- node3.octopus_test1.com .............................................................................. [other roles: mon]
...
    admin:   o- Deployment .............................................................................................................. [...]
    admin:   | o- Bootstrap ......................................................................................................... [enabled]
    admin:   | o- Dashboard ............................................................................................................. [...]
    admin:   | | o- password ............................................................................................................ [***]
    admin:   | | o- username .......................................................................................................... [admin]
    admin:   | o- Mgr ............................................................................................................... [enabled]
    admin:   | o- Mon .............................................................................................................. [disabled]
    admin:   | o- OSD .............................................................................................................. [disabled]

and:

    admin: [2020-02-25 16:49:11.286782] [node1.octopus_te] Finished with failures
    admin: Failure in minion: node1.octopus_test1.com
    admin: __id__: deploy remaining mgrs
    admin: __run_num__: 55
    admin: __sls__: ceph-salt.ceph-mgr
    admin: changes:
    admin:   pid: 14979
    admin:   retcode: 22
    admin:   stderr: "Error EINVAL: Traceback (most recent call last):\n  File \"/usr/share/ceph/mgr/mgr_module.py\"\
    admin:     , line 1070, in _handle_command\n    return self.handle_command(inbuf, cmd)\n\
    admin:     \  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 191, in handle_command\n\
    admin:     \    return dispatch[cmd['prefix']].call(self, cmd, inbuf)\n  File \"/usr/share/ceph/mgr/mgr_module.py\"\
    admin:     , line 309, in call\n    return self.func(mgr, **kwargs)\n  File \"/usr/share/ceph/mgr/orchestrator/_interf
ace.py\"\
    admin:     , line 153, in <lambda>\n    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,\
    admin:     \ **l_kwargs)\n  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line\
    admin:     \ 144, in wrapper\n    return func(*args, **kwargs)\n  File \"/usr/share/ceph/mgr/orchestrator/module.py\"\
    admin:     , line 668, in _apply_mgr\n    completion = self.apply_mgr(spec)\n  File \"/usr/share/ceph/mgr/orchestrator
/_interface.py\"\
    admin:     , line 1694, in inner\n    completion = self._oremote(method_name, args, kwargs)\n\
    admin:     \  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1764, in _oremote\n\
    admin:     \    return mgr.remote(o, meth, *args, **kwargs)\n  File \"/usr/share/ceph/mgr/mgr_module.py\"\
    admin:     , line 1432, in remote\n    args, kwargs)\nRuntimeError: Remote method threw exception:\
    admin:     \ Traceback (most recent call last):\n  File \"/usr/share/ceph/mgr/cephadm/module.py\"\
    admin:     , line 2155, in apply_mgr\n    len(spec.placement.hosts), num_new_mgrs))\nRuntimeError:\
    admin:     \ Error: 2 hosts provided, expected 3"
    admin:   stdout: ''
    admin: comment: 'Command "ceph orch apply mgr 3 node2 node3
    admin:
    admin:   " run'
    admin: duration: 403.004
    admin: name: 'ceph orch apply mgr 3 node2 node3
    admin:
    admin:   '
    admin: result: false
    admin: start_time: '17:49:10.863246'
    admin: state: 'cmd_|-deploy remaining mgrs_|-ceph orch apply mgr 3 node2 node3
    admin:
    admin:   _|-run'
    admin: Finished execution of ceph-salt formula
    admin: Summary: Total=4 Succeeded=3 Failed=1
Command '['vagrant', 'up']' failed: ret=1 stderr:

Remove salt python API terminal output from ceph-boostrap commands

Currently whenever ceph-bootrap runs a salt command through the salt python API, the salt python API prints to the terminal some messages in some situations. Example:

No minions matched the target. No command was sent, no jid was assigned.
No minions matched the target. No command was sent, no jid was assigned.
No minions matched the target. No command was sent, no jid was assigned.

This is a problem in the salt python API, that should output those messages to the logger instead of directly to stdout, but since we can't change that part here, ceph-bootstrap should wrap the salt python API calls in way that captures the stdout outputs avoid them to be propagated to the user stdout.

We have an example of how that can be achieved here: https://github.com/SUSE/DeepSea/blob/master/cli/common.py#L19-L45

No bootstrap minion configured if there is no node has both MON and MGR

In issue #42 a minion is tagged as bootstrapping role only when it has both MON and MGR roles, if this is a requirement, should we document it (maybe in sesdev or downstream doc) or add a check before deploying?

I tested sesdev with the following command:

sesdev create octopus --roles="[admin], [mon], [mgr], [storage]" dev

The deployment failed with:

    admin: ++ ceph-bootstrap config ls
    admin: o- / ......................................................................................................................... [...]
    admin:   o- Cluster ................................................................................................................. [...]
    admin:   | o- Minions ........................................................................................................ [Minions: 4]
    admin:   | | o- admin.dev.com .................................................................................................. [no roles]
    admin:   | | o- node1.dev.com ....................................................................................................... [mon]
    admin:   | | o- node2.dev.com ....................................................................................................... [mgr]
    admin:   | | o- node3.dev.com .................................................................................................. [no roles]
    admin:   | o- Roles ......................................................................... [Bootstrap minion: None, Minions w/ roles: 2]
    admin:   |   o- Mgr .......................................................................................................... [Minions: 1]
    admin:   |   | o- node2.dev.com .......................................................................................... [no other roles]
    admin:   |   o- Mon .......................................................................................................... [Minions: 1]
    admin:   |     o- node1.dev.com .......................................................................................... [no other roles]
    admin:   o- Containers .............................................................................................................. [...]
    admin:   | o- Images ................................................................................................................ [...]
    admin:   |   o- ceph ..................................................................... [docker.io/ceph/daemon-base:latest-master-devel]
    admin:   o- Deployment .............................................................................................................. [...]
    admin:   | o- Bootstrap ......................................................................................................... [enabled]
    admin:   | o- Dashboard ............................................................................................................. [...]
    admin:   | | o- password ............................................................................................................ [***]
    admin:   | | o- username .......................................................................................................... [admin]
    admin:   | o- Mgr ............................................................................................................... [enabled]
    admin:   | o- Mon ............................................................................................................... [enabled]
    admin:   | o- OSD ............................................................................................................... [enabled]
    admin:   o- SSH ............................................................................................................ [Key Pair set]
    admin:   | o- Private_Key ............................................................... [a4:b9:b1:3e:e7:a2:ba:fe:c6:d6:e5:82:e6:99:d3:24]
    admin:   | o- Public_Key ................................................................ [a4:b9:b1:3e:e7:a2:ba:fe:c6:d6:e5:82:e6:99:d3:24]
    admin:   o- Storage ................................................................................................................. [...]
    admin:   | o- Drive_Groups ............................................................................................................ [1]
    admin:   |   o- {"testing_dg_node3": {"host_pattern": "node3*", "data_devices": {"all": true}}} ..................................... [...]
    admin:   o- System_Update ........................................................................................................... [...]
    admin:   | o- Packages .......................................................................................................... [enabled]
    admin:   | o- Reboot ............................................................................................................ [enabled]
    admin:   o- Time_Server ......................................................................................................... [enabled]
    admin:     o- External_Servers ........................................................................................................ [1]
    admin:     | o- 0.pt.pool.ntp.org ................................................................................................... [...]
    admin:     o- Server_Hostname ............................................................................................. [admin.dev.com]
    admin: ++ zypper lr -upEP
    admin: #  | Alias               | Name                        | Enabled | GPG Check | Refresh | Priority | URI                                                                                              
    admin: ---+---------------------+-----------------------------+---------+-----------+---------+----------+--------------------------------------------------------------------------------------------------
    admin:  1 | octopus-repo1       | octopus-repo1               | Yes     | (r ) Yes  | No      |   98     | https://download.opensuse.org/repositories/filesystems:/ceph:/master:/upstream/openSUSE_Leap_15.2
    admin:  6 | repo-non-oss        | Non-OSS Repository          | Yes     | (r ) Yes  | No      |   99     | http://download.opensuse.org/distribution/leap/15.2/repo/non-oss/                                
    admin:  7 | repo-oss            | Main Repository             | Yes     | (r ) Yes  | No      |   99     | http://download.opensuse.org/distribution/leap/15.2/repo/oss/                                    
    admin: 10 | repo-update         | Main Update Repository      | Yes     | (r ) Yes  | No      |   99     | http://download.opensuse.org/update/leap/15.2/oss/                                               
    admin: 11 | repo-update-non-oss | Update Repository (Non-Oss) | Yes     | (r ) Yes  | No      |   99     | http://download.opensuse.org/update/leap/15.2/non-oss/                                           
    admin: ++ zypper info cephadm
    admin: ++ grep -E '(^Repo|^Version)'
    admin: Repository     : octopus-repo1                       
    admin: Version        : 15.1.0-lp152.833.1                  
    admin: ++ ceph-bootstrap --version
    admin: ceph-bootstrap 15.1.0+1581935293.g7a3134c
    admin: ++ stdbuf -o0 ceph-bootstrap -ldebug deploy --non-interactive
    admin: Checking if ceph-salt formula is available...
    admin: salt-master will be restarted to load ceph-salt formula
    admin: Could not find ceph-salt formula. Please check if ceph-salt-formula package is installed
Command '['vagrant', 'up']' failed: ret=1 stderr:
==> admin: An error occurred. The error will be shown after all tasks complete.
An error occurred while executing multiple actions in parallel.
Any errors that occurred are shown below.

The error message doesn't make sense because the formula was installed correctly. After digging for a while, the minion log file provides some insight to the cause:

2020-02-19 08:41:27,500 [salt.minion      :1491][INFO    ][7109] User sudo_vagrant Executing command state.sls_exists with jid 20200219084127497627
2020-02-19 08:41:27,551 [salt.minion      :1618][INFO    ][11686] Starting a new job 20200219084127497627 with PID 11686
2020-02-19 08:41:27,701 [salt.state       :967 ][INFO    ][11686] Loading fresh modules for state activity
2020-02-19 08:41:27,764 [salt.utils.templates:180 ][ERROR   ][11686] Rendering exception occurred
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/templates.py", line 394, in render_jinja_tmpl
    output = template.render(**decoded_context)
  File "/usr/lib/python3.6/site-packages/jinja2/asyncsupport.py", line 76, in render
    return original_render(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/usr/lib/python3.6/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/lib/python3.6/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "<template>", line 13, in top-level template code
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'bootstrap_minion'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/templates.py", line 169, in render_tmpl
    output = render_str(tmplstr, context, tmplpath)
  File "/usr/lib/python3.6/site-packages/salt/utils/templates.py", line 404, in render_jinja_tmpl
    buf=tmplstr)
salt.exceptions.SaltRenderError: Jinja variable 'dict object' has no attribute 'bootstrap_minion'
2020-02-19 08:41:27,764 [salt.state       :3516][CRITICAL][11686] Rendering SLS 'base:ceph-salt' failed: Jinja variable 'dict object' has no attribute 'bootstrap_minion'
2020-02-19 08:41:27,764 [salt.minion      :1946][INFO    ][11686] Returning information for job: 20200219084127497627

Node-Exporter deployment

It should be possible to install Node-Exporter from ceph-bootstrap.

ceph/ceph#33123

Implement `deploy` command

Currently after using ceph-bootstrap to configure all the options required by ceph-salt-formula we run ceph-salt-formula by issue the following salt command:

salt -G 'ceph-salt:member' state.apply ceph-salt

The above command is completely silent until the minions start responding after running the whole formula. The objective of the ceph-bootstrap deploy command is to run the ceph-salt formula but give real-time execution progress feedback to the user.

The idea for the implementation is to listen the salt-event bus for events generated by ceph-salt formula, and show the current status on the terminal.
Related issue in ceph-salt-formula: SUSE/ceph-salt-formula#2

We can re-use the code in DeepSea CLI (https://github.com/SUSE/DeepSea/blob/master/cli/salt_event.py) to listen for the salt-event bus using Listener pattern.

Prometheus and Grafana deployment

It should be possible to install and configure Prometheus and Grafana from ceph-bootstrap.

Usefull links:

ceph/ceph#33073

ceph / ceph-salt Goto Github PK

ceph-salt's People

Stargazers

Watchers

Forkers

ceph-salt's Issues

Problem

Symptom

Analysis

ceph-bootstrap

ceph-salt-formula

Problem

Symptom

Analysis

Simple Mode

Advanced Mode

ceph-bootstrap

ceph-salt-formula

Solution 1

Solution 2

Recommend Projects

Recommend Topics

Recommend Org