digitalrebar / provision-content Goto Github PK

DigitalRebar Provision Content

License: Apache License 2.0

Shell 74.34% Go 0.05% HCL 0.98% Python 8.75% Makefile 2.25% PowerShell 12.34% Batchfile 1.30%

provision-content's Introduction

END OF LIFE NOTICE FOR VERSION 2.x

Version 2 of Digital Rebar is no longer being maintained. All development has shifted to v3 (Digital Rebar Provision) which is being actively supported. If you need help migrating, please contact RackN for assistance.

Ref: https://github.com/digitalrebar/provision

Welcome to Digital Rebar

The Digital Rebar is a container-ready cloud & hardware provisioning platform that delivers the best of software deployment automation and orchestration without locking you into a single platform or operating system. Our mission is to embrace the heterogenous nature of cloud and data center operations.

While it's been completely rebuilt by DevOps artisans, Digital Rebar history includes years of battle-tested ops learnings by the Crowbar Project founders.

Getting Started & Documentation

Documentation is maintained in Read The Docs and sourced from our doc repository.

Help & Community

Help & Chat
- Gitter: ![Gitter](https://badges.gitter.im/Join Chat.svg)
Documentation
Live Weekly Planning and Design Meetings
Mailing List

Commercial support for Digital Rebar is available from RackN Inc.

Codebase History

Fall 2016: Digital Rebar / DigitalRebar converged into single repo + workloads.
Summer 2015: Digital Rebar / Core (and many repos) restructured into microservice containers
Fall 2013: OpenCrowbar Rearchitecture (v2) was a complete re-write of Crowbar to be composable
Spring 2011: Original Project: Crowbar (still maintained by SUSE) was a Chef wrapper layer

Dangerous and Fun Quick Start command.

We know you'll ignore this advice, but we recommend that you Read The Docs Deployment Install Guide first!

WARNING: This is only for clean systems. If you've already cloned this repo, use the run-in-system script!

From the home directory of a user with sudo capabilities,

curl -fsSL https://raw.githubusercontent.com/digitalrebar/digitalrebar/master/deploy/quickstart.sh | bash

This command will turn the current node into a DigitalRebar admin node in Host mode operating as the IP on the default gateway interface. For cloud instances, this means that it will use the private network and will only safely manage nodes in its private network. UX and API will be available through the public IP of the cloud instance assuming https is allowed through the cloud's network protections.

You may add additional arguments to bash to enable features or change the IP address that ADMIN node will use.

E.g. curl -fsSL https://raw.githubusercontent.com/digitalrebar/digitalrebar/master/deploy/quickstart.sh | bash -s -- --con-provisioner --con-dhcp --admin-ip=1.1.2.3/24

will enable the dhcp server and provisioner for the admin node. You will need to edit the admin-internal network to boot nodes properly. This would also set the admin-ip to 1.1.2.3/24 in the configuration files. This last part is needed if you are using an AWS or google instance and you want to use your admin node for things not directly in your VPC/Network.

NOTE: When enabling the provisioner, you will need about 20GB of disk space. Plan accordingly.

provision-content's People

Contributors

Stargazers

Watchers

provision-content's Issues

krib: install on coreos fails, kernel module br_netfilter not loaded

When installing Kubernetes on CoreOS by using the Krib stages it fails on krib-config and the krib-config.sh.tmpl. I used the CoreOS live boot environment. CoreOS version 2023.4.0. Krib content version is v1.12.0-tip-151-31915dd7430d7119be359e4d684cdab30afda538.
It fails in the section that prints out MAKE SURE bridge-nf-call-iptables CONTAINS 1 - kubeadm requirement.
To solve it I added these lines before the section mentioned above.

echo "MAKE SURE br_netfilter is enabled - kubeadm requirement"
if [ ! -f /proc/sys/net/bridge/bridge-nf-call-iptables ]; then
    modprobe br_netfilter
    echo "br_netfilter" > /etc/modules-load.d/br_netfilter.conf
fi

I plan to create a PR that solves this.

Issue creating krib-ha cluster

Hi,

I'm trying to run the krib-live-cluster and I have an issue during the etcd-config task.
It seems to miss the makeroot on the machine.
I am running:
provision v3.9.0-tip-4-fd645195a539acebc6e45f8d44bb745677343d8c
Feature Flags
api-v3, sane-exit-codes, common-blob-size, change-stage-map, job-exit-states, package-repository-handling, profileless-machine, threaded-log-levels, plugin-v2, fsm-runner, plugin-v2-safe-config, workflows, default-workflow, http-range-header, roles, tenants, secure-params, seperate-meta-api, slim-objects

krib content v1.9.0-tip-15-f7a61da2534f67fc292598e62964b164b3500597

Here is the log of the task:

Log for Job: 95530bdc-7a1b-4578-af40-50615e105d12
Starting task etcd-config on 59ea578f-c892-4ae6-81b9-7c4d284ecbbd
Starting command ./etcd-config-etcd-config.sh.tmpl

Command running
Configure the etcd cluster
Add initial variables to track members.
Error: Key, etcd/servers, already present on profile k8s
Error: Key, etcd/servers-done, already present on profile k8s
Creating 1 servers
Electing Members to k8s
Error: INVOKE: machines/59ea578f-c892-4ae6-81b9-7c4d284ecbbd: Action makeroot on machines: Not Found
Command exited with status 1
Action etcd-config.sh.tmpl finished
Task etcd-config failed
Marked machine 59ea578f-c892-4ae6-81b9-7c4d284ecbbd as not runnable
Updated job 95530bdc-7a1b-4578-af40-50615e105d12 to failed
Task signalled that it failed

Br Simon

krib: krib hangs at etcd-config stage

Happens if certs-data profile is not added to the machine or certs-data profile is empty because certs-plugin is not installed.

There's no indication of what the problem is. Task log only shows:
Log for Job: 7ed91edc-7002-43ae-afdb-b7e5502ffc0d

krib: install fails, api-server, scheduler and controller manager can't start

Installation of Kubernetes fails when waiting for its base service can't start. Looking in krib-kubeadm.cfg.tmpl it provided extra volumes to controllerManager, scheduler and apiServer.
Krib content-package version is v1.12.0-tip-151-31915dd7430d7119be359e4d684cdab30afda538.

  extraVolumes:
    - name: hyperkube
      hostPath: /k8s/hyperkube
      mountPath: /hyperkube

These three containers fail to start. By removing the extra volumes from the kubeadm config all three services start up as expected.

Are these mount of an external hyperkube binary used in the CentOS installation or just something left when testing?

krib: krib-dashboard task invalid url in default krib/dashboard-config param

./krib-dashboard-krib-dashboard.sh.tmpl@259(): wget -O /tmp/kubernetes-dashboard.yaml https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
--2018-12-26 13:33:55--  https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-12-26 13:33:56 ERROR 404: Not Found.

Workaround fix with new profile and krib/dashboard-config parameter containing the tagged version.

{"krib/dashboard-config":"https://raw.githubusercontent.com/kubernetes/dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml"}

Sledgehammer lsblk update for JSON output

Updated versions of lsblk from metapackage util-linux contain the ability to produce JSON output of block devices which would be very useful for iterating over raw storage. Version 2.27 and above have this functionality.

I can only find community builds for 2.29, looks like it's used for rkt: https://cbs.centos.org/koji/buildinfo?buildID=15169

Any chance this can be pulled in during the sledgehammer build process?

Template config incorrect on Ubuntu when "access-ssh-root-mode" parameter set

provision-content/templates/ce-root-remote-access.tmpl

Line 31 in 061c3d9

 echo "PermitRootLogin {{if .ParamExists "access-ssh-root-mode"}}{{.Param "access-ssh-root-mode"}}{{else}}without-password{{end}}" >> /etc/ssh/sshd_config 

does not succeed in doing the right thing as far as changing the behavior of "PermitRootLogin" on Ubuntu.

The generated post-install.sh file for a Ubuntu OS machine using a Profile that sets the access-ssh-root-mode parameter to value "yes" has this line in it:

echo "PermitRootLogin yes" >> /etc/ssh/sshd_config

This results in the actual sshd_config file having two active PermitRootLogin lines, e.g.:

PermitRootLogin without-password
  [...]
PermitRootLogin yes

This is undefined, but empirically in my testing, sshd seems to honor only the first of any duplicate stanzas.

The current stock CentOS sshd_config has the following lines with PermitRootLogin in them:

#PermitRootLogin yes
# the setting of "PermitRootLogin without-password".

The current stock Ubuntu sshd_config has the following lines with PermitRootLogin in them:

PermitRootLogin prohibit-password
# the setting of "PermitRootLogin without-password".

Considering the "yes" value case, this sed command line would end up doing the right thing for either CentOS or Ubuntu:
sed --in-place -r -e '/^#?PermitRootLogin/ s/^#//' -e '/^#?PermitRootLogin/ s/prohibit-password/yes/' /etc/ssh/sshd_config

Of course, the value ("yes" in my case) needs to be parameterized to support valid alternate values [without-password|yes|no|forced-commands-only]

So, replace the "echo .." line in the template with the "sed ..." line with relevant parameterization.

write-sledgehammer-to-disk does not copy over sledgehammer systemd service

The write-sledgehammer-to-disk task does not write the sledgehammer service. Therefore the drpcli runner does not start when booting to disk.

Sledgehammer laptops suspend if lid is closed

Sledgehammer builder defaults cause it to suspend laptops if the lid is closed. This is fixed by staging a custom version of /etc/systemd/logind.conf in task sledgehammer-stage-bits.yaml that overrides two values:

  - Name: systemd-logind-do-not-suspend-laptops
    Path: /etc/systemd/logind.conf
    Contents: |
      #  This file is part of systemd.
      #
      #  systemd is free software; you can redistribute it and/or modify it
      #  under the terms of the GNU Lesser General Public License as published by
      #  the Free Software Foundation; either version 2.1 of the License, or
      #  (at your option) any later version.
      #
      # Entries in this file show the compile time defaults.
      # You can change settings by editing this file.
      # Defaults can be restored by simply deleting this file.
      #
      # See logind.conf(5) for details.
      
      [Login]
      #NAutoVTs=6
      #ReserveVT=6
      #KillUserProcesses=no
      #KillOnlyUsers=
      #KillExcludeUsers=root
      #InhibitDelayMaxSec=5
      #HandlePowerKey=poweroff
      #HandleSuspendKey=suspend
      #HandleHibernateKey=hibernate
      HandleLidSwitch=ignore
      #HandleLidSwitchDocked=ignore
      #PowerKeyIgnoreInhibited=no
      #SuspendKeyIgnoreInhibited=no
      #HibernateKeyIgnoreInhibited=no
      LidSwitchIgnoreInhibited=no
      #HoldoffTimeoutSec=30s
      #IdleAction=ignore
      #IdleActionSec=30min
      #RuntimeDirectorySize=10%
      #RemoveIPC=yes
      #InhibitorsMax=8192
      #SessionsMax=8192
      #UserTasksMax=33%

krib: helm install script from helm/helm broken

https://raw.githubusercontent.com/helm/helm/master/scripts/get script will fail md5 check if you use -v latest. If you give it a real version it's fine.

kubeadm preflight checks error

As a user of KRIB content I would like all the kubeadm Preflight checks to pass so the KRIB automation works without manual intervention.
https://kubernetes.io/docs/reference/setup-tools/kubeadm/implementation-details/

Error:

[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1

Work around:

ssh into the bare metal server and run echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables

efi: stuck in local boot env after ubuntu install

Steps to reproduce:

wipe disk
select 'ubuntu-base' as workflow
wait for install to complete
select 'sledgehammer' as workflow

Expected: booting on 'sledgehammer'
Actual: booted on local system (ubuntu)

When the target reboots, since 'ubuntu' has the highest priority in the EFI boot order, the system boots from it:

root@ubuntu:~# efibootmgr 
BootCurrent: 0004
Timeout: 5 seconds
BootOrder: 0004,0002,0003,0001,0000
Boot0000* EFI Misc Device
Boot0001* EFI Internal Shell
Boot0002* EFI Network 001320FE4CA5 IPv4
Boot0003* EFI Network 001320FE4CA5 IPv6
Boot0004* ubuntu

I guess I would expect the system to first try to boot from PXE with syslinux or ipxe chain to the bootloader on a local disk (which seems to be what the 'local' boot env is doing).

There is a {{reorder-uefi-bootorder}} stage (or {{always-pxe-in-uefi-first}} task) but it doesn't seem like it's being used anywhere. Is there any reason?

krib: krib-config should self heal

Now that etcd can self heal if a master is rebooted krib-config should try to heal as well.

Consolidate all "d-i partman*" directives into part-scheme-default.tmpl

At the current time, preseed "partman" directives are split between the base preseed template (net-seed.tmpl) and the default disk partitioning sub-template (part-scheme-default.tmpl). I have found that some of the "d-i partman*" directives found in net-seed.tmpl are incorrect for custom partitioning sub-templates (which can be called by setting a correct string value in the param part-scheme.) It would be better (and more logical) if all "d-i partman*" directives were located in the default partitioning template (or a custom version thereof.)

Kickstart post install scripts fail after upgrade

After upgrading to dR 3.11.0 and community-content to 1.11.0, kickstart-based installations using e.g. centos-7.ks.tmpl fail to run drpcli to run Machine tasks during post-installation. The error indicated is

/tmp/ks-script-CkENlf: line 75: drpcli: command not found

Please find attached the full /mnt/sysimage/root/post-install.log from the kickstart installation.

post-install.log.gz

CentOS 6.9 ISO doesn't exists

The IsoUrl mentioned in CentOS6.9 bootenv file inside contrib content does not exists anymore.
IsoUrl: "http://mirrors.kernel.org/centos/6.9/isos/x86_64/CentOS-6.9-x86_64-bin-DVD1.iso"

https://github.com/digitalrebar/provision-content/blob/master/contrib/bootenvs/centos-6.9.yml

In fact the ISO can't be found on any official centos mirrors as version starting from 6.0 to 6.9 no longer get any updates, nor any security fix's.

ipxe-shell bootenvs not working

I've tried to use the bootenv from dev-library/bootenvs/ipxeshell.yaml and can't seem to get it to work.
This is the syslinux config:

      DEFAULT local
      PROMPT 0
      TIMEOUT 10
      COM32 chain.c32
      APPEND grub/grub.pxe

First thing that it complains about is that local doesn't exist, so I added LABEL local before COM32 chain.c32, but then it also complains about the APPEND format, and I can't seem to find anything in https://wiki.syslinux.org/wiki/index.php?title=Comboot/chain.c32 regarding tftp.

However it does seem like pxechn.c32 would do the job ( https://wiki.syslinux.org/wiki/index.php?title=Pxechn.c32 ), so I tried that and it fails to run grub.pxe.

It says 'Booting ...' and then some garbage characters and it stopped.

I also try to use an ipxe.pxe with some embed config instead, so something like:

DEFAULT local
PROMPT 0
TIMEOUT 10
LABEL local
  COM32 pxechn.c32
  APPEND ipxe-mymachine.pxe

and it worked.

The original solution mentions something about efi, but it doesn't seem like pxechn.c32 would work in efi mode (I've not tested).

rhel bootenvs does not use select-kickseed parameter

Steps to reproduce

Try to install RHEL from os-other package using custom kickstart file.

Expected result

Kickstart file changeable via select-kickseed param the same way it is used for centos/debian/ubuntu.

Actual result

compute.ks in redhat-7.0-install, redhat-6.5-install and rhel-server-7-install bootenvs is hardcoded to centos-6.ks.tmpl, centos-7.ks.tmpl and rhel-7.ks.tmpl respectively.

Rebar KRIB plugin fails with `Task etcd-config failed` during cert setup stage (or just after)

KRIB is failing with Task etcd-config failed

Full error:

Log for Job: aa7e391c-87fe-4344-ba5b-3e7ce81e7d42
Starting task krib-install-cluster:etcd-config:etcd-config on machine-2
Starting command ./etcd-config-etcd-config.sh.tmpl


Command running
Configure the etcd cluster
Add initial variables to track members.
[]
[]
Creating 1 servers
Electing etcd members to cluster profile: k8s
Certs plugin detected....setting up CA
We are first machine in cluster, setting up the root certs...
  Client CA Exists, but we did not set password.  Need to reset!!
Command exited with status 1
Action etcd-config.sh.tmpl finished
Task etcd-config failed
Marked machine machine-2 as not runnable
Updated job for krib-install-cluster:etcd-config:etcd-config to failed
Task signalled that it failed

It look like it' falling into this block here, https://github.com/digitalrebar/provision-content/blob/master/krib/templates/etcd-config.sh.tmpl#L68

It seems like the command

drpcli machines runaction $RS_UUID getca certs/root $CLIENT_CA 2>/dev/null

is failing to run so it thinks the certs are created "when they are not".

The steps to recreate below assume Rebar is already installed. We started with the KRIB content plugin following the video https://youtu.be/rzBq3BsYQTM?t=1295. We also assume you have created a portal.rackn.io account (required for bulk actions and adding the content package)

Steps to recreate.

Go to https://portal.rackn.io/#/e/147.75.196.129:8092/machines and login
Go to https://portal.rackn.io/#/e/147.75.196.129:8092/plugins/packet-ipmi and enter a machine name, count and click create. This will send an api request to create the nodes.
Go to https://portal.rackn.io/#/e/147.75.196.129:8092/machines, and wait for nodes to be discovered.
Go to krib profile https://portal.rackn.io/#/e/147.75.196.129:8092/profiles/example-ha-krib
Clone
Enter name for new profile e.g k8s
Set the etcd/cluster-profile param to match the profile name "k8s"
Set the krib/cluster-profile param to match the profile name "k8s"
Save new profile
Go to https://portal.rackn.io/#/e/147.75.196.129:8092/bulk
Select all machines to be used in new k8s cluster
Use profile drop down to choose the new "k8s" profile
Click the + symbol to add the profile to the machines
Use Workflows drop down to choose the krib-install-cluster workflow
Click the Change Worlflow button (play / skip button) - This starts the k8s deployment

Expected:

Running Kubernetes cluster

Result:

KRIB is failing with Task etcd-config failed

Need to override drpcli username and password when uploading iso in sledgehammer-builder

Need an override for default user/pass:

provision-content/sledgehammer-builder/tasks/sledgehammer-stage-bits.yaml

Line 670 in 48c84b4

drpcli -U rocketskates -P r0cketsk8ts isos upload $tarname as $tarname

Auth Tokens invalidated in future due to user.ChangePassword()

Versions: tip from installer and current code in master both exhibit the same behaviour.

TL;DR Writing an Ansible playbook to deploy DRP, in the following order:

Download and install DRP Binaries
Configure Service & Start
Create user accounts
Set passwords of created user accounts
Delete rocketskates user
Download and activate DRP content packs using newly created admin account

Expressed as shell commands, the relevant part is something like this:

drpcli users create '{"Name":"admin","Description":"Admin User","Roles":["superuser"],"Available":true}'
drpcli users password admin "password"
drpcli users destroy rocketskates
export RS_TOKEN=`drpcli users token admin | jq -r '.Token'`
drpcli contents create dev-library.json # Works
drpcli contents create drp-community-content.json # Fails
drpcli contents create ansible.json # Fails
...

It doesn't matter what order the contents create lines are in, it always fails on the second line.

I built a local copy of dr-provision from master and added debug logging to frontend.userAuth(), which shows that the userSecret and grantorSecret were changing between the first and second contents create calls around here.

Checking this manually using drpcli users list showed the user Secret being updated as well.

I tracked this back to user.ChangePassword() which modifies the Secret whenever a users' password is changed, but does not appear to save it to the User store at the same time.

Currently, the users password call changes the PasswordHash and the Secret, but does not commit the new Secret into the store (changing the Secret happens after the User is saved).

A token is then generated from this User - either using the 'old' Secret which had been previously persisted, or possibly the 'new' Secret which is stored ephemerally in the existing User object but not persisted to the backing store (not sure which one happens here).

The first content create call (or some other consistently occurring action) then either:

Causes the User to be reloaded from the store, invalidating any Tokens generated using the ephemeral Secret, or
Causes the ephemeral User to be saved to the store, invalidating any Tokens generated using the previously stored Secret.

To test if this is the cause, I made the following change:

diff --git a/backend/user.go b/backend/user.go
index dea0cca1..da0cd801 100644
--- a/backend/user.go
+++ b/backend/user.go
@@ -95,9 +95,9 @@ func (u *User) ChangePassword(rt *RequestTracker, newPass string) error {
                return err
        }
        u.PasswordHash = ph
-       _, err = rt.Save(u)
        // When a user changes their password, invalidate any previous cached auth tokens.
        u.Secret = randString(16)
+       _, err = rt.Save(u)
        return err
 }

Which appears to fix the issue and causes a consistent Secret to be returned and used for all runs of content create.

krib: etcd-config stage should self heal HA Live cluster

In a HA live setup if the machine IP is in the krib-ha profile etcd/servers then remove the old etcd member that matches the machine IP using etcdctl and re-add.

More details on the process are here https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux

This will allow the etcd cluster to recover if a live HA master is rebooted.

Krib failing at krib-config stage - pin kubelet version or ignore-preflight-errors?

In the krib-config stage of the krib-live-boot workflow I encountered the following error on the master node when using krib on on-prem servers. TLDR, you can fix it by pinning the version of kubelet on the kubernetes install to 1.11.1. When using the ignore-preflight-errors=all on the kubeadm init in the krib-config stage, I get an error outlined below but this option is probably not right as all the preflight checks are ignored. Documentation is thin on how to ignore specific errors like the control plane / kubelet version matching which might not be desirable anyways.

Initial Error

Command running
Configure kubeadm master and minions...
MAKE SURE SWAP IS OFF!- kubeadm requirement
MAKE SURE bridge-nf-call-iptables CONTAINS 1 - kubeadm requirement
net.bridge.bridge-nf-call-iptables = 1
My Master index is 0
I am master - run kubeadm
Starting Master kubeadm init process.
"xxx"
[init] using Kubernetes version: v1.11.1
[preflight] running pre-flight checks
	[WARNING Hostname]: hostname "sm-2" could not be reached
	[WARNING Hostname]: hostname "sm-2" lookup sm-2 on 10.1.2.2:53: server misbehaving
[preflight] Some fatal errors occurred:
	[ERROR KubeletVersion]: the kubelet version is higher than the control plane version. This is not a supported version skew and may lead to a malfunctional cluster. Kubelet version: "1.12.2" Control plane version: "1.11.1"
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
Command exited with status 2

which shows that the kubelet is a different version from the control plane. Whats odd is that on inspecting the kubernetes-install stage, I see that all the same versions were installed.

Installed:
  kubeadm.x86_64 0:1.12.2-0              kubectl.x86_64 0:1.12.2-0             

Dependency Installed:
  cri-tools.x86_64 0:1.12.0-0              kubelet.x86_64 0:1.12.2-0            
  kubernetes-cni.x86_64 0:0.6.0-0          socat.x86_64 0:1.7.3.2-2.el7

So to get around that I pinned the version of the kubelet install to 1.11.1 which returned this:

Installing:
 kubeadm              x86_64       1.12.2-0              kubernetes       7.2 M
 kubectl              x86_64       1.12.2-0              kubernetes       7.7 M
 kubelet              x86_64       1.11.1-0              kubernetes        18 M
Installing for dependencies:
 cri-tools            x86_64       1.12.0-0              kubernetes       4.2 M
 kubernetes-cni       x86_64       0.6.0-0               kubernetes       8.6 M
 socat                x86_64       1.7.3.2-2.el7         base             290 k

Transaction Summary
Install  3 Packages (+3 Dependent packages)
Total download size: 46 M
Installed size: 228 M
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
warning: /var/cache/yum/x86_64/7/kubernetes/packages/53edc739a0e51a4c17794de26b13ee5df939bd3161b37f503fe2af8980b41a89-cri-tools-1.12.0-0.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 3e1ba8d5: NOKEY
Public key for 53edc739a0e51a4c17794de26b13ee5df939bd3161b37f503fe2af8980b41a89-cri-tools-1.12.0-0.x86_64.rpm is not installed

Total                                               16 MB/s |  46 MB  00:02     
Retrieving key from https://packages.cloud.google.com/yum/doc/yum-key.gpg
Importing GPG key 0xA7317B0F:
 Userid     : "Google Cloud Packages Automatic Signing Key <[email protected]>"
 Fingerprint: d0bc 747f d8ca f711 7500 d6fa 3746 c208 a731 7b0f
 From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg
Retrieving key from https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
Importing GPG key 0x3E1BA8D5:
 Userid     : "Google Cloud Packages RPM Signing Key <[email protected]>"
 Fingerprint: 3749 e1ba 95a8 6ce0 5454 6ed2 f09c 394c 3e1b a8d5
 From       : https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : kubectl-1.12.2-0.x86_64                                      1/6 
  Installing : socat-1.7.3.2-2.el7.x86_64                                   2/6 
  Installing : kubernetes-cni-0.6.0-0.x86_64                                3/6 
  Installing : kubelet-1.11.1-0.x86_64                                      4/6 
  Installing : cri-tools-1.12.0-0.x86_64                                    5/6 
  Installing : kubeadm-1.12.2-0.x86_64                                      6/6 
  Verifying  : kubeadm-1.12.2-0.x86_64                                      1/6 
  Verifying  : kubelet-1.11.1-0.x86_64                                      2/6 
  Verifying  : cri-tools-1.12.0-0.x86_64                                    3/6 
  Verifying  : kubernetes-cni-0.6.0-0.x86_64                                4/6 
  Verifying  : socat-1.7.3.2-2.el7.x86_64                                   5/6 
  Verifying  : kubectl-1.12.2-0.x86_64                                      6/6 

Installed:
  kubeadm.x86_64 0:1.12.2-0 kubectl.x86_64 0:1.12.2-0 kubelet.x86_64 0:1.11.1-0

Dependency Installed:
  cri-tools.x86_64 0:1.12.0-0           kubernetes-cni.x86_64 0:0.6.0-0         
  socat.x86_64 0:1.7.3.2-2.el7         

Complete!

And then it worked. Very weird.

When trying the --ignore-preflight-errors=all on kuebadm init on master and worker, I get this error.

[discovery] Trying to connect to API Server "10.1.33.102:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.1.33.102:6443"
[discovery] Requesting info from "https://10.1.33.102:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "10.1.33.102:6443"
[discovery] Successfully established connection with API Server "10.1.33.102:6443"
[kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.12" ConfigMap in the kube-system namespace
configmaps "kubelet-config-1.12" is forbidden: User "system:bootstrap:fedcba" cannot get configmaps in the namespace "kube-system"
Command exited with status 1
Action krib-config-clone.sh.tmpl finished
Task krib-config-clone failed
Marked machine sm-1 as not runnable
Updated job for krib-live-cluster-clone:krib-config-clone:krib-config-clone to failed
Task signalled that it failed

Though I don't think this is being done right with the ignore "all".

Anyways it's working when version pinning, just thought I'd give heads up.

krib: install on coreos fails, missing /etc/fstab

I've been playing around to setup a workflow that installs Kubernetes on CoreOS. I used CoreOS version 2023.4.0.
Unfortunately it fails in krib-config.sh.tmpl on the line.

sed -i /swap/d /etc/fstab

This is due to CoreOS don't have that file at all.
By adding a check if the file exists before running the command solves the problem. I plan to create a PR that solves it.

krib-dashboard.sh.tmpl - wget: command not found

As a user of the krib-install-cluster workflow I would like the workflow to deploy a fully functional K8s cluster without error's. Currently getting a wget: command not found error during the krib-config stage, krib-dashboard task.

Logs:

./krib-dashboard-krib-dashboard.sh.tmpl@224(): wget -O /tmp/kubernetes-dashboard.yaml https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yaml
./krib-dashboard-krib-dashboard.sh.tmpl: line 224: wget: command not found
Command exited with status 127
Action krib-dashboard.sh.tmpl finished
Task krib-dashboard failed
Marked machine 9832c35d-ae4c-4945-b4e9-dc1ed9113e6d as not runnable
Updated job 732e297a-eaac-49f6-beb7-851999f94f59 to failed
Task signalled runner to reboot

Proposed Resolution:

I believe we just need to add wget to this line
https://github.com/digitalrebar/provision-content/blob/master/krib/templates/krib-dashboard.sh.tmpl#L38

krib reset fails if krib hadn't completed.

The issue is that if the profile is missing vars, the drpcli remove parameter call fails and fails the whole task. This needs to ignore the failure or test for set parameters and then delete.

Rebar OpenStack content addon, ROSE, fails during content upload with error "missing required feature sprig" error

Error: Error: STORE_ERROR: contents: dr-provision missing required feature sprig

Prereqs:

Digital Rebar installed and accessible from the internet eg. https://147.75.196.129:8092
Username and password for API access to Digital Rebar
Linux host with docker installed to run commands

Steps to recreate

Get Rose Content package git clone https://github.com/digitalrebar/provision-content.git
cd provision-content/rose
Get access to rackn cli docker run -v $(pwd):/rose -ti digitalrebar/provision /bin/bash
cd /rose
/provision/drpcli -E https://147.75.196.129:8092 -U USERNAME -P PASSWORD contents bundle rose.yaml
/provision/drpcli -E https://147.75.196.129:8092 -U USERNAME -P PASSWORD contents upload rose.yaml

Full error:

bash-4.3# /provision/drpcli -E https://147.75.196.129:8092 -U USERNAME -P PASSWORD contents upload rose.yaml            
Error: STORE_ERROR: contents: dr-provision missing required feature  sprig
bash-4.3# cd /rose/
bash-4.3# cd /rose && /provision/drpcli -E https://147.75.196.129:8092 -U USERNAME -P PASSWORD contents bundle rose.yaml && /provision/drpcli -E https://147.75.196.129:8092 -U USERNAME -P PASSWORD contents upload rose.yaml
Error: STORE_ERROR: contents: dr-provision missing required feature  sprig

krib-reset-cluster should not fail

If you try to reset an incomplete krib config sometimes krib-reset-cluster will fail. I don't think it should fail. Maybe log a warning if a parameter it wants to clear doesn't exist.

Error: DELETE: params/krib/cluster-bootstrap-token: Not Found

Cluster-add still assumes cluster/machines to be a list of id's

The cluster-add task, adds the id of the current node to the cluster/machines param here

This works fine, however when visiting the cluster profile through the UI a validation error is shown. Because cluster/machines should be a list of objects.

Key 'cluster/machines': invalid val '[ca1b130d-2bc4-49ae-9922-313e5ccdde85 a8f32e7e-b8f1-4656-ab5b-e016b4488992 cf085c85-3e6f-4e2a-8cbb-2e4445262052]': 0: Invalid type. Expected: object, given: string 1: Invalid type. Expected: object, given: string 2: Invalid type. Expected: object, given: string

krib: all stages should be re-runnable

All stages should check to see if they were already run and support re-run by either uninstalling and re-installing or skipping.