Git Product home page Git Product logo

openshift-checks's Introduction

openshift-checks

A set of scripts to run basic checks on an OpenShift cluster. PRs welcome!

โš ๏ธ This is an unofficial tool, don't blame us if it breaks your cluster

Usage

$ ./openshift-checks.sh -h
Usage: openshift-checks.sh [-h]

This script will run a minimum set of checks to an OpenShift cluster

Available options:

-h, --help                               Print this help and exit
-v, --verbose                            Print script debug info
-l, --list                               Lists the available checks
-s <script>, --single <script>           Executes only the provided script
--no-info                                Disable cluster info commands (default: enabled)
--no-checks                              Disable cluster check commands (default: enabled)
--no-ssh                                 Disable ssh-based check commands (default: enabled)
--prechecks path/to/install-config.yaml  Executes only prechecks (default: disabled)
--results-only                           Only shows pass/fail results from checks (default: disabled)

With no options, it will run all checks and info commands with no debug info

Container

There is an automated container build configured with the content of this repository main branch available at quay.io/rhsysdeseng/openshift-checks.

You can use it with your own kubeconfig file and with the parameters required as:

$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest -h

You can even create a handy alias:

$ alias openshift-checks="podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig quay.io/rhsysdeseng/openshift-checks:latest"

Then, simply run it as:

$ openshift-checks -s info/00-clusterversion
Using default/api-foobar-example-com:6443/system:admin context
...

Note: If your kubeconfig file doesn't have the proper permissions you may get the error "KUBECONFIG not set". In that case verify that the kubeconfig file has read permissions for the user that is used inside the container or just chmod o+r kubeconfig in your host.

Build your own container

You can build your own container with the included Containerfile:

$ podman build --tag foobar/openshiftchecks .
STEP 1: FROM registry.access.redhat.com/ubi8/ubi:latest
...
$ podman push foobar/openshiftchecks
...

Then, run it by replacing quay.io/repository/rhsysdeseng/openshift-checks:latest with your own image such as foobar/openshiftchecks:latest:

$ podman run -it --rm -v /home/foobar/kubeconfig:/kubeconfig:Z -e KUBECONFIG=/kubeconfig foobar/openshiftchecks:latest -h
Usage: openshift-checks.sh [-h]
...

CronJob

The checks can be scheduled to run periodically in an OpenShift cluster by creating a CronJob.

Check the cronjob.yaml example.

How it works

The openshift-checks.sh script is just a wrapper around bash scripts located in the info, checks or ssh directories.

Checks

Check each script and its description in checks

Note: This file is autogenerated when running: ./scripts/update-checksmd > checks.md

Environment variables

Environment variable Default value Description
INTEL_IDS 8086:158b Intel device IDs to check for firmware. Can be overridden for non-supported NICs.
OCDEBUGIMAGE registry.redhat.io/rhel8/support-tools:latest Used by oc debug.
OSETOOLSIMAGE registry.redhat.io/openshift4/ose-tools-rhel8:latest Used by oc debug in ethtool-firmware-version
RESTART_THRESHOLD 10 Used by the restarts script.
THRASHING_THRESHOLD 10 Used by the port-thrashing script.
PARALLELJOBS 1 By default, all the oc debug commands run in a serial fashion, unless this variable is set >1
OVN_MEMORY_LIMIT 5000 Used by the ovn-pods-memory-usage script to set the maximum memory LIMIT (in Mi) to trigger the warning.

About firmware version

The current script checks only the firmware version of the SRIOV operator supported NICs (in 4.6).

You can add your own device ID if needed by modifying the script (hint, the variable is called IDS and the format is vendorID_A:deviceID_A vendorID_B:deviceID_B)

Collaborate

Add a new script to get some information or to perform some check in the proper folder and create a pull request.

Make sure you include a # description: $TEXT that will be later used to populate the checks.md file with the description.

Tips & Tricks

Send an email if some check fails

You can pipe the script to mail and if there are any errors, an email will be sent.

First, you can configure postfix (already included in RHEL8) as relay host (see https://access.redhat.com/solutions/217503). As an example:

  • Append the following settings in /etc/postfix/main.cf:
myhostname = kni1-bootstrap.example.com
relayhost = smtp.example.com
  • Restart the postfix service:
sudo systemctl restart postfix
  • Test it:
echo "Hola" | mail -s 'Subject' [email protected]

Then, run the script as:

/openshift-checks.sh > /tmp/oc-errors 2>&1 || mail -s "Something has failed" [email protected] < /tmp/oc-errors

As a bonus, you can include this in a cronjob for periodic checks.

Get JSON and HTML output

This requires the installation of python requirements in the requirements.txt file, recommended within a virtual environment, once those are installed execute:

./risu.py -l

To automatically execute the tests against the current environment and generate two output files:

  • osc.json
  • osc.html

When loaded over a web server, the HTML file will pull the json file over AJAX and represent the results of the tests in a graphical way:

openshift-checks's People

Contributors

albertcard avatar dcritch avatar dependabot[bot] avatar e-minguez avatar iranzo avatar loganmc10 avatar ptrnull avatar ribua avatar sronanrh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

openshift-checks's Issues

CLI and HTML report not consistent

Hi ,

Reporting issues for checks that are working from CLI and report indicates the following messages.

./openshift-checks.sh -s checks/zombies
Using system:serviceaccount:ribu-test:sa-kubeconfig context
Collecting zombie processes... (using oc debug, it can take a while)
No issues found

The above when run via HTML report just indicates the below information

image

The page doesn;t refresh with the new information or final results.

Ribu

Cleanup some of the codes

Most of the checks are executed against oc debug, It might be declutter a bit the output if we remove the 'please wait' message

msg "Checking Intel firmware version (${BLUE}using oc debug, it can take a while${NOCOLOR})"

and similar in many places in the code

[Bug] mellanox-firmware-version doesn't report version properly

[loganmc10@toolbox openshift-checks]$ ./openshift-checks.sh -s info/mellanox-firmware-version 
Using system:admin context
Checking Mellanox firmware version (using oc debug, it can take a while)
node/lmcnaugh-d17u05-b10.cloud.lab.eng.bos.redhat.com:
0000:3b:00.0 => 0000:3b:00.1 => 
No issues found

The check version fails, but it's because it isn't parsing the version number properly:

[loganmc10@toolbox openshift-checks]$ ./openshift-checks.sh -s checks/mellanox-firmware-version
Using system:admin context
Checking Mellanox firmware version (using oc debug, it can take a while)
Firmware for Mellanox card 0000:3b:00.0 on node/lmcnaugh-d17u05-b10.cloud.lab.eng.bos.redhat.com is below the minimum recommended version. Please upgrade to at least 16.28.
Total issues found: 1

This is a supermicro server, here is some lspci output:

sh-4.4# for id in 15b3:1015 15b3:1017 15b3:1013 15b3:101b; do echo $(lspci -D -d "${id}");done
0000:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 0000:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

Intel NIC Firmware version is not being read using the oc debug container

the oc debug container does not return the Intel firmware version.

Ethtool may be the other way to get this info.

ethtool -i ens3f0
driver: i40e
version: 2.8.20-k
firmware-version: 6.00 0x800036cb 1.1747.0
expansion-rom-version:
bus-info: 0000:d8:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

0000:12:00.0 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
        Subsystem: Hewlett Packard Enterprise Ethernet 10/25/Gb 2-port 661SFP28 Adapter
        Physical Slot: 1
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 26
        NUMA node: 0
        Region 0: Memory at df000000 (64-bit, prefetchable) [size=16M]
        Region 3: Memory at e1000000 (64-bit, prefetchable) [size=32K]
        [virtual] Expansion ROM at e1080000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00001000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <16us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 16, stride: 1, Device ID: 154c
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 00000c7fffa00000 (64-bit, prefetchable)
                Region 3: Memory at 00000c7ffff00000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [1b0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [1d0 v1] #19
        Kernel driver in use: i40e
        Kernel modules: i40e```

Running a single script runs the ssh checks as well

$ ./openshift-checks.sh -v -s checks/pdb
+ shift
+ :
+ case "${1-}" in
+ SINGLE=1
+ SCRIPT_PROVIDED=checks/pdb
+ shift
+ :
+ case "${1-}" in
+ break
+ return 0
+ setup_colors
+ [[ -t 2 ]]
+ [[ -z '' ]]
+ [[ xterm-256color != \d\u\m\b ]]
+ NOCOLOR='\033[0m'
+ RED='\033[0;31m'
+ GREEN='\033[0;32m'
+ ORANGE='\033[0;33m'
+ BLUE='\033[0;34m'
+ PURPLE='\033[0;35m'
+ CYAN='\033[0;36m'
+ YELLOW='\033[1;33m'
+ main -v -s checks/pdb
+ '[' 0 -ne 0 ']'
+ for i in oc yq jq curl column
+ check_command oc
+ command -v oc
+ for i in oc yq jq curl column
+ check_command yq
+ command -v yq
+ for i in oc yq jq curl column
+ check_command jq
+ command -v jq
+ for i in oc yq jq curl column
+ check_command curl
+ command -v curl
+ for i in oc yq jq curl column
+ check_command column
+ command -v column
+ '[' 0 -gt 0 ']'
+ kubeconfig
++ oc whoami
+ CONTEXT=system:admin
+ '[' -z system:admin ']'
+ '[' 1 -ne 0 ']'
+ msg 'Using \033[0;32msystem:admin\033[0m context'
+ echo -e 'Using \033[0;32msystem:admin\033[0m context'
Using system:admin context
++ oc_whoami
+++ oc whoami
++ WHOAMI=system:admin
++ '[' -z system:admin ']'
++ echo system:admin
+ OCUSER=system:admin
+ '[' 1 -ne 0 ']'
+ INFO=0
+ CHECKS=0
+ PRE=0
+ checks/pdb
PodDisruptionBudget with 0 disruptions allowed: {
  "name": "my-pdb"
}
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 1 -gt 0 ']'
+ msg 'Running ssh-based health checks as \033[0;32msystem:admin\033[0m'
+ echo -e 'Running ssh-based health checks as \033[0;32msystem:admin\033[0m'
Running ssh-based health checks as system:admin
+ for ssh in ./ssh/*
+++ cat /tmp/tmp.0NMHGRAT4U
++ expr 1 + 0
+ export errors=1
+ errors=1
+ ./ssh/bz1941840
Checking for a hung kubelet...
Error running crictl stats openshift-authentication-operator/
+++ cat /tmp/tmp.0NMHGRAT4U
++ expr 1 + 0
+ export errors=1
+ errors=1
+ '[' 1 -gt 0 ']'
+ die '\033[0;31mTotal issues found: 1\033[0m'
+ local 'msg=\033[0;31mTotal issues found: 1\033[0m'
+ local code=1
+ error '\033[0;31mTotal issues found: 1\033[0m'
+ echo -e '\033[0;31mTotal issues found: 1\033[0m'
Total issues found: 1
+ exit 1
+ rm /tmp/tmp.0NMHGRAT4U

[RFE] filter the nodes by a label selector

The oc debug node is pretty heavy in time consumption, it could happen that a user wants to check only a specific group of nodes or a single node. A parameter like the following would be really useful:
-f sriov

For example, to run the checks on the nodes with the label sriov will get the following nodes and execute the checks only on this nodes:

$ oc get node -l sriov
NAME                STATUS   ROLES    AGE    VERSION
rna1-master-0.*   Ready    worker   104d   v1.17.1
rna1-master-1.*   Ready    worker   104d   v1.17.1
rna1-master-2.*   Ready    worker   104d   v1.17.1

[RFE] Check mellanox network cards firmware

With Mellanox Technologies MT27710 Family [ConnectX-4 Lx]:

for device in $(lspci -D | grep -i mellanox |grep -vi virtual | awk '{ print $1 }');do echo -n "${device} => "; grep -aoP '(?<=FV)[0-9,.]{8}' /sys/bus/pci/devices/${device}/vpd; done
0000:19:00.0 => 14.26.60
0000:19:00.1 => 14.26.60
0000:3b:00.0 => 14.26.60
0000:3b:00.1 => 14.26.60

[RFE] Add prechecks

It would be nice to also include prechecks so it can be executed before a cluster deployment to prevent failed deployments.
Having the instal-config.yaml as entry point, the script can check the ips, ipmi, dns, etc. as well as perform some health checks in the provisioner host such as disk space, available memory, libvirtd status, etc.

[issue] MTU script syntax error

Getting this error when no MTU modifications have been implemented:

./info/mtu: line 26: syntax error near unexpected token <' ./info/mtu: line 26: mapfile -t MTUS < <( oc debug "${node}" -- chroot /host sh -c 'export EXTBR="br-ex"; export OVNBR="ovn-k8s-mp0"; export BMINTERFACE=$(ovs-vsctl list-ports "${EXTBR}" | grep -v patch) ; echo "${BMINTERFACE}"; nmcli -g GENERAL.MTU dev show "${BMINTERFACE}"; nmcli -g GENERAL.MTU dev show "${EXTBR}"; nmcli -g GENERAL.MTU dev show "${OVNBR}"' 2> /dev/null )'

[RFE] Automate quay.io builds

It seems the current container image is not automated. It would be nice if the container is rebuilt after the PRs are merged so you can have the latest builds automatically.

NotReady nodes are not shown

nodes_not_ready=$(oc get nodes -o json | jq '.items[] | { name: .metadata.name, type: .status.conditions[] } | select ((.type.type == "Ready") and (.type.status == "False"))')

type.status should be !="True as for example, this shows unknown instead of false:

cat nodes-notready.json | jq '.items[] | { name: .metadata.name, type: .status.conditions[] } | select ((.type.type == "Ready") and (.type.status != "True"))'
{
  "name": "kni1-master-1.example.com",
  "type": {
    "lastHeartbeatTime": "2021-04-15T09:11:05Z",
    "lastTransitionTime": "2021-04-15T09:12:06Z",
    "message": "Kubelet stopped posting node status.",
    "reason": "NodeStatusUnknown",
    "status": "Unknown",
    "type": "Ready"
  }
}

[RFE] Run just a single script

It would be nice being able to only run a single script such as:

openshift-check.sh -s mtu

And to get a list of scripts as:

openshift-check.sh -l

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.