Git Product home page Git Product logo

check_ib_switch's Introduction

check_ib_switch.py

This tool is used to monitor unmanaged Infiniband switches made by Mellanox. This script is using mlxreg_ext to query the register on the switches. The information could have errors since some guesswork is involved.

Requirements

  • mgt
  • python3

Usage

This script is intended to be used with NRPE.

A switch can be accessed with its GUID or the combination of its name and the path to a node name map file.

The --cable option will also query each transciver to get their temperature. It will also display the cable information like its part number and length.

usage: check_ib_switch.py [-h] [-v] [--guid GUID]
                          [--node_name_map NODE_NAME_MAP] [--name NAME]
                          [--fan] [--cable] [--psu] [--temp]

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase output verbosity
  --guid GUID           Switch GUID to check
  --node_name_map NODE_NAME_MAP
                        Node name map file path
  --name NAME           Switch name used in node-name-map
  --fan                 Check fans
  --cable               Check cables
  --psu                 Check PSUs
  --temp                Check temperatures

Example

Serial numbers and switch name is redacted.

check_ib_switch.py --name switch0 --node_name_map /etc/node-name-map --psu --temp --fan --cable
Switch OK | PSU0_W=58;;30:100;; PSU1_W=65;;30:100;; Fan1_RPM=8493;;6000:10000;; Fan2_RPM=7232;;6000:10000;; Fan3_RPM=8389;;6000:10000;; Fan4_RPM=7119;;6000:10000;; Fan5_RPM=8389;;6000:10000;; Fan6_RPM=7156;;6000:10000;; Fan7_RPM=8441;;6000:10000;; Fan8_RPM=7232;;6000:10000;; Temperature1_C=22.4;;5:45;; Temperature2_C=31.2;;5:45;; Temperature3_C=25.6;;5:45;; Temperature4_C=30.4;;5:45;; Temperature5_C=27.2;;5:45;; Temperature6_C=32.8;;5:45;;
GUID=0xXXXXXXXXXXXXXXXX LID=31 Name=switch0
Scorpion2 IBEDRUnmanaged PN=057FVR Rev=A05 SN=XX
Cable #1, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #2, Mellanox PN=MFA1A00-E030 SN=XX Rev=B1 FW=538837208, 30M
Cable #3, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #4, Mellanox PN=MFA1A00-E030 SN=XX Rev=B1 FW=538837208, 30M
Cable #5, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #6, Mellanox PN=MFA1A00-E030 SN=XX Rev=B1 FW=538837208, 30M
Cable #7, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #8, Mellanox PN=0H4TJX SN=XX Rev=E2 FW=0, 3M
Cable #9, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M
Cable #10, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #11, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M
Cable #12, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #13, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M
Cable #14, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #15,  PN= SN= Rev= FW=0, 0M
Cable #16, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #17,  PN= SN= Rev= FW=0, 0M
Cable #18, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #19, Mellanox PN=MFA1A00-E030 SN=XX Rev=B1 FW=538837208, 30M
Cable #20, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #21, Mellanox PN=MFA1A00-E030 SN=XX Rev=B1 FW=538837208, 30M
Cable #22, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #23, Mellanox PN=MFA1A00-E030 SN=XX Rev=B1 FW=538837208, 30M
Cable #24, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #25, Mellanox PN=0H4TJX SN=XX Rev=E2 FW=0, 3M
Cable #26, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #27, Mellanox PN=0H4TJX SN=XX Rev=E2 FW=0, 3M
Cable #28, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #29, Mellanox PN=0H4TJX SN=XX Rev=E2 FW=0, 3M
Cable #30, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M
Cable #31, Mellanox PN=0H4TJX SN=XX Rev=E2 FW=0, 3M
Cable #32, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M
Cable #33, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #34, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M
Cable #35, Mellanox PN=02F00T SN=XX Rev=E2 FW=0, 2M
Cable #36, Mellanox PN=0684G2 SN=XX Rev=E3 FW=538837208, 10M

check_ib_switch's People

Contributors

guilbaults avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

hpcnow

check_ib_switch's Issues

Support HDR switches (MQM8790-HS2F)

The subcommands --psu and --temp are currently broken:

python3 check_ib_switch.py --guid changeme --psu
Traceback (most recent call last):
  File "check_ib_switch.py", line 182, in <module>
    if psus[psu_watt] < 30:
KeyError: 'watt_0'
python3 check_ib_switch.py --guid changeme--temp
Traceback (most recent call last):
  File "check_ib_switch.py", line 212, in <module>
    temperature = temp_info['temperature']/10
KeyError: 'temperature'

The --fan command do works:

# python3 check_ib_switch.py --guid changeme --fan
Switch OK | Fan1_RPM=12027;;4500:13000;; Fan2_RPM=10846;;4500:13000;; Fan3_RPM=11813;;4500:13000;; Fan4_RPM=10846;;4500:13000;; Fan5_RPM=12027;;4500:13000;; Fan6_RPM=10846;;4500:13000;; Fan7_RPM=12137;;4500:13000;; Fan8_RPM=10935;;4500:13000;;
GUID=changeme LID=737 Name=changeme
Jaguar UnmngIB200 PN=MQM8790-HS2F Rev=AF SN=changeme

Replacing mlxreg_ext by mlxlink for cable detection

mlxreg_ext is hard to use and currently broken for HDR cables.

Example of the output for EDR and HDR cables:

Copper HDR splitter:

mlxlink -d lid-726 -p 1 --show_module

Operational Info
----------------
State                           : Active
Physical state                  : LinkUp
Speed                           : IB-HDR
Width                           : 2x
FEC                             : RS-FEC - (544,514) + PLR
Loopback Mode                   : No Loopback
Auto Negotiation                : ON

Supported Info
--------------
Enabled Link Speed              : 0x00000075 (HDR,EDR,FDR,QDR,SDR)
Supported Cable Speed           : 0x0000007f (HDR,EDR,FDR,FDR10,QDR,DDR,SDR)

Troubleshooting Info
--------------------
Status Opcode                   : 0
Group Opcode                    : N/A
Recommendation                  : No issue was observed.

Tool Information
----------------
Firmware Version                : 27.2012.3040
amBER Version                   : 1.64
MFT Version                     : mft 4.18.0-106

Module Info
-----------
Identifier                      : QSFP28
Compliance                      : N/A
Cable Technology                : Copper cable unequalized
Cable Type                      : Passive copper cable
OUI                             : Mellanox
Vendor Name                     : Mellanox
Vendor Part Number              : MCP7H50-H002R26
Vendor Serial Number            : changeme
Rev                             : A3
Wavelength [nm]                 : N/A
Transfer Distance [m]           : 2
Attenuation (5g,7g,12g) [dB]    : 5,6,9
FW Version                      : N/A
Digital Diagnostic Monitoring   : No
Power Class                     : N/A
CDR RX                          : N/A
CDR TX                          : N/A
LOS Alarm                       : N/A
Temperature [C]                 : N/A
Voltage [mV]                    : N/A
Bias Current [mA]               : N/A
Rx Power Current [dBm]          : N/A
Tx Power Current [dBm]          : N/A
IB Cable Width                  : 1x,2x
Memory Map Revision             : 8
Linear Direct Drive             : 0
Cable Breakout                  : Channels implemented [1,2,3,4]/2 far-ends with 2 channels implemented in each (i.e. 2x2 break out)
SMF Length                      : N/A
MAX Power                       : 0
Cable Rx AMP                    : N/A
Cable Rx Emphasis               : N/A
Cable Rx Post Emphasis          : N/A
Cable Tx Equalization           : N/A
Wavelength Tolerance            : N/A
Module State                    : N/A
DataPath state [per lane]       : N/A,N/A
Rx Output Valid                 : 0,0
Rx Input Valid                  : 0,0
Nominal bit rate                : 26.500Gb/s
Rx Power Type                   : OMA
Manufacturing Date              : 17_00_21
Active Set Host Compliance Code : N/A
Active Set Media Compliance Code: N/A
Error Code Response             : N/A
Module FW Fault                 : N/A
DataPath FW Fault               : N/A
Tx Fault [per lane]             : N/A
Tx LOS [per lane]               : N/A
Tx CDR LOL [per lane]           : N/A
Rx LOS [per lane]               : N/A
Rx CDR LOL [per lane]           : N/A
Tx Adaptive EQ Fault [per lane] : N/A

Copper EDR cable:

mlxlink -d lid-726 -p 7 --show_module

Operational Info
----------------
State                           : Active
Physical state                  : LinkUp
Speed                           : IB-EDR
Width                           : 4x
FEC                             : Standard LL RS-FEC - RS(271,257)
Loopback Mode                   : No Loopback
Auto Negotiation                : ON

Supported Info
--------------
Enabled Link Speed              : 0x00000035 (EDR,FDR,QDR,SDR)
Supported Cable Speed           : 0x0000003f (EDR,FDR,FDR10,QDR,DDR,SDR)

Troubleshooting Info
--------------------
Status Opcode                   : 0
Group Opcode                    : N/A
Recommendation                  : No issue was observed.

Tool Information
----------------
Firmware Version                : 27.2012.3040
amBER Version                   : 1.64
MFT Version                     : mft 4.18.0-106

Module Info
-----------
Identifier                      : QSFP+
Compliance                      : N/A
Cable Technology                : Copper cable unequalized
Cable Type                      : Passive copper cable
OUI                             : Mellanox
Vendor Name                     : Mellanox
Vendor Part Number              : MCP1600-E002E30
Vendor Serial Number            : changeme
Rev                             : A2
Wavelength [nm]                 : N/A
Transfer Distance [m]           : 2
Attenuation (5g,7g,12g) [dB]    : 6,8,12
FW Version                      : N/A
Digital Diagnostic Monitoring   : No
Power Class                     : N/A
CDR RX                          : N/A
CDR TX                          : N/A
LOS Alarm                       : N/A
Temperature [C]                 : N/A
Voltage [mV]                    : N/A
Bias Current [mA]               : N/A
Rx Power Current [dBm]          : N/A
Tx Power Current [dBm]          : N/A
IB Cable Width                  : 1x,2x,4x
Memory Map Revision             : 6
Linear Direct Drive             : 0
Cable Breakout                  : Channels implemented [1,2,3,4]/Cable with single far-end with 4 channels implemented, or separable module with a 4-channel connector
SMF Length                      : N/A
MAX Power                       : 0
Cable Rx AMP                    : N/A
Cable Rx Emphasis               : N/A
Cable Rx Post Emphasis          : N/A
Cable Tx Equalization           : N/A
Wavelength Tolerance            : N/A
Module State                    : N/A
DataPath state [per lane]       : N/A,N/A,N/A,N/A
Rx Output Valid                 : 0,0,0,0
Rx Input Valid                  : 0,0,0,0
Nominal bit rate                : 25.750Gb/s
Rx Power Type                   : OMA
Manufacturing Date              : 30_12_21
Active Set Host Compliance Code : N/A
Active Set Media Compliance Code: N/A
Error Code Response             : N/A
Module FW Fault                 : N/A
DataPath FW Fault               : N/A
Tx Fault [per lane]             : N/A
Tx LOS [per lane]               : N/A
Tx CDR LOL [per lane]           : N/A
Rx LOS [per lane]               : N/A
Rx CDR LOL [per lane]           : N/A
Tx Adaptive EQ Fault [per lane] : N/A

Optical HDR cable:

mlxlink -d lid-726 -p 17 --show_module

Operational Info
----------------
State                           : Active
Physical state                  : LinkUp
Speed                           : IB-HDR
Width                           : 4x
FEC                             : LL-FEC - (271,257) + PLR
Loopback Mode                   : No Loopback
Auto Negotiation                : ON

Supported Info
--------------
Enabled Link Speed              : 0x00000061 (HDR,EDR,SDR)
Supported Cable Speed           : 0x00000061 (HDR,EDR,SDR)

Troubleshooting Info
--------------------
Status Opcode                   : 0
Group Opcode                    : N/A
Recommendation                  : No issue was observed.

Tool Information
----------------
Firmware Version                : 27.2012.3040
amBER Version                   : 1.64
MFT Version                     : mft 4.18.0-106

Module Info
-----------
Identifier                      : QSFP28
Compliance                      : N/A
Cable Technology                : 850 nm VCSEL
Cable Type                      : Active cable (active copper / optics)
OUI                             : Mellanox
Vendor Name                     : Mellanox
Vendor Part Number              : MFS1S00-H020E
Vendor Serial Number            : changeme
Rev                             : A7
Wavelength [nm]                 : 850
Transfer Distance [m]           : 20
Attenuation (5g,7g,12g) [dB]    : N/A
FW Version                      : 37.50.322
Digital Diagnostic Monitoring   : Yes
Power Class                     : 5.0 W max
CDR RX                          : ON,ON,ON,ON
CDR TX                          : ON,ON,ON,ON
LOS Alarm                       : N/A
Temperature [C]                 : 48 [-10..80]
Voltage [mV]                    : 3236 [3100..3500]
Bias Current [mA]               : 7.386,7.392,7.390,7.392 [5.492..8.5]
Rx Power Current [dBm]          : 0,0,0,0 [-14..6]
Tx Power Current [dBm]          : 0,0,0,0 [-12..6]
IB Cable Width                  : 1x,2x,4x
Memory Map Revision             : 8
Linear Direct Drive             : 0
Cable Breakout                  : Channels implemented [1,2,3,4]/Cable with single far-end with 4 channels implemented, or separable module with a 4-channel connector
SMF Length                      : N/A
MAX Power                       : 0
Cable Rx AMP                    : 0
Cable Rx Emphasis               : 0
Cable Rx Post Emphasis          : 0
Cable Tx Equalization           : 8
Wavelength Tolerance            : 3000nm
Module State                    : N/A
DataPath state [per lane]       : N/A,N/A,N/A,N/A
Rx Output Valid                 : 0,0,0,0
Rx Input Valid                  : 0,0,0,0
Nominal bit rate                : 26.500Gb/s
Rx Power Type                   : Average power
Manufacturing Date              : 14_03_21
Active Set Host Compliance Code : N/A
Active Set Media Compliance Code: N/A
Error Code Response             : N/A
Module FW Fault                 : N/A
DataPath FW Fault               : N/A
Tx Fault [per lane]             : 0,0,0,0
Tx LOS [per lane]               : 0,0,0,0
Tx CDR LOL [per lane]           : 0,0,0,0
Rx LOS [per lane]               : 0,0,0,0
Rx CDR LOL [per lane]           : 0,0,0,0
Tx Adaptive EQ Fault [per lane] : 0,0,0,0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.