Comments (22)
I downgraded to ledmon-0.79-1.el6.x86_64, and it works perfectly.
I also noticed something quite serious, it was preventing the array to rebuild with a hotspare.
With version ledmon 0.94 i had to force the rebuild vs ledmon-0.79 it goes rebuilding as expected, with same configuration for the array.
from ledmon.
Hi,
It is SAS JBOD connected to "Intel C600 Series Chipset SAS RAID Controller"?
Please provide us output from "lsscsi" and "lsscsi -H".
Could you verify how behaves ledctl?
Please stop ledmon and impose any state (failure, locate, normal) via ledctl to the /dev/sde drive.
You will see which drive is blinking.
Please also provide to us full debug logs for both ledmon versions:
for v0.94 it can be done by adding --all to the command, all logs will be written to /var/log/ledmon.log
I don't know how enable debug logs for v0.79, you need to check it yourself.
Thanks,
Mariusz
from ledmon.
Hi,
We use the soft raid, with mdadm.
[0:0:0:0] disk ATA HGST HUS726060AL T907 /dev/sdc
[0:0:1:0] disk ATA HGST HUS726060AL T907 /dev/sdd
[0:0:2:0] disk ATA HGST HUS726060AL T907 /dev/sde
[0:0:3:0] disk ATA HGST HUS726060AL T907 /dev/sdf
[0:0:4:0] disk ATA HGST HUS726060AL T907 /dev/sdg
[0:0:5:0] disk ATA HGST HUS726060AL T907 /dev/sdh
[0:0:6:0] disk ATA HGST HUS726060AL T907 /dev/sdi
[0:0:7:0] disk ATA HGST HUS726060AL T907 /dev/sdj
[0:0:8:0] disk ATA HGST HUS726060AL T907 /dev/sdk
[0:0:9:0] disk ATA HGST HUS726060AL T907 /dev/sdl
[0:0:10:0] disk ATA HGST HUS726060AL T907 /dev/sdm
[0:0:11:0] disk ATA HGST HUS726060AL T907 /dev/sdn
[0:0:12:0] disk ATA HGST HUS726060AL T907 /dev/sdo
[0:0:13:0] disk ATA HGST HUS726060AL T907 /dev/sdp
[0:0:14:0] disk ATA HGST HUS726060AL T907 /dev/sdq
[0:0:15:0] disk ATA HGST HUS726060AL T907 /dev/sdr
[0:0:16:0] disk ATA HGST HUS726060AL T907 /dev/sds
[0:0:17:0] disk ATA HGST HUS726060AL T907 /dev/sdt
[0:0:18:0] disk ATA HGST HUS726060AL T907 /dev/sdu
[0:0:19:0] disk ATA HGST HUS726060AL T907 /dev/sdv
[0:0:20:0] disk ATA HGST HUS726060AL T907 /dev/sdw
[0:0:21:0] disk ATA HGST HUS726060AL T907 /dev/sdx
[0:0:22:0] disk ATA HGST HUS726060AL T907 /dev/sdy
[0:0:23:0] disk ATA HGST HUS726060AL T907 /dev/sdz
[0:0:24:0] enclosu PROMISE 4U-SAS-24-12G-BP 0100 -
[1:0:0:0] disk ATA MICRON_M510DC_MT 0013 /dev/sda
[2:0:0:0] disk ATA MICRON_M510DC_MT 0013 /dev/sdb
Number Major Minor RaidDevice State
5 8 32 0 active sync /dev/sdc
1 8 64 1 active sync /dev/sde
0 8 48 2 active sync /dev/sdd
3 8 96 3 active sync /dev/sdg
4 8 112 4 active sync /dev/sdh
2 8 80 - spare /dev/sdf
As below the logs and images
** Using version 0.79**
Led are working as expected, with hot spare /dev/sdf
0x5f2a5dc7:0x00028dda ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:11.4 - enclosure management not supported.
0x5f2a5dc7:0x000291d5 ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:1f.2 - enclosure management not supported.
0x5f2a5dc7:0x0003a65b DEBUG: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
0x5f2a5dc7:0x0003a71a DEBUG: (raid_device_init) path: md127, level=2, state=6, degraded=0, disks=2, type=1
0x5f2a5dc7:0x0003ac3e DEBUG: (_set_block_state): device: sdc, state: Off
0x5f2a5dc7:0x0003ac5c DEBUG: (_set_block_state): device: sdd, state: Off
0x5f2a5dc7:0x0003ac6d DEBUG: (_set_block_state): device: sde, state: Off
0x5f2a5dc7:0x0003ac7d DEBUG: (_set_block_state): device: sdf, state: Hotspare
0x5f2a5dc7:0x0003ac8d DEBUG: (_set_block_state): device: sdg, state: Off
0x5f2a5dc7:0x0003ac9c DEBUG: (_set_block_state): device: sdh, state: Off
Simulating a failure on /dev/sdd led is correct
ledctl failure=/dev/sdd
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:11.4 - enclosure management not supported.
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:1f.2 - enclosure management not supported.
ledctl: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
ledctl: (raid_device_init) path: md127, level=2, state=7, degraded=0, disks=2, type=1
ledctl: (_set_block_state): device: sdc, state: NORMAL
ledctl: (_set_block_state): device: sdd, state: NORMAL
ledctl: (_set_block_state): device: sde, state: NORMAL
ledctl: (_set_block_state): device: sdf, state: HOTSPARE
ledctl: (_set_block_state): device: sdg, state: NORMAL
ledctl: (_set_block_state): device: sdh, state: NORMAL
** Using version 0.94** Using same RAID configuration
We see can the led layout is incorrect, there is an offset /dev/sdg looks like an hotspare, and same for /dev/sdc
Aug 05 09:30:08 DEBUG: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
Aug 05 09:30:08 DEBUG: (raid_device_init) path: md127, level=2, state=6, degraded=0, disks=2, type=1
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sdc, state: Off
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sdd, state: Off
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sde, state: Off
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sdf, state: Hotspare
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sdg, state: Off
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sdh, state: Off
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sda, state: Off
Aug 05 09:30:08 DEBUG: (_set_block_state): device: sdb, state: Off
Simulating a failure on /dev/sdd led is incorrect, there is an off, /dev/sde is taken instead of /dev/sdd
# ledctl failure=/dev/sdd
ledctl: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
ledctl: (raid_device_init) path: md127, level=2, state=6, degraded=0, disks=2, type=1
ledctl: (_set_block_state): device: sdc, state: NORMAL
ledctl: (_set_block_state): device: sdd, state: NORMAL
ledctl: (_set_block_state): device: sde, state: NORMAL
ledctl: (_set_block_state): device: sdf, state: HOTSPARE
ledctl: (_set_block_state): device: sdg, state: NORMAL
ledctl: (_set_block_state): device: sdh, state: NORMAL
ledctl: (_set_block_state): device: sda, state: NORMAL
ledctl: (_set_block_state): device: sdb, state: NORMAL
from ledmon.
Hi,
Thanks for detailed response.
Could you use git bisect to find out bad commit?
https://git-scm.com/docs/git-bisect
Mariusz
from ledmon.
You're welcome, I've been using rpm package. I'm not familiar with git bisect.
I didn't mention but on production we are using version 0.90 which also have this bug.
from ledmon.
In README.md you have all necessary dependencies listed, please have a try to build in manually.
git clone https://github.com/intel/ledmon.git
cd ledmon
(you will use ledmon before migration to autools)
git bisect start
git bisect good v0.79
git bisect bad v0.90
1)Then it will jump you automatically somewhere between those commits. Then do:
make clean
make
./src/ledctl failure=/dev/sdd
(see result and clear led)
./src/ledctl normal=/dev/sdd
if it works mark it as good:
git bisect good
else as bad:
git bisect bad
Then bisect magic will jump you into another commit, so you are returning into 1).
At the end you will get prompt with first commit with regression.
I hope that it will help you. We are waiting for your feedback.
Mariusz
from ledmon.
Thanks, it stops working here
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[bd19f72] ses: load page10 only when necessary
from ledmon.
Great, thanks for that.
We will start working on it soon.
Mariusz
from ledmon.
Hi Mariusz,
Any progress on the issue? I may force version 0.79 into production for the time being.
Cheers.
from ledmon.
Hi,
I'm working on it. Can you send me the contents of /sys/class/enclosure from that system? Something like this will be good enough:
tar cfz enclosure.tar.gz /sys/class/enclosure/*/*
You can ignore the errors, not all files are readable.
Thanks,
Artur
from ledmon.
Thanks, please find the logs in attachment.
enclosure.tar.gz
from ledmon.
Thank you, very helpful. It turns out that the enclosure slots on your system are numbered starting from 1 and ledmon assumes that it should start from 0, and that's true on my platform with a RHEL7 kernel. I'll try to figure out why. Can you try this patch and see if it helps? Thanks.
from ledmon.
How you want me to run the patch? From which version ?
from ledmon.
The buggy version. The patch is just one line change, if it fails to apply you can easily make the change by hand.
Also, could you send the output of these commands? If you don't have sg_ses, please install the sg3_utils package.
sg_ses --page=0 /dev/bsg/10:0:24:0
sg_ses --page=1 /dev/bsg/10:0:24:0
sg_ses --page=2 /dev/bsg/10:0:24:0
sg_ses --page=10 /dev/bsg/10:0:24:0
from ledmon.
Thanks, it does work now with this patch.
from ledmon.
Great, thanks for checking it. That information you provided should be enough for me to make a proper fix.
from ledmon.
@ylemouel would you like to check the linked pull request to verify that it's working correctly on your setup?
Thanks,
Artur
from ledmon.
Thanks, It does work, no more offset detected.
Cheers.
from ledmon.
from ledmon.
I noticed an error from the log during and after a rebuild.
Restarting ledmon service helped, no more error
Could you please take a look?
Sep 03 09:13:40 DEBUG: (raid_device_init) path: md0, level=3, state=6, degraded=0, disks=6, type=1
Sep 03 09:13:40 DEBUG: (raid_device_init) path: md1, level=6, state=6, degraded=0, disks=6, type=1
Sep 03 09:13:40 DEBUG: (raid_device_init) path: md127, level=2, state=7, degraded=0, disks=2, type=1
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdc, state: Hotspare
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdd, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sde, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdf, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdg, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdh, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdi, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdj, state: Hotspare
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdk, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdl, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdm, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdn, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdo, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdp, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sda, state: Off
Sep 03 09:13:40 DEBUG: (_set_block_state): device: sdb, state: Off
Sep 03 09:13:40 DEBUG: DETACHED DEV 'host10/port-10:0/expander-10:0/port-10:0:0/end_device-10:0:0/target10:0:0/10:0:0:0/block/sdc' in failed state
State : clean
Active Devices : 6
Working Devices : 7
Failed Devices : 0
Spare Devices : 1
Layout : near=2
Chunk Size : 512K
Consistency Policy : bitmap
Number Major Minor RaidDevice State
1 8 64 0 active sync set-A /dev/sde
0 8 48 1 active sync set-B /dev/sdd
2 8 80 2 active sync set-A /dev/sdf
3 8 96 3 active sync set-B /dev/sdg
4 8 112 4 active sync set-A /dev/sdh
5 8 128 5 active sync set-B /dev/sdi
6 8 32 - spare /dev/sdc
* ledmon.service - ledmon
Loaded: loaded (/etc/systemd/system/ledmon.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2020-09-02 11:16:08 CEST; 22h ago
Main PID: 16239 (ledmon)
CGroup: /system.slice/ledmon.service
`-16239 /usr/sbin/ledmon --all
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdj, state: Hotspare
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdk, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdl, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdm, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdn, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdo, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdp, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sda, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdb, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: DETACHED DEV 'host10/port-10:0/expander-10:0/port-10:0:0/end_device-10:0:0/target10:0:0/10:0:0:0/block/sdc' in failed state
# systemctl restart ledmon.service
# systemctl status ledmon.service
* ledmon.service - ledmon
Loaded: loaded (/etc/systemd/system/ledmon.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2020-09-03 09:17:48 CEST; 1min 29s ago
Process: 20979 ExecStart=/usr/sbin/ledmon --all (code=exited, status=0/SUCCESS)
Main PID: 20980 (ledmon)
CGroup: /system.slice/ledmon.service
`-20980 /usr/sbin/ledmon --all
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdi, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdj, state: Hotspare
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdk, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdl, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdm, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdn, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdo, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdp, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sda, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdb, state: Off
from ledmon.
Hi,
about your first question. The best we can do is to report bug to OSV and allow them to pick-up the change from upstream.
It will be included in incoming releases.
I don't see errors here, only debug logs. Please remove "--all" parameter from service file.
from ledmon.
You're right, I left the debug logs On.
Thanks!
from ledmon.
Related Issues (20)
- Remove not needed check in ibpi2str function
- Strange behavior with AMD SGPIO controller HOT 3
- Status of AMD Support HOT 27
- tests/runtests.sh failure when testing SES HOT 6
- Cannot locate multipathing nvme block device by ledctl in RHEL 9.2 HOT 2
- [ENHANCEMENT] Unify messaging for non-existent and not supported devices. HOT 1
- 1.0.0: build fails HOT 5
- 1.0 configure fails HOT 5
- ledctl: /dev/nvme0n1: device not supported HOT 3
- [BUG]: Incompatible API change was made with https://github.com/intel/ledmon/pull/211 HOT 2
- [ENHANCEMENT] adjust structs with device properties - code improvement
- [ENHANCEMENT] add README to test directory
- [ENHANCEMENT] Refactor assigning value ibpi_prev in _send_msg.
- [QUESTION] LED indication not working on Supermicro AMD system HOT 4
- [BUG]: Failure to build on 32bit HOT 7
- No function with intel board and intel NVME HOT 5
- [BUG]: locate E3.S NVMe SSD show many abnormal output (SCSI: Unable to locate...) HOT 9
- lib/enclosure.c Produces errors, if Enclosure Indexes do not start by 0 HOT 4
- [QUESTION] Can non-VMD NVMe drives be supported? HOT 3
- [ENHANCEMENT] NPEM/DSM kernel driver support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ledmon.