Git Product home page Git Product logo

Comments (22)

ylemouel avatar ylemouel commented on September 27, 2024

I downgraded to ledmon-0.79-1.el6.x86_64, and it works perfectly.
I also noticed something quite serious, it was preventing the array to rebuild with a hotspare.
With version ledmon 0.94 i had to force the rebuild vs ledmon-0.79 it goes rebuilding as expected, with same configuration for the array.

from ledmon.

mtkaczyk avatar mtkaczyk commented on September 27, 2024

Hi,
It is SAS JBOD connected to "Intel C600 Series Chipset SAS RAID Controller"?
Please provide us output from "lsscsi" and "lsscsi -H".

Could you verify how behaves ledctl?
Please stop ledmon and impose any state (failure, locate, normal) via ledctl to the /dev/sde drive.
You will see which drive is blinking.

Please also provide to us full debug logs for both ledmon versions:
for v0.94 it can be done by adding --all to the command, all logs will be written to /var/log/ledmon.log
I don't know how enable debug logs for v0.79, you need to check it yourself.

Thanks,
Mariusz

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

Hi,

We use the soft raid, with mdadm.

[0:0:0:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdc
[0:0:1:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdd
[0:0:2:0]    disk    ATA      HGST HUS726060AL T907  /dev/sde
[0:0:3:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdf
[0:0:4:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdg
[0:0:5:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdh
[0:0:6:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdi
[0:0:7:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdj
[0:0:8:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdk
[0:0:9:0]    disk    ATA      HGST HUS726060AL T907  /dev/sdl
[0:0:10:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdm
[0:0:11:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdn
[0:0:12:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdo
[0:0:13:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdp
[0:0:14:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdq
[0:0:15:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdr
[0:0:16:0]   disk    ATA      HGST HUS726060AL T907  /dev/sds
[0:0:17:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdt
[0:0:18:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdu
[0:0:19:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdv
[0:0:20:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdw
[0:0:21:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdx
[0:0:22:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdy
[0:0:23:0]   disk    ATA      HGST HUS726060AL T907  /dev/sdz
[0:0:24:0]   enclosu PROMISE  4U-SAS-24-12G-BP 0100  -
[1:0:0:0]    disk    ATA      MICRON_M510DC_MT 0013  /dev/sda
[2:0:0:0]    disk    ATA      MICRON_M510DC_MT 0013  /dev/sdb
    Number   Major   Minor   RaidDevice State
       5       8       32        0      active sync   /dev/sdc
       1       8       64        1      active sync   /dev/sde
       0       8       48        2      active sync   /dev/sdd
       3       8       96        3      active sync   /dev/sdg
       4       8      112        4      active sync   /dev/sdh

       2       8       80        -      spare   /dev/sdf

As below the logs and images

** Using version 0.79**

Led are working as expected, with hot spare /dev/sdf
version79

0x5f2a5dc7:0x00028dda   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:11.4 - enclosure management not supported.
0x5f2a5dc7:0x000291d5   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:1f.2 - enclosure management not supported.
0x5f2a5dc7:0x0003a65b   DEBUG: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
0x5f2a5dc7:0x0003a71a   DEBUG: (raid_device_init) path: md127, level=2, state=6, degraded=0, disks=2, type=1
0x5f2a5dc7:0x0003ac3e   DEBUG: (_set_block_state): device: sdc, state: Off
0x5f2a5dc7:0x0003ac5c   DEBUG: (_set_block_state): device: sdd, state: Off
0x5f2a5dc7:0x0003ac6d   DEBUG: (_set_block_state): device: sde, state: Off
0x5f2a5dc7:0x0003ac7d   DEBUG: (_set_block_state): device: sdf, state: Hotspare
0x5f2a5dc7:0x0003ac8d   DEBUG: (_set_block_state): device: sdg, state: Off
0x5f2a5dc7:0x0003ac9c   DEBUG: (_set_block_state): device: sdh, state: Off

Simulating a failure on /dev/sdd led is correct
version79-failure

ledctl failure=/dev/sdd
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:11.4 - enclosure management not supported.
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:1f.2 - enclosure management not supported.
ledctl: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
ledctl: (raid_device_init) path: md127, level=2, state=7, degraded=0, disks=2, type=1
ledctl: (_set_block_state): device: sdc, state: NORMAL
ledctl: (_set_block_state): device: sdd, state: NORMAL
ledctl: (_set_block_state): device: sde, state: NORMAL
ledctl: (_set_block_state): device: sdf, state: HOTSPARE
ledctl: (_set_block_state): device: sdg, state: NORMAL
ledctl: (_set_block_state): device: sdh, state: NORMAL

** Using version 0.94** Using same RAID configuration

version94
We see can the led layout is incorrect, there is an offset /dev/sdg looks like an hotspare, and same for /dev/sdc

Aug 05 09:30:08   DEBUG: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
Aug 05 09:30:08   DEBUG: (raid_device_init) path: md127, level=2, state=6, degraded=0, disks=2, type=1
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sdc, state: Off
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sdd, state: Off
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sde, state: Off
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sdf, state: Hotspare
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sdg, state: Off
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sdh, state: Off
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sda, state: Off
Aug 05 09:30:08   DEBUG: (_set_block_state): device: sdb, state: Off

Simulating a failure on /dev/sdd led is incorrect, there is an off, /dev/sde is taken instead of /dev/sdd
version94-failure

# ledctl failure=/dev/sdd
ledctl: (raid_device_init) path: md0, level=6, state=6, degraded=0, disks=5, type=1
ledctl: (raid_device_init) path: md127, level=2, state=6, degraded=0, disks=2, type=1
ledctl: (_set_block_state): device: sdc, state: NORMAL
ledctl: (_set_block_state): device: sdd, state: NORMAL
ledctl: (_set_block_state): device: sde, state: NORMAL
ledctl: (_set_block_state): device: sdf, state: HOTSPARE
ledctl: (_set_block_state): device: sdg, state: NORMAL
ledctl: (_set_block_state): device: sdh, state: NORMAL
ledctl: (_set_block_state): device: sda, state: NORMAL
ledctl: (_set_block_state): device: sdb, state: NORMAL

from ledmon.

mtkaczyk avatar mtkaczyk commented on September 27, 2024

Hi,
Thanks for detailed response.

Could you use git bisect to find out bad commit?
https://git-scm.com/docs/git-bisect

Mariusz

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

You're welcome, I've been using rpm package. I'm not familiar with git bisect.

I didn't mention but on production we are using version 0.90 which also have this bug.

from ledmon.

mtkaczyk avatar mtkaczyk commented on September 27, 2024

In README.md you have all necessary dependencies listed, please have a try to build in manually.

git clone https://github.com/intel/ledmon.git
cd ledmon

(you will use ledmon before migration to autools)

git bisect start
git bisect good v0.79
git bisect bad  v0.90

1)Then it will jump you automatically somewhere between those commits. Then do:

make clean
make
./src/ledctl failure=/dev/sdd

(see result and clear led)

./src/ledctl normal=/dev/sdd

if it works mark it as good:

git bisect good

else as bad:

git bisect bad

Then bisect magic will jump you into another commit, so you are returning into 1).
At the end you will get prompt with first commit with regression.

I hope that it will help you. We are waiting for your feedback.

Mariusz

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

Thanks, it stops working here
Bisecting: 12 revisions left to test after this (roughly 4 steps)
[bd19f72] ses: load page10 only when necessary

from ledmon.

mtkaczyk avatar mtkaczyk commented on September 27, 2024

Great, thanks for that.
We will start working on it soon.

Mariusz

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

Hi Mariusz,

Any progress on the issue? I may force version 0.79 into production for the time being.

Cheers.

from ledmon.

apaszkie avatar apaszkie commented on September 27, 2024

Hi,

I'm working on it. Can you send me the contents of /sys/class/enclosure from that system? Something like this will be good enough:

tar cfz enclosure.tar.gz /sys/class/enclosure/*/*

You can ignore the errors, not all files are readable.

Thanks,
Artur

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

Thanks, please find the logs in attachment.
enclosure.tar.gz

from ledmon.

apaszkie avatar apaszkie commented on September 27, 2024

Thank you, very helpful. It turns out that the enclosure slots on your system are numbered starting from 1 and ledmon assumes that it should start from 0, and that's true on my platform with a RHEL7 kernel. I'll try to figure out why. Can you try this patch and see if it helps? Thanks.

patch.txt

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

How you want me to run the patch? From which version ?

from ledmon.

apaszkie avatar apaszkie commented on September 27, 2024

The buggy version. The patch is just one line change, if it fails to apply you can easily make the change by hand.

Also, could you send the output of these commands? If you don't have sg_ses, please install the sg3_utils package.

sg_ses --page=0 /dev/bsg/10:0:24:0
sg_ses --page=1 /dev/bsg/10:0:24:0
sg_ses --page=2 /dev/bsg/10:0:24:0
sg_ses --page=10 /dev/bsg/10:0:24:0

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

Thanks, it does work now with this patch.

sg_ses.txt

from ledmon.

apaszkie avatar apaszkie commented on September 27, 2024

Great, thanks for checking it. That information you provided should be enough for me to make a proper fix.

from ledmon.

apaszkie avatar apaszkie commented on September 27, 2024

@ylemouel would you like to check the linked pull request to verify that it's working correctly on your setup?

Thanks,
Artur

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

Thanks, It does work, no more offset detected.
Cheers.

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

I noticed an error from the log during and after a rebuild.
Restarting ledmon service helped, no more error
Could you please take a look?

Sep 03 09:13:40   DEBUG: (raid_device_init) path: md0, level=3, state=6, degraded=0, disks=6, type=1
Sep 03 09:13:40   DEBUG: (raid_device_init) path: md1, level=6, state=6, degraded=0, disks=6, type=1
Sep 03 09:13:40   DEBUG: (raid_device_init) path: md127, level=2, state=7, degraded=0, disks=2, type=1
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdc, state: Hotspare
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdd, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sde, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdf, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdg, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdh, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdi, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdj, state: Hotspare
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdk, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdl, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdm, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdn, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdo, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdp, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sda, state: Off
Sep 03 09:13:40   DEBUG: (_set_block_state): device: sdb, state: Off
Sep 03 09:13:40   DEBUG: DETACHED DEV 'host10/port-10:0/expander-10:0/port-10:0:0/end_device-10:0:0/target10:0:0/10:0:0:0/block/sdc' in failed state
             State : clean
    Active Devices : 6
   Working Devices : 7
    Failed Devices : 0
     Spare Devices : 1

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

    Number   Major   Minor   RaidDevice State
       1       8       64        0      active sync set-A   /dev/sde
       0       8       48        1      active sync set-B   /dev/sdd
       2       8       80        2      active sync set-A   /dev/sdf
       3       8       96        3      active sync set-B   /dev/sdg
       4       8      112        4      active sync set-A   /dev/sdh
       5       8      128        5      active sync set-B   /dev/sdi

       6       8       32        -      spare   /dev/sdc
* ledmon.service - ledmon
   Loaded: loaded (/etc/systemd/system/ledmon.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-09-02 11:16:08 CEST; 22h ago
 Main PID: 16239 (ledmon)
   CGroup: /system.slice/ledmon.service
           `-16239 /usr/sbin/ledmon --all

Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdj, state: Hotspare
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdk, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdl, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdm, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdn, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdo, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdp, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sda, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: (_set_block_state): device: sdb, state: Off
Sep 03 09:16:50 cs-ccr-pvsstmp.cern.ch ledmon[16239]: DETACHED DEV 'host10/port-10:0/expander-10:0/port-10:0:0/end_device-10:0:0/target10:0:0/10:0:0:0/block/sdc' in failed state
# systemctl restart ledmon.service
# systemctl status ledmon.service
* ledmon.service - ledmon
   Loaded: loaded (/etc/systemd/system/ledmon.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-09-03 09:17:48 CEST; 1min 29s ago
  Process: 20979 ExecStart=/usr/sbin/ledmon --all (code=exited, status=0/SUCCESS)
 Main PID: 20980 (ledmon)
   CGroup: /system.slice/ledmon.service
           `-20980 /usr/sbin/ledmon --all

Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdi, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdj, state: Hotspare
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdk, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdl, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdm, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdn, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdo, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdp, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sda, state: Off
Sep 03 09:19:10 cs-ccr-pvsstmp.cern.ch ledmon[20980]: (_set_block_state): device: sdb, state: Off

from ledmon.

mtkaczyk avatar mtkaczyk commented on September 27, 2024

Hi,
about your first question. The best we can do is to report bug to OSV and allow them to pick-up the change from upstream.
It will be included in incoming releases.

I don't see errors here, only debug logs. Please remove "--all" parameter from service file.

from ledmon.

ylemouel avatar ylemouel commented on September 27, 2024

You're right, I left the debug logs On.
Thanks!

from ledmon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.