Git Product home page Git Product logo

Comments (19)

sanoj-stec avatar sanoj-stec commented on September 28, 2024

Enhanceio is designed to work with partitions, the issue is valid
I was unable to reproduce the issue with following setup

[root@Eio ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdc4[1] sdc3[0]
975860 blocks super 1.2 [2/2] [UU]

md0 : active raid1 sdc2[1] sdc1[0]
974836 blocks super 1.2 [2/2] [UU]

[root@Eio ~]# eio_cli info
Cache Name : c2
Source Device : /dev/md1
SSD Device : /dev/sdd2
Policy : lru
Mode : Read Only
Block Size : 4096
Associativity : 256
Cache Name : c1
Source Device : /dev/md0
SSD Device : /dev/sdd1
Policy : lru
Mode : Read Only
Block Size : 4096
Associativity : 256

from enhanceio.

sanoj-stec avatar sanoj-stec commented on September 28, 2024

Hi mgmartin,
could you please share some more information regarding your setup
specifically,
Workload that was running
md device type (linear/stripe/..?)
Size of Source device and SSD partitions
Is this error consistently reproducible on your setup ?

from enhanceio.

mgmartin avatar mgmartin commented on September 28, 2024

I've looked into this some more. It appears it's not partition related. It looks to be an issue with the cache device and one of the md devices I was using. So in the simplest case, I can consistently reproduce the "attempt to access beyond end of device" errors when using the entire SSD drive as a cache device for this particular md device. Using a partition also causes the read errors, but seems easier to rule that out of the equation for now. Using this same SSD for the other md device ( 4 USB 3.0 drives ) does not produce any errors.

The md device that has issues is a ProBox ( mediasonic) 8 Disk JBOD tower currently connect by eSATA. Suffice it to say, the box is rather flakey to boot. When the workstation does boot up and see the drive correctly, i've had no issues. I pass libata.force=norst to some of the ports on the box to avoid massive reset stalls and errors when I boot up the workstation. This seems to help Linux boot without massive controller reset issues.

The read errors happen immediately on creation of the cache device with minimal reads/write of the md device. No panics in my testing so far today, just the immediate read errors spitting out in the log. With no panic, I can delete the cache device. Then, recreate it and the errors start up.

mdadm detail:

/dev/md/probox:
        Version : 1.2
  Creation Time : Sat Jan 19 15:49:44 2013
     Raid Level : raid10
     Array Size : 3906766592 (3725.78 GiB 4000.53 GB)
  Used Dev Size : 1953383296 (1862.89 GiB 2000.26 GB)
   Raid Devices : 4
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Jan 31 07:00:10 2013
          State : active 
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : offset=2
     Chunk Size : 64K

           Name : gandalf:probox  (local to host gandalf)
           UUID : a0645fb8:dbb836c4:15c1d8db:84de62a7
         Events : 25

    Number   Major   Minor   RaidDevice State
       0       8       96        0      active sync   /dev/sdg
       1       8      112        1      active sync   /dev/sdh
       2       8      128        2      active sync   /dev/sdi
       3       8      144        3      active sync   /dev/sdj

       4       8      160        -      spare   /dev/sdk

SSD detail:

/dev/sdl:

ATA device, with non-removable media
    Model Number:       M4-CT256M4SSD2                          
    Serial Number:      
    Firmware Revision:  040H    
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Used: unknown (minor revision code 0x0028) 
    Supported: 9 8 7 6 5 
    Likely used: 9
Configuration:
    Logical     max current
    cylinders   16383   16383
    heads       16  16
    sectors/track   63  63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors:  500118192
    Logical  Sector size:                   512 bytes
    Physical Sector size:                   512 bytes
    Logical Sector-0 offset:                  0 bytes
    device size with M = 1024*1024:      244198 MBytes
    device size with M = 1000*1000:      256060 MBytes (256 GB)
    cache/buffer size  = unknown
    Form Factor: 2.5 inch
    Nominal Media Rotation Rate: Solid State Device

Command to create the cache device:

eio_cli  create -m ro -d /dev/md/probox  -s /dev/disk/by-id/scsi-SM4-CT256M4SSD2_570000000000  -c proboxcache

Read errors:

[ 1094.403973] attempt to access beyond end of device
[ 1094.403986] sdl: rw=1, want=4702425608, limit=500118192
[ 1094.404000] attempt to access beyond end of device
[ 1094.404006] sdl: rw=1, want=4702425616, limit=500118192
[ 1094.404052] io_callback: io error -5 block 6358956624 action 5
[ 1094.404071] io_callback: io error -5 block 6847963904 action 5
[ 1094.404326] attempt to access beyond end of device
[ 1094.404339] sdl: rw=1, want=4702425608, limit=500118192
[ 1094.404400] io_callback: io error -5 block 6847963912 action 5
[ 1094.405260] attempt to access beyond end of device
[ 1094.405272] sdl: rw=1, want=4557625864, limit=500118192
[ 1094.405338] io_callback: io error -5 block 6847963896 action 5
[ 1094.405741] attempt to access beyond end of device
[ 1094.405750] sdl: rw=1, want=4658039304, limit=500118192
[ 1094.405771] io_callback: io error -5 block 6166292704 action 5
[ 1094.408929] attempt to access beyond end of device
[ 1094.408945] sdl: rw=1, want=4638382600, limit=500118192
[ 1094.408966] io_callback: io error -5 block 7340447864 action 5
[ 1094.409413] attempt to access beyond end of device
[ 1094.409426] sdl: rw=1, want=4309080584, limit=500118192
[ 1094.409444] io_callback: io error -5 block 5173308472 action 5
[ 1094.409803] attempt to access beyond end of device
[ 1094.409816] sdl: rw=1, want=4557625864, limit=500118192
[ 1094.409872] io_callback: io error -5 block 5879040448 action 5
[ 1094.410225] attempt to access beyond end of device
[ 1094.410237] sdl: rw=1, want=4473678344, limit=500118192

Stats for drives in the array:

ATA device, with non-removable media
    Model Number:       WDC WD2002FAEX-007BA0                   
    Serial Number:      
    Firmware Revision:  05.01D05
    Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6
Standards:
    Supported: 8 7 6 5 
    Likely used: 8
Configuration:
    Logical     max current
    cylinders   16383   16383
    heads       16  16
    sectors/track   63  63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors: 3907029168
    Logical/Physical Sector size:           512 bytes
    device size with M = 1024*1024:     1907729 MBytes
    device size with M = 1000*1000:     2000398 MBytes (2000 GB)
    cache/buffer size  = unknown
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, with device specific minimum
    R/W multiple sector transfer: Max = 16  Current = 0
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4 
         Cycle time: no flow control=120ns  IORDY flow control=120ns

from enhanceio.

onlyjob avatar onlyjob commented on September 28, 2024

I experienced this issue as well:

sudo eio_cli create -d /dev/md6 -s /dev/md0 -c R6_CACHE -p lru -m wt

The following errors appeared in /var/log/messages (here are only some of them):

attempt to access beyond end of device
md0: rw=1, want=4314536648, limit=39091232
attempt to access beyond end of device
md0: rw=1, want=4318304968, limit=39091232
attempt to access beyond end of device
md0: rw=1, want=4318304976, limit=39091232
attempt to access beyond end of device
md0: rw=1, want=4318304968, limit=39091232
attempt to access beyond end of device
md0: rw=1, want=4314690248, limit=39091232

Within minutes system completely hanged without leaving anything else in logs.

blockdev --getsz /dev/md0 returns 39091232
blockdev --getsz /dev/md6 returns 19533829120

Linux kernel: Debian 3.8.2-1~experimental.1 x86_64 GNU/Linux

mdadm --detail /dev/md0 (cropped):

Version : 1.0
Raid Level : linear
Array Size : 19545616 (18.64 GiB 20.01 GB)

This issue is not specific to mdadm as I reproduced it with raw physical cache device. Miscalculation?

Please advise.

from enhanceio.

sanoj-stec avatar sanoj-stec commented on September 28, 2024

Could you describe the workload that triggered this issue .
It would be helpful if you change the printk logging level to debug

echo 8 > /proc/sys/kernel/printk

from enhanceio.

onlyjob avatar onlyjob commented on September 28, 2024

In my case workload is not too sophisticated -- I just copy bunch of files from cached partition using rsync while walking some directories with file manager.

As for echo 8 > /proc/sys/kernel/printk I'll keep this in mind for the next time when I have the opportunity to crash my server which may not happen until you fix this bug. :)

I suspect miscalculation problem because it worked when I tested the same configuration on the other, less important machine with smaller disk. It looks like the the problem is triggered by the size of device that I'm trying to attach cache to...

from enhanceio.

sanoj-stec avatar sanoj-stec commented on September 28, 2024

I agree, I will try reproducing with source and cache device of same size and update here

from enhanceio.

sanoj-stec avatar sanoj-stec commented on September 28, 2024

The above reference was accidental (please ignore it)

from enhanceio.

mgmartin avatar mgmartin commented on September 28, 2024

I can reproduce this consistently now with a much simpler config: 1 drive and 1 ssd device--no md or other block layers. The ssd device I'm using now is a 32GB usb flash. I tried 2 different hard drives each one attached to a usb/sata external adapter.

I suspect 4k sectors or size may have something to do with it.

1: A WD 1TB 512b sectors. This drive works fine. No issues.
2. A WD 3TB 4k sector drive. This drive causes the panic immediately on a mkfs.xfs no load on box.

[  308.437984] sdd: rw=1, want=4317270184, limit=62607360
[  308.437997] io_callback: io error -5 block 5860530832 action 2
[  308.437997] ------------[ cut here ]------------
[  308.437999] kernel BUG at drivers/block/enhanceio/eio_main.c:475!
[  308.438004] invalid opcode: 0000 [#1] SMP
[  308.438095] CPU 2
[  308.438095] Pid: 57, comm: kworker/u:3 Tainted: G        W    3.8.2+ #1 Sony Corporation VPCZ112GX/VAIO
[  308.438102] RIP: 0010:[<ffffffffa05c95d4>]  [<ffffffffa05c95d4>] eio_post_io_callback+0x3bc/0x58f [enhanceio]
[  308.438103] RSP: 0018:ffff880150643dd8  EFLAGS: 00010206
[  308.438104] RAX: 000000015d509a90 RBX: ffff88015148d000 RCX: 00000000b797b797
[  308.438105] RDX: 000000005d509a90 RSI: 00000000002a1300 RDI: ffff88015148d000
[  308.438105] RBP: ffff8801514427c0 R08: 000000000000000a R09: 00000000fffffffb
[  308.438106] R10: 0000000000000000 R11: 0000000000002600 R12: ffff88015262ea00
[  308.438107] R13: 00000000fffffffb R14: 00000000002a1300 R15: 000000000000002a
[  308.438108] FS:  0000000000000000(0000) GS:ffff880157c80000(0000) knlGS:0000000000000000
[  308.438109] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  308.438110] CR2: 00007ffecda1b000 CR3: 000000000160c000 CR4: 00000000000007e0
[  308.438111] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  308.438111] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  308.438112] Process kworker/u:3 (pid: 57, threadinfo ffff880150642000, task ffff88015063aab0)
[  308.438113] Stack:
[  308.438115]  0000000000000000 ffff88015262e9f0 0000000000002a13 00000000000126c0
[  308.438117]  ffff88015063aab0 ffff880152ad2240 ffffffff81889e00 ffff88015106d200
[  308.438118]  ffff88015262ea00 0000000000000000 0000000000000000 ffffffff81049e2c
[  308.438119] Call Trace:
[  308.438123]  [<ffffffff81049e2c>] ? process_one_work+0x15d/0x252
[  308.438127]  [<ffffffff810490e6>] ? cwq_activate_delayed_work+0x1e/0x28
[  308.438129]  [<ffffffff8104a1e7>] ? worker_thread+0x118/0x1b2
[  308.438130]  [<ffffffff8104a0cf>] ? rescuer_thread+0x188/0x188
[  308.438133]  [<ffffffff8104de2f>] ? kthread+0x81/0x89
[  308.438135]  [<ffffffff8104ddae>] ? __kthread_parkme+0x5b/0x5b
[  308.438140]  [<ffffffff8138243c>] ? ret_from_fork+0x7c/0xb0
[  308.438141]  [<ffffffff8104ddae>] ? __kthread_parkme+0x5b/0x5b
[  308.438153] Code: 00 00 4c 89 f6 48 89 df e8 2e e0 ff ff 4c 89 f6 48 89 df 41 88 c7 e8 65 e5 ff ff 8b 93 04 01 00 00 f7 da 23 55 18 48 39 d0 74 02 <0f> 0b 41 f6 c7 60 75 02 0f 0b 45 85 ed 75 1e 41 80 ff 62 75 2d
[  308.438156] RIP  [<ffffffffa05c95d4>] eio_post_io_callback+0x3bc/0x58f [enhanceio]
[  308.438157]  RSP <ffff880150643dd8>
[  308.438158] ---[ end trace 2dc9f07886b80dd4 ]---

from enhanceio.

onlyjob avatar onlyjob commented on September 28, 2024

4k sectors may be a different issue. In my case both devices have 512b sectors yet "attempt to access beyond end of device". If you create a linear mdadm device from your 3TB disk, is it still uses 4k sectors? What does blockdev --getss return for your devices? Thanks.

from enhanceio.

mgmartin avatar mgmartin commented on September 28, 2024

I think you're right--sector size is not the issue. blockdev --getss returns 512 .

For my current 1, 3TB disk setup, the issue seems to be related to the size of the partition on the device. If I gpart a +2000G sized partition, all is well. If I bump it up over 2TB to +2050G, the error occurs.

I can swap out the 32GB cache device and stick in a 1GB cache device. All is fine until the partition on the hard drive goes over 2TB. Something with the mapping of hard drive to the cache device once the -d device is over 2TB . SSD size doesn't seem to matter, but something causes the code to look at values > 2^32 on the SSD with >2TB sized partition on the hard drive.

from enhanceio.

mgmartin avatar mgmartin commented on September 28, 2024

I found a way to replicate this issue with sparse files and loopback devices. This helps to reproduce the issue without a physical device > 2TB in size. Hopefully, reproducible on other systems to help track down the cause of the issue. I don't think the sparse files should matter, and the "attempt to access beyond end of device" issue is triggered.

create a sparse disk file and a sparse ssd file. Attach the loopback devices, then create the eio device. If the seek size is 2000G or less, no issues. Increase it to 2100G and I get the kernel panic access. Ignore the udev file creation errors when you create the device. I use "mkfs.ext4" or "mkfs.xfs -f" to trigger the panic. The eio mode doesn't make a difference.

cd /
dd if=/dev/zero of=disk bs=1 count=0 seek=2100G
dd if=/dev/zero of=ssd bs=1 count=0 seek=10G
losetup  /dev/loop0 /disk
losetup  /dev/loop1 /ssd
eio_cli  create -d /dev/loop0  -s /dev/loop1 -c test -m ro
mkfs.ext4 /dev/loop0

from enhanceio.

sanoj-stec avatar sanoj-stec commented on September 28, 2024

@mgmartin great script, I was able to reproduce the issue with this

[ 952.747153] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 953.126775] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 953.505966] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 953.885530] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 954.265955] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 954.645157] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 955.025498] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 955.404669] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 955.784615] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 956.163309] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 956.543932] eio_map: I/O with Discard flag received. Discard flag is not supported.
[ 958.822234] attempt to access beyond end of device
[ 959.060168] loop1: rw=1, want=4295050096, limit=20971520
[ 959.322939] attempt to access beyond end of device
[ 959.322954] io_callback: io error -5 block 4294967296 action 2[ 959.322966] ------------[ cut here ]------------
[ 959.322967] kernel BUG at drivers/block/enhanceio/eio_main.c:475!
[ 959.322968] invalid opcode: 0000 [#1] SMP
[ 959.322969] Modules linked in: enhanceio_lru enhanceio_fifo enhanceio lockd sunrpc bnep bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp kvm_intel e1000e iTCO_wdt kvm joydev iTCO_vendor_support lpc_ich crc32c_intel ghash_clmulni_intel mfd_core hpilo hpwdt microcode serio_raw pcspkr uinput
[ 959.322990] CPU 1 [ 959.322992] Pid: 87, comm: kworker/u:6 Not tainted 3.7.9 #1 HP ProLiant ML110 G7
[ 959.322993] RIP: 0010:[] [] eio_post_io_callback+0x84d/0x910 [enhanceio]
[ 959.322999] RSP: 0018:ffff8803f585fd78 EFLAGS: 00010287
[ 959.323000] RAX: 0000000100000000 RBX: ffff8803c21737d0 RCX: 000000000000002a
[ 959.323001] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000246
[ 959.323002] RBP: ffff8803f585fdc8 R08: ffffffff81e371e0 R09: ffffffff81e60b5e
[ 959.323003] R10: 0000000000000000 R11: 0000000000040000 R12: ffff8803f7239000
[ 959.323004] R13: ffff8803ebdf37e0 R14: 00000000fffffffb R15: 0000000000000081
[ 959.323005] FS: 0000000000000000(0000) GS:ffff8803fac20000(0000) knlGS:0000000000000000
[ 959.323006] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 959.323007] CR2: 00007f615d91a000 CR3: 0000000001c0b000 CR4: 00000000000407e0
[ 959.323008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 959.323010] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 959.323011] Process kworker/u:6 (pid: 87, threadinfo ffff8803f585e000, task ffff8803f5f1dbc0)
[ 959.323012] Stack:
[ 959.323013] ffff880300000000 0000000000000000 ffff8803fac33c80 ffff8803c21737c0
[ 959.323017] 0000000000000000 ffff8803f5f87680 ffff8803c21737d0 ffff8803f55af000
[ 959.323027] 0000000000000000 ffffffff81e7a910 ffff8803f585fe38 ffffffff810788c7
[ 959.323030] Call Trace:
[ 959.323031] [] process_one_work+0x147/0x480
[ 959.323036] [] ? eio_disk_io+0x220/0x220 [enhanceio]
[ 959.323039] [] worker_thread+0x15e/0x450
[ 959.323042] [] ? busy_worker_rebind_fn+0x100/0x100
[ 959.323044] [] kthread+0xc0/0xd0
[ 959.323046] [] ? ftrace_raw_event_xen_mmu_flush_tlb_others+0xb0/0xe0
[ 959.323050] [] ? kthread_create_on_node+0x120/0x120
[ 959.323052] [] ret_from_fork+0x7c/0xb0
[ 959.323054] [] ? kthread_create_on_node+0x120/0x120
[ 959.323057] Code: 54 02 00 00 01 e9 bc f8 ff ff 0f 0b 41 83 bc 24 18 01 00 00 03 0f 84 84 00 00 00 41 83 84 24 54 02 00 00 01 e9 13 fa ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 80 fa 42 0f 84 40 f9 ff ff 41 f6 84 24 34 01
[ 959.323089] RIP [] eio_post_io_callback+0x84d/0x910 [enhanceio]
[ 959.323091] RSP
[ 959.323094] ---[ end trace ebcf66a09a2e82af ]---
[ 959.323130] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 959.323131] IP: [] kthread_data+0x10/0x20
[ 959.323133] PGD 1c0d067 PUD 1c0e067 PMD 0
[ 959.323134] Oops: 0000 [#2] SMP
[ 959.323135] Modules linked in: enhanceio_lru enhanceio_fifo enhanceio lockd sunrpc bnep bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp kvm_intel e1000e iTCO_wdt kvm joydev iTCO_vendor_support lpc_ich crc32c_intel ghash_clmulni_intel mfd_core hpilo hpwdt microcode serio_raw pcspkr uinput
[ 959.323144] CPU 1 [ 959.323145] Pid: 87, comm: kworker/u:6 Tainted: G D 3.7.9 #1 HP ProLiant ML110 G7
[ 959.323146] RIP: 0010:[] [] kthread_data+0x10/0x20
[ 959.323147] RSP: 0018:ffff8803f585fa28 EFLAGS: 00010092
[ 959.323148] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
[ 959.323148] RDX: ffffffff81e7b220 RSI: 0000000000000001 RDI: ffff8803f5f1dbc0
[ 959.323149] RBP: ffff8803f585fa28 R08: ffff8803f5f1dc30 R09: 0000000000000800
[ 959.323149] R10: 0000000000000000 R11: 000000000000002f R12: ffff8803fac33c80
[ 959.323150] R13: 0000000000000001 R14: ffff8803f5f1dbb0 R15: ffff8803f5f1dbc0
[ 959.323150] FS: 0000000000000000(0000) GS:ffff8803fac20000(0000) knlGS:0000000000000000
[ 959.323151] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 959.323151] CR2: ffffffffffffffd8 CR3: 0000000001c0b000 CR4: 00000000000407e0
[ 959.323152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 959.323152] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 959.323153] Process kworker/u:6 (pid: 87, threadinfo ffff8803f585e000, task ffff8803f5f1dbc0)
[ 959.323153] Stack:
[ 959.323154] ffff8803f585fa48 ffffffff8107b845 ffff8803f585fa48 ffff8803f5f1df78
[ 959.323155] ffff8803f585fab8 ffffffff81614aa2 ffff8803f5f1dbc0 ffff8803f585ffd8
[ 959.323156] ffff8803f585ffd8 ffff8803f585ffd8 ffff8803f585faa8 ffff8803f5f1dbc0
[ 959.323157] Call Trace:
[ 959.323158] [] wq_worker_sleeping+0x15/0xc0
[ 959.323159] [] __schedule+0x5c2/0x7a0
[ 959.323163] [] schedule+0x29/0x70
[ 959.323164] [] do_exit+0x59a/0x8b0
[ 959.323166] [] ? printk+0x61/0x63
[ 959.323169] [] oops_end+0x9d/0xe0
[ 959.323170] [] die+0x58/0x90
[ 959.323172] [] do_trap+0x6b/0x170
[ 959.323173] [] ? __atomic_notifier_call_chain+0x12/0x20
[ 959.323175] [] do_invalid_op+0x9c/0xb0
[ 959.323178] [] ? eio_post_io_callback+0x84d/0x910 [enhanceio]
[ 959.323180] [] ? down_trylock+0x36/0x50
[ 959.323183] [] ? console_trylock+0x1c/0x70
[ 959.323186] [] invalid_op+0x1e/0x30
[ 959.323188] [] ? eio_post_io_callback+0x84d/0x910 [enhanceio]
[ 959.323190] [] ? eio_post_io_callback+0x7e3/0x910 [enhanceio]
[ 959.323193] [] process_one_work+0x147/0x480
[ 959.323195] [] ? eio_disk_io+0x220/0x220 [enhanceio]
[ 959.323197] [] worker_thread+0x15e/0x450
[ 959.323199] [] ? busy_worker_rebind_fn+0x100/0x100
[ 959.323201] [] kthread+0xc0/0xd0
[ 959.323203] [] ? ftrace_raw_event_xen_mmu_flush_tlb_others+0xb0/0xe0
[ 959.323205] [] ? kthread_create_on_node+0x120/0x120
[ 959.323207] [] ret_from_fork+0x7c/0xb0
[ 959.323208] [] ? kthread_create_on_node+0x120/0x120
[ 959.323210] Code: 00 48 89 e5 5d 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 87 60 03 00 00 55 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90
[ 959.323259] RIP [] kthread_data+0x10/0x20
[ 959.323261] RSP
[ 959.323262] CR2: ffffffffffffffd8
[ 959.323263] ---[ end trace ebcf66a09a2e82b0 ]---
[ 959.323264] Fixing recursive fault but reboot is needed!

[ 991.721594] loop1: rw=1, want=4299243368, limit=20971520
[ 991.993587] attempt to access beyond end of device
[ 992.231028] loop1: rw=1, want=4303437672, limit=20971520
[ 992.493638] attempt to access beyond end of device
[ 992.730994] loop1: rw=1, want=4307631976, limit=20971520
[ 992.988870] attempt to access beyond end of device
[ 993.225584] loop1: rw=1, want=4311826280, limit=20971520
[ 993.488703] attempt to access beyond end of device
[ 993.725462] loop1: rw=1, want=4295133032, limit=20971520
[ 993.988555] attempt to access beyond end of device
[ 994.226067] loop1: rw=1, want=4299327336, limit=20971520
[ 994.488876] attempt to access beyond end of device
[ 994.726021] loop1: rw=1, want=4303521640, limit=20971520
[ 994.989017] attempt to access beyond end of device
[ 995.227007] loop1: rw=1, want=4295049064, limit=20971520
[ 995.490357] attempt to access beyond end of device
[ 995.727418] loop1: rw=1, want=4299243392, limit=20971520
[ 995.990688] attempt to access beyond end of device
[ 996.227948] loop1: rw=1, want=4303437680, limit=20971520
[ 996.491557] attempt to access beyond end of device
[ 996.728777] loop1: rw=1, want=4307631992, limit=20971520
[ 996.991570] attempt to access beyond end of device
[ 997.229560] loop1: rw=1, want=4311826288, limit=20971520
[ 997.493093] attempt to access beyond end of device
[ 997.729996] loop1: rw=1, want=4295133048, limit=20971520
[ 997.992842] attempt to access beyond end of device
[ 998.230431] loop1: rw=1, want=4299327344, limit=20971520
[ 998.493616] attempt to access beyond end of device
[ 998.731114] loop1: rw=1, want=4303521656, limit=20971520
[ 998.994174] attempt to access beyond end of device
[ 999.232368] loop1: rw=1, want=4295049072, limit=20971520
[ 999.495860] attempt to access beyond end of device
[ 999.732686] loop1: rw=1, want=4299243400, limit=20971520
[ 999.996145] attempt to access beyond end of device
[ 1000.234050] loop1: rw=1, want=4303437688, limit=20971520
[ 1000.497560] attempt to access beyond end of device
[ 1000.734574] loop1: rw=1, want=4307632000, limit=20971520
[ 1000.997396] attempt to access beyond end of device
[ 1001.234624] loop1: rw=1, want=4311826296, limit=20971520
[ 1001.497578] attempt to access beyond end of device
[ 1001.735269] loop1: rw=1, want=4295133056, limit=20971520
[ 1001.998806] attempt to access beyond end of device
[ 1002.236193] loop1: rw=1, want=4299327352, limit=20971520
[ 1002.500013] attempt to access beyond end of device
[ 1002.737273] loop1: rw=1, want=4303521672, limit=20971520
[ 1003.001057] attempt to access beyond end of device
[ 1003.239464] loop1: rw=1, want=4295049080, limit=20971520
[ 1003.503444] attempt to access beyond end of device
[ 1003.740585] loop1: rw=1, want=4299243408, limit=20971520
[ 1004.004122] attempt to access beyond end of device
[ 1004.241073] loop1: rw=1, want=4299244408, limit=20971520
[ 1004.504735] attempt to access beyond end of device
[ 1004.741579] loop1: rw=1, want=4303435624, limit=20971520
[ 1005.004861] attempt to access beyond end of device
[ 1005.242088] loop1: rw=1, want=4303435632, limit=20971520
[ 1005.505472] attempt to access beyond end of device
[ 1005.743105] loop1: rw=1, want=4303435640, limit=20971520
[ 1006.005590] attempt to access beyond end of device
[ 1006.242699] loop1: rw=1, want=4303435648, limit=20971520
[ 1006.505706] attempt to access beyond end of device
[ 1006.743616] loop1: rw=1, want=4303435656, limit=20971520
[ 1007.007243] attempt to access beyond end of device
[ 1007.244523] loop1: rw=1, want=4303435664, limit=20971520
[ 1007.508089] attempt to access beyond end of device
[ 1007.745320] loop1: rw=1, want=4303435672, limit=20971520
[ 1008.008389] attempt to access beyond end of device
[ 1008.245863] loop1: rw=1, want=4303435680, limit=20971520
[ 1008.509006] attempt to access beyond end of device
[ 1008.745990] loop1: rw=1, want=4303435688, limit=20971520
[ 1009.008993] attempt to access beyond end of device
[ 1009.247142] loop1: rw=1, want=4303435696, limit=20971520
[ 1009.511308] attempt to access beyond end of device
[ 1009.748784] loop1: rw=1, want=4303435704, limit=20971520
[ 1010.011577] attempt to access beyond end of device
[ 1010.248782] loop1: rw=1, want=4303435712, limit=20971520
[ 1010.512055] attempt to access beyond end of device
[ 1010.749650] loop1: rw=1, want=4303435720, limit=20971520
[ 1011.012799] attempt to access beyond end of device
[ 1011.249963] loop1: rw=1, want=4303435728, limit=20971520
[ 1011.512890] attempt to access beyond end of device
[ 1011.750124] loop1: rw=1, want=4303435736, limit=20971520
[ 1012.013656] attempt to access beyond end of device
[ 1012.250944] loop1: rw=1, want=4303435744, limit=20971520
[ 1033.857776] ------------[ cut here ]------------
[ 1034.086646] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0()
[ 1034.469878] Hardware name: ProLiant ML110 G7
[ 1034.680956] Watchdog detected hard LOCKUP on cpu 0
[ 1034.909836] Modules linked in: enhanceio_lru enhanceio_fifo enhanceio lockd sunrpc bnep bluetooth rfkill ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp kvm_intel e1000e iTCO_wdt kvm joydev iTCO_vendor_support lpc_ich crc32c_intel ghash_clmulni_intel mfd_core hpilo hpwdt microcode serio_raw pcspkr uinput
[ 1036.636435] Pid: 0, comm: swapper/0 Tainted: G D 3.7.9 #1
[ 1036.946900] Call Trace:
[ 1037.067698] [] warn_slowpath_common+0x7f/0xc0
[ 1037.395293] [] warn_slowpath_fmt+0x46/0x50
[ 1037.680481] [] ? touch_nmi_watchdog+0x80/0x80
[ 1037.978331] [] watchdog_overflow_callback+0x9c/0xd0
[ 1038.301614] [] __perf_event_overflow+0x9d/0x230
[ 1038.607734] [] ? x86_perf_event_set_period+0xd7/0x160
[ 1038.939282] [] perf_event_overflow+0x14/0x20
[ 1039.232738] [] intel_pmu_handle_irq+0x1ae/0x330
[ 1039.539153] [] perf_event_nmi_handler+0x1d/0x20
[ 1039.845431] [] nmi_handle.isra.0+0x51/0x80
[ 1040.131074] [] do_nmi+0x179/0x350
[ 1040.377467] [] end_repeat_nmi+0x1e/0x2e
[ 1040.649578] [] ? _raw_spin_lock+0x25/0x30
[ 1040.929843] [] ? _raw_spin_lock+0x25/0x30
[ 1041.210441] [] ? _raw_spin_lock+0x25/0x30
[ 1041.491036] <> [] sched_rt_period_timer+0x10e/0x370
[ 1041.871292] [] __run_hrtimer+0x73/0x1d0
[ 1042.138426] [] ? dequeue_task_rt+0x50/0x50
[ 1042.423384] [] hrtimer_interrupt+0xf7/0x230
[ 1042.712281] [] smp_apic_timer_interrupt+0x69/0x99
[ 1043.026932] [] apic_timer_interrupt+0x6d/0x80
[ 1043.324527] [] ? intel_idle+0xed/0x150
[ 1043.618943] [] ? intel_idle+0xce/0x150
[ 1043.886957] [] cpuidle_enter+0x19/0x20
[ 1044.154887] [] cpuidle_idle_call+0xa9/0x260
[ 1044.443033] [] cpu_idle+0xaf/0x120
[ 1044.692953] [] rest_init+0x72/0x80
[ 1044.942838] [] start_kernel+0x3b9/0x3c6
[ 1045.213741] [] ? repair_env_string+0x5e/0x5e
[ 1045.506609] [] x86_64_start_reservations+0x131/0x135
[ 1045.834499] [] x86_64_start_kernel+0x100/0x10f
[ 1046.135784] ---[ end trace ebcf66a09a2e82b1 ]---

from enhanceio.

CoRpO avatar CoRpO commented on September 28, 2024

I do experience this problem myself. Kernel 3.8.3; 480 GB SSD, 36 TB raid 5; x64 arch.

Seems to be related to the size of the HDD. as soon as > 2 TB, accesses are tried on the SSD with lba of the HDD.
Nobody did use eio with > 2TB hdd ? I really doubt that ...

Using full device or partition doesn't change anything (for ssd and/or for hdd) and the problem arises immediately in my case.

from enhanceio.

onlyjob avatar onlyjob commented on September 28, 2024

Hi CoRpO,

Please forgive me for off-topic but if you really have 36 TB RAID-5 (that is not made of redundant RAID-6 bricks) then you have a bigger problem:

http://raid6.com.au/posts/RAID5_considered_harmful/
http://www.miracleas.com/BAARF/BAARF2.html
http://miracleas.com/BAARF/RAID5_versus_RAID10.txt
http://www.standalone-sysadmin.com/blog/2012/08/i-come-not-to-praise-raid-5/
http://www.zdnet.com/blog/storage/raidfail-dont-use-raid-5-on-small-arrays/483
http://www.reddit.com/r/sysadmin/comments/ydi6i/dell_raid_5_is_no_longer_recommended_for_any/

from enhanceio.

CoRpO avatar CoRpO commented on September 28, 2024

@onlyjob : I don't really mind loosing those files; so raid 5 is fine for me.

I've noticed that "attempt to access beyond end of device" always happens when writing to the SSD (either in WRITECACHE or READFILL, depending of the cache mode). I'm trying to find the logical flaw (apparently related to a 32 bit value overflowing) but I'm not good at that :/

from enhanceio.

CoRpO avatar CoRpO commented on September 28, 2024

The following commands are issued:

io_callback: io block 6442453072 cache 51166528 action 3 (READDISK)
io_callback: io block 6442453072 cache 4298617840 action 5 (READFILL)
io_callback: io error -5 block 6442453072 cache 4298617840 action 5
attempt to access beyond end of device
sdb: rw=1, want=4298617848, limit=937637552

(I added the "cache" data,it is the value of job_io_regions.cache.sector inside eio_post_io_callback)

When caching a < 2 TB block device, output looks like:

io_callback: io block 33555096 cache 55838231592 action 3 (READDISK)
io_callback: io block 33555096 cache 37203072 action 5 (READFILL)

Hope this helps. In the working log, cache.sector is > to the cache size when READDISK occurs so I guess this is not used in this case ?

from enhanceio.

sanoj-stec avatar sanoj-stec commented on September 28, 2024

Seems like the issue is seen whenever a write/read
happens beyond 2TB, both these operation may perform READFILL and WRITECACHE
(which involves write to SSD's)
The issue is somewhere in the math where we compute and fill where.sector

from enhanceio.

bhansaliakhil avatar bhansaliakhil commented on September 28, 2024

Hello All,

I have created fix for this and submitted pull request to Sanoj.

The bug was in "EIO_ROUND_SECTOR" macro where "bit wise AND" operation was causing truncation of sector value.

The root cause of the problem was doing "bit wise AND" operation between 64 bit value and 32 bit mask. I have changed the mask value from unsigned (i.e. 32 bit) to unsigned long (64 bit).

Hope this helps.
~Akhil

from enhanceio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.