Git Product home page Git Product logo

Comments (30)

apulsifer avatar apulsifer commented on September 26, 2024 3

Took about 20 minutes. They all match.

Fri May 24 12:12:33 UTC 2024
{
"height": 844917,
"bestblock": "000000000000000000017dd5f59b73629f6f88797e90017a9df39c5e435296bf",
"txouts": 181984254,
"bogosize": 13994337434,
"muhash": "b97a3fb13a61e8c889668d064f7ae5408a78ee7e4c6a4fdebedb30ecb2f23378",
"total_amount": 19702649.24256659,
"transactions": 125720653,
"disk_size": 12220356091
}
Fri May 24 12:39:58 UTC 2024

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024 1

Could you explain the paging in a bit more detail?

The machines page pretty hard to SSD for about 20 seconds after each new block arrives. That info comes from "sar -d 10", which I logged for a while when I was setting up the first machine.

Is the node constantly being bombarded with lots of simultaneous RPC calls (which ones?) or is some other interface used?

None of these machines has at this point serviced a single RPC call (I'm still trying to get things set up). No other interface is being used, just the bitcoind peer network, half of the machines with direct IPv6, half via torproxy.

Also, from your log it appears that this happened while the node was catching up with the tip (almost but not completely synced yet), receiving blocks quickly. Is that typical, or does it usually happen when the node is synced and receives blocks as they are mined?

Sync'ing has been pretty typical at the moment, since I'm still setting things up and bitcoind has been started and stopped from time-to-time to try out different settings.

The initial sync from the genesis block was done on a machine with 16 GB RAM. Then on May 7, bitcoind on that machine was stopped and the /bdata directory was copied to these 10 machines with only 512 MB RAM that are only expected to keep up with new blocks as they arrive. Since that time, two of the ten machines have had no data corruption. The other eight machines have had data corruption one or more times.

The problem doesn't just show up during block sync however. I do know that on Monday at end of day, all the machines were running and fully synced, and by Tuesday morning, two machines had data corruption. I was busy with other things and left all the machines alone, and by the time I went to fix them Wednesday afternoon, four machines had data corruption.

I don't think I can definitely rule out that starting and stopping bitcoind, or rebooting the machines has not contributed to this problem. The files could sit in the bdata directory for a while, and it's possible one is getting corrupted when bitcoind shuts down or the machine reboots but bitcoind doesn't notice it until sometime later when it reads the file. Note that bitcoind is being started and stopped by systemd service files (attached below) which has a 10 minute timeout. So systemd will politely ask bitcoind to stop, and if it hasn't exited after 10 minutes, systemd will kill it. Another thing worth noting from the service file is that I set up bitcoind to run at Nice=16, which might contribute to triggering the problem.

As of late last night, all machines are running and fully synced again. By next week, I'm going to start leaving them alone to run autonomously (with the exception of rebuilding a datadir if needed), so that will be a much better test of what happens when the machines are fully synced and bitcoind is running continuously.

[Service]
WorkingDirectory=/home/ec2-user
ExecStart=/home/ec2-user/bitcoin/bin/bitcoind -conf=/home/ec2-user/bitcoin.conf
Restart=always
RestartSec=60
TimeoutStopSec=600
Nice=16
User=ec2-user
Group=ec2-user
StandardOutput=journal
StandardError=journal

[Unit]
After=network-online.target

[Install]
WantedBy=multi-user.target

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024 1

ok, thx, I updated all the servers to bitcoin-27.1-aarch64-linux-gnu.tar.gz
The last data corruption occurred on July 13 and 14. I'll let you know next time I see it.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

I suspect a bug in the code is causing some thread to write to an incorrect memory location, possibly a memory use-after-free/reallocation/reorganization bug.

Would it be possible for you to compile and run with asan, or a similar sanitizer?

Also, what filesystem are you using on the drives? Something like df --print-type --human-readable /bdata should print it.

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

xfs on the root and data drive

sudo mkswap /dev/nvme[4 GB disk]
sudo swapon /dev/nvme[4 GB disk]

sudo mkdir /bdata
sudo mkfs -t xfs /dev/nvme[40 GB disk]
lsblk -o name,size,type,uuid
sudo nano /etc/fstab

add to fstab:
UUID=[4 GB disk uuid] swap swap defaults 0 0
UUID=[40 GB disk uuid] /bdata xfs defaults,nofail 0 2

sudo mount -a

Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs tmpfs 210M 0 210M 0% /dev/shm
tmpfs tmpfs 84M 552K 84M 1% /run
/dev/nvme0n1p1 xfs 14G 3.4G 11G 25% /
tmpfs tmpfs 210M 0 210M 0% /tmp
/dev/nvme1n1 xfs 40G 19G 22G 46% /bdata
/dev/nvme0n1p128 vfat 10M 1.4M 8.7M 14% /boot/efi
tmpfs tmpfs 42M 0 42M 0% /run/user/1000

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Would it be possible for you to compile and run with asan, or a similar sanitizer?

I don't see where I would get a chunk of time to do that right now.... But as I mentioned, I copied the corrupted files, and that might give some clues to someone familiar with their format (especially if the corruption is ascii in the middle of binary, or vice versa)

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

I don't see where I would get a chunk of time to do that right now....

Sure, no rush. I'll probably take some time to pin this down. (I don't have an AWS account, so I can't test it, but maybe someone else has).

Some other ideas to test in the meantime:

  • Try another filesystem instead of xfs
  • Try the master branch (not for production, just for testing whether the issue still happens there)

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

xfs is used on the root drive of every Amazon EC2 instance running Amazon Linux. If xfs on Amazon EC2 were the problem, a lot of of critical infrastructure would be failing right now. And as I mentioned, these machines are also running bitcoin cash in an almost an identical configuration (data file path and ports changed) and it has had zero problems. Since the data corruption only happens about once per week when running on mainnet, I think figuring this problem out will probably take a customized and instrumented version of bitcoind being feed blocks at high speed with random jitter and waits and synthetic memory pressure. This could probably be done on a virtual machine anywhere, like Xen, KVM, Virtual Box, etc.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

Since the data corruption only happens about once per week

Once per week is a lot and if this was a broader problem, I'd assume that more people were complaining. Given that you can consistently reproduce on different machines, this seems like a real bug is somewhere. However, Bitcoin Core is running fine in a lot of other places, so there has to be some hardware or configuration setting (or combination thereof) that triggers this bug on your side. It would be good to know which one it is.

from bitcoin.

mzumsande avatar mzumsande commented on September 26, 2024

(edited first question out, I misunderstood)

From your log it appears that this happened while the node was catching up with the tip (almost but not completely synced yet), receiving blocks quickly. Is that typical, or does it usually happen when the node is synced and receives blocks as they are mined?

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Once per week is a lot and if this was a broader problem, I'd assume that more people were complaining.

It could just be that the memory pressure is uncovering the problem. Of course, any machine can experience memory pressure at times, but one thing that's unique is that those machines starting hard paging to SSD for about 20 seconds after each new block arrives.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

I think calling the RPC gettxoutsetinfo muhash on all nodes (when they are synced to the same block) and it matches for all, then the chainstate leveldb at that point in time is probably fine. I presume all failures happened in the /bdata/bitcoin-data/chainstate/ leveldb?

Edit: Calling that RPC will take a long time on your machines, I suspect.

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Yes, all the checksum errors are in numerically-named NNNNNN.ldb files in /bdata/bitcoin-data/chainstate/

I had no luck with gettxoutsetinfo muhash:

bitcoin/bin/bitcoin-cli -rpcwaittimeout=0 -conf=/home/ec2-user/bitcoin.conf getblockcount
844826

bitcoin/bin/bitcoin-cli -rpcwaittimeout=0 -conf=/home/ec2-user/bitcoin.conf gettxoutsetinfo muhash
error: timeout on transient error: Could not connect to the server 127.0.0.1:8332 (error code 0 - "timeout reached")

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

The RPC will take a long time (probably hours), so you'll have to disable the client timeout -rpcclienttimeout=0.

bitcoin/bin/bitcoin-cli -rpcclienttimeout=0 -rpcwaittimeout=0 -conf=/home/ec2-user/bitcoin.conf gettxoutsetinfo muhash

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Update: Since leaving these servers alone for a week and not rebooting them or restarting bitcoind, they have stayed perfectly in sync without issues. So it looks like the problem is triggered by starting and stopping bitcoind (which I can live with, if I do have to restart a server and I get data corruption, I'll image the data from another server).

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Update: After running continuously since 05-23 (no reboots or restarting bitcoind), one of the servers failed this morning. So it seems the data corruption bug occurs even when bitcoind is running continuously, although at a much lower rate.

Started bitcoind.service.
2024-05-23T11:23:40Z Bitcoin Core version v25.2.0 (release build)
2024-05-23T11:23:40Z InitParameterInteraction: parameter interaction: -blocksonly=1 -> setting -whitelistrelay=0
2024-05-23T11:23:40Z Using the 'arm_shani(1way,2way)' SHA256 implementation
2024-05-23T11:23:40Z Default data directory /home/ec2-user/.bitcoin
2024-05-23T11:23:40Z Using data directory /bdata/bitcoin-data
2024-05-23T11:23:40Z Config file: /home/ec2-user/bitcoin.conf
2024-05-23T11:23:40Z Config file arg: blockreconstructionextratxn="1"
2024-05-23T11:23:40Z Config file arg: blocksonly="1"
2024-05-23T11:23:40Z Config file arg: datadir="/bdata/bitcoin-data"
2024-05-23T11:23:40Z Config file arg: dbcache="200"
2024-05-23T11:23:40Z Config file arg: debuglogfile=false
2024-05-23T11:23:40Z Config file arg: disablewallet="1"
2024-05-23T11:23:40Z Config file arg: discover="0"
2024-05-23T11:23:40Z Config file arg: dns="0"
2024-05-23T11:23:40Z Config file arg: dnsseed="0"
2024-05-23T11:23:40Z Config file arg: listen="1"
2024-05-23T11:23:40Z Config file arg: maxconnections="24"
2024-05-23T11:23:40Z Config file arg: maxmempool="5"
2024-05-23T11:23:40Z Config file arg: maxorphantx="1"
2024-05-23T11:23:40Z Config file arg: maxsigcachesize="4"
2024-05-23T11:23:40Z Config file arg: mempoolexpiry="1"
2024-05-23T11:23:40Z Config file arg: par="1"
2024-05-23T11:23:40Z Config file arg: persistmempool="0"
2024-05-23T11:23:40Z Config file arg: printtoconsole="1"
2024-05-23T11:23:40Z Config file arg: prune="550"
2024-05-23T11:23:40Z Config file arg: rest="1"
2024-05-23T11:23:40Z Config file arg: rpcallowip="127.0.0.1"
2024-05-23T11:23:40Z Config file arg: rpcthreads="1"
2024-05-23T11:23:40Z Config file arg: rpcworkqueue="40"
2024-05-23T11:23:40Z Config file arg: server="1"
2024-05-23T11:23:40Z Config file arg: [main] bind="127.0.0.1:8334=onion"
2024-05-23T11:23:40Z Config file arg: [main] rpcbind="127.0.0.1:8332"
2024-05-23T11:23:40Z Command-line arg: conf="/home/ec2-user/bitcoin.conf"
2024-05-23T11:23:40Z Using at most 24 automatic connections (65535 file descriptors available)
2024-05-23T11:23:40Z Using 2 MiB out of 2 MiB requested for signature cache, able to store 65536 elements
2024-05-23T11:23:40Z Using 2 MiB out of 2 MiB requested for script execution cache, able to store 65536 elements
2024-05-23T11:23:40Z Script verification uses 0 additional threads
2024-05-23T11:23:40Z Wallet disabled!
2024-05-23T11:23:40Z scheduler thread start
2024-05-23T11:23:40Z Binding RPC on address 127.0.0.1 port 8332
2024-05-23T11:23:40Z [http] creating work queue of depth 40
2024-05-23T11:23:40Z [http] starting 1 worker threads
2024-05-23T11:23:40Z Using /16 prefix for IP bucketing
2024-05-23T11:23:40Z init message: Loading P2P addresses…
2024-05-23T11:23:41Z Loaded 67288 addresses from peers.dat 1001ms
2024-05-23T11:23:41Z init message: Loading banlist…
2024-05-23T11:23:41Z SetNetworkActive: true
2024-05-23T11:23:41Z Cache configuration:
2024-05-23T11:23:41Z * Using 2.0 MiB for block index database
2024-05-23T11:23:41Z * Using 8.0 MiB for chain state database
2024-05-23T11:23:41Z * Using 190.0 MiB for in-memory UTXO set (plus up to 4.8 MiB of unused mempool space)
2024-05-23T11:23:41Z init message: Loading block index…
2024-05-23T11:23:41Z Assuming ancestors of block 000000000000000000035c3f0d31e71a5ee24c5aaf3354689f65bd7b07dee632 have valid signatures.
2024-05-23T11:23:41Z Setting nMinimumChainWork=000000000000000000000000000000000000000044a50fe819c39ad624021859
2024-05-23T11:23:41Z Prune configured to target 550 MiB on disk for block and undo files.
2024-05-23T11:23:41Z Opening LevelDB in /bdata/bitcoin-data/blocks/index
2024-05-23T11:23:41Z Opened LevelDB successfully
2024-05-23T11:23:41Z Using obfuscation key for /bdata/bitcoin-data/blocks/index: 0000000000000000
2024-05-23T11:23:50Z LoadBlockIndexDB: last block file = 4298
2024-05-23T11:23:50Z LoadBlockIndexDB: last block file info: CBlockFileInfo(blocks=12, size=18765314, heights=844726...844737, time=2024-05-23...2024-05-23)
2024-05-23T11:23:50Z Checking all blk files are present...
2024-05-23T11:23:51Z LoadBlockIndexDB(): Block files have previously been pruned
2024-05-23T11:23:53Z Initializing chainstate Chainstate [ibd] @ height -1 (null)
2024-05-23T11:23:53Z Opening LevelDB in /bdata/bitcoin-data/chainstate
2024-05-23T11:23:53Z Opened LevelDB successfully
2024-05-23T11:23:53Z Using obfuscation key for /bdata/bitcoin-data/chainstate: 27687fc922c5e117
2024-05-23T11:23:59Z Loaded best chain: hashBestChain=000000000000000000027d7ef87e117148fb2f0fd86daa593be6a9ab60d90b55 height=844737 date=2024-05-23T11:14:21Z progress=0.999998
2024-05-23T11:23:59Z [snapshot] allocating all cache to the IBD chainstate
2024-05-23T11:23:59Z Opening LevelDB in /bdata/bitcoin-data/chainstate
2024-05-23T11:23:59Z Opened LevelDB successfully
2024-05-23T11:23:59Z Using obfuscation key for /bdata/bitcoin-data/chainstate: 27687fc922c5e117
2024-05-23T11:23:59Z [Chainstate [ibd] @ height 844737 (000000000000000000027d7ef87e117148fb2f0fd86daa593be6a9ab60d90b55)] resized coinsdb cache to 8.0 MiB
2024-05-23T11:23:59Z [Chainstate [ibd] @ height 844737 (000000000000000000027d7ef87e117148fb2f0fd86daa593be6a9ab60d90b55)] resized coinstip cache to 190.0 MiB
2024-05-23T11:23:59Z init message: Verifying blocks…
2024-05-23T11:23:59Z Verifying last 6 blocks at level 3
2024-05-23T11:23:59Z Verification progress: 0%
2024-05-23T11:24:08Z Verification progress: 16%
2024-05-23T11:24:13Z Verification progress: 33%
2024-05-23T11:24:16Z Verification progress: 50%
2024-05-23T11:24:21Z Verification progress: 66%
2024-05-23T11:24:26Z Verification progress: 83%
2024-05-23T11:24:30Z Verification progress: 99%
2024-05-23T11:24:30Z Verification: No coin database inconsistencies in last 6 blocks (17307 transactions)
2024-05-23T11:24:30Z block index 49358ms
2024-05-23T11:24:30Z init message: Pruning blockstore…
2024-05-23T11:24:30Z Leaving InitialBlockDownload (latching to false)
2024-05-23T11:24:30Z block tree size = 844738
2024-05-23T11:24:30Z nBestHeight = 844737
2024-05-23T11:24:30Z loadblk thread start
2024-05-23T11:24:30Z loadblk thread exit
2024-05-23T11:24:30Z torcontrol thread start
2024-05-23T11:24:30Z Bound to 127.0.0.1:8334
2024-05-23T11:24:30Z init message: Starting network threads…
2024-05-23T11:24:30Z DNS seeding disabled
2024-05-23T11:24:30Z init message: Done loading
2024-05-23T11:24:30Z opencon thread start
2024-05-23T11:24:30Z net thread start
2024-05-23T11:24:30Z addcon thread start
2024-05-23T11:24:30Z msghand thread start
2024-05-23T11:24:30Z New outbound peer connected: version: 70016, blocks=844737, peer=1 (manual)
2024-05-23T11:24:31Z New outbound peer connected: version: 70016, blocks=844737, peer=4 (manual)
2024-05-23T11:24:31Z New outbound peer connected: version: 70016, blocks=844737, peer=5 (manual)
2024-05-23T11:24:31Z New outbound peer connected: version: 70016, blocks=844737, peer=6 (manual)
2024-05-23T11:25:09Z New outbound peer connected: version: 70016, blocks=844737, peer=10 (manual)
2024-05-23T11:26:10Z New outbound peer connected: version: 70016, blocks=844737, peer=13 (manual)
2024-05-23T11:26:13Z Saw new header hash=00000000000000000003508531e1ec11798f1972e307235a54ef91bf945e246c height=844738
2024-05-23T11:26:58Z UpdateTip: new best=00000000000000000003508531e1ec11798f1972e307235a54ef91bf945e246c height=844738 version=0x322d6000 log2_work=94.940996 tx=1009663503 date='2024-05-23T11:22:20Z' progress=0.999999 cache=1.8MiB(11945txo)
2024-05-23T11:26:58Z Saw new header hash=00000000000000000000c18685513156cfd695edd2378ca2ba819d785866a571 height=844739
2024-05-23T11:27:08Z UpdateTip: new best=00000000000000000000c18685513156cfd695edd2378ca2ba819d785866a571 height=844739 version=0x29a3e000 log2_work=94.941009 tx=1009668305 date='2024-05-23T11:25:43Z' progress=1.000000 cache=2.6MiB(18123txo)
2024-05-23T11:27:39Z New outbound peer connected: version: 70016, blocks=844737, peer=15 (manual)
2024-05-23T11:27:55Z New outbound peer connected: version: 70016, blocks=844737, peer=14 (manual)
2024-05-23T11:36:44Z Saw new header hash=00000000000000000000e9be8cfceef5f4d12313ab2657d7fcf4e617dc9bb839 height=844740
2024-05-23T11:36:51Z UpdateTip: new best=00000000000000000000e9be8cfceef5f4d12313ab2657d7fcf4e617dc9bb839 height=844740 version=0x224c8000 log2_work=94.941023 tx=1009673632 date='2024-05-23T11:36:11Z' progress=1.000000 cache=3.9MiB(26582txo)
2024-05-23T11:47:09Z Saw new header hash=0000000000000000000152fee6b2cb2779c2fe0ce34aaad57f9034c1613463a0 height=844741
2024-05-23T11:47:15Z UpdateTip: new best=0000000000000000000152fee6b2cb2779c2fe0ce34aaad57f9034c1613463a0 height=844741 version=0x24000000 log2_work=94.941037 tx=1009679243 date='2024-05-23T11:46:34Z' progress=1.000000 cache=4.9MiB(34377txo)
2024-05-23T11:49:55Z Saw new header hash=000000000000000000025c416dc7962405d500d87238bf392c95aa9610c3a71e height=844742
2024-05-23T11:49:58Z UpdateTip: new best=000000000000000000025c416dc7962405d500d87238bf392c95aa9610c3a71e height=844742 version=0x2e000000 log2_work=94.941051 tx=1009686227 date='2024-05-23T11:49:28Z' progress=1.000000 cache=5.3MiB(37584txo)
2024-05-23T11:59:40Z Saw new header hash=000000000000000000014072f1d5d67100bf6c097e971cd2af2d579b32a30f93 height=844743
2024-05-23T11:59:46Z UpdateTip: new best=000000000000000000014072f1d5d67100bf6c097e971cd2af2d579b32a30f93 height=844743 version=0x2652e000 log2_work=94.941064 tx=1009691358 date='2024-05-23T11:59:05Z' progress=1.000000 cache=6.8MiB(45760txo)
2024-05-23T12:04:32Z Saw new header hash=00000000000000000002b692a4141102da57b78d667d8a3f9a461fe85106a4c5 height=844744

...

2024-06-04T06:23:12Z Saw new header hash=00000000000000000001de5e312d55f873e73d14f3cd8a8ee656a392dbc28236 height=846459
2024-06-04T06:23:52Z UpdateTip: new best=00000000000000000001de5e312d55f873e73d14f3cd8a8ee656a392dbc28236 height=846459 version=0x2403a000 log2_work=94.964467 tx=1017950010 date='2024-06-04T06:22:45Z' progress=1.000000 cache=87.4MiB(578849txo)
2024-06-04T06:24:43Z Saw new header hash=00000000000000000000225789427db9f0e8f310d8bc0a205f884a8ce68a2aaf height=846460
2024-06-04T06:25:09Z UpdateTip: new best=00000000000000000000225789427db9f0e8f310d8bc0a205f884a8ce68a2aaf height=846460 version=0x25ed2000 log2_work=94.964481 tx=1017955017 date='2024-06-04T06:24:38Z' progress=1.000000 cache=88.0MiB(583022txo)
2024-06-04T06:37:08Z Saw new header hash=0000000000000000000012f2d726f8a033a2bfb5eada30cd92e15e6e1d196ce7 height=846461
2024-06-04T06:37:47Z UpdateTip: new best=0000000000000000000012f2d726f8a033a2bfb5eada30cd92e15e6e1d196ce7 height=846461 version=0x23c16000 log2_work=94.964494 tx=1017959189 date='2024-06-04T06:36:37Z' progress=1.000000 cache=89.1MiB(590143txo)
2024-06-04T07:36:10Z Saw new header hash=000000000000000000031f97130e48c0a7797547416d16ccd3d7dd8a6cc6d0b0 height=846462
2024-06-04T07:36:52Z UpdateTip: new best=000000000000000000031f97130e48c0a7797547416d16ccd3d7dd8a6cc6d0b0 height=846462 version=0x2001e000 log2_work=94.964508 tx=1017962502 date='2024-06-04T07:35:52Z' progress=1.000000 cache=90.6MiB(602543txo)
2024-06-04T07:42:53Z Saw new header hash=000000000000000000002bde133693a19d84616a4cf1db767f8864b5288cce6b height=846463
2024-06-04T07:44:01Z UpdateTip: new best=000000000000000000002bde133693a19d84616a4cf1db767f8864b5288cce6b height=846463 version=0x21aea000 log2_work=94.964521 tx=1017965939 date='2024-06-04T07:42:39Z' progress=1.000000 cache=11.0MiB(0txo)
2024-06-04T08:01:04Z Saw new header hash=00000000000000000001d5e0369520ead2dc646b7b592b8bafff8dc02e368600 height=846464
2024-06-04T08:01:40Z LevelDB read failure: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5978736.ldb
2024-06-04T08:01:40Z Fatal LevelDB error: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5978736.ldb
2024-06-04T08:01:40Z You can use -debug=leveldb to get more complete diagnostic messages
2024-06-04T08:01:40Z Error: Error reading from database, shutting down.
Error: Error reading from database, shutting down.
2024-06-04T08:01:40Z Error reading from database: Fatal LevelDB error: Corruption: block checksum mismatch: /bdata/bitcoin-data/chainstate/5978736.ldb
bitcoind.service: Main process exited, code=dumped, status=6/ABRT
bitcoind.service: Failed with result 'core-dump'.
bitcoind.service: Consumed 3h 46min 58.907s CPU time.
bitcoind.service: Scheduled restart job, restart counter is at 1.
Stopped bitcoind.service.
bitcoind.service: Consumed 3h 46min 58.907s CPU time.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

Another thing you could try to debug this further is to put a swapfile, and the datadir on the same AWS gp3 SSD filesystem.

I am happy to create an AWS account to test this, but it would be good if there was a single (bash) script, which can be deployed to AWS, so that it is easy for anyone to reproduce your exact setup.

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

IMO, the first thing to do would be for someone who's familiar with the format of these block files to look at the corrupted files and see if they can figure out what code may have stomped on the blocks (it might be obvious, like a fragment of p2p networking data in the middle of a block -- you never know until you look).

The most likely scenario is that this is a latent software bug that will show up on any machine if its under memory pressure and heavy paging. In my experience, finding problems low incidence seemingly random problems like this requires instrumenting the code (or using automated tools) with frequent memory buffer guard checks, injected faults such as networking jitter, stalls, disconnects, and invalid data, and random waits before and after memory is allocated, freed, and used (including networking and I/O buffers) and locks are acquired and released. I myself am more familiar with troubleshooting these problems under Windoze than Linux tho.

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Update: After seeing no data corruption for over a week, 5 of the 10 servers experienced data corruption over the last two days. This makes me suspect the problem is not completely random, but depends on contents of the blocks. A replay of the mainnet blocks generated from 2024-06-28 to 2024-06-31 might make good test data (starting with the best blocks, and possibly also including the orphan blocks).

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Another update: After again seeing no data corruption for a while (since the last update above), five of the ten servers again got data corruption over a two day period. This reinforces my beliefs that (a) this data corruption depends on the contents of the blocks received by the bitcoind, and that whatever triggers it is currently relatively infrequent (about once every two weeks); and (2) the data corruption happens at some earlier time and is not detected by bitcoind until sometime later when it attempts to read the corrupted block from the disk.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

Ok, I spun up two machines to see if I can reproduce. I left zram and put everything on one SSD. Also, my config has some debug logging enabled. Also, I am using a recent guix build, instead of a source compile of 25.x.

Let me know when this happens again, so that I can check if it happened to me as well. I'll then try to debug this further.

sh-5.2$ nproc
2
sh-5.2$ uname --kernel-release --kernel-version 
6.1.97-104.177.amzn2023.aarch64 #1 SMP Tue Jul 16 15:18:22 UTC 2024
sh-5.2$ free --human
               total        used        free      shared  buff/cache   available
Mem:           419Mi       354Mi        11Mi       0.0Ki        52Mi        54Mi
Swap:          4.4Gi       435Mi       4.0Gi
sh-5.2$ df --print-type --human-readable ./ 
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme0n1p1 xfs    30G   19G   12G  63% /
sh-5.2$ cat ./bitcoin.conf 
discover=0
listen=1
maxconnections=24

par=1

blocksonly=1
dbcache=200
maxsigcachesize=4
prune=550

maxmempool=5
blockreconstructionextratxn=1
maxorphantx=1
mempoolexpiry=1
persistmempool=0

disablewallet=1

server=1
rpcallowip=127.0.0.1
rpcuser=btc
rpcpassword=btc
rpcworkqueue=40
rpcthreads=1

printtoconsole=1
nodebuglogfile=1

[main]
rpcport=8332
rpcbind=127.0.0.1:8332
bind=[::]:9333
bind=127.0.0.1:8334=onion

# extra args
logthreadnames=1
logsourcelocations=1
debug=1
debugexclude=libevent
#debugexclude=leveldb

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

In the meantime it could make sense for you to consider upgrading to a more recent version of Bitcoin Core. According to https://bitcoincore.org/en/lifecycle/ and https://bitcoincore.org/en/security-advisories/ , 25.x will be EOL soon and "Medium and High severity bugs will be disclosed 2 weeks after the last affected release goes EOL. This is a year after a fixed version was first released. A pre-announcement will be made 2 weeks prior to disclosure."

(I don't know if there will be any disclosures, but given the advisory, it seems better to attempt an upgrade, than not to)

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

FYI, I'm now seeing an outbreak of data corruption, 1 server yesterday, 2 more today, won't be surprised to see more later today or tomorrow...

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

Both of mine are still up.

What type of RPCs are you calling?

I am not calling any RPCs and I am running a version after #30094 (which may be related).

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

Several times a day, we call "bitcoin-cli -rpcclienttimeout=0 getblockcount" to check if bitcoind is still operating. (This is done thru an SSH tunnel, which is the easiest way to check all servers.) That's the only RPC call we've been making at this point.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

getblockcount

yeah, that shouldn't cause OOM, because all it does is serialize a single integer to JSON. I'll downgrade my two versions to 27.1 and see what happens then. Otherwise, I'll try to use different SSDs (c.f. #30159 (comment)).

Can you try if using a single SSD works around the problem for you for now?

If that doesn't pin down the problem, I am not sure how to proceed, because without steps to reproduce, this will be close to impossible to debug or diagnose.

from bitcoin.

apulsifer avatar apulsifer commented on September 26, 2024

At some point last night, another server got data corruption, so up to 4 of 10 on this outbreak. The incidence rate is still only about half the machines every two weeks, so it's going to take a while to duplicate this using only data from the live mainnet. (Note, also very unlikely getblockcount is causing OOM because it's only called about 100 times between data corruption issues.)

Another difference in our setup is that our servers do all of their communication over IPv6, while it looks like yours is only configured with IPv4. (There is some very weak evidence this might be related to the problem -- I have two servers that are only configured with IPv4, and these two servers have never had a data corruption issue, but there are other differences in these servers: one is running testnet, not mainnet, and the other was only running bitcoind for about 2 weeks before bitcoind was shutdown because all of that machine's resources were needed for something else.)

Configuring an EC2 instance with IPv6 is a little cumbersome, but works something like this (from my notes):

edit VPC to assign an Amazon-provided IPv6 CIDR address block
edit VPC route table to create a route from ::/0 to the Internet Gateway
(a local route will also automatically be created for the new route table IPv6 CIDR)

edit all subnets:
disable auto IPv4 assignment
[note: IPv4 is still enabled and a private IPv4 address is assigned; if IPv4 were disabled, the instance's NTP client might have to be reconfigured]
add IPv6 /64 address block (each subnet must have a unique [sequential] prefix)

edit security group to allow:
incoming SSH from admin IPv6 address
(an outgoing rule will also automatically be created to allow all traffic to ::/0)

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

Did you get a chance to see if you can reproduce the crash, if you put all data and swap on a single root SSD?

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

In the meantime I booted up 4 more machines with zram disabled, two of which use three SSDs, as in your setup.

I'd be highly surprised if leveldb corruption has something to do with ipv6, so I won't be testing that.

I'll let them run for a month or two, but if they don't find anything, I am not sure what to do here.

Without exact and full steps to reproduce every single step, including the exact AWS VM setup, as well as the internal VM setup, there is little that can be done here.

from bitcoin.

maflcko avatar maflcko commented on September 26, 2024

The very first machine I set up on AWS refused ssh access and had to be rebooted (fine so far after a reboot). Maybe all of this is just AWS hardware failures?

from bitcoin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.