Background lnd stuck for 18+ hours Your environ

It might help (a lot) to run cat channel.db &gt

[bug]: No idea why lnd is stuck - could it be compacting for 18+ hours? about lnd HOT 18 CLOSED

bensig commented on May 26, 2024

[bug]: No idea why lnd is stuck - could it be compacting for 18+ hours?

from lnd.

Comments (18)

hieblmi commented on May 26, 2024 1

Did you recently experience any downtime or outages of your node?
You should do the following:

Stop lnd
Copy the channel.db file into a separate folder
In that separate folder, run chantools compactdb and check the result (see https://github.com/lightninglabs/chantools and https://github.com/lightninglabs/chantools/blob/master/doc/chantools_compactdb.md)

from lnd.

bensig commented on May 26, 2024 1

Can you check the actual standard out of the daemon (and not just the log)? Do you see a panic/stack trace?

How do I do that exactly?

When I run lnd I get:

2024-03-13 11:47:14.649 [INF] LTND: Opening the main database, this might take a few minutes...
2024-03-13 11:47:14.649 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false
panic: freepages: failed to get all reachable pages (page 256270: multiple references (stack: [256270 255578 256270]))

goroutine 273 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
go.etcd.io/[email protected]/db.go:1178 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 1
go.etcd.io/[email protected]/db.go:1176 +0x1e5

from lnd.

guggero commented on May 26, 2024 1

Yeah, unfortunately if you can't compact the DB anymore it's very likely in a borked state and won't start up again. Restoring from seed and SCB is the safest option, although that will close all channels, which sucks.
But I'm not aware of any way to fix a borked database, unfortunately. Did you have any power outage or did you have to kill (unclean shutdown) of lnd at some point that could explain the problem?

If you restore the node from the seed, you might want to start with a Sqlite backend which is more resilient against those sorts of issues.

I'm closing the issue, since there's not really anything more there is to do here.

from lnd.

hieblmi commented on May 26, 2024

Thanks for flagging this issue. Indeed this takes a long time. Do you know how big the db was when you started the compaction? Have you ever compacted the db before?

from lnd.

bensig commented on May 26, 2024

db I believe is compacted automatically whenever the process is restarted.

Last restart appears to have happened on Jan 31:

2024-01-31T00:43:15Z

Unsure when that last occurred before this.

Here is my lnd/data/graph/mainnet dir (user redacted):

-rw-r--r-- 1 user user 1023M Mar  5 10:09 /home/user/.lnd/data/graph/mainnet/channel.db
-rw------- 1 user user     8 Jan 30 18:51 /home/user/.lnd/data/graph/mainnet/channel.db.last-compacted
-rw-r--r-- 1 user user  132M Mar  5 10:08 /home/user/.lnd/data/graph/mainnet/sphinxreplay.db
-rw------- 1 user user     8 Jan 30 18:51 /home/user/.lnd/data/graph/mainnet/sphinxreplay.db.last-compacted
-rw-r--r-- 1 user user   65M Mar  8 04:11 /home/user/.lnd/data/graph/mainnet/temp-dont-use.db
-rw-r--r-- 1 user user   33M Mar  5 10:08 /home/user/.lnd/data/graph/mainnet/wtclient.db
-rw------- 1 user user     8 Jan 30 18:51 /home/user/.lnd/data/graph/mainnet/wtclient.db.last-compacted

from lnd.

bensig commented on May 26, 2024

I have now tried to restart without compacting bbolt database to see if that helps - but it's seems to be stuck at "opening" for over an hour already. I'm really not sure about the diff between bbolt db and channel db and what could be causing this...

Changed lnd.conf to:

db.bolt.auto-compact=false

logs:

2024-03-08 03:48:17.282 [INF] LTND: Opening the main database, this might take a few minutes...
2024-03-08 03:48:17.282 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false
2024-03-08 04:48:26.182 [INF] LTND: Opening the main database, this might take a few minutes...
2024-03-08 04:48:26.182 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false

from lnd.

C-Otto commented on May 26, 2024

It might help (a lot) to run cat channel.db > /dev/null while lnd is starting/compacting. It seems your system only managed to write 65 MByte after 18 hours of compacting, you might want to check your disk (or storage backend) health.

from lnd.

C-Otto commented on May 26, 2024

The message "Opening the main database" should only appear once during startup. In your logs it is shown twice, with almost exactly one hour in between. Do you maybe re-start lnd automatically (if it doesn't start up completely within a timeout)?

from lnd.

bensig commented on May 26, 2024

It might help (a lot) to run cat channel.db > /dev/null while lnd is starting/compacting. It seems your system only managed to write 65 MByte after 18 hours of compacting, you might want to check your disk (or storage backend) health.

Disk health is good. This is running on a server cluster that is quite healthy.

running cat /home/user/.lnd/data/graph/mainnet/channel.db >> /dev/null does not seem to help.

Log keeps repeating - I assume this is lnd crashing and the service auto-restarting:

2024-03-08 06:57:12.929 [INF] LTND: Version: 0.17.3-beta commit=v0.17.3-beta, build=production, logging=default, debuglevel=info
2024-03-08 06:57:12.929 [INF] LTND: Active chain: Bitcoin (network=mainnet)
2024-03-08 06:57:12.930 [INF] RPCS: RPC server listening on [::]:10009
2024-03-08 06:57:12.930 [INF] RPCS: RPC server listening on 0.0.0.0:10009
2024-03-08 06:57:12.935 [INF] RPCS: gRPC proxy started at [::]:10080
2024-03-08 06:57:12.935 [INF] RPCS: gRPC proxy started at 0.0.0.0:10080
2024-03-08 06:57:12.935 [INF] LTND: Opening the main database, this might take a few minutes...
2024-03-08 06:57:12.935 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false
2024-03-08 06:57:24.181 [INF] LTND: Version: 0.17.3-beta commit=v0.17.3-beta, build=production, logging=default, debuglevel=info
2024-03-08 06:57:24.181 [INF] LTND: Active chain: Bitcoin (network=mainnet)
2024-03-08 06:57:24.182 [INF] RPCS: RPC server listening on [::]:10009
2024-03-08 06:57:24.182 [INF] RPCS: RPC server listening on 0.0.0.0:10009
2024-03-08 06:57:24.185 [INF] RPCS: gRPC proxy started at [::]:10080
2024-03-08 06:57:24.186 [INF] RPCS: gRPC proxy started at 0.0.0.0:10080
2024-03-08 06:57:24.186 [INF] LTND: Opening the main database, this might take a few minutes...
2024-03-08 06:57:24.186 [INF] LTND: Opening bbolt database, sync_freelist=false, auto_compact=false

from lnd.

C-Otto commented on May 26, 2024

How do you start lnd? Maybe systemd is doing weird things?

from lnd.

guggero commented on May 26, 2024

Can you check the actual standard out of the daemon (and not just the log)? Do you see a panic/stack trace?

from lnd.

bensig commented on May 26, 2024

How do you start lnd? Maybe systemd is doing weird things?

Yeah I am using systemd services and docker...

from lnd.

bensig commented on May 26, 2024

Can you check the actual standard out of the daemon (and not just the log)? Do you see a panic/stack trace?

How do I do that exactly?

from lnd.

guggero commented on May 26, 2024

Ugh, that looks like data corruption in the freepages. Did you attempt the steps in this comment? #8532 (comment)

from lnd.

bensig commented on May 26, 2024

Ugh, that looks like data corruption in the freepages. Did you attempt the steps in this comment? #8532 (comment)

Yes, actually I did try it. I ran this a few times:

chantools compactdb --sourcedb channel.db --destdb ./results/compacted.db

...and it seems to stall with no errors in the log and the compacted version seems to be stuck at a size of 57M.

I'll run it in a screen for a few hours to see if it is still stuck at 57M.

from lnd.

bensig commented on May 26, 2024

Does not seem to pass 57M @guggero @hieblmi

Any other ideas?

Or do I just clear data and restore seed and channel backup...

from lnd.

bensig commented on May 26, 2024

What are the steps to restore from seed and scb?

from lnd.

hieblmi commented on May 26, 2024

You can find the steps here https://docs.lightning.engineering/lightning-network-tools/lnd/disaster-recovery.

from lnd.

[bug]: No idea why lnd is stuck - could it be compacting for 18+ hours? about lnd HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent