superfly / litefs Goto Github PK

FUSE-based file system for replicating SQLite databases across a cluster of machines

License: Apache License 2.0

Go 99.88% Dockerfile 0.09% Makefile 0.01% Shell 0.02%

litefs's Introduction

LiteFS

LiteFS is a FUSE-based file system for replicating SQLite databases across a cluster of machines. It works as a passthrough file system that intercepts writes to SQLite databases in order to detect transaction boundaries and record changes on a per-transaction level in LTX files.

This project is actively maintained but is currently in a beta state. Please report any bugs as an issue on the GitHub repository.

You can find a Getting Started guide on LiteFS' documentation site. Please see the ARCHITECTURE.md design document for details about how LiteFS works.

SQLite TCL Test Suite

It's a goal of LiteFS to pass the SQLite TCL test suite, however, this is currently a work in progress. LiteFS doesn't have database deletion implemented yet so that causes many tests to fail during teardown.

To run a test from the suite against LiteFS, you can use the Dockerfile.test to run it in isolation. First build the Dockerfile:

docker build -t litefs-test -f Dockerfile.test .

Then run it with the filename of the test you want to run. In this case, we are running select1.test:

docker run --device /dev/fuse --cap-add SYS_ADMIN -it litefs-test select1.test

Contributing

LiteFS contributions work a little different than most GitHub projects. If you have a small bug fix or typo fix, please PR directly to this repository.

If you would like to contribute a feature, please follow these steps:

Discuss the feature in an issue on this GitHub repository.
Create a pull request to your fork of the repository.
Post a link to your pull request in the issue for consideration.

This project has a roadmap and features are added and tested in a certain order. Additionally, it's likely that code style, implementation details, and test coverage will need to be tweaked so it's easier to for me to grab your implementation as a starting point when implementing a feature.

litefs's People

Contributors

Stargazers

Watchers

Forkers

freekrai hvt jamesyang1986 seeekr nanderoo tanbinh123 akbartrilaksana lromano72 sohankunkerkar richardsonjf crt-fork davidalphafox mtlynch cuulee leokhoa super-rain jwhear r00tc0d3 suryatmodulus isgasho darthshadow anietieakpan frew shnwang xxj123go tide999 jimafisk ryanrussell zeina1i fireflank warmchang tiborvass anacrolix doytsujin cxz david00154 gemwrt dangra ralferoo kiminohero gitsrc renesugar dastbe dylanlingelbach legionxiong walterwanderley markiannucci qhxin gx14ac gkyildirim abcdlsj seanwatters koeng101 ccll renanoliveira0 artisdom shakahl ming535 infdahai registrar-shyran tosunkaya pearl39962118 laiga-pofu fairhopeweb suryamodulus earlofurl xmonader andyli029 fatelei rqlite h0rowitz tachunwu ajayk thesamecat2 justanotherdot jamestiotio xeronith cloudleague brunoscaglione mattcburns imneov vsujeesh pghildiyal wolfi-chainguard-demo muhtasimtanmoy wangqiang1588 devdevdany luis-pinto-fanduel ahdyt

litefs's Issues

Streaming S3 Backups

LiteFS provides some redundancy by running in a cluster, however, losing all nodes would cause all data to be lost. Replicating to S3 in manner similar to Litestream would provide high durability (11 9s) as well as allow point-in-time restores.

As opposed to Litestream, LiteFS is designed for efficient compaction of transaction files so restore time should be much faster.

Require data directory to be set

Right now, the data directory is automatically derived from the mount directory. However, this can be confusing so it's probably better to explicitly set the mount and data directory.

Originally mentioned by David Cameron on Twitter: https://twitter.com/dave_cameron/status/1571514794989191169

Implement STATFS FUSE call

Support Kubernetes Lease API

Kubernetes has something similar to consul sessions called the Lease API. Could we use it to simplify deployments in k8s so there is no dependency on consul?

Investigate `SingleThreaded` race error

Currently, the race detector reports an unusual race when running go-fuse as-is: #5

This has been fixed by enabling SingleThreaded but that shouldn't be necessary. More investigation will be necessary to figure out what's going on.

Enforce LTX retention

Currently, LiteFS will retain LTX files locally forever. That's obviously not ideal. Retention enforcement wasn't added because it wasn't clear at first whether databases would be served out of the LTX pages or from the database file itself. Now it looks like serving from the database is the best approach as it make LiteFS a passthrough file system.

LTX files should be retained based on a time limit for now. Maybe 5 minutes by default? A size based retention should be implemented at some point in the future too but a time-based retention should be a good first start.

is a proxy needed ?

web servers using litefs don't know which litefs server to read from or write from.

SO i presume a proxy ir needed, so that this is automatically managed for them ?

also what about a SQL statement that has a write and read in it ? Again i presume the proxy would handle this ?

https://github.com/CECTC/dbpack is an golang db proxy that does what i am suggesting.

Mmap the underlying file

Would it be possible to open the underlying database file with mmap for better performance or would it not be very useful since all queries need to pass through the FUSE layer anyway, which would be the bigger bottleneck?

Litefs crash

Was restarting litefs and it was connecting back to the primary and it happened. Can't seem to reproduce this, but this is the stacktrace off of the build of 37d2e5d.

root@9d625ce1:/mnt/litefs# fgstream connected
stream disconnected
http: panic serving [fdaa:0:2fff:a7b:2c00:163d:ee67:2]:51588: runtime error: invalid memory address or nil pointer dereference
goroutine 132 [running]:
net/http.(*conn).serve.func1()
        /gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:1802 +0xb9
panic({0x949b80, 0xe43c90})
        /gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/runtime/panic.go:1047 +0x266
github.com/superfly/litefs.(*DB).Pos(0xc00027ee28)
        /current/litefs/db.go:116 +0x39
github.com/superfly/litefs/http.(*Server).streamDB(0xc000170aa0, {0xaa7f10, 0xc0001f8f00}, {0xaa5090, 0xc0004027e0}, 0x6a6f85, 0x1000000009bf760)
        /current/litefs/http/server.go:215 +0x31e
github.com/superfly/litefs/http.(*Server).handlePostStream(0xc000170aa0, {0xaa5090, 0xc0004027e0}, 0xc000284c00)
        /current/litefs/http/server.go:181 +0x549
github.com/superfly/litefs/http.(*Server).serveHTTP(0xc000170aa0, {0xaa5090, 0xc0004027e0}, 0xc000284c00)
        /current/litefs/http/server.go:138 +0x251
net/http.HandlerFunc.ServeHTTP(0x0, {0xaa5090, 0xc0004027e0}, 0x10)
        /gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:2047 +0x2f
net/http.serverHandler.ServeHTTP({0xaa37d8}, {0xaa5090, 0xc0004027e0}, 0xc000284c00)
        /gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:2879 +0x43b
net/http.(*conn).serve(0xc000416780, {0xaa7fb8, 0xc000273cb0})
        /gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:1930 +0xb08
created by net/http.(*Server).Serve
        /gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:3034 +0x4e8

Consul Node Registration

Using Consul's session API works fine without an associated node in a single tenant environment. However, when using Consul in a multitenant environment with ACLs, the session_prefix will apply to the node name in the session and will be rejected as a permission error if one is not present.

The fix is to provide a means for specifying a node name that will be registered on start up. LiteFS uses time-based sessions and will eventually hand off sessions between nodes so we want to register a single node for all LiteFS instances. This node will not have any checks associated with it.

Cannot find .primary file

Here's my litefs config: https://github.com/kentcdodds/kentcdodds.com/blob/dev/other/litefs.yml

So based on what I've read and the example, I would expect to find the .primary file at /litefs/data/.primary. I only have two regions and I've SSH-ed into both of them and cannot find that file. I also checked at /data/.primary and it wasn't there either.

What have I got wrong?

Prometheus Metrics

LiteFS should expose a set of Prometheus metrics via a GET /metrics endpoint on the HTTP server.

Metrics

Store metrics

Number of databases
Primary status
Number of subscribers
Frames sent by database/type

Database metrics

LTX corruption

After failover testing done by fly restart vm:

2022-08-25T03:43:05Z app[8cd3b300] sea [info]cannot open store: open databases: open database: db=00000001 err=recover ltx: read ltx file header (0000000000000062-0000000000000062.ltx): unmarshal header: invalid LTX file

As far as I know, there was no activity for probably 10-30 seconds before the restart. Possibly longer. I'm attaching the entire directory as a base64 in hope that it is of some use to you.

root@8cd3b300:~/db/00000001/ltx# ls -latr
total 236
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000050-0000000000000050.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000051-0000000000000051.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000052-0000000000000052.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000053-0000000000000053.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000054-0000000000000054.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000055-0000000000000055.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000056-0000000000000056.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000057-0000000000000057.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000058-0000000000000058.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000059-0000000000000059.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005a-000000000000005a.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005b-000000000000005b.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005c-000000000000005c.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005d-000000000000005d.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005e-000000000000005e.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005f-000000000000005f.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000060-0000000000000060.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000061-0000000000000061.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000062-0000000000000062.ltx

$ base64 ltx.tar.gz 
H4sIAAAAAAAAA+3dcWxUdwHA8d+1TMAxD1mAIIrHKUhnr7z37t4dbBCHs4FGttD2gLa0tO/ag5XQ
NvRu1CUkBEKmkf4BmLS6RMymfzANMGOAhYxW0dhlY0khyBBHNFoSp53RZMPUwfTeXdvAQ37bvVvf
3ct9P0B/l+v9+rsfb/+wfO/3FCVNXS6mjrlARNdFZinFOqYfq6GgourhSDioCkVVQiFN+PQpfE+T
nkkkjW6fT3R3dSVlr/uo77uUMnH9dya/NVX/DWR//VVNV7n+Trjr+it3CasB6xMVqVfZWyMcCt3n
+mupX3de/9Tzqq6GdOFTPvnt3qvIr//6aJ2aeTTb/FKa+uMZ/5ZhGUVf5ORLu3sPnbKxjKe2en17
Mu7b1tXdYSR9wdRqHo94PPX3nvn5pXe8tiT1Z9qdcz/GzzdERfPnHjInez8lvLtSXwAAAAAAyJv2
kunzFyzw7AskjdjOeGvXM53JROZr6RM1lWuilb7omq+vr/T5M0/6fcv87W1+X9VT0cq1lTW+DTVV
T66pqfd9s7K+3Ofvjm9v7+r0+6KVddHy8SmTry3bN8szff7y5Z59a9NrJVqfjncYzR3t27uNZGpa
4p4nSu5+B/d833wzu+PdifSa//8dtXcm4t3JeFuzkcy8reZvpH5itOrJyrLU5kvNf58/4L0ovDe9
73pHvH9IPQQAAAAAAIVuxgPTxMOejnZj0fRpqUczjY7E9NKJByXjD2Z4zBcl4ka6MxjZ89s3jwcO
fO1I59u+3UNXD+R7C8VO0v/obQHrEw70P1rY7H+Capj+xwmS/qfJMoqbz3/w+oPBwYiNZWT9T5PI
vf9pov8BAAAAABQO+h/6HwAAAAAAXCa7/idgTjlw8ciJ4SM/O3fse77a9cc65+R7C8VO1v9ss/Y/
25zofyLp/kcL0v84QdL/NFtG8eKsstJXd/X5bSwj63+aRe79TzP9DwAAAACgcND/0P8AAAAAAOAy
2fU/6duLDa99fPv24f3H/nN69qk5Twe0fG+h2Mn6n1Zr/9Pq4Pk/IfofJ0j6n0bLKMZ23HhvZnR4
v41lZP1Po8i9/2mk/wEAAAAAFA76H/ofAAAAAABcJrv+p9ycMnEPoauDn/nL787+9Ev53kKxk/U/
qrX/Uae+/9FD6fN/VF2j/3GCpP+ptoxioHfu3lcGq9bZWEbW/1SL3PufavofAAAAAEDhoP+h/wEA
AAAAwGWy63/Src+FoT0LWxc1nX9nVfn1X998c26+t1DsZP1P3Nr/xJ07/0dT6H+cIOl/tlrGyXv3
2VhG1v9sFbn3P1vpfwAAAAAAhYP+h/4HAAAAAACXya7/qTCnvDirrPTVXX3+kfdLd5z/9+If5XsL
xU7S/4SVgPUJJ/qf8fN/VPofJ0j6nxbLOHnvPhvLyPqfFpF7/9NC/wMAAAAAKBz0P/Q/AAAAAAC4
THb9Tzrn6IucfGl376FTfaMHj/krL9/K9xaKnez8n6D1/J+gA/f/0oOZ/idI/+MESf9TaxnF6LWh
097bLU02lpH1P7Ui9/6nlv4HAAAAAFA46H/ofwAAAAAAcJns+p8l5pSjh09cutyYfOvgh5/v/+z8
xXbOEsEnSNb/hK39T9iJ/kdP9z9h+h9HSPqfTZZR3Pjb77/z0PPev9pYRtb/bBK59z+b6H8AAAAA
AIWD/of+BwAAAAAAl8mu/1lmTnmtJ7bEeFbr+fmHyox/PHXtVL63UOxk/Y9u7X90J/qfUKb/Ueh/
nCDpfzZaRtH/wsujVx5bGLOxjKz/2Shy73820v8AAAAAAAoH/Q/9DwAAAAAALpNd//MVc8rEGSLf
/f7DaldgLJLvLRQ7Wf8Ts/Y/sanvf8Ja2Ox/gqpG/+MESf+zxTKKg4t7D116/zczbCwj63+2iNz7
ny30PwAAAACAwkH/Q/8DAAAAAIDLZNf/fNWcMrbjxnszo8P7fzBnef3ALxbMzvcWip2s/zGs/Y/h
RP+Tvv9XUInQ/zhB0v80WEZx/FZk9cC8sX4by8j6nwaRe//TQP8DAAAAACgc9D/0PwAAAAAAuEx2
/c8j5pSJM0S+Xb3quc7yuefyvYViJ+t/FGv/ozhw/69Q+vwfNUT/4whJ/7PBMoozF688cv1Xr9/Z
6nxcsv5ng8i9/9lA/wMAAAAAKBz0P/Q/AAAAAAC4THb9j9+cMtA7d+8rg1Xr3r4+Uv/Bc/9tzfcW
ip2s/9Gs/Y/mQP+jK5n+J0z/4wRJ/1NjGcWFoT0LWxc1nbexjKz/qRG59z819D8AAAAAgMJB/0P/
AwAAAACAy2TX/3zZnDJ6bei093ZL0xsN3pF158I/zPcWip2s/1lp7X9WOnH/r1C6/wnT/zhC0v/U
W0bxxzNb6+Nn3v2CjWVk/U+9yL3/qaf/AQAAAAAUDvof+h8AAAAAAFwmu/7H/H8A4vityOqBeWP9
b0S+OPinRIMjiQfuT9b/rLD2PyscOP8nomb6H53+xwmS/qfOMoqzqx5cc/bq0ts2lpH1P3Ui9/6n
jv4HAAAAAFA46H/ofwAAAAAAcJns+p9l5pSJM0T+9eO/HyrpaazI9xaKnaz/iVj7n4gT/U/m/l9h
jf7HCZL+Z7NlFK/1xJYYz2o9NpaR9T+bRe79z2b6HwAAAABA4aD/of8BAAAAAMBlbJz/M3GGyPmy
vYGTlw//M99bKHaS/iesBaxPOHD/r2Dm/B9dof8BAAAAAAAAAAAAAAAAPors/J+Q9fyfkAPn/+ih
TP8Tpv9xguT8n6hlFEcPn7h0uTH5lo1lZOf/REXu5/9EOf8HAAAAAFA4OP+H838AAAAAAHCZ7M7/
WWpO6X/h5dErjy2M/aR2dWDvp5cezfcWit1k/9NmJI2YkYhP0Rr373+C6cd3n/+jqRr3/3KEpMuJ
idy7nBhdDgAAAACgcBRvl2P+29zj/bNI/QYAAAAAAK4178CFX555Z+iJUhHUFE0LKCsCmh5VtEcV
/VFtJZ/JAQAAAADAhbL7TI42+fmPTqNjKj77YZJ//sNcXR///IceCqrm+a+qFgny+Q8nJOOJZFus
IrFrZ3tyqq4/AAAAAAAAgPz4H9qMJFcA+AIA

Include journal filenames in root listing

Listing files in the mounted directory doesn't include journal files, although it works because looking up for individual filenames returns the correct handles.

$ ls dbs/ -lh
total 0
-rw-rw-rw- 1 daniel daniel 8.0K Oct 19 19:27 state.db

$ ls dbs/state.db-wal -lh
-rw-rw-rw- 1 daniel daniel 57K Oct 19 19:27 dbs/state.db-wal

$ ls dbs/state.db-shm -lh
-rw-rw-rw- 1 daniel daniel 32K Oct 19 19:27 dbs/state.db-shm

As expected, the files exists on the data dir:

$ ls .litefs/dbs/state.db/ -lh
total 376K
-rw-rw-r-- 1 daniel daniel 8.0K Oct 19 19:28 database
drwxrwxr-x 2 daniel daniel 4.0K Oct 19 19:29 ltx
-rw-rw-r-- 1 daniel daniel  32K Oct 19 19:29 shm
-rw-rw-r-- 1 daniel daniel 326K Oct 19 19:29 wal

This is confusing but it doesn't affect SQLite clients per Ben's words:

it's the FUSE ReadDir() implementation I have. It's just listing out the databases right now. it doesn't affect the functionality as SQLite uses LOOKUP calls instead of READDIR but I should fix that.

Clarification on post-split-brain behavior

The architecture doc states:

When the old primary node connects to the new primary node, it will see that its checksum is different even though its transaction ID could be the same. At this point, it will resnapshot the database from the new primary to ensure consistency.

How does this appear to the application?

Exit gracefully on clean exec exit

Candidates

Currently, every node in a LiteFS cluster is a candidate to become the primary. However, in practice, users may want to keep their primaries in a single region for consistent performance. The litefs.yml config file should have a flag to indicate whether a node is a candidate or can only be a replica.

etcd support

Hi there,

Thanks for creating the project, and litefs seems a good fit for my use case, but internally we use etcd instead of consul, and we would like to reduce dependency as much as possible (which is why users are using sqlite in many cases). I wonder if there is any chance we add an additional abstraction for consul so that both etcd and consul can be used? Thanks.

Disk IO errors with simple usage

I've tried working through the README directions but consistently run into Disk IO errors. I'm using the 0.1.0 release. I've distilled everything down to a simple Bash script for reproducing:

#!/bin/bash
set -xe

# Kernel and OS info for debugging
uname -r
cat /etc/os-release

# Assumes you want to place a specific version of the LiteFS binary in the local
#  directory for testing
LITEFS_CMD="./litefs"

function cleanup {
    # Terminate all child processes
    pkill -P $$
}
trap cleanup EXIT

consul agent -dev &
sleep 2 # give Consul a few seconds to start

# We'll create everything in a scratch directory called "repro"
# Blow away left-overs from any previous run
rm -rf repro
mkdir -p repro/data

cat <<EOF >repro/litefs.yml
mount-dir: repro/data
debug: true
http:
    addr: ":20202"

consul:
    url: "http://localhost:8500"
    advertise-url: "http://localhost:20202"
EOF

$LITEFS_CMD -config repro/litefs.yml &
sleep 2 # give LiteFS a few seconds to start

echo "====== Running SQLite against the test database now ========"
sqlite3 -column repro/data/test.db <<EOF
PRAGMA journal_mode;
CREATE TABLE test (foo INT);
INSERT INTO test VALUES (1);
EOF

echo "Success!"

The output of running this is attached
repro.log

Continually improve test coverage

As LiteFS is a database replication tool, test coverage should be high—likely 80-90% test coverage. Some areas of LiteFS may be difficult to report test coverage around such as the FUSE file system but coverage should still exist even if it is not reported.

Allow litefs to keep running if an error occurs

If litefs encounters an error on startup, it will return an error and exit immediately. This is problematic for ephemeral systems as it's impossible to debug the state of the system or any mounted volumes.

Instead, default litefs to report an error but keep running until it receives a signal to stop. Add a flag called ExitOnError bool to the config to change this behavior.

/cc @kentcdodds

Gracefully handle WAL PRAGMA

Until #14 is implemented, LiteFS should gracefully prevent WAL mode from being enabled.

Support WAL Mode

LiteFS currently only supports the rollback journal, however, it should also be able to convert WAL writes to LTX files as well.

Primary-Initiated Replication

A comment on HN had an interesting idea of having the primary walled off from the other nodes and having the primary reach out to replicas instead of replicas connecting to the primary. It could provide security benefits as the primary node(s) could have tighter firewall controls.

Specifying same data & mount directory causes system to hang

Hello,

I tried latest litefs 0.2.0 and it seems to stuck at some point. It is working with 0.1.1 though.
When I start with litefs I can not access database and even list the mount directory.

This is how I start:

litefs -config litefs_SPA.yml

litefs_SPA.yml:

mount-dir: '/mnt/db'
data-dir: '/mnt/db'

http:
  addr: ':20202'

consul:
  url: 'http://10.16.18.230:8500'
  advertise-url: 'http://10.16.18.228:20202'

Also there is kernel info message:

Tue Oct 25 09:15:15 2022]  </TASK>
[Tue Oct 25 09:15:15 2022] INFO: task ls:2307 blocked for more than 120 seconds.
[Tue Oct 25 09:15:15 2022]       Tainted: P           OE     5.15.0-52-generic #58~20.04.1-Ubuntu
[Tue Oct 25 09:15:15 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Oct 25 09:15:15 2022] task:ls              state:D stack:    0 pid: 2307 ppid:  1986 flags:0x00004004
[Tue Oct 25 09:15:15 2022] Call Trace:
[Tue Oct 25 09:15:15 2022]  <TASK>
[Tue Oct 25 09:15:15 2022]  __schedule+0x2cd/0x8a0
[Tue Oct 25 09:15:15 2022]  schedule+0x4e/0xc0
[Tue Oct 25 09:15:15 2022]  request_wait_answer+0x136/0x210
[Tue Oct 25 09:15:15 2022]  ? wait_woken+0x60/0x60
[Tue Oct 25 09:15:15 2022]  fuse_simple_request+0x1ac/0x2f0
[Tue Oct 25 09:15:15 2022]  fuse_do_getattr+0xd7/0x340
[Tue Oct 25 09:15:15 2022]  fuse_getattr+0xa9/0x130
[Tue Oct 25 09:15:15 2022]  vfs_getattr_nosec+0xba/0xe0
[Tue Oct 25 09:15:15 2022]  vfs_getattr+0x37/0x50
[Tue Oct 25 09:15:15 2022]  vfs_statx+0x89/0x110
[Tue Oct 25 09:15:15 2022]  __do_sys_newlstat+0x3e/0x80
[Tue Oct 25 09:15:15 2022]  __x64_sys_newlstat+0x16/0x20
[Tue Oct 25 09:15:15 2022]  do_syscall_64+0x59/0xc0
[Tue Oct 25 09:15:15 2022]  ? exit_to_user_mode_prepare+0x3d/0x1c0
[Tue Oct 25 09:15:15 2022]  ? do_user_addr_fault+0x1e0/0x660
[Tue Oct 25 09:15:15 2022]  ? irqentry_exit_to_user_mode+0x9/0x20
[Tue Oct 25 09:15:15 2022]  ? irqentry_exit+0x1d/0x30
[Tue Oct 25 09:15:15 2022]  ? exc_page_fault+0x89/0x170
[Tue Oct 25 09:15:15 2022]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Oct 25 09:15:15 2022] RIP: 0033:0x7f7244ed557a
[Tue Oct 25 09:15:15 2022] RSP: 002b:00007ffca76b3b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[Tue Oct 25 09:15:15 2022] RAX: ffffffffffffffda RBX: 00005555e6434550 RCX: 00007f7244ed557a
[Tue Oct 25 09:15:15 2022] RDX: 00005555e6434568 RSI: 00005555e6434568 RDI: 00007ffca76b3b10
[Tue Oct 25 09:15:15 2022] RBP: 00007ffca76b3ec0 R08: 0000000000000001 R09: 00000000e643ab00
[Tue Oct 25 09:15:15 2022] R10: 00007ffca76b3b14 R11: 0000000000000246 R12: 00005555e643ab43
[Tue Oct 25 09:15:15 2022] R13: 0000000000000003 R14: 00007ffca76b3b10 R15: 00005555e6434568
[Tue Oct 25 09:15:15 2022]  </TASK>

Maybe this has something to do with fuse but dont know how to dig it. As I said it is working with 0.1.1.

Thanks for this great project and let me know if I can help more.

Non-default page size

LiteFS is tested against the default 4KB page size, however, it should be able to handle any page size from 512 bytes to 64K.

Recover on matching TXID with checksum mismatch

I just deployed LiteFS to den and maa and I'm getting a bunch of logs that I'm concerned about:

2022-10-25T22:15:01.166 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:01.286 app[2a69c631] den [info] stream connected

2022-10-25T22:15:01.287 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:01.420 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:02.670 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:02.790 app[2a69c631] den [info] stream connected

2022-10-25T22:15:02.924 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:04.182 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:04.437 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:04.753 app[2a69c631] den [info] HEAD / 200 89545 - 49.571 ms

2022-10-25T22:15:04.755 app[2a69c631] den [info] GET /healthcheck 200 - - 64.283 ms

2022-10-25T22:15:05.658 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:05.780 app[2a69c631] den [info] stream connected

2022-10-25T22:15:05.780 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:05.913 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:07.138 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:07.259 app[2a69c631] den [info] stream connected

2022-10-25T22:15:07.262 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:07.394 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:08.615 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:08.735 app[2a69c631] den [info] stream connected

2022-10-25T22:15:08.737 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:08.875 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

2022-10-25T22:15:10.107 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica

2022-10-25T22:15:10.228 app[2a69c631] den [info] stream connected

2022-10-25T22:15:10.228 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320

2022-10-25T22:15:10.362 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3

I'm not sure what these mean 🤔

Support file-based SQLite database initialization

Ideally, it'd be nice to support the creation or replacement of a database with simple file commands. e.g.

cp /path/to/src.db /mnt/target.db

Synchronous Replication

LiteFS should allow users to specify that they require acknowledgement of a write before accepting new writes.

Getting non-matching LTX checksum on fresh volume

https://github.com/kentcdodds/kentcdodds.com/actions/runs/3316512422/jobs/5478478215

cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (e3d3906d74cc0273) does not match latest LTX checksum (0000000000000000)

This volume is brand new and completely empty. @benbjohnson said this is a bug that needs fixing and asked me to open this issue. More context at https://www.youtube.com/watch?v=vTNPJGKqsYQ

Thanks!

PERSIST & TRUNCATE Journal Mode

LiteFS currently handles the DELETE rollback journal mode. It should also handle PERSIST & TRUNCATE.

Write Forwarding

Currently, LiteFS supports a single primary node that performs all the writes. However, there are situations where it would be useful to have multiple nodes that can write—even if it means taking a performance hit. Two common examples are background jobs & out-of-band migrations.

This could work by having the primary handoff the write lock to another node temporarily:

Given N₁ is the primary and N₂ is a replica.
N₂ sends a request to acquire a write lock from N₁.
N₁ acquires the write lock on behalf of N₂ and holds it for the duration of the request.
N₂ ensures it has the latest transaction data.
N₂ executes its write transaction locally.
N₂ sends the LTX file for the transaction back to N₁
If N₁ still holds the lock, it commits the transaction and notifies N₂
If N₁ no longer holds the lock or is demoted, it rejects the transaction and notifies N₂

It is to be determined exactly how the lock handoff is requested by the client application. It could be transparent but that could cause users to experience slow performance if they are not correctly forwarding writes when they can. Maybe this should be a flag in the config to enable it?

LTX Event Data

The LTX file format is designed to store optional event data (similar to SSE) for each transaction file. This can be useful as replicas may need to know when and how data changes instead of polling their local copy for changes.

This will need to be implemented as something like a file handle so the application can write events to it. Notifications can also be implemented as a file handle or an HTTP endpoint.

Snapshots

Since LTX files need to be removed after a while (#48), we need to be able to take a snapshot of the full database as an LTX file. For the initial implementation, obtaining a SHARED lock on the database and then copying the pages to an ltx.Writer should suffice. This will prevent writes during a snapshot but this can be alleviated by supporting WAL mode (#14).

A snapshot LTX file should be served to HTTP clients if they connect and their TXID is not available as an LTX file.

LTX Encryption

LTX files are designed to support encryption so that remote storage, such as AWS S3, will not be able to read the underlying data. Currently, I'm leaning toward using AES-GCM-SIV from the Tink project.

Slow loading of sqlite dump file

Not sure if the project is considered stable enough for such testing right now, but gave it a try anyway. Feel free to close the issue if it's not suitable right now and I can give it a try again later.

I tried loading one of my bigger databases (~ 4 GB) by dumping it into a sql file (via .dump) and then loading it into a new DB on a litefs mount (via .read) and these are my observations for the same.

My first try was with the regular sqlite3 tool on a regular filesystem (non-litefs mount) just to see the time taken (~5 mins).

Trying the same on a litefs mount ran for around ~16 mins before failing with an EIO error of nonsequential page numbers in snapshot transaction. Trying the same multiple times gave me the same error.

I thought it was due to the single transaction (added by sqlite as part of the .dump command to the output) and perhaps litefs was not handling such a large transaction properly so I removed the BEGIN TRANSACTION & COMMIT lines from the file before trying again.

This did run for longer but the speeds slowed down drastically. It was able to process only 300 MB (verified by checking the size of the underlying database file) of the ~ 4 GB in 70+ mins file but generated ~ 11 GB of pages in the ltx folder consisting of 650k+ individual .ltx files.

Is this expected for now till the performance improvements come in later or can something be done?

Local setup guide

I am trying this project for the first time on my laptop. I tried building both litefs and the example from source.
I got the dependencies from the github pipelines.
Installed consul on my laptop too.

Everything compiled and runnable. Unfortunately, I couldn't make the replication work.

I tried tweaking the litefs.yml file configurations but it doesn't replicate. I am pretty sure i am doing something wrong here.

Thanks 🙏

Expose primary information

Directing writes to the primary node is the job of the application, so the primary's hostname needs to be exposed somehow. At present the application can get this by querying Consul, but if #37 and/or #23 are implemented this will no longer be the case. I propose exposing this information via a file in the shadow directory, similar to Litestream's .primary.

SQLite Test Suite

While LiteFS is mostly a passthrough file system, it'd be good to run the SQLite test suite against it to verify correctness. Most of the test suite does not test persistence so only a subset is valuable. The main limiting factors for implementing this is my lack of Tcl knowledge and WAL mode being unsupported right now (#14).

Support for other embedded databases (e.g., DuckDB)

Wonderful project, @benbjohnson! Thank you for open sourcing it!

As the name implies, the primary litefs use case is around SQLite. I'm curious how tightly coupled to SQLite internals the intercept-writes-at-the-file system-level approach is. What challenges you envision in supporting other embedded databases like DuckDB?

(Apologies in advance if you prefer not to use issues to answer questions. I'm happy to ask on another forum if that's your preference. Thank you again for your wonderful software.)

Expose TXID

Because LiteFS requires that the application layer direct writes to the primary node, many use cases will need the transaction ID to be exposed to the application layer. The simplest example of this is a replica node sending a write to the primary, waiting for it to succeed, then querying its local database with the expectation that the write will have appeared locally. This can be accomplished by having the application on the primary node return the transaction ID of the write and the application on the replica busy-loop until the local transaction ID is equal or greater to the returned ID. Litestream exposes a position file that allows this kind of functionality and it seems reasonable for LiteFS to do the same by exposing TXID and checksum as files in the shadow directory.

Some implementation considerations from my brief experience working with Litestream:

Storing checksum and txid in separate files would be convenient for readers but we could get into trouble with atomicity: if the timing is just so, a reader could observe a checksum and a txid that are out of sync. Storing both in the same file probably makes it much easier to guarantee consistency.
Make the file(s) fixed length and thus mmap friendly. A naive getPosition/getTXID function would open, read, and close: three system calls to fetch what essentially boils down to an incrementing integer. Much more efficient to simply mmap the file on startup and thus have a live view of the data in the process memory.
The Unix way is to make the file text so that anyone can cat it and see the current id. But it would be ever so much more convenient (for me) if it were stored in binary form in the system endianness: getting the current, ready for comparison TXID would then be as simple as a pointer dereference. That said, storing as zero-padded text is still probably the Right thing to do.
This is getting nitty-gritty on performance, but one problem with the Litestream position file is that the three components are stored with 1-byte separators. I was originally mapping the whole file into memory as a struct where the three members were a union { char[16], u128 } but this resulted in unaligned loads which are not great for performance. I'd recommend either storing data in the file such that it's naturally aligned or having readers make separate mmap calls using the offset and length arguments to slice out each piece of the file (e.g. in LiteFS's case this might be txid and checksum).
Really nitty-gritty but something I discovered while working with Litestream: if storing an integer as zero-padded 16 characters, you can check equality and less-than/greater-than without parsing text -> integer. Specifically, it is about half the cycles to reinterpret the memory as a u128, do a byte swap, then compare the resulting u128s. Mileage obviously varies on parsing implementation, my benchmark was done using Zig's std.fmt.parseInt and @byteSwap builtin.

Support `exec` command from CLI arguments

Per suggestion on the Fly community board, we should add support for executing the subprocess from the command line arguments instead of only from the exec field in litefs.yml.

See: https://community.fly.io/t/how-do-you-properly-configure-a-litefs-deployment/8126/6

Primary not being set on failover

When I was testing failover, I found some odd behaviour testing against two nodes (sea and mia).

Assuming mia was the primary, on reboot occasionally it would re-acquire the lease. It would advertise that it the lease was acquired, but no ltx were being shipped to sea.

2022-08-25T02:30:13Z app[bfd1ef5d] mia [info]primary lease acquired, advertising as http://fdaa:0:2fff:a7b:2cc3:1:932e:2:20202

In sea, the .primary contains:

root@199f22d0:/# cat /mnt/litefs/db/.primary 
fdaa:0:2fff:a7b:2dbb:1:932d:2

And the ip address of mia is:

root@bfd1ef5d:/# cat /etc/hosts|grep 6pn
fdaa:0:2fff:a7b:2cc3:1:932e:2   bfd1ef5d.vm.litefs-liveview.internal bfd1ef5d fly-local-6pn

It looks like that sea is not acquiring the new primary information from consul?

Now if we reboot mia and it does not re-acquire the lease:

2022-08-25T02:46:41Z app[bfd1ef5d] mia [info]existing primary found (fdaa:0:2fff:a7b:2cc3:1:932e:2), connecting as replica

root@199f22d0:/# env|grep REGI
FLY_REGION=sea
root@199f22d0:/# cat /etc/hosts|grep 6pn
fdaa:0:2fff:a7b:2dbb:1:932d:2   199f22d0.vm.litefs-liveview.internal 199f22d0 fly-local-6pn
root@199f22d0:/# cat /mnt/litefs/db/.primary 
fdaa:0:2fff:a7b:2dbb:1:932d:2

We do not see any lease acquisition logs from sea. Both hosts believe they are replicas. Trying to insert into the database manually gives:

root@199f22d0:/# sqlite3 /mnt/litefs/db/testdb.sqlite 
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> insert into counts values(2, "ams", 1);
Error: unable to open database file
2022-08-25T02:59:01Z app[199f22d0] sea [info]fuse: create(): cannot create journal: read only replica

root@bfd1ef5d:/# sqlite3 /mnt/litefs/db/testdb.sqlite 
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> insert into counts values(5, "ams", 1);
Error: unable to open database file
2022-08-25T03:01:53Z app[bfd1ef5d] mia [info]fuse: create(): cannot create journal: read only replica

Also, I built a library for elixir assuming the .primary file would be deleted on after the lease was acquired on the primary. Essentially, it connects to all the nodes and checks if the .primary does not exist. This way if the user is on different platforms, etc. we don't have to parse the contents of the file to figure out the location of the primary. Can you confirm that is the expected behaviour? And if it isn't, can it be changed so only the replicas have the .primary?

Thanks.

Fixed Leader

Currently, LiteFS uses Consul for leader election but it could be useful to have a single, fixed leader that replicates out to read-only replicas.

Multiple databases

Just wondering - are multiple databases / sqlite files supported per fuse mount?

Primary Handoff

Currently, replicas will automatically obtain the lease quickly after the primary shuts down. However, this incurs a small window of downtime. We can remove this window by having the primary handoff the session ID to an up-to-date replica and have that replica become the new primary.

Replication is broken for WAL journal mode

Replica fails to sync when WAL journal mode is set

Primary's config:

mount-dir: "./dbs"
data-dir: "./.litefs"
exec: sleep inf
static:
  primary: true
  hostname: "${HOSTNAME}"
  advertise-url: "http://${HOSTNAME}:20202"

Replica's config:

mount-dir: "./dbs-replica"
data-dir: "./.litefs-replica"
exec: sleep inf
static:
  primary: false
  hostname: "primary"
  advertise-url: "http://localhost:20202"
http:
  addr: ":20203"

Primary's output:

$ ./litefs -config litefs.yml
Using static primary: is-primary=true hostname=nucbox advertise-url=http://nucbox:20202
primary lease acquired, advertising as http://nucbox:20202
LiteFS mounted to: ./dbs
http server listening on: http://localhost:20202
waiting to connect to cluster
connected to cluster, ready
starting subprocess: sleep [inf]
removing ltx file, per retention: db=state.db file=0000000000000001-0000000000000001.ltx
removing ltx file, per retention: db=state.db file=0000000000000002-0000000000000002.ltx
removing ltx file, per retention: db=state.db file=0000000000000003-0000000000000003.ltx
removing ltx file, per retention: db=state.db file=0000000000000004-0000000000000004.ltx
removing ltx file, per retention: db=state.db file=0000000000000005-0000000000000005.ltx
removing ltx file, per retention: db=state.db file=0000000000000006-0000000000000006.ltx
removing ltx file, per retention: db=state.db file=0000000000000007-0000000000000007.ltx
removing ltx file, per retention: db=state.db file=0000000000000008-0000000000000008.ltx
removing ltx file, per retention: db=state.db file=0000000000000009-0000000000000009.ltx
removing ltx file, per retention: db=state.db file=000000000000000a-000000000000000a.ltx
removing ltx file, per retention: db=state.db file=000000000000000b-000000000000000b.ltx
removing ltx file, per retention: db=state.db file=000000000000000c-000000000000000c.ltx
removing ltx file, per retention: db=state.db file=000000000000000d-000000000000000d.ltx
removing ltx file, per retention: db=state.db file=000000000000000e-000000000000000e.ltx
removing ltx file, per retention: db=state.db file=000000000000000f-000000000000000f.ltx
removing ltx file, per retention: db=state.db file=0000000000000010-0000000000000010.ltx
removing ltx file, per retention: db=state.db file=0000000000000011-0000000000000011.ltx
removing ltx file, per retention: db=state.db file=0000000000000012-0000000000000012.ltx
removing ltx file, per retention: db=state.db file=0000000000000013-0000000000000013.ltx
removing ltx file, per retention: db=state.db file=0000000000000014-0000000000000014.ltx
removing ltx file, per retention: db=state.db file=0000000000000015-0000000000000015.ltx
removing ltx file, per retention: db=state.db file=0000000000000016-0000000000000016.ltx
removing ltx file, per retention: db=state.db file=0000000000000017-0000000000000017.ltx
removing ltx file, per retention: db=state.db file=0000000000000018-0000000000000018.ltx
removing ltx file, per retention: db=state.db file=0000000000000019-0000000000000019.ltx
removing ltx file, per retention: db=state.db file=000000000000001a-000000000000001a.ltx
removing ltx file, per retention: db=state.db file=000000000000001b-000000000000001b.ltx
removing ltx file, per retention: db=state.db file=000000000000001c-000000000000001c.ltx
removing ltx file, per retention: db=state.db file=000000000000001d-000000000000001d.ltx
removing ltx file, per retention: db=state.db file=000000000000001e-000000000000001e.ltx
removing ltx file, per retention: db=state.db file=000000000000001f-000000000000001f.ltx
removing ltx file, per retention: db=state.db file=0000000000000020-0000000000000020.ltx
removing ltx file, per retention: db=state.db file=0000000000000021-0000000000000021.ltx
removing ltx file, per retention: db=state.db file=0000000000000022-0000000000000022.ltx
removing ltx file, per retention: db=state.db file=0000000000000023-0000000000000023.ltx
removing ltx file, per retention: db=state.db file=0000000000000024-0000000000000024.ltx
stream connected
http: error: stream error: db="state.db" err=stream ltx: pos=0
stream disconnected
stream connected
send frame<ltx>: db="state.db" tx=(0000000000000001,0000000000000025) chksum=(0,ee25d886681e95a3) (snapshot)
send frame<ltx>: db="state.db" tx=0000000000000026-0000000000000026 size=8320
send frame<ltx>: db="state.db" tx=0000000000000027-0000000000000027 size=8320
send frame<ltx>: db="state.db" tx=0000000000000028-0000000000000028 size=4220
send frame<ltx>: db="state.db" tx=0000000000000029-0000000000000029 size=4220
send frame<ltx>: db="state.db" tx=000000000000002a-000000000000002a size=4220

Replica's output:

$ ./litefs -config litefs-replica.yml
Using static primary: is-primary=false hostname=primary advertise-url=http://localhost:20202
existing primary found (primary), connecting as replica
LiteFS mounted to: ./dbs-replica
http server listening on: http://localhost:20203
waiting to connect to cluster
recv frame<ltx>: db="state.db" tx=0000000000000001-0000000000000025 size=8320
recv frame<ready>
connected to cluster, ready
starting subprocess: sleep [inf]
replica disconnected, retrying: process ltx stream frame: position mismatch on db "state.db": 0000000000000025/ee25d886681e95a3 <> 0000000000000025/f151190dd71ae66b
existing primary found (primary), connecting as replica
replica disconnected, retrying: process ltx stream frame: position mismatch on db "state.db": 0000000000000025/ee25d886681e95a3 <> 0000000000000025/f151190dd71ae66b
existing primary found (primary), connecting as replica
(cut)

Hosting options.

Looks very nice. Like it’s async and recovery approach.

It would be good at sone stage to highlight hosting options.

for example google Cloud Run allows using a remote File System these days. Not saying that this is compatible with the way Litefs works, but just giving one example as a reference.

part of the reason is to make it easy for me to host anywhere but also benchmark anywhere.

so it would be good if this was eventually documented in the repo, and devs can then try out different hosting options .

I would prefer to not be dependent on needing k8 in order to host.

LTX Compression

LTX files are currently sent to replicas as uncompressed blobs. LTX files should support compression internally.

macOS support