superfly / litefs Goto Github PK
View Code? Open in Web Editor NEWFUSE-based file system for replicating SQLite databases across a cluster of machines
License: Apache License 2.0
FUSE-based file system for replicating SQLite databases across a cluster of machines
License: Apache License 2.0
Because LiteFS requires that the application layer direct writes to the primary node, many use cases will need the transaction ID to be exposed to the application layer. The simplest example of this is a replica node sending a write to the primary, waiting for it to succeed, then querying its local database with the expectation that the write will have appeared locally. This can be accomplished by having the application on the primary node return the transaction ID of the write and the application on the replica busy-loop until the local transaction ID is equal or greater to the returned ID. Litestream exposes a position file that allows this kind of functionality and it seems reasonable for LiteFS to do the same by exposing TXID and checksum as files in the shadow directory.
Some implementation considerations from my brief experience working with Litestream:
mmap
friendly. A naive getPosition/getTXID
function would open
, read
, and close
: three system calls to fetch what essentially boils down to an incrementing integer. Much more efficient to simply mmap
the file on startup and thus have a live view of the data in the process memory.cat
it and see the current id. But it would be ever so much more convenient (for me) if it were stored in binary form in the system endianness: getting the current, ready for comparison TXID would then be as simple as a pointer dereference. That said, storing as zero-padded text is still probably the Right thing to do.union { char[16], u128 }
but this resulted in unaligned loads which are not great for performance. I'd recommend either storing data in the file such that it's naturally aligned or having readers make separate mmap
calls using the offset and length arguments to slice out each piece of the file (e.g. in LiteFS's case this might be txid and checksum).std.fmt.parseInt
and @byteSwap
builtin.I am trying this project for the first time on my laptop. I tried building both litefs and the example from source.
I got the dependencies from the github pipelines.
Installed consul on my laptop too.
Everything compiled and runnable. Unfortunately, I couldn't make the replication work.
I tried tweaking the litefs.yml
file configurations but it doesn't replicate. I am pretty sure i am doing something wrong here.
Thanks 🙏
LiteFS currently only supports the rollback journal, however, it should also be able to convert WAL writes to LTX files as well.
When I was testing failover, I found some odd behaviour testing against two nodes (sea and mia).
2022-08-25T02:30:13Z app[bfd1ef5d] mia [info]primary lease acquired, advertising as http://fdaa:0:2fff:a7b:2cc3:1:932e:2:20202
In sea, the .primary contains:
root@199f22d0:/# cat /mnt/litefs/db/.primary
fdaa:0:2fff:a7b:2dbb:1:932d:2
And the ip address of mia is:
root@bfd1ef5d:/# cat /etc/hosts|grep 6pn
fdaa:0:2fff:a7b:2cc3:1:932e:2 bfd1ef5d.vm.litefs-liveview.internal bfd1ef5d fly-local-6pn
It looks like that sea is not acquiring the new primary information from consul?
2022-08-25T02:46:41Z app[bfd1ef5d] mia [info]existing primary found (fdaa:0:2fff:a7b:2cc3:1:932e:2), connecting as replica
root@199f22d0:/# env|grep REGI
FLY_REGION=sea
root@199f22d0:/# cat /etc/hosts|grep 6pn
fdaa:0:2fff:a7b:2dbb:1:932d:2 199f22d0.vm.litefs-liveview.internal 199f22d0 fly-local-6pn
root@199f22d0:/# cat /mnt/litefs/db/.primary
fdaa:0:2fff:a7b:2dbb:1:932d:2
We do not see any lease acquisition logs from sea. Both hosts believe they are replicas. Trying to insert into the database manually gives:
root@199f22d0:/# sqlite3 /mnt/litefs/db/testdb.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> insert into counts values(2, "ams", 1);
Error: unable to open database file
2022-08-25T02:59:01Z app[199f22d0] sea [info]fuse: create(): cannot create journal: read only replica
root@bfd1ef5d:/# sqlite3 /mnt/litefs/db/testdb.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> insert into counts values(5, "ams", 1);
Error: unable to open database file
2022-08-25T03:01:53Z app[bfd1ef5d] mia [info]fuse: create(): cannot create journal: read only replica
Also, I built a library for elixir assuming the .primary file would be deleted on after the lease was acquired on the primary. Essentially, it connects to all the nodes and checks if the .primary does not exist. This way if the user is on different platforms, etc. we don't have to parse the contents of the file to figure out the location of the primary. Can you confirm that is the expected behaviour? And if it isn't, can it be changed so only the replicas have the .primary?
Thanks.
Ideally, it'd be nice to support the creation or replacement of a database with simple file commands. e.g.
cp /path/to/src.db /mnt/target.db
While LiteFS is mostly a passthrough file system, it'd be good to run the SQLite test suite against it to verify correctness. Most of the test suite does not test persistence so only a subset is valuable. The main limiting factors for implementing this is my lack of Tcl knowledge and WAL mode being unsupported right now (#14).
Here's my litefs config: https://github.com/kentcdodds/kentcdodds.com/blob/dev/other/litefs.yml
So based on what I've read and the example, I would expect to find the .primary
file at /litefs/data/.primary
. I only have two regions and I've SSH-ed into both of them and cannot find that file. I also checked at /data/.primary
and it wasn't there either.
What have I got wrong?
Currently, replicas will automatically obtain the lease quickly after the primary shuts down. However, this incurs a small window of downtime. We can remove this window by having the primary handoff the session ID to an up-to-date replica and have that replica become the new primary.
web servers using litefs don't know which litefs server to read from or write from.
SO i presume a proxy ir needed, so that this is automatically managed for them ?
also what about a SQL statement that has a write and read in it ? Again i presume the proxy would handle this ?
https://github.com/CECTC/dbpack is an golang db proxy that does what i am suggesting.
Directing writes to the primary node is the job of the application, so the primary's hostname needs to be exposed somehow. At present the application can get this by querying Consul, but if #37 and/or #23 are implemented this will no longer be the case. I propose exposing this information via a file in the shadow directory, similar to Litestream's .primary
.
Replica fails to sync when WAL journal mode is set
Primary's config:
mount-dir: "./dbs"
data-dir: "./.litefs"
exec: sleep inf
static:
primary: true
hostname: "${HOSTNAME}"
advertise-url: "http://${HOSTNAME}:20202"
Replica's config:
mount-dir: "./dbs-replica"
data-dir: "./.litefs-replica"
exec: sleep inf
static:
primary: false
hostname: "primary"
advertise-url: "http://localhost:20202"
http:
addr: ":20203"
Primary's output:
$ ./litefs -config litefs.yml
Using static primary: is-primary=true hostname=nucbox advertise-url=http://nucbox:20202
primary lease acquired, advertising as http://nucbox:20202
LiteFS mounted to: ./dbs
http server listening on: http://localhost:20202
waiting to connect to cluster
connected to cluster, ready
starting subprocess: sleep [inf]
removing ltx file, per retention: db=state.db file=0000000000000001-0000000000000001.ltx
removing ltx file, per retention: db=state.db file=0000000000000002-0000000000000002.ltx
removing ltx file, per retention: db=state.db file=0000000000000003-0000000000000003.ltx
removing ltx file, per retention: db=state.db file=0000000000000004-0000000000000004.ltx
removing ltx file, per retention: db=state.db file=0000000000000005-0000000000000005.ltx
removing ltx file, per retention: db=state.db file=0000000000000006-0000000000000006.ltx
removing ltx file, per retention: db=state.db file=0000000000000007-0000000000000007.ltx
removing ltx file, per retention: db=state.db file=0000000000000008-0000000000000008.ltx
removing ltx file, per retention: db=state.db file=0000000000000009-0000000000000009.ltx
removing ltx file, per retention: db=state.db file=000000000000000a-000000000000000a.ltx
removing ltx file, per retention: db=state.db file=000000000000000b-000000000000000b.ltx
removing ltx file, per retention: db=state.db file=000000000000000c-000000000000000c.ltx
removing ltx file, per retention: db=state.db file=000000000000000d-000000000000000d.ltx
removing ltx file, per retention: db=state.db file=000000000000000e-000000000000000e.ltx
removing ltx file, per retention: db=state.db file=000000000000000f-000000000000000f.ltx
removing ltx file, per retention: db=state.db file=0000000000000010-0000000000000010.ltx
removing ltx file, per retention: db=state.db file=0000000000000011-0000000000000011.ltx
removing ltx file, per retention: db=state.db file=0000000000000012-0000000000000012.ltx
removing ltx file, per retention: db=state.db file=0000000000000013-0000000000000013.ltx
removing ltx file, per retention: db=state.db file=0000000000000014-0000000000000014.ltx
removing ltx file, per retention: db=state.db file=0000000000000015-0000000000000015.ltx
removing ltx file, per retention: db=state.db file=0000000000000016-0000000000000016.ltx
removing ltx file, per retention: db=state.db file=0000000000000017-0000000000000017.ltx
removing ltx file, per retention: db=state.db file=0000000000000018-0000000000000018.ltx
removing ltx file, per retention: db=state.db file=0000000000000019-0000000000000019.ltx
removing ltx file, per retention: db=state.db file=000000000000001a-000000000000001a.ltx
removing ltx file, per retention: db=state.db file=000000000000001b-000000000000001b.ltx
removing ltx file, per retention: db=state.db file=000000000000001c-000000000000001c.ltx
removing ltx file, per retention: db=state.db file=000000000000001d-000000000000001d.ltx
removing ltx file, per retention: db=state.db file=000000000000001e-000000000000001e.ltx
removing ltx file, per retention: db=state.db file=000000000000001f-000000000000001f.ltx
removing ltx file, per retention: db=state.db file=0000000000000020-0000000000000020.ltx
removing ltx file, per retention: db=state.db file=0000000000000021-0000000000000021.ltx
removing ltx file, per retention: db=state.db file=0000000000000022-0000000000000022.ltx
removing ltx file, per retention: db=state.db file=0000000000000023-0000000000000023.ltx
removing ltx file, per retention: db=state.db file=0000000000000024-0000000000000024.ltx
stream connected
http: error: stream error: db="state.db" err=stream ltx: pos=0
stream disconnected
stream connected
send frame<ltx>: db="state.db" tx=(0000000000000001,0000000000000025) chksum=(0,ee25d886681e95a3) (snapshot)
send frame<ltx>: db="state.db" tx=0000000000000026-0000000000000026 size=8320
send frame<ltx>: db="state.db" tx=0000000000000027-0000000000000027 size=8320
send frame<ltx>: db="state.db" tx=0000000000000028-0000000000000028 size=4220
send frame<ltx>: db="state.db" tx=0000000000000029-0000000000000029 size=4220
send frame<ltx>: db="state.db" tx=000000000000002a-000000000000002a size=4220
Replica's output:
$ ./litefs -config litefs-replica.yml
Using static primary: is-primary=false hostname=primary advertise-url=http://localhost:20202
existing primary found (primary), connecting as replica
LiteFS mounted to: ./dbs-replica
http server listening on: http://localhost:20203
waiting to connect to cluster
recv frame<ltx>: db="state.db" tx=0000000000000001-0000000000000025 size=8320
recv frame<ready>
connected to cluster, ready
starting subprocess: sleep [inf]
replica disconnected, retrying: process ltx stream frame: position mismatch on db "state.db": 0000000000000025/ee25d886681e95a3 <> 0000000000000025/f151190dd71ae66b
existing primary found (primary), connecting as replica
replica disconnected, retrying: process ltx stream frame: position mismatch on db "state.db": 0000000000000025/ee25d886681e95a3 <> 0000000000000025/f151190dd71ae66b
existing primary found (primary), connecting as replica
(cut)
Just wondering - are multiple databases / sqlite files supported per fuse mount?
LiteFS should allow users to specify that they require acknowledgement of a write before accepting new writes.
Currently, LiteFS uses Consul for leader election but it could be useful to have a single, fixed leader that replicates out to read-only replicas.
I've tried working through the README directions but consistently run into Disk IO errors. I'm using the 0.1.0 release. I've distilled everything down to a simple Bash script for reproducing:
#!/bin/bash
set -xe
# Kernel and OS info for debugging
uname -r
cat /etc/os-release
# Assumes you want to place a specific version of the LiteFS binary in the local
# directory for testing
LITEFS_CMD="./litefs"
function cleanup {
# Terminate all child processes
pkill -P $$
}
trap cleanup EXIT
consul agent -dev &
sleep 2 # give Consul a few seconds to start
# We'll create everything in a scratch directory called "repro"
# Blow away left-overs from any previous run
rm -rf repro
mkdir -p repro/data
cat <<EOF >repro/litefs.yml
mount-dir: repro/data
debug: true
http:
addr: ":20202"
consul:
url: "http://localhost:8500"
advertise-url: "http://localhost:20202"
EOF
$LITEFS_CMD -config repro/litefs.yml &
sleep 2 # give LiteFS a few seconds to start
echo "====== Running SQLite against the test database now ========"
sqlite3 -column repro/data/test.db <<EOF
PRAGMA journal_mode;
CREATE TABLE test (foo INT);
INSERT INTO test VALUES (1);
EOF
echo "Success!"
The output of running this is attached
repro.log
LTX files are currently sent to replicas as uncompressed blobs. LTX files should support compression internally.
Hello,
I tried latest litefs 0.2.0 and it seems to stuck at some point. It is working with 0.1.1 though.
When I start with litefs I can not access database and even list the mount directory.
This is how I start:
litefs -config litefs_SPA.yml
litefs_SPA.yml:
mount-dir: '/mnt/db'
data-dir: '/mnt/db'
http:
addr: ':20202'
consul:
url: 'http://10.16.18.230:8500'
advertise-url: 'http://10.16.18.228:20202'
Also there is kernel info message:
Tue Oct 25 09:15:15 2022] </TASK>
[Tue Oct 25 09:15:15 2022] INFO: task ls:2307 blocked for more than 120 seconds.
[Tue Oct 25 09:15:15 2022] Tainted: P OE 5.15.0-52-generic #58~20.04.1-Ubuntu
[Tue Oct 25 09:15:15 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Oct 25 09:15:15 2022] task:ls state:D stack: 0 pid: 2307 ppid: 1986 flags:0x00004004
[Tue Oct 25 09:15:15 2022] Call Trace:
[Tue Oct 25 09:15:15 2022] <TASK>
[Tue Oct 25 09:15:15 2022] __schedule+0x2cd/0x8a0
[Tue Oct 25 09:15:15 2022] schedule+0x4e/0xc0
[Tue Oct 25 09:15:15 2022] request_wait_answer+0x136/0x210
[Tue Oct 25 09:15:15 2022] ? wait_woken+0x60/0x60
[Tue Oct 25 09:15:15 2022] fuse_simple_request+0x1ac/0x2f0
[Tue Oct 25 09:15:15 2022] fuse_do_getattr+0xd7/0x340
[Tue Oct 25 09:15:15 2022] fuse_getattr+0xa9/0x130
[Tue Oct 25 09:15:15 2022] vfs_getattr_nosec+0xba/0xe0
[Tue Oct 25 09:15:15 2022] vfs_getattr+0x37/0x50
[Tue Oct 25 09:15:15 2022] vfs_statx+0x89/0x110
[Tue Oct 25 09:15:15 2022] __do_sys_newlstat+0x3e/0x80
[Tue Oct 25 09:15:15 2022] __x64_sys_newlstat+0x16/0x20
[Tue Oct 25 09:15:15 2022] do_syscall_64+0x59/0xc0
[Tue Oct 25 09:15:15 2022] ? exit_to_user_mode_prepare+0x3d/0x1c0
[Tue Oct 25 09:15:15 2022] ? do_user_addr_fault+0x1e0/0x660
[Tue Oct 25 09:15:15 2022] ? irqentry_exit_to_user_mode+0x9/0x20
[Tue Oct 25 09:15:15 2022] ? irqentry_exit+0x1d/0x30
[Tue Oct 25 09:15:15 2022] ? exc_page_fault+0x89/0x170
[Tue Oct 25 09:15:15 2022] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Oct 25 09:15:15 2022] RIP: 0033:0x7f7244ed557a
[Tue Oct 25 09:15:15 2022] RSP: 002b:00007ffca76b3b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[Tue Oct 25 09:15:15 2022] RAX: ffffffffffffffda RBX: 00005555e6434550 RCX: 00007f7244ed557a
[Tue Oct 25 09:15:15 2022] RDX: 00005555e6434568 RSI: 00005555e6434568 RDI: 00007ffca76b3b10
[Tue Oct 25 09:15:15 2022] RBP: 00007ffca76b3ec0 R08: 0000000000000001 R09: 00000000e643ab00
[Tue Oct 25 09:15:15 2022] R10: 00007ffca76b3b14 R11: 0000000000000246 R12: 00005555e643ab43
[Tue Oct 25 09:15:15 2022] R13: 0000000000000003 R14: 00007ffca76b3b10 R15: 00005555e6434568
[Tue Oct 25 09:15:15 2022] </TASK>
Maybe this has something to do with fuse but dont know how to dig it. As I said it is working with 0.1.1.
Thanks for this great project and let me know if I can help more.
Looks very nice. Like it’s async and recovery approach.
It would be good at sone stage to highlight hosting options.
for example google Cloud Run allows using a remote File System these days. Not saying that this is compatible with the way Litefs works, but just giving one example as a reference.
part of the reason is to make it easy for me to host anywhere but also benchmark anywhere.
so it would be good if this was eventually documented in the repo, and devs can then try out different hosting options .
I would prefer to not be dependent on needing k8 in order to host.
Right now, the data directory is automatically derived from the mount directory. However, this can be confusing so it's probably better to explicitly set the mount and data directory.
Originally mentioned by David Cameron on Twitter: https://twitter.com/dave_cameron/status/1571514794989191169
Since LTX files need to be removed after a while (#48), we need to be able to take a snapshot of the full database as an LTX file. For the initial implementation, obtaining a SHARED
lock on the database and then copying the pages to an ltx.Writer
should suffice. This will prevent writes during a snapshot but this can be alleviated by supporting WAL mode (#14).
A snapshot LTX file should be served to HTTP clients if they connect and their TXID is not available as an LTX file.
LTX files are designed to support encryption so that remote storage, such as AWS S3, will not be able to read the underlying data. Currently, I'm leaning toward using AES-GCM-SIV from the Tink project.
Currently, every node in a LiteFS cluster is a candidate to become the primary. However, in practice, users may want to keep their primaries in a single region for consistent performance. The litefs.yml
config file should have a flag to indicate whether a node is a candidate or can only be a replica.
https://github.com/kentcdodds/kentcdodds.com/actions/runs/3316512422/jobs/5478478215
cannot open store: open databases: open database("sqlite.db"): verify database file: database checksum (e3d3906d74cc0273) does not match latest LTX checksum (0000000000000000)
This volume is brand new and completely empty. @benbjohnson said this is a bug that needs fixing and asked me to open this issue. More context at https://www.youtube.com/watch?v=vTNPJGKqsYQ
Thanks!
LiteFS currently handles the DELETE
rollback journal mode. It should also handle PERSIST
& TRUNCATE
.
Hi there,
The project only releases a binary for Linux currently, I wonder if macOS will be supported in the future. This is a Go project and sqlite can be run on macOS as well, and there are projects like macFUSE, so I assume this is technically possible. It will be great if macOS can be used for development and experiment purpose. Thanks.
As LiteFS is a database replication tool, test coverage should be high—likely 80-90% test coverage. Some areas of LiteFS may be difficult to report test coverage around such as the FUSE file system but coverage should still exist even if it is not reported.
After failover testing done by fly restart vm
:
2022-08-25T03:43:05Z app[8cd3b300] sea [info]cannot open store: open databases: open database: db=00000001 err=recover ltx: read ltx file header (0000000000000062-0000000000000062.ltx): unmarshal header: invalid LTX file
As far as I know, there was no activity for probably 10-30 seconds before the restart. Possibly longer. I'm attaching the entire directory as a base64 in hope that it is of some use to you.
root@8cd3b300:~/db/00000001/ltx# ls -latr
total 236
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000050-0000000000000050.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000051-0000000000000051.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000052-0000000000000052.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000053-0000000000000053.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000054-0000000000000054.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000055-0000000000000055.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000056-0000000000000056.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000057-0000000000000057.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:41 0000000000000058-0000000000000058.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000059-0000000000000059.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005a-000000000000005a.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005b-000000000000005b.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005c-000000000000005c.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005d-000000000000005d.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005e-000000000000005e.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 000000000000005f-000000000000005f.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000060-0000000000000060.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000061-0000000000000061.ltx
-rw-r--r-- 1 root root 8320 Aug 25 03:42 0000000000000062-0000000000000062.ltx
$ base64 ltx.tar.gz
H4sIAAAAAAAAA+3dcWxUdwHA8d+1TMAxD1mAIIrHKUhnr7z37t4dbBCHs4FGttD2gLa0tO/ag5XQ
NvRu1CUkBEKmkf4BmLS6RMymfzANMGOAhYxW0dhlY0khyBBHNFoSp53RZMPUwfTeXdvAQ37bvVvf
3ct9P0B/l+v9+rsfb/+wfO/3FCVNXS6mjrlARNdFZinFOqYfq6GgourhSDioCkVVQiFN+PQpfE+T
nkkkjW6fT3R3dSVlr/uo77uUMnH9dya/NVX/DWR//VVNV7n+Trjr+it3CasB6xMVqVfZWyMcCt3n
+mupX3de/9Tzqq6GdOFTPvnt3qvIr//6aJ2aeTTb/FKa+uMZ/5ZhGUVf5ORLu3sPnbKxjKe2en17
Mu7b1tXdYSR9wdRqHo94PPX3nvn5pXe8tiT1Z9qdcz/GzzdERfPnHjInez8lvLtSXwAAAAAAyJv2
kunzFyzw7AskjdjOeGvXM53JROZr6RM1lWuilb7omq+vr/T5M0/6fcv87W1+X9VT0cq1lTW+DTVV
T66pqfd9s7K+3Ofvjm9v7+r0+6KVddHy8SmTry3bN8szff7y5Z59a9NrJVqfjncYzR3t27uNZGpa
4p4nSu5+B/d833wzu+PdifSa//8dtXcm4t3JeFuzkcy8reZvpH5itOrJyrLU5kvNf58/4L0ovDe9
73pHvH9IPQQAAAAAAIVuxgPTxMOejnZj0fRpqUczjY7E9NKJByXjD2Z4zBcl4ka6MxjZ89s3jwcO
fO1I59u+3UNXD+R7C8VO0v/obQHrEw70P1rY7H+Capj+xwmS/qfJMoqbz3/w+oPBwYiNZWT9T5PI
vf9pov8BAAAAABQO+h/6HwAAAAAAXCa7/idgTjlw8ciJ4SM/O3fse77a9cc65+R7C8VO1v9ss/Y/
25zofyLp/kcL0v84QdL/NFtG8eKsstJXd/X5bSwj63+aRe79TzP9DwAAAACgcND/0P8AAAAAAOAy
2fU/6duLDa99fPv24f3H/nN69qk5Twe0fG+h2Mn6n1Zr/9Pq4Pk/IfofJ0j6n0bLKMZ23HhvZnR4
v41lZP1Po8i9/2mk/wEAAAAAFA76H/ofAAAAAABcJrv+p9ycMnEPoauDn/nL787+9Ev53kKxk/U/
qrX/Uae+/9FD6fN/VF2j/3GCpP+ptoxioHfu3lcGq9bZWEbW/1SL3PufavofAAAAAEDhoP+h/wEA
AAAAwGWy63/Src+FoT0LWxc1nX9nVfn1X998c26+t1DsZP1P3Nr/xJ07/0dT6H+cIOl/tlrGyXv3
2VhG1v9sFbn3P1vpfwAAAAAAhYP+h/4HAAAAAACXya7/qTCnvDirrPTVXX3+kfdLd5z/9+If5XsL
xU7S/4SVgPUJJ/qf8fN/VPofJ0j6nxbLOHnvPhvLyPqfFpF7/9NC/wMAAAAAKBz0P/Q/AAAAAAC4
THb9Tzrn6IucfGl376FTfaMHj/krL9/K9xaKnez8n6D1/J+gA/f/0oOZ/idI/+MESf9TaxnF6LWh
097bLU02lpH1P7Ui9/6nlv4HAAAAAFA46H/ofwAAAAAAcJns+p8l5pSjh09cutyYfOvgh5/v/+z8
xXbOEsEnSNb/hK39T9iJ/kdP9z9h+h9HSPqfTZZR3Pjb77/z0PPev9pYRtb/bBK59z+b6H8AAAAA
AIWD/of+BwAAAAAAl8mu/1lmTnmtJ7bEeFbr+fmHyox/PHXtVL63UOxk/Y9u7X90J/qfUKb/Ueh/
nCDpfzZaRtH/wsujVx5bGLOxjKz/2Shy73820v8AAAAAAAoH/Q/9DwAAAAAALpNd//MVc8rEGSLf
/f7DaldgLJLvLRQ7Wf8Ts/Y/sanvf8Ja2Ox/gqpG/+MESf+zxTKKg4t7D116/zczbCwj63+2iNz7
ny30PwAAAACAwkH/Q/8DAAAAAIDLZNf/fNWcMrbjxnszo8P7fzBnef3ALxbMzvcWip2s/zGs/Y/h
RP+Tvv9XUInQ/zhB0v80WEZx/FZk9cC8sX4by8j6nwaRe//TQP8DAAAAACgc9D/0PwAAAAAAuEx2
/c8j5pSJM0S+Xb3quc7yuefyvYViJ+t/FGv/ozhw/69Q+vwfNUT/4whJ/7PBMoozF688cv1Xr9/Z
6nxcsv5ng8i9/9lA/wMAAAAAKBz0P/Q/AAAAAAC4THb9j9+cMtA7d+8rg1Xr3r4+Uv/Bc/9tzfcW
ip2s/9Gs/Y/mQP+jK5n+J0z/4wRJ/1NjGcWFoT0LWxc1nbexjKz/qRG59z819D8AAAAAgMJB/0P/
AwAAAACAy2TX/3zZnDJ6bei093ZL0xsN3pF158I/zPcWip2s/1lp7X9WOnH/r1C6/wnT/zhC0v/U
W0bxxzNb6+Nn3v2CjWVk/U+9yL3/qaf/AQAAAAAUDvof+h8AAAAAAFwmu/7H/H8A4vityOqBeWP9
b0S+OPinRIMjiQfuT9b/rLD2PyscOP8nomb6H53+xwmS/qfOMoqzqx5cc/bq0ts2lpH1P3Ui9/6n
jv4HAAAAAFA46H/ofwAAAAAAcJns+p9l5pSJM0T+9eO/HyrpaazI9xaKnaz/iVj7n4gT/U/m/l9h
jf7HCZL+Z7NlFK/1xJYYz2o9NpaR9T+bRe79z2b6HwAAAABA4aD/of8BAAAAAMBlbJz/M3GGyPmy
vYGTlw//M99bKHaS/iesBaxPOHD/r2Dm/B9dof8BAAAAAAAAAAAAAAAAPors/J+Q9fyfkAPn/+ih
TP8Tpv9xguT8n6hlFEcPn7h0uTH5lo1lZOf/REXu5/9EOf8HAAAAAFA4OP+H838AAAAAAHCZ7M7/
WWpO6X/h5dErjy2M/aR2dWDvp5cezfcWit1k/9NmJI2YkYhP0Rr373+C6cd3n/+jqRr3/3KEpMuJ
idy7nBhdDgAAAACgcBRvl2P+29zj/bNI/QYAAAAAAK4178CFX555Z+iJUhHUFE0LKCsCmh5VtEcV
/VFtJZ/JAQAAAADAhbL7TI42+fmPTqNjKj77YZJ//sNcXR///IceCqrm+a+qFgny+Q8nJOOJZFus
IrFrZ3tyqq4/AAAAAAAAgPz4H9qMJFcA+AIA
Was restarting litefs and it was connecting back to the primary and it happened. Can't seem to reproduce this, but this is the stacktrace off of the build of 37d2e5d.
root@9d625ce1:/mnt/litefs# fgstream connected
stream disconnected
http: panic serving [fdaa:0:2fff:a7b:2c00:163d:ee67:2]:51588: runtime error: invalid memory address or nil pointer dereference
goroutine 132 [running]:
net/http.(*conn).serve.func1()
/gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:1802 +0xb9
panic({0x949b80, 0xe43c90})
/gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/runtime/panic.go:1047 +0x266
github.com/superfly/litefs.(*DB).Pos(0xc00027ee28)
/current/litefs/db.go:116 +0x39
github.com/superfly/litefs/http.(*Server).streamDB(0xc000170aa0, {0xaa7f10, 0xc0001f8f00}, {0xaa5090, 0xc0004027e0}, 0x6a6f85, 0x1000000009bf760)
/current/litefs/http/server.go:215 +0x31e
github.com/superfly/litefs/http.(*Server).handlePostStream(0xc000170aa0, {0xaa5090, 0xc0004027e0}, 0xc000284c00)
/current/litefs/http/server.go:181 +0x549
github.com/superfly/litefs/http.(*Server).serveHTTP(0xc000170aa0, {0xaa5090, 0xc0004027e0}, 0xc000284c00)
/current/litefs/http/server.go:138 +0x251
net/http.HandlerFunc.ServeHTTP(0x0, {0xaa5090, 0xc0004027e0}, 0x10)
/gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:2047 +0x2f
net/http.serverHandler.ServeHTTP({0xaa37d8}, {0xaa5090, 0xc0004027e0}, 0xc000284c00)
/gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:2879 +0x43b
net/http.(*conn).serve(0xc000416780, {0xaa7fb8, 0xc000273cb0})
/gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:1930 +0xb08
created by net/http.(*Server).Serve
/gnu/store/d06665qgp3zqp05fr0q1sdbfnpvxywsc-go-1.17.11/lib/go/src/net/http/server.go:3034 +0x4e8
Currently, the race detector reports an unusual race when running go-fuse
as-is: #5
This has been fixed by enabling SingleThreaded
but that shouldn't be necessary. More investigation will be necessary to figure out what's going on.
Currently, LiteFS will retain LTX files locally forever. That's obviously not ideal. Retention enforcement wasn't added because it wasn't clear at first whether databases would be served out of the LTX pages or from the database file itself. Now it looks like serving from the database is the best approach as it make LiteFS a passthrough file system.
LTX files should be retained based on a time limit for now. Maybe 5 minutes by default? A size based retention should be implemented at some point in the future too but a time-based retention should be a good first start.
Would it be possible to open the underlying database file with mmap for better performance or would it not be very useful since all queries need to pass through the FUSE layer anyway, which would be the bigger bottleneck?
If litefs
encounters an error on startup, it will return an error and exit immediately. This is problematic for ephemeral systems as it's impossible to debug the state of the system or any mounted volumes.
Instead, default litefs
to report an error but keep running until it receives a signal to stop. Add a flag called ExitOnError bool
to the config to change this behavior.
/cc @kentcdodds
Listing files in the mounted directory doesn't include journal files, although it works because looking up for individual filenames returns the correct handles.
$ ls dbs/ -lh
total 0
-rw-rw-rw- 1 daniel daniel 8.0K Oct 19 19:27 state.db
$ ls dbs/state.db-wal -lh
-rw-rw-rw- 1 daniel daniel 57K Oct 19 19:27 dbs/state.db-wal
$ ls dbs/state.db-shm -lh
-rw-rw-rw- 1 daniel daniel 32K Oct 19 19:27 dbs/state.db-shm
As expected, the files exists on the data dir:
$ ls .litefs/dbs/state.db/ -lh
total 376K
-rw-rw-r-- 1 daniel daniel 8.0K Oct 19 19:28 database
drwxrwxr-x 2 daniel daniel 4.0K Oct 19 19:29 ltx
-rw-rw-r-- 1 daniel daniel 32K Oct 19 19:29 shm
-rw-rw-r-- 1 daniel daniel 326K Oct 19 19:29 wal
This is confusing but it doesn't affect SQLite clients per Ben's words:
it's the FUSE ReadDir() implementation I have. It's just listing out the databases right now. it doesn't affect the functionality as SQLite uses LOOKUP calls instead of READDIR but I should fix that.
The architecture doc states:
When the old primary node connects to the new primary node, it will see that its checksum is different even though its transaction ID could be the same. At this point, it will resnapshot the database from the new primary to ensure consistency.
How does this appear to the application?
Hi there,
Thanks for creating the project, and litefs seems a good fit for my use case, but internally we use etcd instead of consul, and we would like to reduce dependency as much as possible (which is why users are using sqlite in many cases). I wonder if there is any chance we add an additional abstraction for consul so that both etcd and consul can be used? Thanks.
Currently, LiteFS supports a single primary node that performs all the writes. However, there are situations where it would be useful to have multiple nodes that can write—even if it means taking a performance hit. Two common examples are background jobs & out-of-band migrations.
This could work by having the primary handoff the write lock to another node temporarily:
It is to be determined exactly how the lock handoff is requested by the client application. It could be transparent but that could cause users to experience slow performance if they are not correctly forwarding writes when they can. Maybe this should be a flag in the config to enable it?
Per suggestion on the Fly community board, we should add support for executing the subprocess from the command line arguments instead of only from the exec
field in litefs.yml
.
See: https://community.fly.io/t/how-do-you-properly-configure-a-litefs-deployment/8126/6
LiteFS is tested against the default 4KB page size, however, it should be able to handle any page size from 512 bytes to 64K.
Wonderful project, @benbjohnson! Thank you for open sourcing it!
As the name implies, the primary litefs
use case is around SQLite. I'm curious how tightly coupled to SQLite internals the intercept-writes-at-the-file system-level approach is. What challenges you envision in supporting other embedded databases like DuckDB?
(Apologies in advance if you prefer not to use issues to answer questions. I'm happy to ask on another forum if that's your preference. Thank you again for your wonderful software.)
Until #14 is implemented, LiteFS should gracefully prevent WAL mode from being enabled.
A comment on HN had an interesting idea of having the primary walled off from the other nodes and having the primary reach out to replicas instead of replicas connecting to the primary. It could provide security benefits as the primary node(s) could have tighter firewall controls.
Kubernetes has something similar to consul sessions called the Lease API. Could we use it to simplify deployments in k8s so there is no dependency on consul?
The LTX file format is designed to store optional event data (similar to SSE) for each transaction file. This can be useful as replicas may need to know when and how data changes instead of polling their local copy for changes.
This will need to be implemented as something like a file handle so the application can write events to it. Notifications can also be implemented as a file handle or an HTTP endpoint.
LiteFS should expose a set of Prometheus metrics via a GET /metrics
endpoint on the HTTP server.
LiteFS provides some redundancy by running in a cluster, however, losing all nodes would cause all data to be lost. Replicating to S3 in manner similar to Litestream would provide high durability (11 9s) as well as allow point-in-time restores.
As opposed to Litestream, LiteFS is designed for efficient compaction of transaction files so restore time should be much faster.
Not sure if the project is considered stable enough for such testing right now, but gave it a try anyway. Feel free to close the issue if it's not suitable right now and I can give it a try again later.
I tried loading one of my bigger databases (~ 4 GB) by dumping it into a sql file (via .dump
) and then loading it into a new DB on a litefs mount (via .read
) and these are my observations for the same.
My first try was with the regular sqlite3
tool on a regular filesystem (non-litefs mount) just to see the time taken (~5 mins
).
Trying the same on a litefs mount ran for around ~16 mins
before failing with an EIO error of nonsequential page numbers in snapshot transaction
. Trying the same multiple times gave me the same error.
I thought it was due to the single transaction (added by sqlite as part of the .dump
command to the output) and perhaps litefs was not handling such a large transaction properly so I removed the BEGIN TRANSACTION
& COMMIT
lines from the file before trying again.
This did run for longer but the speeds slowed down drastically. It was able to process only 300 MB
(verified by checking the size of the underlying database file) of the ~ 4 GB in 70+ mins
file but generated ~ 11 GB
of pages in the ltx folder consisting of 650k+
individual .ltx
files.
Is this expected for now till the performance improvements come in later or can something be done?
I just deployed LiteFS to den
and maa
and I'm getting a bunch of logs that I'm concerned about:
2022-10-25T22:15:01.166 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:01.286 app[2a69c631] den [info] stream connected
2022-10-25T22:15:01.287 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320
2022-10-25T22:15:01.420 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
2022-10-25T22:15:02.670 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:02.790 app[2a69c631] den [info] stream connected
2022-10-25T22:15:02.924 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
2022-10-25T22:15:04.182 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:04.437 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
2022-10-25T22:15:04.753 app[2a69c631] den [info] HEAD / 200 89545 - 49.571 ms
2022-10-25T22:15:04.755 app[2a69c631] den [info] GET /healthcheck 200 - - 64.283 ms
2022-10-25T22:15:05.658 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:05.780 app[2a69c631] den [info] stream connected
2022-10-25T22:15:05.780 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320
2022-10-25T22:15:05.913 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
2022-10-25T22:15:07.138 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:07.259 app[2a69c631] den [info] stream connected
2022-10-25T22:15:07.262 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320
2022-10-25T22:15:07.394 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
2022-10-25T22:15:08.615 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:08.735 app[2a69c631] den [info] stream connected
2022-10-25T22:15:08.737 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320
2022-10-25T22:15:08.875 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
2022-10-25T22:15:10.107 app[6c9dc779] maa [info] existing primary found (2a69c631), connecting as replica
2022-10-25T22:15:10.228 app[2a69c631] den [info] stream connected
2022-10-25T22:15:10.228 app[2a69c631] den [info] send frame<ltx>: db="sqlite.db" tx=0000000000000252-0000000000000252 size=49320
2022-10-25T22:15:10.362 app[6c9dc779] maa [info] replica disconected, retrying: process ltx stream frame: position mismatch on db "sqlite.db": 0000000000000251/ce552e44f23fbbdd <> 0000000000000251/e5f3198f64a74ef3
I'm not sure what these mean 🤔
Using Consul's session API works fine without an associated node in a single tenant environment. However, when using Consul in a multitenant environment with ACLs, the session_prefix
will apply to the node name in the session and will be rejected as a permission error if one is not present.
The fix is to provide a means for specifying a node name that will be registered on start up. LiteFS uses time-based sessions and will eventually hand off sessions between nodes so we want to register a single node for all LiteFS instances. This node will not have any checks associated with it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.