Git Product home page Git Product logo

viewstamped-replication-made-famous's Introduction

tigerbeetle

TigerBeetle is a financial transactions database designed for mission critical safety and performance to power the next 30 years of OLTP.

Quickstart

First, download a prebuilt copy of TigerBeetle.

# macOS
curl -Lo tigerbeetle.zip https://mac.tigerbeetle.com && unzip tigerbeetle.zip && ./tigerbeetle version

# Linux
curl -Lo tigerbeetle.zip https://linux.tigerbeetle.com && unzip tigerbeetle.zip && ./tigerbeetle version

# Windows
powershell -command "curl.exe -Lo tigerbeetle.zip https://windows.tigerbeetle.com; Expand-Archive tigerbeetle.zip .; .\tigerbeetle version"

Want to build from source locally?

git clone https://github.com/tigerbeetle/tigerbeetle && cd tigerbeetle
./zig/download.sh # or .bat if you're on Windows.
./zig/zig build
./tigerbeetle version

Running TigerBeetle

Then create the TigerBeetle data file.

./tigerbeetle format --cluster=0 --replica=0 --replica-count=1 --development 0_0.tigerbeetle
info(io): creating "0_0.tigerbeetle"...
info(io): allocating 660.140625MiB...

And start the replica.

./tigerbeetle start --addresses=3000 --development 0_0.tigerbeetle
info(io): opening "0_0.tigerbeetle"...
info(main): 0: cluster=0: listening on 127.0.0.1:3000

Using the CLI Client

Now that you've got a cluster running, let's connect to it and do some accounting!

First let's create two accounts. (Don't worry about the details, you can read about them later.)

./tigerbeetle repl --cluster=0 --addresses=3000
TigerBeetle Client
  Hit enter after a semicolon to run a command.

Examples:
  create_accounts id=1 code=10 ledger=700 flags=linked|history,
                  id=2 code=10 ledger=700;
  create_transfers id=1 debit_account_id=1 credit_account_id=2 amount=10 ledger=700 code=10;
  lookup_accounts id=1;
  lookup_accounts id=1, id=2;
  get_account_transfers account_id=1 flags=debits|credits;
  get_account_balances account_id=1 flags=debits|credits;
create_accounts id=1 code=10 ledger=700,
                id=2 code=10 ledger=700;

Now create a transfer of 10 (of some amount/currency) between the two accounts.

create_transfers id=1 debit_account_id=1 credit_account_id=2 amount=10 ledger=700 code=10;

Now, the amount of 10 has been credited to account 2 and debited from account 1. Let's query TigerBeetle for these two accounts to verify!

lookup_accounts id=1, id=2;
{
  "id": "1",
  "user_data": "0",
  "ledger": "700",
  "code": "10",
  "flags": "",
  "debits_pending": "0",
  "debits_posted": "10",
  "credits_pending": "0",
  "credits_posted": "0"
}
{
  "id": "2",
  "user_data": "0",
  "ledger": "700",
  "code": "10",
  "flags": "",
  "debits_pending": "0",
  "debits_posted": "0",
  "credits_pending": "0",
  "credits_posted": "10"
}

And indeed you can see that account 1 has debits_posted as 10 and account 2 has credits_posted as 10. The 10 amount is fully accounted for!

For further reading:

Next Steps

Watch an introduction to TigerBeetle on The Primeagen for our design decisions regarding performance, safety, and financial accounting debit/credit primitives:

The FASTEST and SAFEST Database

Read more about the history of TigerBeetle, the problem of balance tracking at scale, and the solution of a purpose-built financial transactions database.

Check out our DESIGN doc to see an overview of TigerBeetle's data structures, take a look at our roadmap, and join one of our communities to stay in the loop about fixes and features!

Documentation

Check out docs.tigerbeetle.com.

Here are a few key pages you might be interested in:

Clients

Community

Contributing

Read docs/HACKING.md.

Roadmap

See #259.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

viewstamped-replication-made-famous's People

Contributors

joblerstune avatar jorangreef avatar koekiebox avatar threefx avatar traviscrist avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

viewstamped-replication-made-famous's Issues

Liveness: overflow in prepare_timeout.attempts leads to node crash

Description and Impact

Once the on_prepare_timeout.attempts counter (which is an u8) overflows the assertion self.prepare_timeout.attempts > 0 fails, and the node crashes with an assertion failure. This is a liveness issue, as crashing a node reduces system availability.

// in src/vsr/replica.zig

fn on_prepare_timeout(self: *Self) void {
    // [snip]

    // increases prepare_timeout.attempts
    self.prepare_timeout.backoff(&self.prng);
    
    // [snip]

    // Cycle through the list to reach live replicas and get around partitions:
    
    // This is where the overflow comes back to bite us.
    assert(self.prepare_timeout.attempts > 0);
    const replica = waiting[self.prepare_timeout.attempts % waiting_len];
    assert(replica != self.replica);


    // [snip]
}

Steps to Reproduce the Bug

  1. Apply patch
  2. Run ./vopr.sh 15502461157524088066 -OReleaseSafe
    • After 25 transitions, the system will crash with the following message (line number may differ):
thread 18126 panic: reached unreachable code
./vopr.sh: line 23: 18126 Aborted                 (core dumped) zig run src/simulator.zig $BUILD_MODE -- $1
  1. Run ./vopr.sh 15502461157524088066 -ODebug 2> /tmp/logfile
    • Note: Takes a long time, about half an hour on my RX5900
    • Stacktrace:
thread 11349 panic: reached unreachable code
/usr/lib/zig/std/debug.zig:226:14: 0x22f90b in std.debug.assert (simulator)
    if (!ok) unreachable; // assertion failure
             ^
/home/bfiedler/projects/tigerbeetle-challenge/src/vsr/replica.zig:1479:19: 0x27c073 in vsr.replica.Replica(test.state_machine.StateMachine,test.message_bus.MessageBus,test.storage.Storage,test.time.Time).on_prepare_timeout (simulator)
            assert(self.prepare_timeout.attempts > 0);
                  ^
/home/bfiedler/projects/tigerbeetle-challenge/src/vsr/replica.zig:398:70: 0x2621bd in vsr.replica.Replica(test.state_machine.StateMachine,test.message_bus.MessageBus,test.storage.Storage,test.time.Time).tick (simulator)
            if (self.prepare_timeout.fired()) self.on_prepare_timeout();
                                                                     ^
/home/bfiedler/projects/tigerbeetle-challenge/src/simulator.zig:177:25: 0x258a92 in main (simulator)
            replica.tick();
                        ^
/usr/lib/zig/std/start.zig:458:37: 0x25084a in std.start.callMain (simulator)
            const result = root.main() catch |err| {
                                    ^
/usr/lib/zig/std/start.zig:400:12: 0x23431e in std.start.callMainWithArgs (simulator)
    return @call(.{ .modifier = .always_inline }, callMain, .{});
           ^
/usr/lib/zig/std/start.zig:319:17: 0x233325 in std.start.posixCallMainAndExit (simulator)
    std.os.exit(@call(.{ .modifier = .always_inline }, callMainWithArgs, .{ argc, argv, envp }));
                ^
/usr/lib/zig/std/start.zig:244:5: 0x233082 in std.start._start (simulator)
    @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{});
    ^

diff --git a/src/config.zig b/src/config.zig
index f4c3781..d976dad 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -6,14 +6,14 @@ pub const log_level = 6;
 
 /// The maximum number of replicas allowed in a cluster.
 /// This has been limited to 5 just to decrease the amount of memory required by the VOPR simulator.
-pub const replicas_max = 5;
+pub const replicas_max = 6;
 
 /// The maximum number of clients allowed per cluster, where each client has a unique 128-bit ID.
 /// This impacts the amount of memory allocated at initialization by the server.
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 20;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..210052e 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -55,14 +55,14 @@ pub fn main() !void {
     var prng = std.rand.DefaultPrng.init(seed);
     const random = &prng.random;
 
-    const replica_count = 1 + prng.random.uintLessThan(u8, config.replicas_max);
-    const client_count = 1 + prng.random.uintLessThan(u8, config.clients_max);
+    const replica_count = config.replicas_max; //1 + prng.random.uintLessThan(u8, config.replicas_max);
+    const client_count = config.clients_max; //1 + prng.random.uintLessThan(u8, config.clients_max);
     const node_count = replica_count + client_count;
 
     const ticks_max = 100_000_000;
     const transitions_max = config.journal_size_max / config.message_size_max;
-    const request_probability = 1 + prng.random.uintLessThan(u8, 99);
-    const idle_on_probability = prng.random.uintLessThan(u8, 20);
+    const request_probability = 100; //1 + prng.random.uintLessThan(u8, 99);
+    const idle_on_probability = 0; //prng.random.uintLessThan(u8, 20);
     const idle_off_probability = 10 + prng.random.uintLessThan(u8, 10);
 
     cluster = try Cluster.create(allocator, &prng.random, .{
@@ -74,13 +74,13 @@ pub fn main() !void {
             .packet_simulator_options = .{
                 .node_count = node_count,
                 .seed = prng.random.int(u64),
-                .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
-                .one_way_delay_min = prng.random.uintLessThan(u16, 3),
-                .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+                .one_way_delay_mean = 0, //3 + prng.random.uintLessThan(u16, 10),
+                .one_way_delay_min = 0, //prng.random.uintLessThan(u16, 3),
+                .packet_loss_probability = 0, //prng.random.uintLessThan(u8, 30),
+                .path_maximum_capacity = 100, //20 + prng.random.uintLessThan(u8, 20),
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
-                .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .packet_replay_probability = 80, // prng.random.uintLessThan(u8, 50),
             },
         },
         .storage_options = .{
diff --git a/src/test/storage.zig b/src/test/storage.zig
index ed3cbfa..d81270a 100644
--- a/src/test/storage.zig
+++ b/src/test/storage.zig
@@ -281,52 +281,25 @@ pub const Storage = struct {
         // We need to ensure there is message_size_max fault-free padding
         // between faulty areas of memory so that a single message
         // cannot straddle the corruptable areas of a majority of replicas.
-        switch (replica_count) {
-            1 => {
-                // If there is only one replica in the cluster, storage faults are not recoverable.
-                out[0] = .{ .first_offset = size, .period = 1 };
-            },
-            2 => {
-                //  0123456789
-                // 0X   X   X
-                // 1  X   X   X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 4 * message_size_max };
-                out[1] = .{ .first_offset = 2 * message_size_max, .period = 4 * message_size_max };
-            },
-            3 => {
-                //  0123456789
-                // 0X     X
-                // 1  X     X
-                // 2    X     X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
-                out[1] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
-                out[2] = .{ .first_offset = 4 * message_size_max, .period = 6 * message_size_max };
-            },
-            4 => {
-                //  0123456789
-                // 0X   X   X
-                // 1X   X   X
-                // 2  X   X   X
-                // 3  X   X   X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 4 * message_size_max };
-                out[1] = .{ .first_offset = 0 * message_size_max, .period = 4 * message_size_max };
-                out[2] = .{ .first_offset = 2 * message_size_max, .period = 4 * message_size_max };
-                out[3] = .{ .first_offset = 2 * message_size_max, .period = 4 * message_size_max };
-            },
-            5 => {
-                //  0123456789
-                // 0X     X
-                // 1X     X
-                // 2  X     X
-                // 3  X     X
-                // 4    X     X
-                out[0] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
-                out[1] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
-                out[2] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
-                out[3] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
-                out[4] = .{ .first_offset = 4 * message_size_max, .period = 6 * message_size_max };
-            },
-            else => unreachable,
+        var i: usize = 0;
+        while (i < config.replicas_max) : (i += 3) {
+            //  0123456789
+            // 0X     X
+            // 1  X     X
+            // 2    X     X
+            // 3X     X
+            // 4  X     X
+            // 5    X     X
+            // ...
+            out[i] = .{ .first_offset = 0 * message_size_max, .period = 6 * message_size_max };
+
+            if (i + 1 < config.replicas_max) {
+                out[i + 1] = .{ .first_offset = 2 * message_size_max, .period = 6 * message_size_max };
+            }
+
+            if (i + 2 < config.replicas_max) {
+                out[i + 2] = .{ .first_offset = 4 * message_size_max, .period = 6 * message_size_max };
+            }
         }
 
         prng.shuffle(FaultyAreas, out[0..replica_count]);

Suggested Fix

A short-term fix is to increase the attempts counter to something that doesn't overflow as quickly. However that does not take care of the root cause.

I'm not sure why that assertion is there, I think it is a bug and should be removed, as the timer code explicitly states that it allows overflows:

pub const Timeout = struct {
    // [snip]
    attempts: u8 = 0,
    // [snip]

    /// Increments the attempts counter and resets the timeout with exponential backoff and jitter.
    /// Allows the attempts counter to wrap from time to time.
    /// The overflow period is kept short to surface any related bugs sooner rather than later.
    /// We do not saturate the counter as this would cause round-robin retries to get stuck.
    pub fn backoff(self: *Timeout, prng: *std.rand.DefaultPrng) void {
        assert(self.ticking);

        self.ticks = 0;
        self.attempts +%= 1;

        log.debug("{}: {s} backing off", .{ self.id, self.name });
        self.set_after_for_rtt_and_attempts(prng);
    }
    // [snip]
}

There do not seem to be similar bugs lurking in the code (at least according to grep 'attempts > 0').

The Story Behind the Bug

Just playing around. My current idea is stress-testing the system by replaying a lot of messages and generating a high load (20 clients each generating many requests), which lead to the overflow during this prepare timeout.

Songs Enjoyed During the Production of This Issue

You guessed it: Liquicity Yearmix 2020

I'll throw out a recommendation though: "High" by Polygon and Lois Lauri

Literature

No response

The Last Word

No response

Correctness: 3 corrupt writes in a cluster of 9 lead to data loss

Description and Impact

It is claimed that "I/O corruptions are allowed across all replicas but only for a minority per op". The problem in this case is quorum_replication_max, which allows corruptions of a minority of replicas to lose data.

Below we have a cluster of 9 replicas with one client. I/O operations are not failing across a majority of replicas, only 0, 3 and 6. With quorum_replication_max set to 3, this may lead to data loss of op 6, as shown below.

image

Steps to Reproduce the Bug

  1. Apply patch
  2. Run ./vopr.sh 4995708898024657020

Parameters:

replicas=9
clients=1
request_probability=1%
idle_on_probability=0%
idle_off_probability=10%
one_way_delay_mean=5 ticks
one_way_delay_min=1 ticks
packet_loss_probability=0%
path_maximum_capacity=250 messages
path_clog_duration_mean=236 ticks
path_clog_probability=0%
packet_replay_probability=0%
packet_misdeliver_probability=0%
partition_mode = PartitionMode.fixed, // Always partitions 0, 3 and 6
partition_probability = 100%,
unpartition_probability = 100%,
partition_stability = 10000 ticks,
unpartition_stability = 800 ticks,
read_latency_min=0
read_latency_mean=0
write_latency_min=0
write_latency_mean=0
read_fault_probability=0%
write_fault_probability=80% // could happen anytime this is >0%.

Inspecting the VOPR log reveals the following interesting sequence of events

  1. 0 is elected leader of view 0
  2. 0, 3 and 6 are partitioned
  3. Client sends op 6 to leader 0
  4. 0 accepts request, sends out replica messages
  5. All replicate messages to 1, 2, 4, 5, 7 and 8 are lost because of the partition
  6. 0: write: view=0 op=6 offset=2347008 len=668878: 231234353740128973364628986099286746060 starting
  7. corrupting sector at offset 135168 during write by replica 0
  8. 3: write: view=0 op=6 offset=2347008 len=668878: 231234353740128973364628986099286746060 starting
  9. corrupting sector at offset 151552 during write by replica 3
  10. 6: write: view=0 op=6 offset=2347008 len=668878: 231234353740128973364628986099286746060 starting
  11. corrupting sector at offset 151552 during write by replica 6
  12. 3 and 6 send prepare_ok to 0
  13. 0: on_prepare_ok: quorum received, context=231234353740128973364628986099286746060
  14. 0: commit_op: executing view=0 true op=6 checksum=231234353740128973364628986099286746060
  15. 0: commit_op: replying to client: Header{ .checksum = 5181793799176001691513400008763912183, .checksum_body = 1750912696441858518917388097653399 35190 73858, .parent = 149345613266326242233401785838854728312, .client = 278261561401712285609078112622347422273, .context = 0, .request = 5, .cluster = 0, .epoch = 0, 35190 .view = 0, .op = 6, .commit = 6, .offset = 0, .size = 144, .replica = 0, .command = Command.reply, .operation = Operation(3), .version = 0 }
  16. client: on_message: Header{ .checksum = 5181793799176001691513400008763912183, .checksum_body = 1750912696441858 35196 51891738809765339973858, .parent = 149345613266326242233401785838854728312, .client = 278261561401712285609078112622347422273, .context = 0, .request = 5, .cluster 35196 = 0, .epoch = 0, .view = 0, .op = 6, .commit = 6, .offset = 0, .size = 144, .replica = 0, .command = Command.reply, .operation = Operation(3), .version = 0 }
  17. Replicas 0, 3 and 6 cannot read operation 6 anymore, because the body is corrupted, which means that operation 6 is lost after it was committed
  18. No other replica knows about operation 6
3: read_prepare: op=6 checksum=231234353740128973364628986099286746060: corrupt body after read
6: read_prepare: op=6 checksum=231234353740128973364628986099286746060: corrupt body after read
0: read_prepare: op=6 checksum=231234353740128973364628986099286746060: corrupt body after read

diff --git a/src/config.zig b/src/config.zig
index f4c3781..32ecc92 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -6,14 +6,14 @@ pub const log_level = 6;
 
 /// The maximum number of replicas allowed in a cluster.
 /// This has been limited to 5 just to decrease the amount of memory required by the VOPR simulator.
-pub const replicas_max = 5;
+pub const replicas_max = 9;
 
 /// The maximum number of clients allowed per cluster, where each client has a unique 128-bit ID.
 /// This impacts the amount of memory allocated at initialization by the server.
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 1;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index f89a899..5c6eae2 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -10,6 +10,7 @@ const Header = @import("vsr.zig").Header;
 const Replica = @import("test/cluster.zig").Replica;
 const StateChecker = @import("test/state_checker.zig").StateChecker;
 const StateMachine = @import("test/cluster.zig").StateMachine;
+const PartitionMode = @import("test/packet_simulator.zig").PartitionMode;
 
 /// The `log` namespace in this root file is required to implement our custom `log` function.
 const output = std.log.scoped(.state_checker);
@@ -57,14 +58,14 @@ pub fn main() !void {
     var prng = std.rand.DefaultPrng.init(seed);
     const random = &prng.random;
 
-    const replica_count = 1 + prng.random.uintLessThan(u8, config.replicas_max);
-    const client_count = 1 + prng.random.uintLessThan(u8, config.clients_max);
+    const replica_count = config.replicas_max; //1 + prng.random.uintLessThan(u8, config.replicas_max);
+    const client_count = config.clients_max; //1 + prng.random.uintLessThan(u8, config.clients_max);
     const node_count = replica_count + client_count;
 
     const ticks_max = 100_000_000;
     const transitions_max = config.journal_size_max / config.message_size_max;
-    const request_probability = 1 + prng.random.uintLessThan(u8, 99);
-    const idle_on_probability = prng.random.uintLessThan(u8, 20);
+    const request_probability = 1; //1 + prng.random.uintLessThan(u8, 99);
+    const idle_on_probability = 0; //prng.random.uintLessThan(u8, 20);
     const idle_off_probability = 10 + prng.random.uintLessThan(u8, 10);
 
     cluster = try Cluster.create(allocator, &prng.random, .{
@@ -74,25 +75,36 @@ pub fn main() !void {
         .seed = prng.random.int(u64),
         .network_options = .{
             .packet_simulator_options = .{
+                .replica_count = replica_count,
+                .client_count = client_count,
                 .node_count = node_count,
                 .seed = prng.random.int(u64),
-                .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
-                .one_way_delay_min = prng.random.uintLessThan(u16, 3),
-                .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+
+                .one_way_delay_mean = 5, //3 + prng.random.uintLessThan(u16, 100),
+                .one_way_delay_min = 1, //prng.random.uintLessThan(u16, 3),
+                .packet_loss_probability = 0, //prng.random.uintLessThan(u8, 30),
+                .path_maximum_capacity = 250, //20 + prng.random.uintLessThan(u8, 20),
+
+                .partition_mode = .fixed, //random_enum(PartitionMode, random),
+                .partition_probability = 100, //prng.random.uintLessThan(u8, 3),
+                .unpartition_probability = 100, //1 + prng.random.uintLessThan(u8, 10),
+                .partition_stability = 10_000, //100 + prng.random.uintLessThan(u32, 100),
+                .unpartition_stability = 800, //prng.random.uintLessThan(u32, 20),
+
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
-                .path_clog_probability = prng.random.uintLessThan(u8, 2),
-                .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .path_clog_probability = 0, //prng.random.uintLessThan(u8, 2),
+                .packet_replay_probability = 0, //prng.random.uintLessThan(u8, 50),
+                .packet_misdeliver_probability = 0, //prng.random.uintLessThan(u8, 10),
             },
         },
         .storage_options = .{
             .seed = prng.random.int(u64),
-            .read_latency_min = prng.random.uintLessThan(u16, 3),
-            .read_latency_mean = 3 + prng.random.uintLessThan(u16, 10),
-            .write_latency_min = prng.random.uintLessThan(u16, 3),
-            .write_latency_mean = 3 + prng.random.uintLessThan(u16, 10),
-            .read_fault_probability = prng.random.uintLessThan(u8, 10),
-            .write_fault_probability = prng.random.uintLessThan(u8, 10),
+            .read_latency_min = 0, //prng.random.uintLessThan(u16, 3),
+            .read_latency_mean = 0, //3 + prng.random.uintLessThan(u16, 10),
+            .write_latency_min = 0, //prng.random.uintLessThan(u16, 3),
+            .write_latency_mean = 0, //3 + prng.random.uintLessThan(u16, 10),
+            .read_fault_probability = 0, //prng.random.uintLessThan(u8, 10),
+            .write_fault_probability = 80, //prng.random.uintLessThan(u8, 10),
         },
     });
     defer cluster.destroy();
@@ -121,6 +133,12 @@ pub fn main() !void {
         \\          path_clog_duration_mean={} ticks
         \\          path_clog_probability={}%
         \\          packet_replay_probability={}%
+        \\          packet_misdeliver_probability={}%
+        \\          partition_mode = {},
+        \\          partition_probability = {}%,
+        \\          unpartition_probability = {}%,
+        \\          partition_stability = {} ticks,
+        \\          unpartition_stability = {} ticks,
         \\          read_latency_min={}
         \\          read_latency_mean={}
         \\          write_latency_min={}
@@ -142,6 +160,14 @@ pub fn main() !void {
         cluster.options.network_options.packet_simulator_options.path_clog_duration_mean,
         cluster.options.network_options.packet_simulator_options.path_clog_probability,
         cluster.options.network_options.packet_simulator_options.packet_replay_probability,
+        cluster.options.network_options.packet_simulator_options.packet_misdeliver_probability,
+
+        cluster.options.network_options.packet_simulator_options.partition_mode,
+        cluster.options.network_options.packet_simulator_options.partition_probability,
+        cluster.options.network_options.packet_simulator_options.unpartition_probability,
+        cluster.options.network_options.packet_simulator_options.partition_stability,
+        cluster.options.network_options.packet_simulator_options.unpartition_stability,
+
         cluster.options.storage_options.read_latency_min,
         cluster.options.storage_options.read_latency_mean,
         cluster.options.storage_options.write_latency_min,
@@ -263,6 +289,12 @@ fn client_callback(
     assert(user_data == 0);
 }
 
+fn random_enum(comptime EnumType: type, random: *std.rand.Random) EnumType {
+    const typeInfo = @typeInfo(EnumType).Enum;
+    const enumAsInt = random.uintAtMost(typeInfo.tag_type, typeInfo.fields.len - 1);
+    return @intToEnum(EnumType, enumAsInt);
+}
+
 fn parse_seed(bytes: []const u8) u64 {
     return std.fmt.parseUnsigned(u64, bytes, 10) catch |err| switch (err) {
         error.Overflow => @panic("seed exceeds a 64-bit unsigned integer"),
diff --git a/src/test/packet_simulator.zig b/src/test/packet_simulator.zig
index adb4942..d277f97 100644
--- a/src/test/packet_simulator.zig
+++ b/src/test/packet_simulator.zig
@@ -11,9 +11,19 @@ pub const PacketSimulatorOptions = struct {
 
     packet_loss_probability: u8,
     packet_replay_probability: u8,
+    packet_misdeliver_probability: u8,
     seed: u64,
+
+    replica_count: u8,
+    client_count: u8,
     node_count: u8,
 
+    partition_mode: PartitionMode,
+    partition_stability: u32,
+    unpartition_stability: u32,
+    partition_probability: u8,
+    unpartition_probability: u8,
+
     /// The maximum number of in-flight packets a path can have before packets are randomly dropped.
     path_maximum_capacity: u8,
 
@@ -27,11 +37,19 @@ pub const Path = struct {
     target: u8,
 };
 
+pub const PartitionMode = enum {
+    uniform_size,
+    uniform_partition,
+    isolate_single,
+    fixed,
+};
+
 /// A fully connected network of nodes used for testing. Simulates the fault model:
 /// Packets may be dropped.
 /// Packets may be delayed.
 /// Packets may be replayed.
 pub const PacketStatistics = enum(u8) {
+    dropped_due_to_partition,
     dropped_due_to_congestion,
     dropped,
     replay,
@@ -58,6 +76,11 @@ pub fn PacketSimulator(comptime Packet: type) type {
         stats: [@typeInfo(PacketStatistics).Enum.fields.len]u32 = [_]u32{0} **
             @typeInfo(PacketStatistics).Enum.fields.len,
 
+        is_partitioned: bool,
+        partition: []bool,
+        replicas: []u8,
+        stability: u32,
+
         pub fn init(allocator: *std.mem.Allocator, options: PacketSimulatorOptions) !Self {
             assert(options.one_way_delay_mean >= options.one_way_delay_min);
             var self = Self{
@@ -71,8 +94,17 @@ pub fn PacketSimulator(comptime Packet: type) type {
                 ),
                 .options = options,
                 .prng = std.rand.DefaultPrng.init(options.seed),
+
+                .is_partitioned = false,
+                .stability = options.unpartition_stability,
+                .partition = try allocator.alloc(bool, @as(usize, options.replica_count)),
+                .replicas = try allocator.alloc(u8, @as(usize, options.replica_count)),
             };
 
+            for (self.replicas) |_, i| {
+                self.replicas[i] = @intCast(u8, i);
+            }
+
             for (self.paths) |*queue| {
                 queue.* = std.PriorityQueue(Data).init(allocator, Self.order_packets);
                 try queue.ensureCapacity(options.path_maximum_capacity);
@@ -134,6 +166,10 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return self.prng.random.uintAtMost(u8, 100) < self.options.packet_replay_probability;
         }
 
+        fn should_misdeliver(self: *Self) bool {
+            return self.prng.random.uintAtMost(u8, 100) < self.options.packet_misdeliver_probability;
+        }
+
         /// Return a value produced using an exponential distribution with
         /// the minimum and mean specified in self.options
         fn one_way_delay(self: *Self) u64 {
@@ -142,14 +178,83 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return min + @floatToInt(u64, @intToFloat(f64, mean - min) * self.prng.random.floatExp(f64));
         }
 
+        fn random_choice(self: *Self, probability: u8) bool {
+            return self.prng.random.uintLessThan(u8, 100) < probability;
+        }
+
+        fn partition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = true;
+            self.stability = self.options.partition_stability;
+
+            switch (self.options.partition_mode) {
+                .uniform_size => {
+                    const sz = self.prng.random.uintAtMost(u8, self.options.replica_count);
+                    self.prng.random.shuffle(u8, self.replicas);
+                    for (self.replicas) |r, i| {
+                        self.partition[r] = i < sz;
+                    }
+                },
+                .uniform_partition => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = self.random_choice(50);
+                    }
+                },
+                .isolate_single => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = false;
+                    }
+                    const n = self.prng.random.uintLessThan(u8, self.options.replica_count);
+                    self.partition[n] = true;
+                },
+                .fixed => {
+                    // XXX: You can make this do anything you want
+                    var i: usize = 0;
+                    while (i < self.replicas.len) : (i += 3) {
+                        self.partition[i] = false;
+                        if (i + 1 < self.replicas.len) self.partition[i + 1] = true;
+                        if (i + 2 < self.replicas.len) self.partition[i + 2] = true;
+                    }
+                },
+            }
+        }
+
+        fn unpartition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = false;
+            self.stability = self.options.unpartition_stability;
+
+            for (self.replicas) |_, i| {
+                self.partition[i] = false;
+            }
+        }
+
         pub fn tick(self: *Self) void {
             self.ticks += 1;
 
+            if (self.stability > 0) {
+                self.stability -= 1;
+            } else {
+                if (self.is_partitioned) {
+                    if (self.random_choice(self.options.unpartition_probability)) {
+                        self.unpartition_network();
+                        log.alert("unpartitioned network: partition={d}", .{self.partition});
+                    }
+                } else {
+                    if (self.random_choice(self.options.partition_probability)) {
+                        self.partition_network();
+                        log.alert("partitioned network: partition={d}", .{self.partition});
+                    }
+                }
+            }
+
             var from: u8 = 0;
             while (from < self.options.node_count) : (from += 1) {
                 var to: u8 = 0;
                 while (to < self.options.node_count) : (to += 1) {
-                    const path = .{ .source = from, .target = to };
+                    var path = .{ .source = from, .target = to };
                     if (self.is_clogged(path)) continue;
 
                     const queue = self.path_queue(path);
@@ -157,14 +262,33 @@ pub fn PacketSimulator(comptime Packet: type) type {
                         if (data.expiry > self.ticks) break;
                         _ = queue.remove();
 
+                        const is_misdeliver = self.options.node_count > 1 and self.should_misdeliver();
+                        var orig_to: u8 = self.options.node_count;
+                        if (is_misdeliver) {
+                            orig_to = to;
+                            to = self.prng.random.uintLessThan(u8, self.options.node_count - 1);
+                            if (to >= orig_to) to += 1;
+                            assert(to < self.options.node_count and to != orig_to);
+
+                            path.target = to;
+                            log.debug("misrouting packet from={} orig_to={} new_to={}", .{ from, orig_to, to });
+                        }
+
+                        if (self.is_partitioned) {
+                            if (from < self.options.replica_count and to < self.options.replica_count and self.partition[from] != self.partition[to]) {
+                                self.stats[@enumToInt(PacketStatistics.dropped_due_to_partition)] += 1;
+                                log.alert("dropped packet (different partitions): from={} to={}", .{ from, to });
+                                data.packet.deinit(path);
+                                continue;
+                            }
+                        }
+
                         if (self.should_drop()) {
                             self.stats[@enumToInt(PacketStatistics.dropped)] += 1;
                             log.alert("dropped packet from={} to={}.", .{ from, to });
                             data.packet.deinit(path);
                             continue;
-                        }
-
-                        if (self.should_replay()) {
+                        } else if (self.should_replay()) {
                             self.submit_packet(data.packet, data.callback, path);
 
                             log.debug("replayed packet from={} to={}", .{ from, to });
@@ -176,6 +300,12 @@ pub fn PacketSimulator(comptime Packet: type) type {
                             data.callback(data.packet, path);
                             data.packet.deinit(path);
                         }
+
+                        if (is_misdeliver) {
+                            assert(orig_to < self.options.node_count);
+                            to = orig_to;
+                            path.target = to;
+                        }
                     }
 
                     const reverse_path: Path = .{ .source = to, .target = from };

Suggested Fix

The problem here is that replicas do not verify that their writes are successful before committing.

One possiblity would be to require each replica to reproduce its write before committing. This rules out this specific error case, where the write itself corrupts all copies of an operation, however it does not protect against disk corruptions occurring after the write.

I don't think that this is something the protocol can deal with, instead I think that the guarantee has to be reduced from

I/O corruptions are allowed across all replicas but only for a minority per op

to

I/O corruptions are allowed across all replicas but only for less than quorum_replication_max per op

The Story Behind the Bug

I found the claim "I/O corruptions are allowed across all replicas but only for a minority per op" to sound too strong given that commits can be done with fewer than the majority of replicas. I came up with the scenario depicted in my drawing first, and then specifically set the system parameters to reproduce it.

Songs Enjoyed During the Production of This Issue

Liquicity 2020 Yearmix

This time I can recommend "Feels So Good" by Logistics

Literature

No response

The Last Word

No response

Issue: What is the status of challenge? [Will eventually be re-opened]

Description

Hello,

I can see that there is still an active marketing of that challenge eg. https://tigerbeetle.com/20k-challenge.html .

However, repositories claim:

The challenge will run until 6pm UTC on November 30, 2022, or until the total challenge bounty of $20k is fully awarded, whichever comes first.

I think that the first condition has been met, because November 30, 2022 has already passed, so the challenge is no longer active. Should we update the date or make it clear that the challenge is expired?

Kind regards

Correctness: Fast Beetle Search Results

Description and Impact

Searching online for "fast beetle" yields information and images of VW Beetle vehicles as the top search result. Cars are not beetles. Tiger beetles are beetles, and incredibly fast ones at that! Therefore, Tiger beetles should be the top search result.

If Tiger beetles don't easily pop up when people search for "fast beetles" then people risk missing out on some fascinating facts about the world they live in and that loss of information would be a genuine shame.

Steps to Reproduce the Bug

  1. Open a Chrome browser.
  2. Use the Google search engine.
  3. Search for "fast beetle".
  4. View the results.

Suggested Fix

This issue cannot be fixed immediately. But there are steps that can be taken.

  • Get people to read this issue and get inspired to search for "fast beetle" but only click on links that lead to articles about Tiger beetles.
  • When people read about Tiger beetles they will hopefully want to keep reading. Tiger beetles are mighty impressive little critters.
  • All these new Google search hits will bump genuine Tiger beetles up as a search result.
  • Another potential fix is to post more articles and images online to create more awareness. Tiger beetles should get the recognition they deserve. They are the fastest beetle after all!

The Story Behind the Bug

This issue was discovered while the TigerTeam were deciding on a name for the TigerBeetle accounting database.
Don suggested, โ€œWhat about TigerBeetle? A really fast beetle. Also to go with the idea that it survives anything.โ€ Three minutes later we had decided this was the perfect name, so Joran asks โ€œHow did you know about them?โ€ and Don replies โ€œI didnโ€™t, I googled โ€˜fast beetleโ€™โ€ to which Joran replies โ€œThat was a very good google! This is what I get for โ€˜fast beetleโ€™ - Google images of Volkswagen Beetle carsโ€...

Songs Enjoyed During the Production of This Issue

Paul McCartney's rocking live version of The Beatlesโ€™ โ€œDrive My Carโ€.

Literature

Tiger Beetles: The Fastest Bugs on Six Legs by Debbie Hadley
The Tiger beetle Wikipedia page
When tiger beetles chase prey at high speeds they go blind temporarily, Cornell entomologists learn by By Blaine Friedlander

The Last Word

I found it incredible that Tiger beetles run so fast that their eyes can't keep focus and they have to slow down to get a view of the world around them. When they are on the move the world is literally a blur around them.

Liveness: Replicas crash on receiving `Command.reply` messages

Description and Impact

A replica receiving a Command.reply triggers an unreachable statement and crashes the replica. Replicas must be able to deal with this according to the "messages may be misrouted" specification in the network fault description.

// in: src/vsr/replica.zig

/// Called by the MessageBus to deliver a message to the replica.
pub fn on_message(self: *Self, message: *Message) void {
    // [snip]
    assert(message.header.replica < self.replica_count);
    switch (message.header.command) {
        .ping => self.on_ping(message),
        .pong => self.on_pong(message),
        .request => self.on_request(message),
        .prepare => self.on_prepare(message),
        .prepare_ok => self.on_prepare_ok(message),
        .commit => self.on_commit(message),
        .start_view_change => self.on_start_view_change(message),
        .do_view_change => self.on_do_view_change(message),
        .start_view => self.on_start_view(message),
        .recovery => self.on_recovery(message),
        .request_start_view => self.on_request_start_view(message),
        .request_prepare => self.on_request_prepare(message),
        .request_headers => self.on_request_headers(message),
        .headers => self.on_headers(message),
        .nack_prepare => self.on_nack_prepare(message),
        // command == Command.reply lands here
        else => unreachable,
    }
    // [snip]
}

Steps to Reproduce the Bug

  1. Apply patch
  2. Run ./vopr.sh 2669933095950905174
    • crashes after 18th transition, stack trace:
[debug] (packet_simulator): misdelivering packet from=0 orig_to=2 new_to=0
[debug] (network): deliver_message: Process{ .replica = 0 } > Process{ .replica = 0 }: Command.reply
[debug] (replica): 0: on_message: view=0 status=normal Header{ .checksum = 226925266819502557119179983641806389019, .checksum_body = 298413714882050206950453565564009099268, .parent = 162187561933345702967133189502892135902, .client = 205787606504586803876943260068027083215, .context = 0, .request = 11, .cluster = 0, .epoch = 0, .view = 0, .op = 20, .commit = 20, .offset = 0, .size = 144, .replica = 0, .command = Command.reply, .operation = Operation(3), .version = 0 }
thread 9741 panic: reached unreachable code
/home/bfiedler/projects/tigerbeetle-challenge/src/vsr/replica.zig:459:25: 0x26b7b0 in vsr.replica.Replica(test.state_machine.StateMachine,test.message_bus.MessageBus,test.storage.Storage,test.time.Time).on_message (simulator)
                else => unreachable,
                        ^
/home/bfiedler/projects/tigerbeetle-challenge/src/test/message_bus.zig:59:27: 0x26f9ee in test.message_bus.struct:57:35.wrapper (simulator)
                on_message(@intToPtr(Context, @ptrToInt(_context)), message);
                          ^
/home/bfiedler/projects/tigerbeetle-challenge/src/test/network.zig:170:41: 0x2a6a53 in test.network.Network.deliver_message (simulator)
        target_bus.on_message_callback.?(target_bus.on_message_context, message);
                                        ^
/home/bfiedler/projects/tigerbeetle-challenge/src/test/packet_simulator.zig:185:42: 0x259fa3 in test.packet_simulator.PacketSimulator(test.network.Packet).tick (simulator)
                            data.callback(data.packet, new_path);
                                         ^
/home/bfiedler/projects/tigerbeetle-challenge/src/simulator.zig:168:46: 0x24fe18 in main (simulator)
        cluster.network.packet_simulator.tick();
                                             ^
/usr/lib/zig/std/start.zig:458:37: 0x247b8a in std.start.callMain (simulator)
            const result = root.main() catch |err| {
                                    ^
/usr/lib/zig/std/start.zig:400:12: 0x22b65e in std.start.callMainWithArgs (simulator)
    return @call(.{ .modifier = .always_inline }, callMain, .{});
           ^
/usr/lib/zig/std/start.zig:319:17: 0x22a665 in std.start.posixCallMainAndExit (simulator)
    std.os.exit(@call(.{ .modifier = .always_inline }, callMainWithArgs, .{ argc, argv, envp }));
                ^
/usr/lib/zig/std/start.zig:244:5: 0x22a3c2 in std.start._start (simulator)
    @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{});
    ^
./vopr.sh: line 23:  9741 Aborted                 (core dumped) zig run src/simulator.zig $BUILD_MODE -- $1

diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..8dde542 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -81,6 +81,7 @@ pub fn main() !void {
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
                 .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .packet_misdeliver_probability = prng.random.uintLessThan(u8, 10),
             },
         },
         .storage_options = .{
@@ -118,6 +119,7 @@ pub fn main() !void {
         \\          path_clog_duration_mean={} ticks
         \\          path_clog_probability={}%
         \\          packet_replay_probability={}%
+        \\          packet_misdeliver_probability={}%
         \\          read_latency_min={}
         \\          read_latency_mean={}
         \\          write_latency_min={}
@@ -141,6 +143,7 @@ pub fn main() !void {
         cluster.options.network_options.packet_simulator_options.path_clog_duration_mean,
         cluster.options.network_options.packet_simulator_options.path_clog_probability,
         cluster.options.network_options.packet_simulator_options.packet_replay_probability,
+        cluster.options.network_options.packet_simulator_options.packet_misdeliver_probability,
 
         cluster.options.storage_options.read_latency_min,
         cluster.options.storage_options.read_latency_mean,
diff --git a/src/test/packet_simulator.zig b/src/test/packet_simulator.zig
index 0ec2bc2..a7e2f1c 100644
--- a/src/test/packet_simulator.zig
+++ b/src/test/packet_simulator.zig
@@ -11,6 +11,7 @@ pub const PacketSimulatorOptions = struct {
 
     packet_loss_probability: u8,
     packet_replay_probability: u8,
+    packet_misdeliver_probability: u8,
     seed: u64,
     node_count: u8,
 
@@ -134,6 +135,10 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return self.prng.random.uintAtMost(u8, 100) < self.options.packet_replay_probability;
         }
 
+        fn should_misdeliver(self: *Self) bool {
+            return self.prng.random.uintAtMost(u8, 100) < self.options.packet_misdeliver_probability;
+        }
+
         /// Return a value produced using an exponential distribution with
         /// the minimum and mean specified in self.options
         fn one_way_delay(self: *Self) u64 {
@@ -162,15 +167,25 @@ pub fn PacketSimulator(comptime Packet: type) type {
                             log.alert("dropped packet from={} to={}.", .{ from, to });
                             data.packet.deinit(path);
                             continue;
-                        }
-
-                        if (self.should_replay()) {
+                        } else if (self.should_replay()) {
                             self.submit_packet(data.packet, data.callback, path);
 
                             log.debug("replayed packet from={} to={}", .{ from, to });
                             self.stats[@enumToInt(PacketStatistics.replay)] += 1;
 
                             data.callback(data.packet, path);
+                        } else if (self.options.node_count > 1 and self.should_misdeliver()) {
+                            const orig_to = path.target;
+                            var new_to = self.prng.random.uintLessThan(u8, self.options.node_count - 1);
+                            if (new_to >= orig_to) new_to += 1;
+                            assert(new_to < self.options.node_count and new_to != orig_to);
+
+                            const new_path = Path{ .source = path.source, .target = new_to };
+                            log.debug("misdelivering packet from={} orig_to={} new_to={}", .{ from, orig_to, new_to });
+                            data.callback(data.packet, new_path);
+
+                            // only source matters here
+                            data.packet.deinit(path);
                         } else {
                             log.debug("delivering packet from={} to={}", .{ from, to });
                             data.callback(data.packet, path);

Suggested Fix

Either explicitly ignore .reply commands or ignore all unknown messages in on_message.

// in: src/vsr/replica.zig

/// Called by the MessageBus to deliver a message to the replica.
pub fn on_message(self: *Self, message: *Message) void {
    // [snip]
    assert(message.header.replica < self.replica_count);
    switch (message.header.command) {
        .ping => self.on_ping(message),
        .pong => self.on_pong(message),
        .request => self.on_request(message),
        .prepare => self.on_prepare(message),
        .prepare_ok => self.on_prepare_ok(message),
        .commit => self.on_commit(message),
        .start_view_change => self.on_start_view_change(message),
        .do_view_change => self.on_do_view_change(message),
        .start_view => self.on_start_view(message),
        .recovery => self.on_recovery(message),
        .request_start_view => self.on_request_start_view(message),
        .request_prepare => self.on_request_prepare(message),
        .request_headers => self.on_request_headers(message),
        .headers => self.on_headers(message),
        .nack_prepare => self.on_nack_prepare(message),
        // 1. this
        .reply => // log and ignore
        else => unreachable,
        // or, 2. this
        else => // log and ignore,
    }
    // [snip]
}

The Story Behind the Bug

Misdelivering packets was the next thing on my todo-list :) I was surprised that I triggered a bug this quickly.

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Another recommendation: "EYE HAVE YOU" from Maretu. Sounds like it could be from Undertale/an Undertale mashup.

Literature

No response

The Last Word

No response

Issue: VOPR fails to run on Windows

Description

Following the instructions in 'How can I run the implementation? How many batteries are included? Do you mean I can even run the VOPR?' in the README on Windows will cause Zig to run into an error regarding an unexpected amount of arguments.

Reproduction

  • Clone the viewstamped-replication-made-famous repository
  • Manually download the prebuilt Zig 0.8.0 linked in 'How can ... the VOPR?'
  • Try to run the VOPR with either given command (zig.exe run src/simulator.zig -OReleaseSafe or zig.exe run src/simulator.zig -- 123)

Relevant system specs

  • OS: Windows 10 Pro
  • Zig version: 0.8.0
  • Shell: both cmd and MSys2

Console output

https://pastebin.com/KsbaVbYQ

Liveness: Client trusts all received pongs

Description and Impact

Client pings are answered by isolated replicas (which are not part of any quorum). This may leads to the client learning a view number which is higher than what the quorum agrees on, and subsequently the client's requests are ignored by the quorum, since its view isn't new enough.

Steps to Reproduce the Bug

  1. Apply the below patch
  2. Run ./vopr.sh 5271112275961929105 -OReleaseSafe
    • no progress is made after operation 20
    • it times out after 100_000_000 ticks
  3. Run ./vopr.sh 5271112275961929105
    • on line ~9.8k, the client acceps a newer view number from the isolated replica 2 (search for on_pong:)
    • this view number is included in subsequent requests, causing the quorum of 0, 1, 3 and 4 to ignore the client's messages
    • no retry logic is embedded in the client, thus no recovery takes place and we stall forever

Note that despite me isolating replica 2 only from the cluster, this can already happen with a "simple" full node crash (and no packet drops):

  1. Client sends ping to all replicas.
  2. Replica 2 fires its view change status timeout
  3. Replica 2 increases its view number
  4. Client ping is delivered to Replica 2
  5. Replica 2 responds with view+1
  6. Replica 2 crashes before sending out view_change messages
  7. Client accepts view+1 in on_pong
  8. Cluster stays at view

diff --git a/src/config.zig b/src/config.zig
index ed6ac8f..ea47420 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -13,7 +13,7 @@ pub const replicas_max = 5;
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 1;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..b6fd4a9 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -55,8 +55,8 @@ pub fn main() !void {
     var prng = std.rand.DefaultPrng.init(seed);
     const random = &prng.random;
 
-    const replica_count = 1 + prng.random.uintLessThan(u8, config.replicas_max);
-    const client_count = 1 + prng.random.uintLessThan(u8, config.clients_max);
+    const replica_count = config.replicas_max; //1 + prng.random.uintLessThan(u8, config.replicas_max);
+    const client_count = config.clients_max; //1 + prng.random.uintLessThan(u8, config.clients_max);
     const node_count = replica_count + client_count;
 
     const ticks_max = 100_000_000;
@@ -72,15 +72,26 @@ pub fn main() !void {
         .seed = prng.random.int(u64),
         .network_options = .{
             .packet_simulator_options = .{
+                .replica_count = replica_count,
+                .client_count = client_count,
                 .node_count = node_count,
+
                 .seed = prng.random.int(u64),
+
                 .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
                 .one_way_delay_min = prng.random.uintLessThan(u16, 3),
-                .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+
+                .partition_mode = .isolate_single,
+                .partition_probability = 100,
+                .unpartition_probability = 0,
+                .partition_stability = 10,
+
+                .path_maximum_capacity = 250, // + prng.random.uintLessThan(u8, 20),
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
-                .path_clog_probability = prng.random.uintLessThan(u8, 2),
-                .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .path_clog_probability = 0, //prng.random.uintLessThan(u8, 2),
+
+                .packet_loss_probability = 0, //prng.random.uintLessThan(u8, 30),
+                .packet_replay_probability = 0, //prng.random.uintLessThan(u8, 50),
             },
         },
         .storage_options = .{
@@ -118,6 +129,10 @@ pub fn main() !void {
         \\          path_clog_duration_mean={} ticks
         \\          path_clog_probability={}%
         \\          packet_replay_probability={}%
+        \\          partition_mode = {},
+        \\          partition_probability = {}%,
+        \\          unpartition_probability = {}%,
+        \\          partition_stability = {} ticks,
         \\          read_latency_min={}
         \\          read_latency_mean={}
         \\          write_latency_min={}
@@ -142,6 +157,11 @@ pub fn main() !void {
         cluster.options.network_options.packet_simulator_options.path_clog_probability,
         cluster.options.network_options.packet_simulator_options.packet_replay_probability,
 
+        cluster.options.network_options.packet_simulator_options.partition_mode,
+        cluster.options.network_options.packet_simulator_options.partition_probability,
+        cluster.options.network_options.packet_simulator_options.unpartition_probability,
+        cluster.options.network_options.packet_simulator_options.partition_stability,
+
         cluster.options.storage_options.read_latency_min,
         cluster.options.storage_options.read_latency_mean,
         cluster.options.storage_options.write_latency_min,
diff --git a/src/test/packet_simulator.zig b/src/test/packet_simulator.zig
index 0ec2bc2..3ef918e 100644
--- a/src/test/packet_simulator.zig
+++ b/src/test/packet_simulator.zig
@@ -12,8 +12,16 @@ pub const PacketSimulatorOptions = struct {
     packet_loss_probability: u8,
     packet_replay_probability: u8,
     seed: u64,
+
+    replica_count: u8,
+    client_count: u8,
     node_count: u8,
 
+    partition_mode: PartitionMode,
+    partition_stability: u32,
+    partition_probability: u8,
+    unpartition_probability: u8,
+
     /// The maximum number of in-flight packets a path can have before packets are randomly dropped.
     path_maximum_capacity: u8,
 
@@ -27,11 +35,18 @@ pub const Path = struct {
     target: u8,
 };
 
+pub const PartitionMode = enum {
+    uniform_size,
+    uniform_partition,
+    isolate_single,
+};
+
 /// A fully connected network of nodes used for testing. Simulates the fault model:
 /// Packets may be dropped.
 /// Packets may be delayed.
 /// Packets may be replayed.
 pub const PacketStatistics = enum(u8) {
+    dropped_due_to_partition,
     dropped_due_to_congestion,
     dropped,
     replay,
@@ -58,6 +73,11 @@ pub fn PacketSimulator(comptime Packet: type) type {
         stats: [@typeInfo(PacketStatistics).Enum.fields.len]u32 = [_]u32{0} **
             @typeInfo(PacketStatistics).Enum.fields.len,
 
+        is_partitioned: bool,
+        partition: []bool,
+        replicas: []u8,
+        stability: u32,
+
         pub fn init(allocator: *std.mem.Allocator, options: PacketSimulatorOptions) !Self {
             assert(options.one_way_delay_mean >= options.one_way_delay_min);
             var self = Self{
@@ -71,8 +91,17 @@ pub fn PacketSimulator(comptime Packet: type) type {
                 ),
                 .options = options,
                 .prng = std.rand.DefaultPrng.init(options.seed),
+
+                .is_partitioned = false,
+                .stability = options.partition_stability,
+                .partition = try allocator.alloc(bool, @as(usize, options.replica_count)),
+                .replicas = try allocator.alloc(u8, @as(usize, options.replica_count)),
             };
 
+            for (self.replicas) |_, i| {
+                self.replicas[i] = @intCast(u8, i);
+            }
+
             for (self.paths) |*queue| {
                 queue.* = std.PriorityQueue(Data).init(allocator, Self.order_packets);
                 try queue.ensureCapacity(options.path_maximum_capacity);
@@ -142,9 +171,69 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return min + @floatToInt(u64, @intToFloat(f64, mean - min) * self.prng.random.floatExp(f64));
         }
 
+        fn random_choice(self: *Self, probability: u8) bool {
+            return self.prng.random.uintLessThan(u8, 100) < probability;
+        }
+
+        fn partition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = true;
+            self.stability = self.options.partition_stability;
+
+            switch (self.options.partition_mode) {
+                .uniform_size => {
+                    const sz = self.prng.random.uintAtMost(u8, self.options.replica_count);
+                    self.prng.random.shuffle(u8, self.replicas);
+                    for (self.replicas) |r, i| {
+                        self.partition[r] = i < sz;
+                    }
+                },
+                .uniform_partition => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = self.random_choice(50);
+                    }
+                },
+                .isolate_single => {
+                    for (self.replicas) |_, i| {
+                        self.partition[i] = false;
+                    }
+                    const n = self.prng.random.uintLessThan(u8, self.options.replica_count);
+                    self.partition[n] = true;
+                },
+            }
+        }
+
+        fn unpartition_network(
+            self: *Self,
+        ) void {
+            self.is_partitioned = false;
+            self.stability = self.options.partition_stability;
+
+            for (self.replicas) |_, i| {
+                self.partition[i] = false;
+            }
+        }
+
         pub fn tick(self: *Self) void {
             self.ticks += 1;
 
+            if (self.stability > 0) {
+                self.stability -= 1;
+            } else {
+                if (self.is_partitioned) {
+                    if (self.random_choice(self.options.unpartition_probability)) {
+                        self.unpartition_network();
+                        log.alert("unpartitioned network: partition={d}", .{self.partition});
+                    }
+                } else {
+                    if (self.random_choice(self.options.partition_probability)) {
+                        self.partition_network();
+                        log.alert("partitioned network: partition={d}", .{self.partition});
+                    }
+                }
+            }
+
             var from: u8 = 0;
             while (from < self.options.node_count) : (from += 1) {
                 var to: u8 = 0;
@@ -157,6 +246,15 @@ pub fn PacketSimulator(comptime Packet: type) type {
                         if (data.expiry > self.ticks) break;
                         _ = queue.remove();
 
+                        if (self.is_partitioned) {
+                            if (from < self.options.replica_count and to < self.options.replica_count and self.partition[from] != self.partition[to]) {
+                                self.stats[@enumToInt(PacketStatistics.dropped_due_to_partition)] += 1;
+                                log.alert("dropped packet (different partitions): from={} to={}", .{ from, to });
+                                data.packet.deinit(path);
+                                continue;
+                            }
+                        }
+
                         if (self.should_drop()) {
                             self.stats[@enumToInt(PacketStatistics.dropped)] += 1;
                             log.alert("dropped packet from={} to={}.", .{ from, to });

Suggested Fix

I see three reasonable approaches here:

  1. Have the client dis- and reconnect to the cluster after some amount of failed requests.
    • This is a good option in general, as it increases cluster robustness to similar issues (client believing it's ahead of the cluster)
    • This is probably what I would implement, since it is the simplest possible solution to this exact issue.
  2. Forbid replicas to respond to (client) pings in non-normal operation, since it is not guaranteed to return to a normal state.
    • Not sure about the implications of this one, but it seems sensible given that the client fully trusts the pong message it receives
  3. Do random view changes in regular operation
    • This guarantees that the cluster will not "stagnate" at one view, and eventually catch up to whatever the client has received
    • Not sure about the wider implications of this

The Story Behind the Bug

I've implemented network partitioning and am playing around with 5 replicas and one client. 5 replicas is an interesting case since it allows me to isolate one replica completely without (theoretically) compromising both correct- and liveness.

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Literature

No response

The Last Word

I'm having a lot of fun :)

Issue: Read faults use write_fault_probability

Description

Read faults use write_fault_probability

diff --git a/src/test/storage.zig b/src/test/storage.zig
index e1fdd86..ed3cbfa 100644
--- a/src/test/storage.zig
+++ b/src/test/storage.zig
@@ -176,7 +176,7 @@ pub const Storage = struct {
 
     fn read_sectors_finish(storage: *Storage, read: *Storage.Read) void {
         const faulty = storage.faulty_sectors(read.offset, read.buffer.len);
-        if (faulty.len > 0 and storage.x_in_100(storage.options.write_fault_probability)) {
+        if (faulty.len > 0 and storage.x_in_100(storage.options.read_fault_probability)) {
             // Randomly corrupt one of the faulty sectors the read targeted
             // TODO: inject more realistic and varied storage faults as described above.
             const sector_count = @divExact(faulty.len, config.sector_size);

Liveness: race condition when handling successive requests from two clients leads to deadlock

Description and Impact

This bug occurs (only?) with one or two replicas. We assume one replica for all of the following. It seems like this issue is resolved by leader election following a view expiration for n>=3 replicas.


Successive client requests may lead to a deadlock: The write lock acquired by a replica in jounal.write_prepare may be held until after the next op should be written, causing the write to fail and not be retried. In the one replica case it seems that the replica can never recovers this error and deadlocks.

// file: src/vsr/journal.zig

pub fn write_prepare(
    self: *Self,
    callback: fn (self: *Replica, wrote: ?*Message, trigger: Write.Trigger) void,
    message: *Message,
    trigger: Self.Write.Trigger,
) void {
    // [snip]
    // right here there's a race condition between two messages
    const write = self.writes.acquire() orelse {
        self.write_prepare_debug(message.header, "no IOP available");
        callback(replica, null, trigger);
        return;
    };
    // [snip]
}

This leads to a dirty operation clogging the journal. The replica believes that the op is still going to be written, on_prepare_timeout reports 0: on_prepare_timeout: waiting for journal, and backs off. In repair_op(self, op), the replica cannot choose_any_other_replica(), so the uncommited, dirty op stays uncommited and dirty.

At some point the replica gets stuck in a preparing state for each client, and no new requests can be answered, thus the system deadlocks. In this exact example I have 14 clients and op 12 gets stuck, and the last op number assigned is 25, which makes sense.

This only happens sometimes during testing: It must be the case that two requests from clients i and j (we assume i < j wlog) are both en route to the target replica, and i's request is delivered immediately before j's request. In a distributed setting (i.e. with multiple replicas) this does not occur with high probability, because replicas have to exchange prepare messages before writing the log (maybe when setting the message delay to 0, I have not tested this). Still, I have no reason to believe that this is only a testing error though, as this routing might occur naturally in the one replica setting (imagine co-located clients issuing requests at the same time).

Steps to Reproduce the Bug

  1. Apply the patch at the bottom
  2. Run ./vopr.sh 14432206721076428670 -OReleaseSafe
    • The system will deadlock after two transitions
  3. Run ./vopr.sh 14432206721076428670 to get a detailed log.
diff --git a/src/config.zig b/src/config.zig
index ed6ac8f..f212a24 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -6,14 +6,14 @@ pub const log_level = 7;
 
 /// The maximum number of replicas allowed in a cluster.
 /// This has been limited to 5 just to decrease the amount of memory required by the VOPR simulator.
-pub const replicas_max = 5;
+pub const replicas_max = 11;
 
 /// The maximum number of clients allowed per cluster, where each client has a unique 128-bit ID.
 /// This impacts the amount of memory allocated at initialization by the server.
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 30;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index aab7380..7631fd1 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -75,7 +75,7 @@ pub fn main() !void {
                 .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
                 .one_way_delay_min = prng.random.uintLessThan(u16, 3),
                 .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+                .path_maximum_capacity = 20, // must deterministically set because it screws up the randomness otherwise
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
                 .packet_replay_probability = prng.random.uintLessThan(u8, 50),

Suggested Fix

A "obviously correct" fix is to always wait for writes, however this is infeasible.

I believe that the responsible path in repair_prepare could be adjusted to account for the one replica case. I have not tested any modifications, but this seems like a promising start. I am available on the TigerBeetle Discord for further debugging if needed.

// file: src/vsr/replica.zig

fn repair_prepare(self: *Self, op: u64) bool {
    // [snip]

    if (self.status == .view_change and op > self.commit_max) {
        // [snip]
    } else {
        const nature = if (op > self.commit_max) "uncommitted" else "committed";
        const reason = if (self.journal.faulty.bit(op)) "faulty" else "dirty";
        log.debug("{}: repair_prepare: op={} checksum={} ({s}, {s})", .{
            self.replica,
            op,
            checksum,
            nature,
            reason,
        });

        // We expect that `repair_prepare()` is called in reverse chronological order:
        // Any uncommitted ops should have already been dealt with.
        // We never roll back committed ops, and thus never regard `nack_prepare` responses.
        // Alternatively, we may not be the leader, in which case we do distinguish anyway.
        assert(self.nack_prepare_op == null);
        assert(request_prepare.context == checksum);

        // TODO: account for the one replica case here?
        if (self.choose_any_other_replica()) |replica| {
            self.send_header_to_replica(replica, request_prepare);
        }
    }

    return true;
}

The Story Behind the Bug

I had (have?) a feeling that many clients may lead to problems/race conditions, so I increased the client count. Then I ran ./vopr.sh with increased client numbers, stumbled upon this.

I was not specifically targeting the one replica case, and was really lucky that I got such a small trace to analyze (only the first about 1.5k relevant log lines have all neccessary info).

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Literature

No response

The Last Word

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.