Jepsen is a framework for testing di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Towards Jepsen-like tests about quickcheck-state-machine HOT 7 OPEN

advancedtelematic commented on May 26, 2024 2

Towards Jepsen-like tests

from quickcheck-state-machine.

Comments (7)

stevana commented on May 26, 2024

There's also the case when an action that returns a reference fails. In that case the subsequent actions that use the reference and have a precondition that the reference must exist in the model, will fail saying that the precondition was false.

from quickcheck-state-machine.

stevana commented on May 26, 2024

Crashes

If an operation does not complete for some reason (perhaps because it timed out or a critical component crashed) that operation has no completion time, and must, in general, be considered concurrent with every operation after its invocation. It may or may not execute.

A process with an operation is in this state is effectively stuck, and can never invoke another operation again. If it were to invoke another operation, it would violate our single-threaded constraint: processes only do one thing at a time.

Source: https://jepsen.io/consistency

from quickcheck-state-machine.

kderme commented on May 26, 2024

@stevana my understanding is that this #208 is a prerequisite for Jepsen-like tests right?

from quickcheck-state-machine.

stevana commented on May 26, 2024

@kderme: No, I don't think so. (It might make some tests easier to write, but shouldn't be necessary.)

from quickcheck-state-machine.

stevana commented on May 26, 2024

The above PR adds the ability to complete histories. To be able to complete a
history is needed when exceptions are thrown in the semantics,
because of e.g. request timeouts (as will happen with partitions). We
don't want to catch those exceptions because this would lead to a
non-deterministic model, because if a timeout happen for a write-like
command then we have no way to tell if it the request to the node
timed out (database didn't change) or if the response back from
the node timed out (database updated). And thus if we would catch this
exception we would need to update the model with both possibilities,
otherwise later read operations would fail.

What completing history does is that it simply appends a response to the
end of the history for all threads/pids/workers that crashed/threw an
exception. In the case of a write-like command the response is a simple
"Ack". Now because of the way of how linearisability works, it will try
all possible interleavings of this write-like command with all the later
read-like commands and check if there's a possible interleaving that is
consistent. So the linearisability checker handles/hides the
non-determinism of the model!

What's still not clear to me is: can we find bugs without completing, e.g.
maybe with partitions and some (simpler) version of accounting for this
non-determinism in the model? Or somehow controlling the fault injection in a more
precise way so that we know exactly if the timeout happened to or from
the node, and thus know if the database was updated or not?

from quickcheck-state-machine.

stevana commented on May 26, 2024

@kderme is currently working on adding an example which uses rqlite (a distributed version of sqlite that uses Raft for consensus), hopefully this can serve as a test bed for experimenting with distributed systems and fault injection. See the following work-in-progress branch.

As a first experiment, the idea is to try to trigger a stale read in the weak read consistency mode by either stopping and restarting nodes or by causing partitions (perhaps using blockade).

from quickcheck-state-machine.

stevana commented on May 26, 2024

What's still not clear to me is: can we find bugs without completing, e.g.
maybe with partitions and some (simpler) version of accounting for this
non-determinism in the model? [...]

I think the answer is yes and that there's a trade-off here:

Simple model that doesn't account for faults, needs completion, will be slow
Complex model that accounts for faults, doesn't need completion, will be fast(er)

I've also learned why Jepsen doesn't have a completion function, if an operation crashes it advances the model purely from the request. This isn't possible in our case as our transition function also involves the response.

Lets make things a bit more concrete with an example, consider a simple counter that starts at 0 and can be incremented:

thread 1, request: increment
thread 2, request: increment
thread 1, response: ok
thread 2, response: timeout

At this point the value of the counter could be 1 or 2 depending on if the request on the second thread timed out while going to the server (the counter didn't get updated), or if the response going back to the client timed out (the counter got updated). Let's continue execution:

thread 1, request: read
thread 3, request: read
thread 1, response: read -> 1
thread 3, response: read -> 2

This seems weird, how can read return two different values, without an increment happening in between? Remember that crashing operations (e.g. the timeout on the second thread) are concurrent with all operations after it, so the effect of the increment could happen between the responses of the reads and thus making the history linearise.

Also note that the second thread cannot be used as that could break the "single-threaded constraint: processes only do one thing at a time", as per the comment above.

from quickcheck-state-machine.

Towards Jepsen-like tests about quickcheck-state-machine HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent