Git Product home page Git Product logo

Comments (7)

stevana avatar stevana commented on May 26, 2024

There's also the case when an action that returns a reference fails. In that case the subsequent actions that use the reference and have a precondition that the reference must exist in the model, will fail saying that the precondition was false.

from quickcheck-state-machine.

stevana avatar stevana commented on May 26, 2024

Crashes

If an operation does not complete for some reason (perhaps because it timed out or a critical component crashed) that operation has no completion time, and must, in general, be considered concurrent with every operation after its invocation. It may or may not execute.

A process with an operation is in this state is effectively stuck, and can never invoke another operation again. If it were to invoke another operation, it would violate our single-threaded constraint: processes only do one thing at a time.

Source: https://jepsen.io/consistency

from quickcheck-state-machine.

kderme avatar kderme commented on May 26, 2024

@stevana my understanding is that this #208 is a prerequisite for Jepsen-like tests right?

from quickcheck-state-machine.

stevana avatar stevana commented on May 26, 2024

@kderme: No, I don't think so. (It might make some tests easier to write, but shouldn't be necessary.)

from quickcheck-state-machine.

stevana avatar stevana commented on May 26, 2024

The above PR adds the ability to complete histories. To be able to complete a
history is needed when exceptions are thrown in the semantics,
because of e.g. request timeouts (as will happen with partitions). We
don't want to catch those exceptions because this would lead to a
non-deterministic model, because if a timeout happen for a write-like
command then we have no way to tell if it the request to the node
timed out (database didn't change) or if the response back from
the node timed out (database updated). And thus if we would catch this
exception we would need to update the model with both possibilities,
otherwise later read operations would fail.

What completing history does is that it simply appends a response to the
end of the history for all threads/pids/workers that crashed/threw an
exception. In the case of a write-like command the response is a simple
"Ack". Now because of the way of how linearisability works, it will try
all possible interleavings of this write-like command with all the later
read-like commands and check if there's a possible interleaving that is
consistent. So the linearisability checker handles/hides the
non-determinism of the model!

What's still not clear to me is: can we find bugs without completing, e.g.
maybe with partitions and some (simpler) version of accounting for this
non-determinism in the model? Or somehow controlling the fault injection in a more
precise way so that we know exactly if the timeout happened to or from
the node, and thus know if the database was updated or not?

from quickcheck-state-machine.

stevana avatar stevana commented on May 26, 2024

@kderme is currently working on adding an example which uses rqlite (a distributed version of sqlite that uses Raft for consensus), hopefully this can serve as a test bed for experimenting with distributed systems and fault injection. See the following work-in-progress branch.

As a first experiment, the idea is to try to trigger a stale read in the weak read consistency mode by either stopping and restarting nodes or by causing partitions (perhaps using blockade).

from quickcheck-state-machine.

stevana avatar stevana commented on May 26, 2024

What's still not clear to me is: can we find bugs without completing, e.g.
maybe with partitions and some (simpler) version of accounting for this
non-determinism in the model? [...]

I think the answer is yes and that there's a trade-off here:

  1. Simple model that doesn't account for faults, needs completion, will be slow
  2. Complex model that accounts for faults, doesn't need completion, will be fast(er)

I've also learned why Jepsen doesn't have a completion function, if an operation crashes it advances the model purely from the request. This isn't possible in our case as our transition function also involves the response.

Lets make things a bit more concrete with an example, consider a simple counter that starts at 0 and can be incremented:

thread 1, request: increment
thread 2, request: increment
thread 1, response: ok
thread 2, response: timeout

At this point the value of the counter could be 1 or 2 depending on if the request on the second thread timed out while going to the server (the counter didn't get updated), or if the response going back to the client timed out (the counter got updated). Let's continue execution:

thread 1, request: read
thread 3, request: read
thread 1, response: read -> 1
thread 3, response: read -> 2

This seems weird, how can read return two different values, without an increment happening in between? Remember that crashing operations (e.g. the timeout on the second thread) are concurrent with all operations after it, so the effect of the increment could happen between the responses of the reads and thus making the history linearise.

Also note that the second thread cannot be used as that could break the "single-threaded constraint: processes only do one thing at a time", as per the comment above.

from quickcheck-state-machine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.