Git Product home page Git Product logo

Comments (13)

mjameson-se avatar mjameson-se commented on September 18, 2024 1

I've updated my PR with some tweaks based on our discussion:

  1. Allowed separately configuring the election-related request timeout
  2. Tweaked the timeout implementation to not create a cancellation token source if it is not needed

from dotnext.

sakno avatar sakno commented on September 18, 2024

Hi @mjameson-se ,

I'm wondering how exactly your algorithm of distributed lock based on Raft looks like. There are two possible approaches:

  • If you need to have guarantees that some portion of code (critical section) executing on the single node then you can simply use ICluster.Leader property and then check its IsRemote property. If it's true then you can safely execute your code. However, min election timeout must be equal to or greater than the execution time of such code. This is important because another node will not become a leader during execution of the critical section.
  • If you want to have fully-featured framework for distributed locks with the ability to have multiple named distributed locks then yes, you have to use IMessageBus.LeaderRouter property. Distributed locks were added by me to this repo but removed afterwards. If you're interested in this implementation (it was even fully tested) please check 7d4f7b3 commit and its parent commits. This feature was removed due to few reasons: 1) distributed locks can be implemented in masterless architecture without Raft at all, 2) persistent write-ahead log must be used; as a result, this requirement turns your application into stateful service, 3) I didn't want to support the code

Let's return back to HttpClient timeout. First of all, it's highly recommended to configure HttpMessageHandler properly. The most important thing here is Connection Timeout, not Request Timeout. My implementation doesn't rely on Request Timeout. By default, it's equal to 60 seconds for all HTTP requests. Don't use LeaderRouter for long-running operations because the request will be aborted due to timeout. You can assume to use potentially long-running request for lock acquisition on the master node because you need to create log record indicating a new state of distributed lock (and you have to use persistent write-ahead log for that). If so, this is bad decision. By design, distributed lock must have so called lease period. It's amount of time a lock can be held without renewal requests. This time may be much greater than HTTP request timeout. Therefore, the only one valid strategy here is to use long-polling. Just replace AcquireLock RPC call via LeaderRouter with TryAcquire.

from dotnext.

mjameson-se avatar mjameson-se commented on September 18, 2024

Lets set aside the question of how to properly implement distributed locks and focus on the core of what is not working for me: an operation that takes longer than the upper election timeout (in my application currently set to 500ms) gets cancelled by the sender and when using the leader router it automatically gets repeated until it completes normally. With some basic tweaks I was able to get it to work as expected: https://github.com/mjameson-se/dotNext/tree/my/mj/longPolling

For more context, I started off this application by copying the RaftNode example which uses the SocketsHttpHandler with a very low connect timeout. I have not tested to see if the default http handler behaves differently.

from dotnext.

sakno avatar sakno commented on September 18, 2024

@mjameson-se , please check the commit mentioned above. Unfortunately, there is no way to specify request timeout individually for each request. This is limitation of HttpClient class. The commit contains a new configuration property requestTimeout that allows to override this value. Is this what you want?

from dotnext.

mjameson-se avatar mjameson-se commented on September 18, 2024

While HttpClient does not, in the strictest sense, support specifying a timeout per request, the documentation suggests that you can accomplish the same goal by cancelling your cancellation token after the desired time using CancelAfter; the risk with setting the timeout for all requests to be higher than the election timeout is that certain failure modes may not be detected promptly

from dotnext.

sakno avatar sakno commented on September 18, 2024

I checked the consequences of setting timeout higher than election timeout. This makes sense when the node in candidate or leader state. In case of candidate state, the token will be canceled automatically by candidate timeout regardless of actual HttpClient timeout. In case of leader state if some node doesn't respond then another node will start election. The current leader will receive a new vote message with increased term and downgrades the current state. This action calls LeaderState.StopAsync and cancels the token associated with this state. As a result, the request will be aborted.

Individual timeout for each request could be challenging. Unfortunately, the implementation expects that cancellation can be caused by two reasons:

  • Caller code cancels the token
  • Request is timed out

Both raises OperationCanceledException.

from dotnext.

sakno avatar sakno commented on September 18, 2024

One more thing: allocating CancellationTokenSource for each request is not optimal from my point of view. I understand that underlying implementation of HttpClient doing the same thing to control the timeout. However, this code is a part of .NET, not mine, and can be fixed by .NET contributors in any time. I would prefer do not allocate token source twice for each request.

from dotnext.

mjameson-se avatar mjameson-se commented on September 18, 2024

Having a hung node cause the leader to hang sounds like a bug to me 👎 In fact in light of this I think having the request timeout for the heartbeat messages set to the upper election timeout is probably still too high because with the way its implemented in DotNext this will almost certainly cause a change in leadership.

from dotnext.

sakno avatar sakno commented on September 18, 2024

Here is a good discussion about connect/request timeouts in Raft: https://groups.google.com/g/raft-dev/c/YWTYeq_7QmQ/m/-BYwWbA_BAAJ
Opinions were divided: use upperElectionTimeout/2 or upperElectionTimeout. My choice was the last one. It depends on interpretation. Hung leader will be reverted to Follower state regardless of the client timeout.

from dotnext.

sakno avatar sakno commented on September 18, 2024

@mjameson-se , many thanks for your contribution. I'm using your PR as a base for code changes. There is more elegant way to override timeout for individual request based on message, so I did this directly in develop branch. Additionally, it's necessary to port these changes to UDP/TCP transports for Raft as well. You can examine commit 094ace3 with all necessary modifications.

As a result, your PR conflicting with my changes. I would appreciate if you able to apply similar changes to UDP/TCP transports. If not, I'll do that, no problem here.

from dotnext.

mjameson-se avatar mjameson-se commented on September 18, 2024

@sakno I've tested your version and it does solve my problem. Thank you.

As for the UDP/TCP transports, I haven't looked at them enough to even know where to start. TLS is a hard requirement for me so the AspNetCore version was the only option I considered. I will close my PR

from dotnext.

sakno avatar sakno commented on September 18, 2024

Probably I'll add TLS support for TCP transport in the next versions.

from dotnext.

sakno avatar sakno commented on September 18, 2024

DotNext.Net.Cluster 2.11.0 and DotNext.AspNetCore.Cluster 2.11.0 released on NuGet.

from dotnext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.