Comments (13)
I've updated my PR with some tweaks based on our discussion:
- Allowed separately configuring the election-related request timeout
- Tweaked the timeout implementation to not create a cancellation token source if it is not needed
from dotnext.
Hi @mjameson-se ,
I'm wondering how exactly your algorithm of distributed lock based on Raft looks like. There are two possible approaches:
- If you need to have guarantees that some portion of code (critical section) executing on the single node then you can simply use
ICluster.Leader
property and then check itsIsRemote
property. If it's true then you can safely execute your code. However, min election timeout must be equal to or greater than the execution time of such code. This is important because another node will not become a leader during execution of the critical section. - If you want to have fully-featured framework for distributed locks with the ability to have multiple named distributed locks then yes, you have to use
IMessageBus.LeaderRouter
property. Distributed locks were added by me to this repo but removed afterwards. If you're interested in this implementation (it was even fully tested) please check 7d4f7b3 commit and its parent commits. This feature was removed due to few reasons: 1) distributed locks can be implemented in masterless architecture without Raft at all, 2) persistent write-ahead log must be used; as a result, this requirement turns your application into stateful service, 3) I didn't want to support the code
Let's return back to HttpClient
timeout. First of all, it's highly recommended to configure HttpMessageHandler
properly. The most important thing here is Connection Timeout, not Request Timeout. My implementation doesn't rely on Request Timeout. By default, it's equal to 60 seconds for all HTTP requests. Don't use LeaderRouter
for long-running operations because the request will be aborted due to timeout. You can assume to use potentially long-running request for lock acquisition on the master node because you need to create log record indicating a new state of distributed lock (and you have to use persistent write-ahead log for that). If so, this is bad decision. By design, distributed lock must have so called lease period. It's amount of time a lock can be held without renewal requests. This time may be much greater than HTTP request timeout. Therefore, the only one valid strategy here is to use long-polling. Just replace AcquireLock RPC call via LeaderRouter
with TryAcquire.
from dotnext.
Lets set aside the question of how to properly implement distributed locks and focus on the core of what is not working for me: an operation that takes longer than the upper election timeout (in my application currently set to 500ms) gets cancelled by the sender and when using the leader router it automatically gets repeated until it completes normally. With some basic tweaks I was able to get it to work as expected: https://github.com/mjameson-se/dotNext/tree/my/mj/longPolling
For more context, I started off this application by copying the RaftNode example which uses the SocketsHttpHandler with a very low connect timeout. I have not tested to see if the default http handler behaves differently.
from dotnext.
@mjameson-se , please check the commit mentioned above. Unfortunately, there is no way to specify request timeout individually for each request. This is limitation of HttpClient
class. The commit contains a new configuration property requestTimeout
that allows to override this value. Is this what you want?
from dotnext.
While HttpClient
does not, in the strictest sense, support specifying a timeout per request, the documentation suggests that you can accomplish the same goal by cancelling your cancellation token after the desired time using CancelAfter
; the risk with setting the timeout for all requests to be higher than the election timeout is that certain failure modes may not be detected promptly
from dotnext.
I checked the consequences of setting timeout higher than election timeout. This makes sense when the node in candidate or leader state. In case of candidate state, the token will be canceled automatically by candidate timeout regardless of actual HttpClient timeout. In case of leader state if some node doesn't respond then another node will start election. The current leader will receive a new vote message with increased term and downgrades the current state. This action calls LeaderState.StopAsync
and cancels the token associated with this state. As a result, the request will be aborted.
Individual timeout for each request could be challenging. Unfortunately, the implementation expects that cancellation can be caused by two reasons:
- Caller code cancels the token
- Request is timed out
Both raises OperationCanceledException
.
from dotnext.
One more thing: allocating CancellationTokenSource
for each request is not optimal from my point of view. I understand that underlying implementation of HttpClient
doing the same thing to control the timeout. However, this code is a part of .NET, not mine, and can be fixed by .NET contributors in any time. I would prefer do not allocate token source twice for each request.
from dotnext.
Having a hung node cause the leader to hang sounds like a bug to me 👎 In fact in light of this I think having the request timeout for the heartbeat messages set to the upper election timeout is probably still too high because with the way its implemented in DotNext this will almost certainly cause a change in leadership.
from dotnext.
Here is a good discussion about connect/request timeouts in Raft: https://groups.google.com/g/raft-dev/c/YWTYeq_7QmQ/m/-BYwWbA_BAAJ
Opinions were divided: use upperElectionTimeout/2 or upperElectionTimeout. My choice was the last one. It depends on interpretation. Hung leader will be reverted to Follower state regardless of the client timeout.
from dotnext.
@mjameson-se , many thanks for your contribution. I'm using your PR as a base for code changes. There is more elegant way to override timeout for individual request based on message, so I did this directly in develop
branch. Additionally, it's necessary to port these changes to UDP/TCP transports for Raft as well. You can examine commit 094ace3 with all necessary modifications.
As a result, your PR conflicting with my changes. I would appreciate if you able to apply similar changes to UDP/TCP transports. If not, I'll do that, no problem here.
from dotnext.
@sakno I've tested your version and it does solve my problem. Thank you.
As for the UDP/TCP transports, I haven't looked at them enough to even know where to start. TLS is a hard requirement for me so the AspNetCore version was the only option I considered. I will close my PR
from dotnext.
Probably I'll add TLS support for TCP transport in the next versions.
from dotnext.
DotNext.Net.Cluster 2.11.0
and DotNext.AspNetCore.Cluster 2.11.0
released on NuGet.
from dotnext.
Related Issues (20)
- DotNext vs Community toolkit
- How to use Optional<string> as query parameter? HOT 5
- Support for non-LTS .NET releases HOT 1
- Various trimming warnings in DotNext.Metaprogramming HOT 15
- Potential addition of an `OrderedDictionary<TKey, TValue>` type HOT 7
- raft leader loses leadership and the node gets stuck HOT 19
- Metaprogramming: Try-catch not catching exception HOT 2
- ask for set up workaround to make AOT works HOT 4
- Directly reference algorithm used in int sqrt HOT 3
- Not seeing expected improvement in throughput of RaftCluster.ReplicateAsync method when cluster minority is inaccessible HOT 20
- Seemingly random NullReferenceException in async state machine HOT 13
- Cache bound by weighted count HOT 1
- AsyncReaderWriterLock overload parity between Enter and Acquire HOT 1
- Add `IDisposable`-returning extension method for upgrading read lock to write lock HOT 7
- DotNext.Net.Cluster crash in production since I think version 5.4.0 HOT 24
- DotNext.Net.Cluster: System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'length') HOT 44
- If ColdStarted node is down, new leader is not elected HOT 1
- Did you remove "IEnumerable<T>.FirstOrNone"? HOT 11
- Suggestions about UserDataStorage.CopyTo() and UserDataStorage.Clear() HOT 2
- Support for enumeration in TypeMap HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dotnext.