See suggestions in <a class="issue-link js-issue-link" data-error-text="Failed to load

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Improve performance of persistent write-ahead log,about dotnet/dotnext

Comments (78)

potrusil-osi commented on July 17, 2024 1

@sakno, after all the optimizations I'm getting much better numbers... The suggestion to have a buffer with a different size during compaction (described in #56) still holds. I have modified dotNext locally to give me the option and writing to the snapshot file is much faster thanks to that. I set the buffer to approximately the size of snapshot.

from dotnext.

sakno commented on July 17, 2024 1

Great news! Currently I'm working on buffered network I/O that allows to copy log entry content passed over the network before calling AppendAsync. This will reduce lock contention time between replication process and user code calling AppendAsync manually. Additionally, your suggestion has been added to the task list.

from dotnext.

potrusil-osi commented on July 17, 2024 1

I'm getting about 50% performance increase without that flag! I wonder if I need to test everything again to figure out the ideal buffer sizes.

from dotnext.

sakno commented on July 17, 2024 1

Now WriteThrough flag is disabled by default but can be enabled using Options.WriteThrough property. Changes are pushed to develop branch.

from dotnext.

sakno commented on July 17, 2024 1

@potrusil-osi , I'm finished buffering API for log entries. It has two use cases:

For buffering before calling AppendAsync from user code
For buffering before calling AppendAsync by transport infrastructure

In case of Interpreter Framework, you can use it easily as follows:

var entry = interpreter.CreateLogEntry<MyCommand>(...);
using var buffered = await BufferedRaftLogEntry.CopyAsync(entry, options, token);
log.AppendAsync(buffered, token);

This make sense only for large log entries. Also, you need to experiment with settings from RaftLogEntryBufferingOptions class which instance must be passed to CopyAsync method. It has MemoryThreshold option. If the size of log entry is larger than this threshold then disk persistence will be used for buffering. This is relevant for transport-level buffering because read from the disk is faster than read over the network but probably make no sense for custom log entries acquired constructed by Interpreter. As a result, use buffered log entries for appropriate commands.

from dotnext.

sakno commented on July 17, 2024 1

Now I'm in a halfway to a new version of PersistentState with parallel compaction. It will have the following characteristics:

Log compaction and appending new entries can be done in parallel
Parallel reads are allowed
Reads are mutually exclusive with log compaction or writes (so you cannot read in parallel with log compaction)

At the moment I implemented one innovative thing: concurrent sorted linked list for partitions. This hand-written data structure is much better than SortedDictionary<K, V> from .NET that has O(log n) in comparison to O(1) average for search provided by a new structure. Moreover, new structure provides lock-free adding/removing operations.

The remaining half way is about writing special lock type on top from AsyncTrigger.

from dotnext.

sakno commented on July 17, 2024 1

SnapshotBuilder.AdjustIndex has been added to develop branch. In your case all skipped partitions remain unread and removed at the final stage of the compaction. But a little performance overhead still exists: partition table is now organized as sorted linked list. If your switch the index in a random way then finding the right partition cannot be done in O(1). Anyway, this overhead is very small in comparison to I/O operations.

from dotnext.

sakno commented on July 17, 2024 1

Interpreter Framework is not tightly coupled with PersistentState so there is now way to do that. However, you can use S.T.AsyncLocal<T> to pass the index from PersistentState.ApplyAsync to the formatter.

from dotnext.

sakno commented on July 17, 2024 1

You can remember this index using ApplyAsync method. Check IsSnapshot property of LogEntry type. But you need careful here, compaction can be executed in parallel with commit process. Therefore, ApplyAsync can be called concurrently.

from dotnext.

sakno commented on July 17, 2024 1

It's an error that exist for a long time since 1.x. Usually, it has no effect on application and you can ignore it.

from dotnext.

sakno commented on July 17, 2024 1

One more thing: PersistentState.ApplyAsync may receive a snapshot not only at startup. If your node too far behind of the leader then snapshot installation procedure may be triggered by it. In this case, the leader will send the snapshot instead of log entries to that node you'll get the snapshot entry in the method.

from dotnext.

sakno commented on July 17, 2024 1

@potrusil-osi , I have moved partitions cleanup out of the lock for all compaction modes.

from dotnext.

potrusil-osi commented on July 17, 2024 1

Not now. We just moved the testing to a testing cluster which might reveal some more things.

from dotnext.

potrusil-osi commented on July 17, 2024 1

Hi @sakno, I like both optimizations! Hopefully I will have time to take advantage of them before the research project this is part of is finished.

from dotnext.

potrusil-osi commented on July 17, 2024 1

After getting the latest develop:

Method	Mean	Error	StdDev
ReadLogEntriesWithFullCacheAsync	6.208 us	0.0295 us	0.0247 us
ReadLogEntriesWithMetadataCacheAsync	59.048 us	0.2609 us	0.2313 us
ReadLogEntriesWithoutMetadataCacheAsync	89.919 us	0.5084 us	0.4507 us

from dotnext.

sakno commented on July 17, 2024 1

I missed call to FlushAsync, my bad. Now it's added. Flushing happens only if partition switched or the end of the log reached for better I/O performance especially if cached log entries are small.

from dotnext.

afink-osi commented on July 17, 2024 1

@sakno We have completed our research project, unfortunately, we also did not have time to dig into the counters that you added before our time window closed. Thank you for your quick and thorough responses and improvements throughout this work! You may proceed with your release

from dotnext.

potrusil-osi commented on July 17, 2024 1

Hi @sakno, I also wanted to thank you for all the work. It's been great experience to work with you on improving the WAL performance!

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi, you was right about SkipAsync. The bug in StreamSegment.Position property setter. Now it's fixed. Also, the second minor performance improved is also pushed to develop branch. Command identifier used by Interpreter Framework is now a part of log entry metadata and doesn't require serialization. However, there is another side of this change: log entry representing snapshot must not be passed to the interpreter. Snapshot doesn't have its own identifier.

from dotnext.

potrusil-osi commented on July 17, 2024

@sakno, it is unfortunate that I cannot pass snapshot to the interpreter anymore. I currently use it to store the in-memory state and the snapshot command is how it is often initialized. It is easy and elegant. Isn't there a way how to work around the limitation?

from dotnext.

sakno commented on July 17, 2024

Could you please provide a simple example of how you use custom command type for the snapshot? Do you serialize identifier manually inside of SnapshotBuilder?

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , support of snapshot entries returned back. CommandHandler attribute has special parameter indicating that the marked method is a handler for snapshot log entry:

[CommandHandler(IsSnapshotHandler = true)]
public ValueTask HandleSnapshot(SnapshotCommand command, CancellationToken token)
{
}

from dotnext.

potrusil-osi commented on July 17, 2024

Thanks for such a fast response! I guess you don't want me to give you the example anymore... But no, I was no serializing the identifier inside SnapshotBuilder - it just used its own instance of CommandInterpreter to apply individual entries (including the previous snapshot) and then wrote the SnapshotCommand-based entry at the very end. It worked.

After getting the latest changes (and updating the CommandHandler for snapshot) I'm consistently getting this exception in a follower:

Unhandled exception. DotNext.Net.Cluster.Consensus.Raft.PersistentState+MissingLogEntryException: Log entry with relative index 1 doesn't exist in partition ...\Logs\partition-0\sp-0\0
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.Partition.ReadAsync(StreamSegment reader, Memory`1 buffer, Int32 index, Boolean refreshStream, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.Partition.cs:line 86
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.ReadAsync[TResult](LogEntryConsumer`2 reader, DataAccessSession session, Int64 startIndex, Int64 endIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 170
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.ReadAsync[TResult](LogEntryConsumer`2 reader, Int64 startIndex, Int64 endIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 234
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.<DotNext.Net.Cluster.Consensus.Raft.IRaftStateMachine.MoveToCandidateState>g__PreVoteAsync|76_0(Int64 currentTerm) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 750
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.DotNext.Net.Cluster.Consensus.Raft.IRaftStateMachine.MoveToCandidateState() in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 717
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

The 0 partition file mentioned in the error is completely empty (unlike the same file in the leader).

from dotnext.

sakno commented on July 17, 2024

Physical layout of log entries on the disk in 3.1.0 is not compatible with previous versions. As a result, if new PersistentState uses files generated by previous version then you may have unpredictable results. I'll include this fact in release notes. Currently, you need to delete the entire directory with log entries and snapshot.

from dotnext.

potrusil-osi commented on July 17, 2024

I delete the logs every time before my tests. I'll keep digging myself what could be causing it...

from dotnext.

sakno commented on July 17, 2024

Oh, binary form of data representation is efficient when transferring over the wire but very fragile and require a lot of testing... I fixed issue caused invalid calculation of Content-Length header when transferring log entries over the wire. Probably, corrupted stream of log entries can cause error you described above. Fix is pushed.

from dotnext.

potrusil-osi commented on July 17, 2024

The last fix seems to fix the error. But I'm getting new errors now.

This one occurs periodically on the node that is the very first leader:

System.Net.Http.HttpRequestException: Response status code does not indicate success: 500 (Internal Server Error).
   at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
   at DotNext.Net.Cluster.Consensus.Raft.Http.RaftClusterMember.SendAsync[TResult,TMessage](TMessage message, CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftClusterMember.cs:line 132

And this occurs on startup if I do not clear the logs from the previous run:

System.ArgumentException: Log entry must have associated command identifier (Parameter 'entry')
   at ...SdsQueueAuditTrail.ApplyAsync(LogEntry entry) in .../SdsQueueAuditTrail.cs:line 32
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.ApplyAsync(Int64 startIndex, CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 812
   at DotNext.Net.Cluster.Consensus.Raft.PersistentState.ReplayAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/PersistentState.cs:line 857
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.StartAsync(CancellationToken token) in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 354
   at DotNext.Net.Cluster.Consensus.Raft.Http.RaftHttpCluster.StartAsync(CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/RaftHttpCluster.cs:line 127
   at DotNext.Net.Cluster.Consensus.Raft.Http.Hosting.RaftHostedCluster.StartAsync(CancellationToken token) in /_/src/cluster/DotNext.AspNetCore.Cluster/Net/Cluster/Consensus/Raft/Http/Hosting/RaftHostedCluster.cs:line 88
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)

from dotnext.

sakno commented on July 17, 2024

Fixed. Command id was not passed over the wire correctly.

from dotnext.

potrusil-osi commented on July 17, 2024

Everything works well now. Thanks!

from dotnext.

sakno commented on July 17, 2024

Did you change the buffer size passed to FileStream constructor? There is may be another issue. I'm using FileOptions.WriteThrough to bypass any intermediate buffers. For me, it's strange that bufferSize has any sense. Could you please remove WriteThrough flag and check performance again? Without this flag, I/O is less durable however it might be more performant. If so, I'll add this flag to configuration of PersistentState.

from dotnext.

sakno commented on July 17, 2024

I guess without WriteThrough flag the size of the buffer has small impact on performance.

from dotnext.

sakno commented on July 17, 2024

Buffer sizes are now separated in Options class.

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , everything is ready for beta-testing. It was really challenging 😎 There are few options for background compaction available:

Option # 1:
If you're using UsePersistenceEngine extension method for registering custom WAL derived from PersistentState and PersistentState.Options.CompactionMode is set to Background then the library automatically registers background compaction worker.

By default, the worker implements incremental compaction. It means that the only one partition will be compacted at a time. It gives you the best performance with minimal interference with writes. However, this approach has drawbacks: if there are too many writes then disk space will grow faster than compaction.

To replace incremental compaction, you need to implement DotNext.IO.Log.ILogCompactionSupport interface in you custom WAL. It requires only one method. This method will be called by compaction worker on every commit. As a result, you are able to define how deep the compaction should be performed. PersistentState.ForceCompactionAsync(long count, CancellationToken) now gives this opportunity. The first parameter accepts compaction factor. Zero means that no compaction should be performed. 1 means that compaction should be minimal (as in incremental approach). In other words, this factor means how many partitions you want to squash. The number of available partitions for squashing can be requested via PersistentState.CompactionCount property.

Option # 2:
If you're not using UsePersistenceEngine extension method then you need to write Hosted Service and register it in DI container. The service is responsible for running compaction in the background. You can use built-in background compaction worker as an example.

I would appreciate if you will able to contribute a good benchmark. Unfortunately, BenchmarkDotNet is not applicable for load testing. Also, it would be great if you'll share the final measurements of your application with background compaction.

from dotnext.

sakno commented on July 17, 2024

As an option, EWMA can be used as a base for adaptive algorithm for background compaction. During application lifetime the algorithm can compute exponentially weighted average value of compaction factor and pass it to ForceCompactionAsync method. It's really trivial. However, its application is limited to situation when application has floating workload. If not, EWMA will give the maximum compaction factor which leads to high interference with writes. Moreover, the existing incremental approach should work fine with floating workload.

from dotnext.

potrusil-osi commented on July 17, 2024

hey @sakno, thank you for so much work! I'll try to take advantage of the most recent changes as soon as possible.

I have two questions:

What exactly did you mean when you asked me for "a good benchmark"? So far, I'm testing the code as part of a bigger application. It would be possible to extract just the state machine (the "simple queue" I described earlier) and simulate its usage by the application, but then question is if that specific state machine is a good benchmark for you since it is so simple and specific to our use case (it supports just adding items and clearing the queue up to a certain sequence number). Which leads to the next question...
I already took advantage of the first version of ForceCompactionAsync which helped in many areas. But to take it to the next level I'd like to optimize the clear operation so that:
- Compaction runs only after the clear is committed - that is already done.
- When the compaction runs, it doesn't even try to read those partition files that contain only items have been cleared. This is obviously not possible now. I have optimized this process by skipping the deserialization of most of the content of such items, but the partition files are still processed, metadata read, etc. I'm not sure how exactly to support such a use case. Can you think of anything? Or am I asking for too much here?

from dotnext.

sakno commented on July 17, 2024

I understand that it may be problematic to convert a portion of real app to the benchmark. If it's not possible or hard, you may share only measurements.
SkipAsync is very cheap if IAsyncBinaryReader backed by the stream (which is true for PersistentState). I cannot implement some app-specific optimizations because the current WAL is general-purpose. However, you can reorganize your binary format and place BLOBs in the beginning of the stream so they can be skipped easily with one call. And, of course, you don't need to save skipped content to the new snapshot during compaction.

from dotnext.

sakno commented on July 17, 2024

Another compaction mode has been added in the latest commit. Now there are three compaction modes:

Sequential which is the only compaction mode available in 3.0.x or earlier. Offers the best optimization of disk space by the cost of the commit performance.
Background which allows to run compaction in parallel with writes
A new one, Foreground which is running during commit process in parallel with applying of the committed entries. It requires no ceremony as Background but much more efficient than Sequential.

Foreground compaction mode tries to squash the same number of previously committed entries in parallel as the number of entries requested for the commit. For instance, you have 3 previously committed entries and you want to commit another 3 entries. In this case, compaction process squashes 3 previously committed entries in parallel with applying new 3 committed entries. If overridden ApplyAsync method has approx the same performance as overridden SnapshotBuilder.ApplyAsync then overhead will be close to zero.

from dotnext.

sakno commented on July 17, 2024

Also, I have an idea about your second question. I can provide optional interface called IRandomAccessSnapshotSupport with a single method. You can implement this interface in the class derived from SnapshotBuilder. In this case I'll able to provide existing snapshot as read-only memory-mapped file and you will get random access to the snapshot contents at high speed instead of sequential reading and skipping. Of course, by the cost of your RAM (but without GC pressure). What do you think?

from dotnext.

potrusil-osi commented on July 17, 2024

I think I didn't explain my use case very well. The biggest pain point is that when the compaction is executed on the background there are many partition files full of add operations (the snapshot file is quite small). When clear is committed and the compaction is triggered, almost all of the partitions (including the previous snapshot) can be dropped, except add entries in the last few partitions (which are squashed into a relatively small new snapshot).

To support that specific use case, the state implementation would probably need to be able to set custom metadata for the snapshot and all the partitions, access it during compaction, and influence the compaction algorithm to skip/delete certain partition files based on custom logic. But I'm not sure all this would make any sense for a general purpose WAL.

from dotnext.

sakno commented on July 17, 2024

Metadata will not help, because replication relies on IAuditTrail interface which may be represented by any custom WAL implementation. Log compaction and snapshotting is not the only strategy and the developer may choose to implements its own WAL based on different compaction algorithm, such as B-tree merging.

Compaction always starts from the first committed log entry to the last committed log entry. There is no way to look forward. This is the nature and main property of state machine: the state of the node can be reconstructed by the repeating of applied log entries.

However, I can add partition number to LogEntry type and you can remember the partition (or absolute index, which is better) in which clear command sits. This can be captured in your ApplyAsync method and saved into the snapshot. During the compaction, you can skip the entire entries without calling SkipAsync using condition entry.Index <= firstCleanCommandOccurrence.

UPD: Also I can add AdjustStartIndex(ref long index) virtual method to SnapshotBuilder that will help you to adjust the starting position of the log entry during compaction and skip unnecessary entries using the condition described above.

from dotnext.

potrusil-osi commented on July 17, 2024

All this would be nice if the clear operation always cleared everything before. But that is not the case. It has a sequence number parameter which means that all items with sequence numbers less that the provided one should be cleared. Those sequence numbers are provided by the application. That's why I was thinking about custom metadata assigned to snapshot or partitions files - in my case it would be the highest sequence number in that file.

from dotnext.

potrusil-osi commented on July 17, 2024

On the other hand, I could easily remember mapping between sequence numbers and entry indexes... In that case the AdjustStartIndex(ref long index) would be really interesting. Especially if you optimized it even more - if the new start index is bigger than the index of the last entry in a file, then the file can be skipped...

from dotnext.

sakno commented on July 17, 2024

Yes, it will allow to skip partition files without reading from them.

from dotnext.

potrusil-osi commented on July 17, 2024

@sakno, I'm looking how to create the mapping between sequence numbers and entry indexes and it is currently not that easy. I'm using the interpreter framework which deserializes an entry through the formatter. But because the formatter doesn't have access to the LogEntry object anymore, there is no way to create the <sequence number, entry index> pair (the sequence number is being deserialized). Well, I could work around it by setting the "latest entry index" property on the interpreter (by whoever uses the interpreter, i.e. the state or the snapshot builder), but it is not very elegant for multiple reasons. Could you rather make the current entry index available somehow (either in the formatter or the interpreter)?

from dotnext.

potrusil-osi commented on July 17, 2024

I'm trying to use the new version of ForceCompactionAsync(long count, CancellationToken token) and as the first argument I'd like to provide only as many partitions as necessary for the clear command. In other words, I have a specific entry index and would like to translate it to the count argument. It seems it is currently not possible because I do not have the snapshot index. Any thoughts?

from dotnext.

potrusil-osi commented on July 17, 2024

I have the latest version and have seen a new exception when the application was shutting down:

Unhandled exception. System.AggregateException: One or more errors occurred. (Cannot access a disposed object.
Object name: 'AsyncExclusiveLock'.)
 ---> System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'AsyncExclusiveLock'.
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at DotNext.Threading.Tasks.CompletedTask`2.WhenCanceled(Task`1 task) in /_/src/DotNext/Threading/Tasks/CompletedTask.cs:line 24
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location where exception was thrown ---
   at DotNext.Threading.AsyncLock.TryAcquireAsync(TimeSpan timeout, Boolean suppressCancellation, CancellationToken token) in /_/src/DotNext.Threading/Threading/AsyncLock.cs:line 252
   at DotNext.Net.Cluster.Consensus.Raft.RaftCluster`1.DotNext.Net.Cluster.Consensus.Raft.IRaftStateMachine.MoveToCandidateState() in /_/src/cluster/DotNext.Net.Cluster/Net/Cluster/Consensus/Raft/RaftCluster.cs:line 717
   at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_1(Object state)
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

from dotnext.

potrusil-osi commented on July 17, 2024

Hi @sakno, it was a bit tricky to figure out the snapshot index. First, I was struggling where exactly I should be waiting for the snapshot to capture its index. It cannot be PersistentState.ApplyAsync because it is not called with the latest snapshot. It could theoretically be SnapshotBuilder.ApplyAsync, but I should not be capturing the index of a snapshot entry but rather index of the last entry. The problem with this approach is that SnapshotBuilder.ApplyAsync doesn't need to be called at all in case all partitions being compacted are skipped (thanks for the new AdjustIndex). So what I ended up doing was to capture the second argument of AdjustIndex, endIndex. It is a bit tricky, but it works.

I was benchmarking compaction with the latest changes and it was quite slow in my specific case (a lot of partition files, all of them are skipped). In fact, it was so slow (hundreds of milliseconds) that it often caused the leader to lose its leadership because of the compaction lock being held the whole time. Almost all the compaction time was spent in RemovePartitionsAsync. When I locally modified that method to execute the partition disposal on the background (the code after the partitions are detached) everything was way faster. After that change the compaction takes only couple of milliseconds and the leadership is stable. Would it be possible to do something about it officially?

from dotnext.

sakno commented on July 17, 2024

PersistentState.ApplyAsync is called with latest snapshot just once - at application startup (if ReplayOnInitialize is configured). This is when you can remember it. Afterwards, you can catch it in SnapshotBuilder.ApplyAsync and update easily.

I'm curious what exactly slow in RemovePartitionsAsync? There are only two potentially heavy operations:

await partition.DisposeAsync().ConfigureAwait(false)
File.Delete(fileName)

It would be great if you have a chance to measure these two lines of code separately. DisposeAsync should not be slow because all buffers at the time of the call are flushed already. File.Delete probably yes, but usually OS just erases file metadata from the filesystem table without filling with zeroes of its content. Removing partitions out of the lock is very dangerous:

Readers may want to acquire some entries from these partitions
Another compaction may be triggered

What I can do:

Remove partitions in parallel. This will have no effect if your disk drive is a classic HDD and not SSD. Moreover, if it's SSD then controller may not support many parallel writes. So I don't like this idea at all.
Remove partitions from main sorted linked list to another collection and then call DisposeAsync and File.Delete without the lock.

from dotnext.

potrusil-osi commented on July 17, 2024

I think it was the DisposeAsync that took the longest. But I will measure it again.

For my local testing I did what you describe as option 2). I understood that modifying the linked list without a lock was dangerous, but even though it worked to call DisposeAsync and File.Delete without the lock in my tests, I wasn't sure if it was ultimately safe or not.

from dotnext.

potrusil-osi commented on July 17, 2024

DisposeAsync takes a bit longer than File.Delete but not much. They are both quite expensive.

from dotnext.

sakno commented on July 17, 2024

This is because of tests. In the real world, some of your cluster nodes may be too far behind in terms of WAL state. It will force the leader to read from the beginning of the WAL. And this is the moment when the app can crash due to concurrent read/write from the list or even continue to work with damaged state. I think this is irreproducible in any synthetic unit/integration test as well as many situations in distributed/multi-threading scenarios, such as dead locks.

from dotnext.

sakno commented on July 17, 2024

Anyway, I'm trying to implement option 2: I need to remove the head partitions in the linked list when running within the lock and detach them. Then, I can safely release the lock and continue to do heavy I/O operations with these detached partitions without worrying about potential concurrency.

from dotnext.

sakno commented on July 17, 2024

Do you have any additional suggestions? If no, I'll start preparing the release.

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , there is some additional performance optimization has added recently: you can enable copy-on-read to reduce lock contention between writes and replication process. Replication requires read lock which cannot co-exist with write/compaction. As a result, the duration of read lock depends on network latency. It can be eliminated if we could have a copy of log entries to be replicated outside of the read lock. Now this behavior can be enabled via CopyOnReadOptions configuration property of PersistentState.Options class and increase throughput of writes by the cost of RAM.

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , final performance improvements now available. There are many options to achieve the best performance with persistent WAL:

Copy-on-read allows to eliminate lock contention between the replication process and writers of new log entries (see previous comment).
Added AppendAsync(TEntry, bool, CancellationToken) and AppendAndEnsureCommitAsync overloads. bool parameter allows to specify whether the entry should be cached in the memory. If passed as true then you don't need to manually create a buffered copy of the log entry. In this case, the entry will be copied to the memory outside of the lock and then stored in the internal cache as well as to the disk. However, subsequent reads (caused by replication, commit or snapshotting) will be done using the memory without the overhead of disk I/O. This is working only if Options.UseCaching property is enabled. Moreover, you can control the eviction policy of the cache using Options.CacheEvictionPolicy property.

from dotnext.

sakno commented on July 17, 2024

Here is benchmark of various PersistentState caching modes on my NVMe SSD:

Method	Mean	Error	StdDev
ReadLogEntriesWithFullCacheAsync	4.230 us	0.0179 us	0.0139 us
ReadLogEntriesWithMetadataCacheAsync	22.512 us	0.4412 us	0.8605 us
ReadLogEntriesWithoutMetadataCacheAsync	42.370 us	0.8241 us	1.4214 us

You can try this benchmark on your machine. Run DotNext.Benchmarks project with dotnet run -c Bench and select PersistentStateBenchmark.

from dotnext.

potrusil-osi commented on July 17, 2024

Mine results are not as good as yours...

Method	Mean	Error	StdDev
ReadLogEntriesWithFullCacheAsync	8.268 us	0.0766 us	0.0716 us
ReadLogEntriesWithMetadataCacheAsync	213.049 us	8.0940 us	23.8652 us
ReadLogEntriesWithoutMetadataCacheAsync	227.656 us	4.4002 us	3.9007 us

from dotnext.

sakno commented on July 17, 2024

The actual result depends on the hardware. In case of ReadLogEntriesWithFullCacheAsync benchmark, the result depends on CPU because log entry fully cached in the memory and disk I/O is not needed. The last two benchmarks have dependency on disk I/O. In my case, it's NVMe SSD which uses PCIe as a bus. In your case it's SSD with SATA interface or even HDD. ReadLogEntriesWithMetadataCacheAsync result looks very suspicious because of high value of standard deviation. It seems like some another task was doing its job in the same time with the benchmark.

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , do you have estimation of time needed to complete your research work? I'm ready to publish release.

from dotnext.

potrusil-osi commented on July 17, 2024

We have two more weeks. After the last two optimizations you provided the local results on my computer are pretty impressive. In our testing cluster the results are not as great, so we are digging deeper what the bottlenecks are.

from dotnext.

potrusil-osi commented on July 17, 2024

@sakno, back to your analysis of the benchmark results: I also have a NVMe SSD drive (Samsung PM981 1TB). Any thoughts why my results are so much worse than yours? Are you running the benchmark in Windows?

from dotnext.

sakno commented on July 17, 2024

No, I ran these tests on Linux. But I have machine with Windows 10:

Method	Mean	Error	StdDev
ReadLogEntriesWithFullCacheAsync	9.568 us	0.1045 us	0.0872 us
ReadLogEntriesWithMetadataCacheAsync	226.568 us	4.5002 us	3.9893 us
ReadLogEntriesWithoutMetadataCacheAsync	284.843 us	5.6811 us	5.3141 us

The results are close to your. But machine with Windows 10 has SATA SSD instead of NVMe SSD installed on PCIe. I think your laptop has two drives: system partition running on NVMe and user partition running on SATA. Benchmark creates WAL in temp directory. Usually, this directory is mounted to SSD SATA drive in case of two drives by default.

Also, in your case 23.8652 us is very abnormal StdDev. Try to run this benchmark again without any background tasks.

from dotnext.

potrusil-osi commented on July 17, 2024

There is only one drive. But after disabling half the services the results are a bit better:

Method	Mean	Error	StdDev
ReadLogEntriesWithFullCacheAsync	6.396 us	0.0646 us	0.0604 us
ReadLogEntriesWithMetadataCacheAsync	106.715 us	0.7805 us	0.6517 us
ReadLogEntriesWithoutMetadataCacheAsync	140.691 us	0.6169 us	0.5771 us

from dotnext.

sakno commented on July 17, 2024

In benchmark I didn't try to choose optimal configuration parameters. However, I changed default buffer size to 4096 bytes which is equal to filesystem cluster size and now the results much better on SATA SSD but remains the same for NVMe. Pull latest changes from develop branch and try again.

from dotnext.

sakno commented on July 17, 2024

Thanks for sharing results! Also, I managed to cut off a few microseconds by optimizing hot paths in code within UnsafeReadAsync method which is workhorse for all reading operations.

from dotnext.

afink-osi commented on July 17, 2024

Hi @sakno, I am a colleague of @potrusil-osi who is out of the office this week. We have deployed our application to a cluster on dedicated hardware to continue testing. Unfortunately, we are not seeing the same degree of benefit from the optimizations that have been seen when testing locally, as our overall throughput numbers are only about 80% of what was achieved in the local testing. In particular, we are seeing two issues that we would like to get your thoughts on.

Write Lock Acquisition

We are seeing that the write lock acquisition in AppendCached() is taking significantly more time (~36ms) as compared to the writing of the entry to disk in UnsafeAppend() which takes on average (~1ms). With the compaction improvements, we are seeing an average compaction time of ~9ms. Do you have any thoughts that might explain why this is taking so long to acquire the write lock in AppendCached()?

Read Lock Acquisition

In the ReadBuffered() and ReadUnbuffered() methods, we are seeing that a read lock is being acquired, which on average is taking ~4ms. We have configured the PersitentState.Options with CopyOnReadOptions, which from the previous discussion, indicates that that this operation shouldn't require a read lock at all. Is it expected that there should still be a lock required for this operation?

from dotnext.

sakno commented on July 17, 2024

Hi @afink-osi ,
How many parallel writes do you have? Are reads/writes intermixed? What's the average size of log entry record?

from dotnext.

sakno commented on July 17, 2024

We are seeing that the write lock acquisition in AppendCached() is taking significantly more time (~36ms) as compared to the writing of the entry to disk in UnsafeAppend() which takes on average (~1ms).

It depends on numbers of parallel writes. For instance, if you have 36 parallel writes and each write operation takes ~1 ms then the last writer need to wait ~36 ms.

from dotnext.

sakno commented on July 17, 2024

Additionally, there is another project FASTER that offers general-purpose and high-performance WAL. It might be a good candidate for Raft but requires a lot of research.

from dotnext.

sakno commented on July 17, 2024

@afink-osi , @potrusil-osi , let's try again. What I've done: cached log entry will not be persisted until commit. As a result, AppendCachedAsync will be completed synchronously in most cases and lock contention will be low.

from dotnext.

afink-osi commented on July 17, 2024

Hi @sakno, thanks for the quick response and for the clarification! Yesterday when running tests, we were seeing replication throughput at ~28% the throughput of our baseline (the application running without replication). Today, we were able to run some tests using the same storage volume as before but with your latest changes, and the results are very impressive! We are now running at about 65% of the baseline which is much more in line with the local tests that @potrusil-osi was running before and we are seeing minimal time waiting for locks. We also tested using memory instead of the SSD disk (may be acceptable for our use case) for log storage and pushed this further to ~72% of the baseline.

One more question: In the new method PersistCachedEntryAsync(), there is a call to WriteAsync() but we do not see any call to a Flush() method. Are these entries being flushed somewhere else?

from dotnext.

sakno commented on July 17, 2024

Also I added special TryGetBufferWriter method to IAsyncBinaryWriter interface that allows to enable fast synchronous serialization. When you appending log entries with addToCache parameter the writer will return non-null result. This method is supported only if IAsyncBinaryWriter backed by pipe or dynamic in-memory buffer. In other cases, it will return null.

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , @afink-osi , I've added support of diagnostic events to persistent WAL. You can configure them via PersistentState.Options type:

LockContentionCounter measures the rate of lock contention
LockDurationCounter measures the duration of the lock in acquired state, in ms
ReadCounter measures the rate of reads, in the number of log entries
WriteCounter measures the rate of writes, in the number of log entries
CompactionCounter measures the rate of log compaction, in the number of log entries
CommitCounter measures the rate of commits, in the number of log entries

You can use dotnet-counters tool from .NET SDK to track the counters or submit values to the monitoring tool.

from dotnext.

sakno commented on July 17, 2024

@potrusil-osi , @afink-osi , I would like to ask you guys to check develop branch again. It contains another optimization of writes based on memory-mapped files. Read performance remains the same. Also, replication process now optimized to reduce numbers of reads from WAL. Thanks in advance!

from dotnext.

afink-osi commented on July 17, 2024

@sakno The new counters look very helpful, I didn't get a chance to try them yet but will hopefully look at them tomorrow. We were able to test the new changes to develop branch, and we are seeing an additional 11% increase in throughput in our application compared to before these latest optimizations!

from dotnext.

sakno commented on July 17, 2024

Nice to hear that, @afink-osi ! Let me know when you check these counters.

from dotnext.

sakno commented on July 17, 2024

@afink-osi , @potrusil-osi , did you have a chance to finalize your research project? If so, I would like to release new version of .NEXT.

from dotnext.

sakno commented on July 17, 2024

.NEXT 3.1.0 has been released. Closing issue.

from dotnext.

Improve performance of persistent write-ahead log about dotnext HOT 78 CLOSED

Comments (78)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent