greptimeteam / bunshin Goto Github PK
View Code? Open in Web Editor NEWA distributed Write-Ahead log implementation for cloud
A distributed Write-Ahead log implementation for cloud
Shared-data architecture binds dedicated local storage to every computing node, which has good performance for small IO operations like continous insertions. But, the replication and failover of shared-data style systems is complicated and error-prone. Shared-nothing architecture, on the other hand, benifits from the inherent replication and scalibility of cloud storages like S3, but the write performance may degrade when handling continous IO due the overhead of cloud storage operations.
Thus we need a distributed WAL to bridge the gap between shared-data and shared-nothing. Data inserted first go to this WAL to ensure durability and then apply to datanode. Insertions are buffered on datanode's local storage and write to cloud storage at a time to increase throughput. The data buffered on datanode is transient. Once datanode crashes, those data can be easily recovered from WAL.
Some systems like RedPanda introduces Raft to replicate logs across nodes. But using such consensus algorithm for each log record is quite expensive. On the premise that GreptimeDB already has a metasrv to coordinate datanodes, we can use quorum-base replication on data plane and leave consensus to metasrv.
Quorum-base systems read from and write to a subset of copies of data, namely read set Vr and write set Vw respectively. To ensure consistency, Vr and Vw must overlap at least one node so that when data is retrived, at least one node at Vr contains the latest version of value.
Quorum-base replication is widely used on existing database systems like Amazon Aurora and Amazon DynamoDB.
In Bunshin, ensemble is the pool of all available nodes, while quorum is a subset of ensemble that a log entry is actually written to. When ensemble size > quorum size, adjacent entries will be written to different quorum, which is called "data striping".
Entry is the basic unit of log. Entry has a monotonic increasing sequence number generated by writer, Bunshin ensures that once a sequence number is acked, then all entries with lower sequence numbers are also acked.
In Bunshin, an unbounded log entry sequence is called a "stream". A stream can have only one writer but many readers. Unbounded data structure is hard to deal with, so we partition the stream into "segment"s. Segment is a batch of log entries that accepts append operations only. Once a segment is closed (or sealed), it's immutable and cannot be reopened for writing. Since segments are immutable once closed, we can easily migrate it to other storage systems or simply delete it when it's safe to reclaim disk spaces.
Nodes in quorum may fail, permanently or transiently. When writer finds some node in write quorum is unable to handle the request, instead of just hangs there and waits, it intiates a write quorum change and picks another node to write. A sequence of log entries in a segment that are written to the same write quorum is logically called a "chunk". Different chunks in a segment may consist of different quorum nodes, but the quorum size must be the same.
Writer opens a stream by creating a segment metadata in metasrv. To write an entry, it chooses N nodes to form the write quorum and sends write request in parallel to these nodes. In order to successfully replicate entries, writer waits ACK response from up to K nodes. Here K is an configurable option for each stream. Increasing K brings higher reliability, but may also result into spiking latency.
When writer fails to write to some node in quorum, it initiates a quorum change and open a new chunk with the new write quorum.
Since every entry carries a sequence number, we can simplify the quorum read to a read from single node.
Readers reads entries by entry id. As reader already knows the ensemble of a fragment, it can infer the write quorum of that entry and sends read request to any of those nodes to fetch the content of the entry.
TBD
TBD
TBD
Bunshin uses RocksDB as entry storage, we won't focus on writing an append-only storage from scratch since RocksDB provides a relatively good batch write performance while point query is also fast.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.