Git Product home page Git Product logo

bunshin's People

Contributors

killme2008 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bunshin's Issues

Distributed logging for cloud-native databases

Background

Shared-data architecture binds dedicated local storage to every computing node, which has good performance for small IO operations like continous insertions. But, the replication and failover of shared-data style systems is complicated and error-prone. Shared-nothing architecture, on the other hand, benifits from the inherent replication and scalibility of cloud storages like S3, but the write performance may degrade when handling continous IO due the overhead of cloud storage operations.

Thus we need a distributed WAL to bridge the gap between shared-data and shared-nothing. Data inserted first go to this WAL to ensure durability and then apply to datanode. Insertions are buffered on datanode's local storage and write to cloud storage at a time to increase throughput. The data buffered on datanode is transient. Once datanode crashes, those data can be easily recovered from WAL.

Design goals

  • Repeatable read consistency
  • Namespace isolation and TTL
  • Automatic failover
  • High write throughput and low tailing read latency
  • Provide easy access for downstream subscriber to facilitate tasks like realtime materializing

Proposal

Some systems like RedPanda introduces Raft to replicate logs across nodes. But using such consensus algorithm for each log record is quite expensive. On the premise that GreptimeDB already has a metasrv to coordinate datanodes, we can use quorum-base replication on data plane and leave consensus to metasrv.

Quorum-based replication

Quorum-base systems read from and write to a subset of copies of data, namely read set Vr and write set Vw respectively. To ensure consistency, Vr and Vw must overlap at least one node so that when data is retrived, at least one node at Vr contains the latest version of value.

Quorum-base replication is widely used on existing database systems like Amazon Aurora and Amazon DynamoDB.

Concepts

Quorum, ensemble and data striping

In Bunshin, ensemble is the pool of all available nodes, while quorum is a subset of ensemble that a log entry is actually written to. When ensemble size > quorum size, adjacent entries will be written to different quorum, which is called "data striping".
image

Entry and sequence

Entry is the basic unit of log. Entry has a monotonic increasing sequence number generated by writer, Bunshin ensures that once a sequence number is acked, then all entries with lower sequence numbers are also acked.

Stream and segment

In Bunshin, an unbounded log entry sequence is called a "stream". A stream can have only one writer but many readers. Unbounded data structure is hard to deal with, so we partition the stream into "segment"s. Segment is a batch of log entries that accepts append operations only. Once a segment is closed (or sealed), it's immutable and cannot be reopened for writing. Since segments are immutable once closed, we can easily migrate it to other storage systems or simply delete it when it's safe to reclaim disk spaces.

Chunks

Nodes in quorum may fail, permanently or transiently. When writer finds some node in write quorum is unable to handle the request, instead of just hangs there and waits, it intiates a write quorum change and picks another node to write. A sequence of log entries in a segment that are written to the same write quorum is logically called a "chunk". Different chunks in a segment may consist of different quorum nodes, but the quorum size must be the same.

Protocol

Write

Writer opens a stream by creating a segment metadata in metasrv. To write an entry, it chooses N nodes to form the write quorum and sends write request in parallel to these nodes. In order to successfully replicate entries, writer waits ACK response from up to K nodes. Here K is an configurable option for each stream. Increasing K brings higher reliability, but may also result into spiking latency.

When writer fails to write to some node in quorum, it initiates a quorum change and open a new chunk with the new write quorum.

Read

Since every entry carries a sequence number, we can simplify the quorum read to a read from single node.
Readers reads entries by entry id. As reader already knows the ensemble of a fragment, it can infer the write quorum of that entry and sends read request to any of those nodes to fetch the content of the entry.

Recovery and repairing

TBD

Metadata storage

Segment meta and chunk meta

TBD

Exclusive writer and epoch

TBD

Underlying Storage

Bunshin uses RocksDB as entry storage, we won't focus on writing an append-only storage from scratch since RocksDB provides a relatively good batch write performance while point query is also fast.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.