Git Product home page Git Product logo

hm2's Introduction

HM2 - key/value NoSQL database that utilize LSM trees


Why another key/value?

We worked a lot with Apache Cassandra, but we never needed its distributed functionality. We also worked with Redis, but problem there is data must fit in memory.

Goals of the project are.

  • Key/Value store
  • Redis / Memcached protocol
  • Async network I/O
  • Data not need to fit in memory
  • Speed
  • Consistent
  • High quality code

Architecture

Architecture is derived from Apache Cassandra. There is memtable and several disk files.

Writes will be done in Memtable. Reading will be done first in memtable and if data will not be found there, system will try read each disktable until find the key. This is called "LSM tree" or much simpler - "Differential files".

Writes should be always fast. Reads should be fast if there are not too many disktables.


Memtable

Memtable is stored in memory in skiplist. Skiplist is very fast O(Log N) structure, very similar to binary tree.

It is much faster than a vector, but slower than hashtable O(1).


Testing memtable with other structures

Prior skiplist we tested with hashtables, vectors (dynamic array with shifting) and linked lists.

General speed of hashtables was much faster than skiplists, but there were some problems, such:

  • Hashtable buckets must be fixed, because data must later be stored on the disk with LSM tree.
  • Because of hashtable buckets, disktables have more complicated structure and are initially larger - all bucket must be stored.
  • Slow creation of disktables. Because buckets are fixed, just as it is in Tokyo Cabinet, one may choose high number of buckets, say 1M. This means even "empty" memtable is translated to huge disktable with at least 1M "records" on the disk.
  • Merging of disktable with "buckets" is more difficult, but also waste more unused space in the file. This is not discussed yet, but merging two tables with sizes N1 and N2 elements, will need allocation of N1 + N2 space on the "output" disktable.

Vector performance was very good, but worse than skiplist.

Linked list performance was poor.

All tests and code should be in "structure_testing" branch of the repository.


Disktables

Periodicaly, the memtable is flushed on the disk in disktable. Disktable file have similar order - vector/array like structure with sorted elements.

Disktables are immutable. This means very good OS cache and easy backup. Downside is that deletes must be implemented by markers called tombstones (same key, NULL value).

Disktables header and control data are 64bit integers (uint64_t) stored as BigEndian. Simple 1 byte XOR check checksum is included for each pair.

Disktables ment to be load using MMAP(). Effors are made things to works on 32bit OS-es, but soon or later these will run out of address space. This means that for production you will definitely need 64bit OS.


LSM tree notes.

Here is what Wikipedia say about:

Log structured merge tree

However there is 1976's research by Dennis G. Severance and Guy M. Lohman on the similar topic here:

Differential files: their application to the maintenance of large databases - University of Minnesota, Minneapolis

When database works, many disktables will be created. This will slow down the reads very much. By this reason disktables must be merged.

Because disktables are read only, merge can be implemented as separate user space process. For read only databases you may choose not to start it at all.

Unlike Apache Cassandra, there are a way to compact several tables into single file. However this will complicate the things and probably only basic two table merge will be implemented.

hm2's People

Contributors

nmmmnu avatar

Stargazers

Eason Wang avatar  avatar Robert Winslow avatar

Watchers

 avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.