Git Product home page Git Product logo

pub-task's People

Contributors

rockeet avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pub-task's Issues

In place shrink memory: jemalloc: xallocx

When MemTableRep::MarkReadOnly(), we should do something like vector::shrink_to_fit() to free unused extra memory. since we use large memory block, the unused memory is the upper part of the
memory block.

Such memory block can be shrinked by realloc, but realloc is not guaranteed to shrink memory in place. This is not an issue in single thread, but it is a big issue in multi-thread:

  1. Thread-1 calling realloc, and realloc reallocated a smaller memory block and copying old memory into the smaller memory block, then free the old memory block.
  2. Thread-2 is reading the old memory block....

A good news is that jemalloc has a specific function xallocx:

// jemalloc specific:
size_t xallocx(void *ptr, size_t size, size_t extra, int flags);

jemalloc document says:

The xallocx() function resizes the allocation at ptr in place to be at least size bytes, and returns the real size of the allocation. If extra is non-zero, an attempt is made to resize the allocation to be at least (size + extra) bytes, though inability to allocate the extra byte(s) will not by itself result in failure to resize. Behavior is undefined if size is 0, or if (size + extra > SIZE_T_MAX).

jemalloc document also says:

The realloc(), rallocx(), and xallocx() functions may resize allocations without moving them under limited circumstances. Unlike the allocx() API, the standard API does not officially round up the usable size of an allocation to the nearest size class, so technically it is necessary to call realloc() to grow e.g. a 9-byte allocation to 16 bytes, or shrink a 16-byte allocation to 9 bytes. Growth and shrinkage trivially succeeds in place as long as the pre-size and post-size both round up to the same size class. No other API guarantees are made regarding in-place resizing, but the current implementation also tries to resize large allocations in place, as long as the pre-size and post-size are both large. For shrinkage to succeed, the extent allocator must support splitting (see arena.<i>.extent_hooks). Growth only succeeds if the trailing memory is currently available, and the extent allocator supports merging.

The conclusion is better: for any real world realloc, shrink is in place when the memory block is large, the large defined in jemalloc is really small for our usage, we can define our large value larger to get stronger guarantee, such as 2M.

MemTable with array based Threaded Red-Black Tree

This memory pool is aimed to showing advantages of our MemTable refactory:

  1. Memory dumpable
  2. Without RocksDB key value prefix len encoding
  3. Dynamic Plugable MemTable(MemTable Abstract Factory: #18)

Using array based Threaded Red-Black Tree:

struct Node {
    uint32_t  left;
    uint32_t  right;
    uint64_t  offset : 39; // offset to key value data in mempool
    uint64_t  color  :  1; // red or black
    uint64_t  keylen : 24; // valuelen = offset[idx+1] - offset.nodes[idx] - keylen;
};
  1. node and KeyValue data share single mempool
  2. nodes start at begin of mempool, growing upward
  3. offsets start at end of mempool, growing downward (KeyValue data)
  4. max node num is 232 - 2
  5. max KeyValue mempool capacity is 512G
  6. max KeyLen is 224-1

terark-core: Implement a thread-cached mempool

Priority: Low

  1. Same as existing mempool: memory is identified by offset.
  2. Interface should be same with existing mempool.
  3. Object layout should be consistent with existing mempool as MemPool_CompileX.

patricia: File read+write mmap memory pool

With file read+write mmap memory pool, write to device will not introduce cpu&memory waste.

This needs RocksDB to create a file for memtable.

If the memory pool file is in NVM, this will produce a non-volatile memtable.

This improvement is less useful for distributed rocksdb.

patricia: Reimplement MultiReadMultiWrite insert

Priority: Low

  1. Lazy List and MemPool need refactory
  2. memcmp for Copy on write check
  3. update parent mutex sharded by 127, because:
    • 127 mutexes is moderated
    • 127 is a prime number
    • x % 127 can be optimized by compiler
  4. other unexpected things

terarksql-cluster design

  1. Make storage engine distributed: communicate between nodes
    • 1-writer, n-reader / master-slave(slave is reader)
    • data sync by storage engine: RocksDB WAL log(async between nodes after master fsync WAL to SSD)
  2. Master election?
  3. MySQL layer is stateless
    • MySQL layer master/slave is same as storage engine level
    • MySQL layer know it is master or slave
    • Disable MySQL master/slave
  4. Configuration syncs between nodes: Storage Engine config & MySQL config
    • Using ZeroMQ for WAL log sync
    • First release: Only one configured master

Add db change callback

In rocksdb, add a callback for db to notify db changes to application.

For example, in MyRocks, if MyRocks storage engine has states, and these states is used as cache for some metadata.

In distributed MyRocks, slave node read wal-log to sync master's change on db, thus the states for metadata cache may be staled. By db-change callback, these states can be updated.

This can also be implemented by slaver's wal-log syncer.

terark-fsa: Write an optimized dynamic Patricia Trie

We can not find an optimized dynamic Patricia Trie implementation, so we write it ourselfves.

  1. Integrate to terark-fsa architecture
  2. Small memory usage
    1. Use a single memory block, be dumpable
  3. Fast lookup
  4. Fast Iterator

This Patricia Trie will be use as rocksdb memtable.

NLT core string pool: use end mark instead of length

find a byte which does not exist in core string, use this byte as end mark.

The core string vector will be sorted by right-aligned bytewise order(longer is less) for compressing and use the end mark.

对 core string vector 进行排序:依据右对齐的字节序,即从字符串末尾开始往前进行比较,长串更小:

int compare(fstring x, fstring y) {
   size_t n = std::min(x.size(), y.size());
   auto px = (const byte_t*)x.data() + x.size();
   auto py = (const byte_t*)y.data() + y.size();
   while (n) {
       if (*--px != *--py)
           return *px < *py ? -1 : +1;
       --n;
   }
   if (x.size() > y.size())  return -1; // longer is less 长的更小
   if (x.size() < y.size())  return +1; // shorter is greater
   return 0;
}

Distributed RocksDB: Data sync & file lock

There are two kind of readonly readers, syncing-readonly and nonsyncing-readonly.

  1. Writer publish DB changes: wal-log, manifest...
  2. When a readonly node is up
    1. it need to connect to writer node to acquire a snapshot.
      • if writer is not online, ...
      • any way, it needs some mechanism to lock needed files(prevent such files being deleted)
    2. if it is syncing-readonly, it subscribe topics for DB changes
    3. if it is nonsyncing-readonly, it just do read
  3. using ZooKeeper

  • 2018-10-18 10:57: In gluster-env implementation, create a sync-channel for log files: file name pattern is log, we create such a sync-channel, this may be simplify the overall implementation.

Dump MemTable as SST

PatriciaTrie based MemTable can be dumped to disk and load by mmap, yield many andvantages:

Dump MemTable is much faster than MemTable flush, yield many andvantages:

  1. Dump speed can be up to SSD's max write speed(such as 3GB/sec), much faster than MemTable flush.
  2. MemTable flush can be replaced by dump, thus reduce compression workload.
  3. Fast start up: MemTable dump will very likely to success, thus avoid wal log replay on start up.

rocksdb: 1-write n-read, engine level distribution

  • sync in-memory change
  • load sst after compaction/flush
  • since writer/master may delete file if not needed any more, so it must "know" files in using by reader/slave
    • explicit communication between writer & reader is required
    • tailing on wal-log is simpler & slower than ZeroMQ
    • manifest files also need to be monitored
  • there are two kind of readonly instance
    • readonly & sync writer's update on DB, used for OLTP
    • readonly & do not sync writer's update, used for OLAP(spark...), and distributed compaction

Blocked Louds Succinct Trie

  1. Divide LOUDS into blocks, block size can be 4K, 8K, 16K...
  2. Do not Nesting
  3. Compress in-block redundancy
    • Compressed data must be searchable with fast speed
    • Maximize compression ratio
  4. Building speed should be fast

Add Abastract Factory for MemTable

Now rocksdb has Factory for MemTable classes, but it lacks Abstract Factory, we can not create a MemTable just by class name.

So add Abastract Factory by MemTable class name.

terark_zip_table: store GroupSize of Multi-Value pack in separated array

Priority: Low

Now terark_zip_table has zvType, but we don't know GroupSize of each MultiValue pack before unzip it. To get GroupSize before unzip with succinct:

  1. Needs 2-bit rank: rank0, rank1, rank2, rank3
  2. Build RankIndex for zvType
  3. Store GroupSize in a separated array

High level code:

size_t groupIndex = zvType.rank3(recId); // 3 == ZipValueType::kMulti
groupSize = groupSizeVec[groupIndex];
// ....

NLT: fixed length suffix

After building outer trie, the inner strVec maybe fixed length, or nearly fixed length, we can save the fixed length suffix into an parallel array, thus omit the storage for offset array.

MemPool: Thread Local Cache MemPool

Use Thead Local Cache to minimize CPU Cache conflict.

2M block for sfree implementation:

 void sfree(size_t pos, size_t len) {
    auto tc = m_thread_cache[pos >> 21]; // >>21 for div 2M
    tc->sfree(pos, len);
 }

rocksdb: add gluster filesystem env

use gluster native api, do not use gluster fuse


2018-10-15 16:00

  1. do read by gfapi, for reading value data
  2. support mmap for reading index data

Has added RandomAccessFile::FsRead for this purpose: d1d5621a

rocksdb: Implement a memtable by boost intrusive rbtree

Implement such a memtable is to using hint to optimize sequential writes.

hint should be end(), boost intrusive rbtree packed color bit into pointer, and it has parent point, which is why it can be optimized by hint. rbtree without parent pointer is very hard to optimize by hint.

Brute force & simple stupid concurrent MemTable

Create an instance-thread-local Sub-MemTable for such MemTable, such MemTable is optimized for write and has read amplification.

  • Each write thread write KeyValue into its Sub-MemTable
  • Reader thread find or iterate for all Sub-MemTable

We first implement such MemTable on top of PatriciaTrie

  • PatriciaTrie requires fixed memory size may be an issue
  • realloc to shrink_to_fit, we assume realloc will not relocate memory when shrink.
    • jemalloc has function xallocx which is dedicated for resize memory in place

Once MemTable is not a bottle neck, the advantage of distributed compaction will be shown more...

Compare: pika + (TerarkDB vs RocksDB)

Use wikipedia data, multi thread/connection.

Uniform random read:

  1. get
  2. mget
  3. pipeline get

Write:

  1. Fill data sequentially
  2. Fill data randomly
  3. Random update

测试 glusterfs 的 tailing 性能

如果 glusterfs 的 tailing 性能足够强,我们就不需要 ZeroMQ 之类的消息系统来同步 WAL Log。

tail 测试在一个结点上追加写文件,在另一个结点上读文件末尾。glusterfs 自带 tail 命令,其使用 glusterfs 的用户层 API 实现,效率要高于在 fuse 上挂载并使用 linux tail 命令。

可能需要自己写代码,着重测试 tail 延时。

rocksdb: distributed compaction

in 1-write n-read framework, write node just handling client write request and sst flush.

background compactions are running on other nodes, once compaction finished, notify all nodes to update: load new sst and discard old sst.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.