kraret / pub-task Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 6.0 2 KB

Terark public developing

License: MIT License

pub-task's People

Contributors

Stargazers

Watchers

Forkers

qqnidaye hiidea oambre seadu yetclass fpdb

pub-task's Issues

In place shrink memory: jemalloc: xallocx

When MemTableRep::MarkReadOnly(), we should do something like vector::shrink_to_fit() to free unused extra memory. since we use large memory block, the unused memory is the upper part of the
memory block.

Such memory block can be shrinked by realloc, but realloc is not guaranteed to shrink memory in place. This is not an issue in single thread, but it is a big issue in multi-thread:

Thread-1 calling realloc, and realloc reallocated a smaller memory block and copying old memory into the smaller memory block, then free the old memory block.
Thread-2 is reading the old memory block....

A good news is that jemalloc has a specific function xallocx:

// jemalloc specific:
size_t xallocx(void *ptr, size_t size, size_t extra, int flags);

jemalloc document says:

The xallocx() function resizes the allocation at ptr in place to be at least size bytes, and returns the real size of the allocation. If extra is non-zero, an attempt is made to resize the allocation to be at least (size + extra) bytes, though inability to allocate the extra byte(s) will not by itself result in failure to resize. Behavior is undefined if size is 0, or if (size + extra > SIZE_T_MAX).

jemalloc document also says:

The realloc(), rallocx(), and xallocx() functions may resize allocations without moving them under limited circumstances. Unlike the allocx() API, the standard API does not officially round up the usable size of an allocation to the nearest size class, so technically it is necessary to call realloc() to grow e.g. a 9-byte allocation to 16 bytes, or shrink a 16-byte allocation to 9 bytes. Growth and shrinkage trivially succeeds in place as long as the pre-size and post-size both round up to the same size class. No other API guarantees are made regarding in-place resizing, but the current implementation also tries to resize large allocations in place, as long as the pre-size and post-size are both large. For shrinkage to succeed, the extent allocator must support splitting (see arena.<i>.extent_hooks). Growth only succeeds if the trailing memory is currently available, and the extent allocator supports merging.

The conclusion is better: for any real world `realloc`, shrink is in place when the memory block is `large`, the `large` defined in jemalloc is really `small` for our usage, we can define our `large` value larger to get stronger guarantee, such as `2M`.

MemTable with array based Threaded Red-Black Tree

This memory pool is aimed to showing advantages of our MemTable refactory:

Memory dumpable
Without RocksDB key value prefix len encoding
Dynamic Plugable MemTable(MemTable Abstract Factory: #18)

Using array based Threaded Red-Black Tree:

struct Node {
    uint32_t  left;
    uint32_t  right;
    uint64_t  offset : 39; // offset to key value data in mempool
    uint64_t  color  :  1; // red or black
    uint64_t  keylen : 24; // valuelen = offset[idx+1] - offset.nodes[idx] - keylen;
};

node and KeyValue data share single mempool
nodes start at begin of mempool, growing upward
offsets start at end of mempool, growing downward (KeyValue data)
max node num is 2³² - 2
max KeyValue mempool capacity is 512G
max KeyLen is 2²⁴-1

Compare: TerarkDB + (fuse vs ceph)

terark-zip-rocksdb: Use arena for Index's Iterator

Now NestLoudsTrie support user provided memory and do not use malloc. terark-zip-table should speed up by this feature.

terark-core: Implement a thread-cached mempool

Priority: Low

Same as existing mempool: memory is identified by offset.
Interface should be same with existing mempool.
Object layout should be consistent with existing mempool as MemPool_CompileX.

rocksdb: Add light weight compaction: merge multiple sorted runs

Keys in rocksdb are SST often range-clustered, say that range overlap points in different SST across level are rare.

Pick several sorted runs, then Create an index to map key range into SST number, we call this is a light weight compaction.

patricia: File read+write mmap memory pool

With file read+write mmap memory pool, write to device will not introduce cpu&memory waste.

This needs RocksDB to create a file for memtable.

If the memory pool file is in NVM, this will produce a non-volatile memtable.

This improvement is less useful for distributed rocksdb.

Second pass scan: Populate intersection set with first pass scan

In compaction, TerarkSST needs two pass scan, now we require two pass scan yield identical result, this restriction can be removed: just keep the intersection set of these two pass scan!

TerarkDB vs RocksDB: reverse scan performance

TerarkDB 的反向扫描很快，我们需要拿出测试结果。

对比项目：

整库反向扫描
按不同长度反向扫描：每执行一次 Iterator.Seek，接着执行若干次(10次，20次…) Iterator.Prev
对比反向扫描和正向扫描的性能
内存限制
- 不限内存
- 对 TerarkDB 刚好够用

patricia: Reimplement MultiReadMultiWrite insert

Priority: Low

Lazy List and MemPool need refactory
memcmp for Copy on write check
update parent mutex sharded by 127, because:
- 127 mutexes is moderated
- 127 is a prime number
- x % 127 can be optimized by compiler
other unexpected things

terark-zip-rocksdb: madvise index as sequential access

Now mmap data is madvised as RANDOM, random is not optimal for index data, such as NLT & zbs offsets.

rocksdb: WriteBatchWithIndex: use per ColumnFamily index

在 WriteBatchWithIndex 中，每个 ColumnFamily 一个 Index，以减少 compare 的开销，提高总体性能。

terarksql-cluster design

Make storage engine distributed: communicate between nodes
- 1-writer, n-reader / master-slave(slave is reader)
- data sync by storage engine: RocksDB WAL log(async between nodes after master fsync WAL to SSD)
Master election?
MySQL layer is stateless
- MySQL layer master/slave is same as storage engine level
- MySQL layer know it is master or slave
- Disable MySQL master/slave
Configuration syncs between nodes: Storage Engine config & MySQL config
- Using ZeroMQ for WAL log sync
- First release: Only one configured master

Add db change callback

In rocksdb, add a callback for db to notify db changes to application.

For example, in MyRocks, if MyRocks storage engine has states, and these states is used as cache for some metadata.

In distributed MyRocks, slave node read wal-log to sync master's change on db, thus the states for metadata cache may be staled. By db-change callback, these states can be updated.

This can also be implemented by slaver's wal-log syncer.

Huffman Encoding bug fix

terark-fsa: Write an optimized dynamic Patricia Trie

We can not find an optimized dynamic Patricia Trie implementation, so we write it ourselfves.

Integrate to terark-fsa architecture
Small memory usage
1. Use a single memory block, be dumpable
Fast lookup
Fast Iterator

This Patricia Trie will be use as rocksdb memtable.

terark-zip: 充分利用投机执行写出的 value文件

投机执行写出的 value 文件，目前在投机失败时会被丢弃，将这些数据利用起来，减小第二遍扫描的数据量

RangeMerge Support

Priority: Low

As described in RocksDB issue

This is a low priority job of us, let RocksDB community to grab it first.

NLT core string pool: use end mark instead of length

find a byte which does not exist in core string, use this byte as end mark.

The core string vector will be sorted by right-aligned bytewise order(longer is less) for compressing and use the end mark.

对 core string vector 进行排序：依据右对齐的字节序，即从字符串末尾开始往前进行比较，长串更小：

int compare(fstring x, fstring y) {
   size_t n = std::min(x.size(), y.size());
   auto px = (const byte_t*)x.data() + x.size();
   auto py = (const byte_t*)y.data() + y.size();
   while (n) {
       if (*--px != *--py)
           return *px < *py ? -1 : +1;
       --n;
   }
   if (x.size() > y.size())  return -1; // longer is less 长的更小
   if (x.size() < y.size())  return +1; // shorter is greater
   return 0;
}

glusterfs: investigate mix of fuse + mmap and libgfapi glfs_pread

Index region should using fuse + mmap
Data region should using glfs_pread

investigate the feasibility of mixing fuse + mmap and libgfapi glfs_pread

Distributed RocksDB: Data sync & file lock

There are two kind of readonly readers, syncing-readonly and nonsyncing-readonly.

Writer publish DB changes: wal-log, manifest...
When a readonly node is up
1. it need to connect to writer node to acquire a snapshot.
  - if writer is not online, ...
  - any way, it needs some mechanism to lock needed files(prevent such files being deleted)
2. if it is syncing-readonly, it subscribe topics for DB changes
3. if it is nonsyncing-readonly, it just do read
using ZooKeeper

2018-10-18 10:57: In gluster-env implementation, create a sync-channel for log files: file name pattern is log, we create such a sync-channel, this may be simplify the overall implementation.

NLT: Divide large NLT into blocks

Divide large NLT into blocks, to improve sequencial scan performance.

Dump MemTable as SST

PatriciaTrie based MemTable can be dumped to disk and load by mmap, yield many andvantages:

Dump MemTable is much faster than MemTable flush, yield many andvantages:

Dump speed can be up to SSD's max write speed(such as 3GB/sec), much faster than MemTable flush.
MemTable flush can be replaced by dump, thus reduce compression workload.
Fast start up: MemTable dump will very likely to success, thus avoid wal log replay on start up.

terark-zip: composite-index: minimum discriminate prefix

Priority: Low

refactory
nlt as index prefix

rocksdb: 1-write n-read, engine level distribution

sync in-memory change
load sst after compaction/flush
since writer/master may delete file if not needed any more, so it must "know" files in using by reader/slave
- explicit communication between writer & reader is required
- tailing on wal-log is simpler & slower than ZeroMQ
- manifest files also need to be monitored
there are two kind of readonly instance
- readonly & sync writer's update on DB, used for OLTP
- readonly & do not sync writer's update, used for OLAP(spark...), and distributed compaction

Implement TableReader::Prefetch(begin, end)

Priority: Low

For future feature DB::Prefetch(begin, end), which can be used by MyRocks...

Blocked Louds Succinct Trie

Divide LOUDS into blocks, block size can be 4K, 8K, 16K...
Do not Nesting
Compress in-block redundancy
- Compressed data must be searchable with fast speed
- Maximize compression ratio
Building speed should be fast

Add Abastract Factory for MemTable

Now rocksdb has Factory for MemTable classes, but it lacks Abstract Factory, we can not create a MemTable just by class name.

So add Abastract Factory by MemTable class name.

TerarkCompositeUintIndex under SortedUint impl bug

glusterfs: Large File random read performance

Large File is larger than 10GB
Random read IOPS of length: 128B, 256B, 512B, 1K, 2K, 4K, 8K, 16K
Multi thread read

terark_zip_table: store GroupSize of Multi-Value pack in separated array

Priority: Low

Now terark_zip_table has zvType, but we don't know GroupSize of each MultiValue pack before unzip it. To get GroupSize before unzip with succinct:

Needs 2-bit rank: rank0, rank1, rank2, rank3
Build RankIndex for zvType
Store GroupSize in a separated array

High level code:

size_t groupIndex = zvType.rank3(recId); // 3 == ZipValueType::kMulti
groupSize = groupSizeVec[groupIndex];
// ....

NLT: fixed length suffix

After building outer trie, the inner strVec maybe fixed length, or nearly fixed length, we can save the fixed length suffix into an parallel array, thus omit the storage for offset array.

MemPool: Thread Local Cache MemPool

Use Thead Local Cache to minimize CPU Cache conflict.

2M block for sfree implementation:

 void sfree(size_t pos, size_t len) {
    auto tc = m_thread_cache[pos >> 21]; // >>21 for div 2M
    tc->sfree(pos, len);
 }

rocksdb: allow memtable per prefix

dynamic patricia trie should be fast, but it is hard to implement concurrent write, so using prefix id to multiplex writes.

rocksdb: add gluster filesystem env

use gluster native api, do not use gluster fuse

2018-10-15 16:00

do read by gfapi, for reading value data
support mmap for reading index data

Has added RandomAccessFile::FsRead for this purpose: d1d5621a

Push modify "Delay verify compaction output table" to official

https://github.com/Terark/rocksdb/commit/f102374127b7325607d7b37341b31729d4f3dbb8

rocksdb: Implement a memtable by boost intrusive rbtree

Implement such a memtable is to using hint to optimize sequential writes.

hint should be end(), boost intrusive rbtree packed color bit into pointer, and it has parent point, which is why it can be optimized by hint. rbtree without parent pointer is very hard to optimize by hint.

Brute force & simple stupid concurrent MemTable

Create an instance-thread-local Sub-MemTable for such MemTable, such MemTable is optimized for write and has read amplification.

Each write thread write KeyValue into its Sub-MemTable
Reader thread find or iterate for all Sub-MemTable

We first implement such MemTable on top of PatriciaTrie

PatriciaTrie requires fixed memory size may be an issue
realloc to shrink_to_fit, we assume realloc will not relocate memory when shrink.
- jemalloc has function xallocx which is dedicated for resize memory in place

Once MemTable is not a bottle neck, the advantage of distributed compaction will be shown more...

Compare: pika + (TerarkDB vs RocksDB)

Use wikipedia data, multi thread/connection.

Uniform random read:

get
mget
pipeline get

Write:

Fill data sequentially
Fill data randomly
Random update

测试 glusterfs 的 tailing 性能

如果 glusterfs 的 tailing 性能足够强，我们就不需要 ZeroMQ 之类的消息系统来同步 WAL Log。

tail 测试在一个结点上追加写文件，在另一个结点上读文件末尾。glusterfs 自带 tail 命令，其使用 glusterfs 的用户层 API 实现，效率要高于在 fuse 上挂载并使用 linux tail 命令。

可能需要自己写代码，着重测试 tail 延时。

kraret / pub-task Goto Github PK

pub-task's People

Contributors

Stargazers

Watchers

Forkers

pub-task's Issues

jemalloc document says:

jemalloc document also says:

The conclusion is better: for any real world realloc, shrink is in place when the memory block is large, the large defined in jemalloc is really small for our usage, we can define our large value larger to get stronger guarantee, such as 2M.

Priority: Low

Priority: Low

Priority: Low

Priority: Low

Priority: Low

Priority: Low

2018-10-15 16:00

Recommend Projects

Recommend Topics

Recommend Org

The conclusion is better: for any real world `realloc`, shrink is in place when the memory block is `large`, the `large` defined in jemalloc is really `small` for our usage, we can define our `large` value larger to get stronger guarantee, such as `2M`.