cockroachdb / pebble Goto Github PK

RocksDB/LevelDB inspired key-value database in Go

License: BSD 3-Clause "New" or "Revised" License

Go 99.32% Makefile 0.07% CSS 0.02% JavaScript 0.47% Shell 0.12%

pebble's Introduction

Pebble ^_{_Coverage}

Nightly benchmarks

Pebble is a LevelDB/RocksDB inspired key-value store focused on performance and internal usage by CockroachDB. Pebble inherits the RocksDB file formats and a few extensions such as range deletion tombstones, table-level bloom filters, and updates to the MANIFEST format.

Pebble intentionally does not aspire to include every feature in RocksDB and specifically targets the use case and feature set needed by CockroachDB:

Block-based tables
Checkpoints
Indexed batches
Iterator options (lower/upper bound, table filter)
Level-based compaction
Manual compaction
Merge operator
Prefix bloom filters
Prefix iteration
Range deletion tombstones
Reverse iteration
SSTable ingestion
Single delete
Snapshots
Table-level bloom filters

RocksDB has a large number of features that are not implemented in Pebble:

Backups
Column families
Delete files in range
FIFO compaction style
Forward iterator / tailing iterator
Hash table format
Memtable bloom filter
Persistent cache
Pin iterator key / value
Plain table format
SSTable ingest-behind
Sub-compactions
Transactions
Universal compaction style

WARNING: Pebble may silently corrupt data or behave incorrectly if used with a RocksDB database that uses a feature Pebble doesn't support. Caveat emptor!

Production Ready

Pebble was introduced as an alternative storage engine to RocksDB in CockroachDB v20.1 (released May 2020) and was used in production successfully at that time. Pebble was made the default storage engine in CockroachDB v20.2 (released Nov 2020). Pebble is being used in production by users of CockroachDB at scale and is considered stable and production ready.

Advantages

Pebble offers several improvements over RocksDB:

Faster reverse iteration via backwards links in the memtable's skiplist.
Faster commit pipeline that achieves better concurrency.
Seamless merged iteration of indexed batches. The mutations in the batch conceptually occupy another memtable level.
L0 sublevels and flush splitting for concurrent compactions out of L0 and reduced read-amplification during heavy write load.
Faster LSM edits in LSMs with large numbers of sstables through use of a copy-on-write B-tree to hold file metadata.
Delete-only compactions that drop whole sstables that fall within the bounds of a range deletion.
Block-property collectors and filters that enable iterators to skip tables, index blocks and data blocks that are irrelevant, according to user-defined properties over key-value pairs.
Range keys API, allowing KV pairs defined over a range of keyspace with user-defined semantics and interleaved during iteration.
Smaller, more approachable code base.

See the Pebble vs RocksDB: Implementation Differences doc for more details on implementation differences.

RocksDB Compatibility

Pebble strives for forward compatibility with RocksDB 6.2.1 (the latest version of RocksDB used by CockroachDB). Forward compatibility means that a DB generated by RocksDB 6.2.1 can be upgraded for use by Pebble. Pebble versions in the v1 series may open DBs generated by RocksDB 6.2.1. Since its introduction, Pebble has adopted various backwards-incompatible format changes that are gated behind new 'format major versions'. The Pebble master branch does not support opening DBs generated by RocksDB. DBs generated by RocksDB may only be used with recent versions of Pebble after migrating them through format major version upgrades using previous versions of Pebble. See the below section of format major versions.

Even the RocksDB-compatible versions of Pebble only provide compatibility with the subset of functionality and configuration used by CockroachDB. The scope of RocksDB functionality and configuration is too large to adequately test and document all the incompatibilities. The list below contains known incompatibilities.

Pebble's use of WAL recycling is only compatible with RocksDB's kTolerateCorruptedTailRecords WAL recovery mode. Older versions of RocksDB would automatically map incompatible WAL recovery modes to kTolerateCorruptedTailRecords. New versions of RocksDB will disable WAL recycling.
Column families. Pebble does not support column families, nor does it attempt to detect their usage when opening a DB that may contain them.
Hash table format. Pebble does not support the hash table sstable format.
Plain table format. Pebble does not support the plain table sstable format.
SSTable format version 3 and 4. Pebble does not support version 3 and version 4 format sstables. The sstable format version is controlled by the BlockBasedTableOptions::format_version option. See #97.

Format major versions

Over time Pebble has introduced new physical file formats. Backwards incompatible changes are made through the introduction of 'format major versions'. By default, when Pebble opens a database, it defaults to the lowest supported version. In v1, this is FormatMostCompatible, which is bi-directionally compatible with RocksDB 6.2.1 (with the caveats described above).

Databases created by RocksDB or Pebble versions v1 and earlier must be upgraded to a compatible format major version before running newer Pebble versions. Newer Pebble versions will refuse to open databases in no longer supported formats.

To opt into new formats, a user may set FormatMajorVersion on the Options supplied to Open, or upgrade the format major version at runtime using DB.RatchetFormatMajorVersion. Format major version upgrades are permanent; There is no option to return to an earlier format.

The table below outlines the history of format major versions, along with what range of Pebble versions support that format.

Name	Value	Migration	Pebble support
FormatMostCompatible	1	No	v1
FormatVersioned	3	No	v1
FormatSetWithDelete	4	No	v1
FormatBlockPropertyCollector	5	No	v1
FormatSplitUserKeysMarked	6	Background	v1
FormatSplitUserKeysMarkedCompacted	7	Blocking	v1
FormatRangeKeys	8	No	v1
FormatMinTableFormatPebblev1	9	No	v1
FormatPrePebblev1Marked	10	Background	v1
FormatSSTableValueBlocks	12	No	v1
FormatFlushableIngest	13	No	v1, master
FormatPrePebblev1MarkedCompacted	14	Blocking	v1, master
FormatDeleteSizedAndObsolete	15	No	v1, master
FormatVirtualSSTables	16	No	v1, master

Upgrading to a format major version with 'Background' in the migration column may trigger background activity to rewrite physical file formats, typically through compactions. Upgrading to a format major version with 'Blocking' in the migration column will block until a migration is complete. The database may continue to serve reads and writes if upgrading a live database through RatchetFormatMajorVersion, but the method call will not return until the migration is complete.

For reference, the table below lists the range of supported Pebble format major versions for CockroachDB releases.

CockroachDB release	Earliest supported	Latest supported
20.1 through 21.1	FormatMostCompatible	FormatMostCompatible
21.2	FormatMostCompatible	FormatSetWithDelete
21.2	FormatMostCompatible	FormatSetWithDelete
22.1	FormatMostCompatible	FormatSplitUserKeysMarked
22.2	FormatMostCompatible	FormatPrePebblev1Marked
23.1	FormatSplitUserKeysMarkedCompacted	FormatFlushableIngest
23.2	FormatSplitUserKeysMarkedCompacted	FormatVirtualSSTables

Pedigree

Pebble is based on the incomplete Go version of LevelDB:

https://github.com/golang/leveldb

The Go version of LevelDB is based on the C++ original:

https://github.com/google/leveldb

Optimizations and inspiration were drawn from RocksDB:

https://github.com/facebook/rocksdb

Getting Started

Example Code

package main

import (
	"fmt"
	"log"

	"github.com/cockroachdb/pebble"
)

func main() {
	db, err := pebble.Open("demo", &pebble.Options{})
	if err != nil {
		log.Fatal(err)
	}
	key := []byte("hello")
	if err := db.Set(key, []byte("world"), pebble.Sync); err != nil {
		log.Fatal(err)
	}
	value, closer, err := db.Get(key)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s %s\n", key, value)
	if err := closer.Close(); err != nil {
		log.Fatal(err)
	}
	if err := db.Close(); err != nil {
		log.Fatal(err)
	}
}

pebble's People

Contributors

Stargazers

Watchers

Forkers

ariesdevil dialtr nvanbenschoten ajkr tbg justinj maimuderizi zhanglei isgasho c1dx promiseteam cih-y2k jwasinger stanxii yaojingguo ryanfsdf starius cybvon happyhakka andyli029 jeffrey-xiao itsbilal backwardn hueypark kuafou sumeerbhola itoolabs growler ahdong2007 asubiotto rohansuri andypeng2015 holajiawei heidawei mbrukman wangdong026 aayushshah15 jahau pbardea forkkit genkiapp zhiqiangxu jbowens ibytechaos cedrickchee brunotm miretskiy clucle kkxiaotikk gitsrc glorv otan-cockroach oibe petermattis lni flaskeer hopkings2008 alrs liuxp0827 xxxmailk dioptre v8lab chanyoung jaydenwen123 aadityasondhi topecongiro glycerine dergus cristaloleg rjl493456442 chainpunk fengfengzi1202 suzaku ksharpdabu h2atecnologia 13meimei ifls foxxnuaa xszyh opmindo buffernihility tangdi stjordanis zhangbitao mbtamuli skyzh ijsong res-web3 gjaldon douxu kuailehaibin dt ben1009 joeuo rheehot daniel-007 bugfyi rosegaller kennytm zhaox1n

pebble's Issues

cache: add support for "weak" cache handles

Add support for "weak" cache handles which allow a cache entry to be evicted but also allow a fast test to see if a cache entry is still present. Use these weak cache handles for caching the sstable filter and index blocks for which we want fast access in the normal case, but also want to allow being evicted from the cache under memory pressure.

A weak cache handle could be implemented as a struct containing an unsafe.Pointer:

type WeakHandle struct {
  ptr unsafe.Pointer
}

func (h *WeakHandle) Get() unsafe.Pointer {
  return atomic.LoadPointer(&h.ptr)
}

func (h *WeakHandle) clear() {
  return atomic.StorePointer(&h.ptr, nil)
}

The actual cache valued is stored by reference in WeakHandle.ptr. The cache internally would hand out the same WeakHandle object for a given entry and maintain a pointer to it. When an entry is evicted, the cache calls WeakHandle.clear().

db: add max_manifest_file_size option

The RocksDB max_manifest_file_size option limits how large the manifest can become:

// manifest file is rolled over on reaching this limit.
// The older manifest file be deleted.
// The default value is 1GB so that the manifest file can grow, but not
// reach the limit of storage capacity.

CRDB sets this to 128MB as we found that 1GB can cause an excessive startup time as the manifest is "replayed". Currently, Pebble has no support for rolling the manifest when it becomes too large. There is a TODO in the code that this needs to be addressed.

compaction: manual compactions

Implement DB.Compact which compacts a range of keys down to the lowest level of the LSM. Determine if it is worthwhile to be able to specify a target level for the compaction.

perf: maybe throttle concurrent compactions

Similar to #46, Pebble only allows a single compaction to occur at any time. RocksDB allows multiple concurrent compactions. The benefit of allowing multiple concurrent compactions is a bit unclear, as on the surface allowing a concurrent compaction would seem to simply divide write bandwidth across the concurrent compactions. The exception to this is the benefit in allowing L0 compactions due to the need to quickly move L0 sstables to lower levels in order to avoid write throttling due to too many L0 sstables. This points to an alternative to concurrent compactions: allow compactions to be preempted (or paused), preferentially performing work on higher priority compactions (or flushes).

Cc @ajkr

perf: avoid holding DB.mu during log writes

Mutual exclusion for appending to the log (in memory) is currently achieved by holding DB.mu. This is unfortunate as a log write can result in a disk write and thus hold DB.mu for a significant amount of time. RocksDB has a separate log-write mutex for providing mutual exclusion for log writes. We should investigate performing a similar separation of locking duties in Pebble.

Merge operator

Add Options.MergeOperator and implement DB.Merge. Nothing particularly magical here, though care needs to be taken with the MergeOperator API.

db: debug/event logging

Add a debug/event logging facility for events such as flushes and compactions.

compaction: support CompactionFilter

I am migrating code from gorocksdb to pebble.

Ideally CompactionFilter can be added to skip some entry by key or value. The CompactionFilter is defined as this:

https://github.com/tecbot/gorocksdb/blob/master/compaction_filter.go#L7

perf: add AbbreviatedKey support to mergingIter

mergingIter.{Next,Prev} currently perform a full key comparison. The AbbreviatedKey facility used by batchskl could be used here as well to do a quick check on a fixed prefix of the key. An initial prototype showed this to provide a modest improvement in performance.

cache: experiment with manually managing memory

cache.Cache currently caches entries where the value has been allocated by the normal Go allocator (i.e. each value is individually allocated). The upside to this approach is simplicity and that correctness is easily achieved by leveraging the Go GC. The downside is that this adds significant work for the Go GC. For example, consider a 16 GB cache composed of 32 KB blocks (the default block size in Pebble). The cache would contain 512K blocks. Each GC cycle needs to scan the pointers in memory, looking for unreachable data. The more individual pointers, the slower the GC cycle.

Rather than using the Go allocator for each cached block, we could manually manage the memory. The thinking here is to allocate a relatively small number of large chunks of memory and to break that memory into pieces. We'd want to do this in a way that the cache memory is hidden from the Go GC (otherwise we'd just be creating an equivalent number of pointers for the GC to follow). This can be accomplished by using mmap to allocate chunks of memory. Examples of doing this in Go are easily found. The challenge with this approach is correctness: how to determine when a chunk of memory retrieved from the cache is no longer being used. The brute force approach is to require every user of the cache to release memory that is retrieved from the cache. Or we can try to do something more subtle using finalizers: retrieving an object from the cache allocates a new object with a finalizer attached. When the finalizer runs it decrements a reference count on the entry. This partially leverages the Go GC and might be simpler to implement, though this usage of finalizers would have to be thoroughly tested.

Before doing anything significant here, it will be useful to develop a benchmark showing the overhead of the current (Go allocator) approach for multi-gigabyte size caches.

compaction: sstable output size

The compaction sstable output size is unimplemented. The ShouldStopBefore mechanism/heuristic used in both RocksDB and LevelDB needs to be implemented. This heuristic adjusts the output sstable size based on the number of bytes the sstable overlaps in the "grandparent" level (the level below the output sstable level). Care should be taken to ensure that this takes into account range deletions in addition to puts, merges and point deletions.

iterator: TableFilter option

Add support for prefix-same-as-start (though a better name would be welcome), iterate {upper,lower} bound, and iterator table-filter.

Provide an API for cloning an iterator (with new options) that returns a new iterator pinned to the same visible sequence number.

use raw block device for db

Thanks for this great project, i known that ceph uses in their bluestore rocksdb, does it possible to use raw block device for pebble?
In case of using raw block device we don't need to know how to tune filesystem, and use kv directly...

Transactions

Didn't see it on the [TODO] or any roadmap but are transactions on the horizon or out of scope?

perf: lock-free reads

DB.Get and DB.NewIter both currently grab DB.mu in order to atomically reference the current memtable, the immutable memtables, and the current version. In RocksDB, the SuperVersion holds a reference to the this state. By bundling this state into the SuperVersion, a single reference count can be bumped to ensure that the state is stable for the lifetime of a read operation (Get or Iterator). The current SuperVersion is retrieved from a thread-local variable (see ColumnFamilyData::GetThreadLocalSuperVersion). It is unclear how to do something similar in Go given the lack of thread-local variables, but at the very least we could bundle the read state into a single struct and "acquire" it with an atomic operation.

perf: compaction / user iterator readahead

RocksDB has separate knobs for controlling readahead for compaction and user iterators. RocksDB recommends a value of 2MB for compaction_readahead_size when running on spinning disk. This knob defaults to off. Similarly ReadOptions.readahead_size controls readahead for user iterators. A similar recommendation of 2MB is made for improving forward iteration performance on spinning disk. It isn't clear if either knob is useful when using buffered IO.

Related to readahead is the OS caching behavior. RocksDB specifies fadvise(..., POSIX_FADV_DONTNEED) when writing sstables to avoid having the data for those tables retained in the OS block cache.

Cc @ajkr. Do you know how useful these knobs and fadvice-dont-need are in practice?

db: DeleteRange

Implement {DB,Batch}.DeleteRange. Similar to RocksDB, range tombstones will be stored in a parallel memtable. In sstables, range tombstones will be stored in the range-del meta block. The initial implementation should take a similar approach to RocksDB and aggregate range tombstones during iteration into something akin to the RangeDelAggregator structure.

During iteration, range tombstones need to be checked to see if a key is deleted. Additionally, range tombstones should be used to skip spans of deleted keys and entire sstables. During compaction, range tombstones should be taken into consideration when determining sstable boundaries.

cmd/pebble: add many more benchmarks

The RocksDB db_bench tool contains a variety of benchmarks:

Available benchmarks:
      	fillseq       -- write N values in sequential key order in async mode
      	fillseqdeterministic       -- write N values in the specified key order and keep the shape of the LSM tree
      	fillrandom    -- write N values in random key order in async mode
      	filluniquerandomdeterministic       -- write N values in a random key order and keep the shape of the LSM tree
      	overwrite     -- overwrite N values in random key order in async mode
      	fillsync      -- write N/100 values in random key order in sync mode
      	fill100K      -- write N/1000 100K values in random order in async mode
      	deleteseq     -- delete N keys in sequential order
      	deleterandom  -- delete N keys in random order
      	readseq       -- read N times sequentially
      	readtocache   -- 1 thread reading database sequentially
      	readreverse   -- read N times in reverse order
      	readrandom    -- read N times in random order
      	readmissing   -- read N missing keys in random order
      	readwhilewriting      -- 1 writer, N threads doing random reads
      	readwhilemerging      -- 1 merger, N threads doing random reads
      	readwhilescanning     -- 1 thread doing full table scan, N threads doing random reads
      	readrandomwriterandom -- N threads doing random-read, random-write
      	updaterandom  -- N threads doing read-modify-write for random keys
      	xorupdaterandom  -- N threads doing read-XOR-write for random keys
      	appendrandom  -- N threads doing read-modify-write with growing values
      	mergerandom   -- same as updaterandom/appendrandom using merge operator. Must be used with merge_operator
      	readrandommergerandom -- perform N random read-or-merge operations. Must be used with merge_operator
      	newiterator   -- repeated iterator creation
      	seekrandom    -- N random seeks, call Next seek_nexts times per seek
      	seekrandomwhilewriting -- seekrandom and 1 thread doing overwrite
      	seekrandomwhilemerging -- seekrandom and 1 thread doing merge
      	crc32c        -- repeated crc32c of 4K of data
      	xxhash        -- repeated xxHash of 4K of data
      	acquireload   -- load N*1000 times
      	fillseekseq   -- write N values in sequential key, then read them by seeking to each key
      	randomtransaction     -- execute N random transactions and verify correctness
      	randomreplacekeys     -- randomly replaces N keys by deleting the old version and putting the new version
      	timeseries            -- 1 writer generates time series data and multiple readers doing random reads on id

The number of options is intimidating, yet there are undoubtedly good ideas that should be incorporated into Pebble benchmarking.

Additionally, we need infrastructure for verifying that real-world performance does not regress. This is complicated as we need to watch for dips in performance, not just compare a few numbers at the end of a run.

perf: investigate WAL file reuse

RocksDB supports preallocating space for the WAL which can reduce the data that needs to be synced whenever fdatasync is called. In particular, if we pre-extend the size of the file, calling fdatasync will only need to write the data blocks and not the inode. Under RocksDB this is achieved by setting fallocate_with_keep_size == false.

See preallocate.go, preallocate_unix.go, and preallocate_darwin.go for the etcd code which performs the OS specific system calls.

sstable: ingestion

Implement DB.Ingest for atomically ingesting a set of externally created sstables. The entries in the sstable are all given the same sequence number using the global-seqnum sstable property. That means the ingested sstables must not self overlap. Sstables should be ingested into the lowest level of the LSM tree possible, though that might mean flushing the memtable and ingesting them into L0.

iterator: ensure iterators are always closed

And a debug mechanism that verifies that iterators are always closed and can pinpoint the source of an iterator leak.

*: testing

Testing is inadequate in various areas:

Data driven test infrastructure should be added. In particular, the compaction heuristics would benefit from data driven test infrastructure to allow the easy specification of compaction scenarios.

Full-table bloom filter

Add support for RocksDB full-table bloom filters.

*: doc comments

There are various areas of the code which need further commentary to explain the implementation:

Batch: full indexing support (including merge and delete-range)
Batch: large (flushable) batches
Batch: commit pipeline
Cache: weak handles
Delete range: fragmentation, batches, compaction, get, iterator, ingestion
Memtable: max size (see flushable batches)
Memtable: reverse iteration
Table ingestion

perf: performance stats

RocksDB has an extensive set of performance statistics. These take the form of tickers, histograms, IO stats, and per-thread performance counters.

The default Statistics implementation (used for tickers and histograms) utilizes core-local variables to minimize the overhead for gathering these stats. Go doesn't provide any facilities for core-local or thread-local variables making this difficult to translate, though it isn't clear if doing so is necessary.

It is possible that Pebble should simply provide an interface for Statistics (or Metrics) and a mechanism to control which are enabled and then allow the user to decide how they are to be recorded. CRDB would likely want to use its existing metrics package.

perf: optimize {Lower,Upper}Bound handling

IterOptions.{Lower,Upper}Bound handling is currently performed by checking the current key against the upper bound during forward iteration and against lower bound during reverse iteration. This comparison is performed on every call to Iterator.{Next,Prev}. Rather than performing this check in Iterator, we can push this check down to the leaves of the iteration tree. For example, a levelIter can check to see if the entire table is before UpperBound and then perform no further checks for keys within the table. Similarly, sstable.Iterator can do a check to see if all keys within a block are before UpperBound.

A bit of experimental code shows that this should be good for an 8-10% improvement in forward iteration speed. A similar improvement should be present for reverse iteration speed.

perf: investigate rent-or-buy compaction heuristics

Pebble currently implements compaction heuristics that match the RocksDB kOldestSmallestSeqFirst setting combined with level_compaction_dynamic_level_bytes. The kOldestSmallestSeqFirst setting has this comment:

// First compact files whose range hasn't been compacted to the next level
// for the longest. If your updates are random across the key space,
// write amplification is slightly better with this option.

CRDB currently uses a different compaction heuristic, kMinOverlappingRatio:

// First compact files whose ratio between overlapping size in next level
// and its size is the smallest. It in many cases can optimize write
// amplification.

The KOldestSmallestSeqFirst heuristic has the advantage of being simpler, and it avoids starving an sstable from being compacted. kMinOverlappingRatio is the default heuristic in RocksDB. My understanding is that it can propagate deletion tombstones to lower levels quicker.

It is unclear to me if kMinOverlappingRatio is significantly better for CRDBs use case. Note that CRDB also know marks any sstable containing a range deletion tombstone as requiring compaction, which will tend to result in range deletion tombstones quickly compacting down to the bottom of the LSM and thus freeing up space. Incorporating better heuristics around range deletion tombstones into compaction is worth investigating.

Cc @ajkr

perf: pool allocate Iterator, sstable.Iterator, and arenaskl.Iterator

Iterator, sstable.Iterator, and arenaskl.Iterator have known lifetimes due to the existence of the Close method, making them easy candidates for pool allocation which will reduce GC pressure.

inconsistent batch count in `pebble sync`

$ pebble sync -c 24 -d 0 bench
dir bench
concurrency 24
panic: pebble: inconsistent batch count

goroutine 1 [running]:
github.com/petermattis/pebble.(*memTable).apply(0xc00012a0f0, 0xc00010f7a8, 0x38f2afc, 0xc0081b5cf0, 0x152)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/mem_table.go:147 +0x2d1
github.com/petermattis/pebble.(*DB).replayWAL(0xc0000fe800, 0xc00010fa58, 0x659b20, 0x7ea320, 0xc00413abf0, 0x10, 0x38f2b01, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/open.go:266 +0x3bd
github.com/petermattis/pebble.Open(0x7ffd65180606, 0x5, 0xc000108210, 0x0, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/open.go:158 +0xedd
main.runTest(0x7ffd65180606, 0x5, 0xc00010fc70, 0xc00010fc90, 0xc00010fc80)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/cmd/pebble/test.go:193 +0x2a2
main.runSync(0x7c9800, 0xc0000f60a0, 0x1, 0x5)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/cmd/pebble/sync.go:30 +0xb1
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).execute(0x7c9800, 0xc0000f6050, 0x5, 0x5, 0x7c9800, 0xc0000f6050)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:766 +0x2cc
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x7c9a60, 0x0, 0x625ce8, 0x16)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:852 +0x2fd
github.com/petermattis/pebble/vendor/github.com/spf13/cobra.(*Command).Execute(0x7c9a60, 0x7b2360, 0x6220cc)
        /home/tschottdorf/go/src/github.com/petermattis/pebble/vendor/github.com/spf13/cobra/command.go:800 +0x2b
main.main()
        /home/tschottdorf/go/src/github.com/petermattis/pebble/cmd/pebble/main.go:49 +0x34a

I had previously run the same command for around an hour on the same directory. The error first occurred when I terminated the program and ran <same> | log by accident, but now the error also appears right away using <same>

iterators deviating from RocksDB merge semantics in reverse iteration

https://github.com/petermattis/pebble/blob/31f4d2c3b39e95b7027b2c72bb2701995ae8bae2/iterator.go#L151-L169

This looks like we’re merging values in old-to-new order when iterating backwards, and in new-to-old order when going forward. I think the correct (“RocksDB”) behavior is new-to-old, which is why on reverse iteration all the merge operands are queued up until the newest one has been reached:

https://github.com/cockroachdb/rocksdb/blob/c905cbca7b5ac1ef59046cfa58e146458f499eec/db/db_iter.cc#L947-L948

Since floating point operations don't commute, this would lead to different values observed depending on the direction of iteration.

Maybe that's OK, but it seems very awkward. OTOH, buffering all of the internal versions of a key for a merge is, too.

Btw, not the right venue, but isn't this a bug when there are merges? I think each merge entry increments that int, so eventually we'll seek and just miss versions? Hope I'm wrong.

https://github.com/cockroachdb/rocksdb/blob/c905cbca7b5ac1ef59046cfa58e146458f499eec/db/db_iter.cc#L910-L916

DeleteRange truncation issues

There are a couple issues with range tombstone truncation exposed in these test cases: https://gist.github.com/ajkr/9619663ad7dc5586ac79b29bbc06921f

I believe the current approach (implicit truncation) works nicely for point lookups and iterator next/prev. But the above test cases shows it doesn't handle all cases of iterator seek or writing compaction outputs.

For iterator seek, it should be enough to limit the seek over shadowed data to the sstable boundaries.

For writing compaction outputs it is a bit more difficult. We need to truncate tombstones before writing them but can't simply use the input sstables' boundaries due to cases like facebook/rocksdb#4432 (comment). In RocksDB we solved it by truncating to the compaction unit boundary. I believe the analogous thing should work here, even though compaction units are computed differently.

db: add support for the OPTIONS file

The RocksDB OPTIONS file provides an easy facility for determining the options that are set for a database. Supporting all of the RocksDB options is a non-starter, but providing a similar facility is useful for debugging/introspection purposes.

perf: user-level IO scheduler

ScyllaDB claims their user-level IO scheduler provides better latencies under any circumstance.

sstable: support two level indexes

Pebble currently only supports a single level of index in sstables. This is the original layout of sstables. RocksDB added support for two level sstable indexes, where the root index block points to a second level of index blocks which in turn point to data blocks. This additional level of indexing can reduce the size of the root index block for sstables which have small data blocks and/or large keys.

db: support arbitrary size write batches

LevelDB and RocksDB both support write batches that are larger than the configured memtable size because the memtable size is a soft-limit. The Pebble memtable size is a hard limit. Supporting write batches larger than the memtable size could be accomplished by allocating a custom-sized memtable when a write batch is larger than the memtable.

perf: pacing user writes, flushes and compactions

Add a mechanism to pace (rate limit) user writes, flushes and compactions. Some level of rate limiting of user writes is necessary to prevent user write from blowing up memory (if flushes can't keep up) or creating too many L0 tables (if compaction can't keep up). The existing control mechanisms mirror what are present in RocksDB: Options.MemTableStopWritesThreshold, Options.L0CompactionThreshold and Options.L0SlowdownWritesThreshold. These control mechanisms are blunt resulting in undesirable hiccups in write performance. The controller object was an initial attempt at providing smoother write throughput and it achieved some success, but is too fragile.

The problem here is akin to the problem with pacing a concurrent garbage collector. The Go GC pacer design should be inspiration. We want to balance the rate of dirty data arriving in memtables with the rate of flushes and compactions. Flushes and compactions should also be throttled to run at the rate necessary to keep up with incoming user writes and no faster so as to leave CPU available for user reads and other operations. A challenge here is to adjust quickly to changing user load.

cache: reserve memtable memory from block cache

Provide better control over memory usage by reserving memory in the block cache for each memtable. The main work item here is to provide an API on cache for creating and releasing memory reservations in the cache.

perf: fractional cascading on seeks

RocksDB performs fractional cascading on lookups in some restricted circumstances. The reason behind not supporting it everywhere is complexity. We should investigate whether this optimization can be applied to mergingIter to reduce the seek time within a level.

cache: evict sstables from cache on deletion

Evict all cached blocks for an sstable from the cache when the sstable is deleted. This prevents growth of the cache up to its max size when the database is small but there is a steady small amount of background churn causing sstables to be compacted.

Have you considered WiscKey ?

Have you considered separates keys from values ie. WiscKey just like Badger and TiDB?

perf: read-based compaction heuristic

Re-introduce a read-based compaction heuristic. LevelDB had a heuristic to trigger a compaction of a table if that table was being read frequently and there were tables from multiple levels involved in the read. This essentially reduced read-amplification in read heavy workloads. A read-based compaction is only selected if a size-based compaction is not needed.

In addition to reducing read-amplification, a read compaction would also squash deletion tombstones. So we'd get a scan performance benefit in not needing to skip the deletion tombstones.

compaction: flesh out heuristics

The existing compaction heuristics are simplistic and incomplete: looping over every level that needs compaction and compacting the first sstable in the level. We need to pull in the more advanced heuristics used in RocksDB. In particular, the compensated-file-size heuristic takes deletions into account when determining which sstables to compact. And the dynamic-level-bytes heuristic dynamically adjusts the target size for levels which improves performance when some of the LSM levels are empty.

sstable: prefix bloom filter

Add support for constructing the bloom filter on a prefix of the user key. Add a Comparator.PrefixExtractor function for extracting the prefix.

One question here is whether this should be considered a prefix extractor or restricted to user keys which have a version. An alternate api would be to provide Comparator.Split to split a user key into a prefix and a version suffix. This might be useful in the future for segregating historic versions of user keys into a separate storage area.

perf: faster table/block cache

The existing block cache implementation uses the Clock-PRO replacement policy. The block cache is important for performance. The raw speed of accessing the block cache is an important component of read performance. Concurrency of the block cache directly affects concurrent read performance. There are three areas worth exploring here:

Should the block cache be placed outside of the visibility of the Go GC. This could be accomplished by implementing our own hash table on top of arena allocated memory.
Improve the concurrency of the block cache, either via sharding or lock-free approaches.
Use a better cache replacement policy such as TinyLFU. See also Caffeine, a high-performance Java caching library.

perf: faster ingestion

DB.Ingest currently experiences a hiccup if the table being ingested overlaps with a memtable: ingestion needs to wait for the memtable to be flushed. This is necessary because the ingested sstable is given a sequence number newer than entries in the memtable. We cannot add the ingested table to the LSM until overlapping entries with older sequence numbers are written to L0.

One way to avoid the hiccup is to have ingestion lazily add the table to the LSM. If the ingested table overlaps with a memtable, an entry is added to the WAL (ensuring that the action will be completed in the face of a crash) and the ingested table is appended to the list of memtables. We'll have to add a small wrapper around the sstable so that it implements the flushable interface and add some logic so that the table isn't actually flushed, but it would then simply be added to the L0 metadata when it is time to flush (i.e. a new table would not be created).

perf: add support for storing the WAL in a separate directory

RocksDB supports storing the WAL in a separate directory. That directory can be on a different disk than sstables, and that disk might be a different media that is significantly faster. For example, the WAL could be stored on SSD or Optane, while sstables are stored in spinning disk. This feature is relatively straightforward to implement, mainly just plumbing a configuration option.

perf: investigate concurrent flushes

Pebble only allows a single flush of a memtable to L0 to occur at any time. RocksDB allows multiple concurrent flushes, though note that the resulting sstables need to be installed "in order". The benefit of allowing multiple concurrent flushes is a bit unclear, as on the surface allowing a concurrent flush would seem to simply divide write bandwidth across the concurrent flushes.

Cc @ajkr

cockroachdb / pebble Goto Github PK

pebble's Introduction

Pebble Coverage

Production Ready

Advantages

RocksDB Compatibility

Format major versions

Pedigree

Getting Started

Example Code

pebble's People

Contributors

Stargazers

Watchers

Forkers

pebble's Issues

Recommend Projects

Recommend Topics

Recommend Org

Pebble ^_{_Coverage}