engula / engula Goto Github PK

View Code? Open in Web Editor NEW

776.0 19.0 66.0 3.45 MB

Engula is a distributed key-value store, used as a cache, database, and storage engine.

Home Page: https://engula.io

License: Apache License 2.0

Rust 97.79% Shell 0.45% Python 1.72% Makefile 0.05%

storage-engine database rust cache

engula's Introduction

Engula

Engula is a distributed key-value store, used as a cache, database, and storage engine.

Architecture

See design doc for more details.

Quick start

Build

make build

Deploy a cluster

bash scripts/bootstrap.sh setup

Verify

cargo run -- shell

Run and enjoy it.

Contributing

Thanks for your help in improving the project! We have a contributing guide to help you get involved in the Engula project.

More information

For informal discussions, please go to the forum.

engula's People

Contributors

Stargazers

Watchers

engula's Issues

Write down the first version CODE_OF_CONDUCT.md

@huachaohuang thanks for your reply.

Sure. Please help to add the CODE_OF_CONDUCT.

I've read the CoC of Rust Community and really like its section "Moderation". I'll create a dedicated issue for continuing the work of CoC - it deserves a dedicated thread.

Originally posted by @tisonkun in #19 (comment)

Provide a kernel abstraction for engines

A kernel is responsible for the persistent part of an engine. It provides the ability to store events, objects, metadata, etc.

Reduce CI time cost

engula using some Rust tools, like cargo-udeps, cargo-audit, maybe we should reduce CI time cost to improve contributor experience.

Mostly, maybe we should download binary rather than runing cargo install for every job.

elaborate the contributing guide

As the project grows, I answer similar questions again and again. Our current CONTRIBUTING.md somehow lacks information and doesn't specialize in our project. Thus, I propose to elaborate the contributing guide a bit to cover the following topics:

Get started
- Prepare the development environment
- Run examples
Communications (why communications are important, how we organize discussions)
- Discussion forum
- Zulip chat room
Contributions
- Principles (reference)
- Report issues
- Review patches (it should be intuitive but many (potential) contributors ask me about this topic)
- Contribute code (it should be intuitive also, with satisfying our CI tasks)
- Licenses (many contributors confuse where a license header should be placed)

I'm going to prepare a PR this week to catch up on the exposure of v0.2 release. The points above are insights. If things go complex (I hope not yet), it can be factored out to the docs/ folder.

Reference:

cc @huachaohuang

cc @Xuanwo @zojw @DCjanus any information you think there should be?

Roadmap v0.2 - Engine

Provide a simple hash engine that supports get/set/delete key-value pairs.
The engine should be able to use different kinds of kernels.

#168

Support `delete` operation

#143 only supports get/set, we should support delete too. This is not as easy as it might sound. We need to add tombstones for deleted records.

Roadmap v0.2 - Microunit

Non-goals:

Data persistence
Leader election and failover
Resource isolation and management

Tasks:

A file kernel that stores everything in local files

A file kernel integrates a file journal, storage, manifest, etc.

Can we use learned indexes to build a new file format?

There are a few papers about applying machine learning to construct indexes on sorted records.
These learned indexes can reduce memory usage significantly.
Maybe we can explore the possibilities to build a new file format with these learned indexes.

Specifically, the PGM index looks very interesting.
However, it seems only applicable to integers instead of arbitrary byte arrays.
Maybe we can apply some order-preserving minimal perfect hash functions here.

Just some immature thoughts here, welcome for discussion.

Reference:

Roadmap v0.2 - Background

Tasks:

Consider following the Cargo.toml conventions

There is a Cargo.toml style guide here. We can consider following it.

Example does not have output

Although it does do some works and write under /tmp/engula_test/hello, however, a user friendly message helps improve the journey.

For example, tell the user he/she can find the output under the directory or just list them out. It seems when failed, the framework itself can give error message.

Adopt new chatroom Zulip

Discussed in #141

^{Originally posted by tisonkun November 30, 2021}
TL;DR: Join Engula Zulip by https://engula.zulipchat.com.

In #33 we migrate the auxiliary online instant messaging tool from Gitter to Discord. However, during the experience these days and direct discussion with several contributors we found several shortcomings that pushes us to find a replacement:

Discord cannot be accessed from somewhere. Connect to Discord is unstable or blocked in some regions, which likely restricted contributors from those regions to participating.
Discord's APP isn't available on some platforms in some regions, too. In addition, Discord doesn't provide a web version on mobile. This causes significant inconvenience for discussing everywhere.
One of the top reasons we choose Discord is that it provides a voice channel out-of-the-box. However, it turns out to be unstable.
Another reason we choose Discord is that the Rust community uses it. But they migrated to Zulip years ago.

So, I do an investigation on Zulip, and here is the report about we can consider Zulip as a replacement:

Zulip can be accessed where Discord cannot as shown above. For me, I can connect to the chatroom anywhere.
Zulip provides meeting support based on Zoom.
Zulip supports Markdown syntax to render messages, which is much concise than posting links on Discord.
As mentioned above, many Rust teams migrated to Zulip. And there are endorsements from developers. In the meantime, Discord is designed for gamers to discuss online.
As a personal preference, because of designing for gamers, Discord is in rich-text style, while Zulip is almost plain-text style. I like the latter.

This is not an urgent issue but I hope someone helps with practicing the experience on Zulip. You can join the organization on Zulip via this invite link ( https://engula.zulipchat.com/join/eicksrlhhokl4274ouijr7s6/ ). We still host all truth on GitHub and recommend Discord as the auxiliary online chatroom in README, until it turns out that Zulip is a better choice.

cc @huachaohuang @levy5307

Install PrometheusBuilder causes panic?

engula/engula/bin/engula.rs

Line 49 in fdfae5d

// This panic in some cases, haven't figure it out.

@huachaohuang I see this comment but don't know what is "in some cases". Could you please use a TODO comment as well as comment under this issue with an error stack?

It seems from the doc of install it already states that:

An error will be returned if there's an issue with creating the HTTP server or with installing the recorder as the global recorder.

Merge Bucket into Storage

We have three traits for the storage component now: Storage, Bucket, Object. This brings two problems:

The trait bound is a bit complicated, like Storage<Object, Bucket>
There are some corner cases dealing with individual buckets, for example, #86 (comment)

A simpler solution is to merge Bucket into Storage, so Storage will provide the following interfaces:

list buckets
create bucket
delete bucket
list objects in a bucket
upload an object to a bucket
delete an object from a bucket

We can still keep Object and ObjectUploader as usually. Note that only Object is performance-critical here, as users may read from an object frequently. Other storage operations don't need to care about performance very much.

Remove branch filter on workflow trigger strategy

@huachaohuang @ZhongliGao I don't know why you guys change the branch name, but if it's the case and we have only interest branches on the upstream repository (i.e., no personal branch), we can trigger the workflow on every "push" and "pull request" event.

ci - cargo clippy should be run with --tests

Otherwise tests aren't covered.

Investigate how to test with s3

ref #99 (comment)

need some method to automatic test with AWS service like s3 in CI without leaking access&secret key, IMHO, maybe we can do it in two directions:

unit test: mock AWS SDK, there are exists an issue indicates SDK support TestConnection to mock/record request-response and we take some tries, but it seems we need more encapsulation works to make it easy for use.
integration-test: deploy a temp S3-like service(for example https://min.io/) in CI pipeline and make test case using temp S3-like service.

Welcome to give more advice or contribution code 😄

Roadmap v0.2 - Journal

Tasks:

journal - make timestamp generic

What do you mean by "make it generic"? Does it mean:

pub struct Event<TS> {
    pub ts: TS,
    pub data: Vec<u8>,
}

or Timestamp<T>? Is there similar stuff in other projects?

cc @huachaohuang

Originally posted by @tisonkun in #82 (comment)

The Event<TS> one, but should be more complicated than that. The rationale is to let users customize the type of the timestamp. Depending on how users implement their databases, the usage of timestamp can be very different. For example, it can be a simple increasing counter, a logical lock, or a hybrid logical lock, which may need different representations.

Originally posted by @huachaohuang in #82 (comment)

Implement a BloomFilter

A Filter trait has been added without an implementation.
We can add a BloomFilter like LevelDB.

Reference: https://github.com/google/leveldb/blob/master/util/bloom.cc

Add a simple command to do local checks before sending PRs

We need to run several commands to check tests, style, license, unused dependencies before sending a PR, which is tedious. Maybe we can add a simple way to check everything at once locally.

Implement an LRU cache

A Cache trait has been added with no concrete implementation yet.
We can add a simple LRU cache like RocksDB.
For example, an LRUCache can be partition into multiple LRUShard, each of which can be a simple HashMap protected by a RwLock.

Reference: https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h

Apply license header to engula work

First of all, I must clarify that this project seems still in an early stage so this issue is a suggestion to consider along with the project grows.

According to APL 2.0, it is a common practice to apply explicitly the license by:

To apply the Apache License to specific files in your work, attach the following boilerplate declaration, replacing the fields enclosed by brackets "[]" with your own identifying information. (Don't include the brackets!) Enclose the text in the appropriate comment syntax for the file format. We also recommend that you include a file or class name and description of purpose on the same "printed page" as the copyright notice for easier identification within third-party archives.
Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

If we decide to stick to APL 2.0, we'll meet this proposal sooner or later. Contributors probably complain that a template is too heavy to carry when we're still in an early stage - we don't care about how others make use of the demo. But when you think your work has its shape, it is a signal that you should consider this proposal.

Also, thinking about the name of "copyright owner" is essential for the project.

Last, without audit tools the license header can easily miss or with typo, etc. skywalking-eyes will be a great language-agnostic solution integrated with GitHub Actions.

Find a practice to return asynchronous paged results

We have some APIs that need to return a list of values. For example, get a list of object names from a bucket. The simplest way is to return a Vec<T>. But if the list goes too long, it is not a good idea to return it all at once. A common solution for this kind of problem is paging. Rust provides a concept called Stream, which is like an asynchronous iterator that allows the caller to get one value at a time.

I think the most mature way to do that is to use Stream from the futures crate. But it is not a good idea for our public APIs to depend on a third-party trait, which is not stable either. Fortunately, Rust is working on a built-in Stream trait (nightly), so we may consider using that in our public APIs when it is stablized.

Settle down the project layout

Rust crates:

engula: the integral library and binary
- engula-engine: contains different kinds of storage engines
- engula-kernel: contains the kernel abstraction and some built-in implementations
- engula-journal: contains the journal abstraction and some built-in implementations
- engula-storage: contains the storage abstraction and some build-in implementations

TODO:

Manifest abstraction to store kernel metadata

Kernel needs a metadata store for its metadata persistence. We have two choices:

Add a general metadata abstraction in the same level with Journal and Storage
Add a sub-module inside Kernel that only handles kernel specific metadata

For v0.2, we can choose the second solution for simplicity. We can still move to the first solution in the future if the second one is proofed to be inadequate.

Roadmap v0.2 - Storage

Tasks:

Most tasks have been done, but we still need to refactor some implementations before releasing v0.2.

Setup workflow to check unnecessary dependencies

ref #99 (comment)

maybe we can add https://github.com/est31/cargo-udeps to git workflow, just like cargo audit

Before we have a clear branch strategy, simplify GitHub workflow

engula/.github/workflows/checkcode.yml

Lines 4 to 10 in b27a515

 pull_request: 

 push: 

 branches: 

 - '*' 

 - '!staging.tmp' 

 tags: 

 - '*'

This is confused and too complex for now. We can use main only and see where we go later.

Consider separate executable and library

In #78 we simplify the project layout into:

├── Cargo.lock
├── Cargo.toml
├── src
│   ├── api
│   ├── background
│   ├── hello_unit.rs
│   ├── main.rs
│   ├── manifest
│   ├── microunit
│   ├── node.rs
│   └── storage

However, the root cargo config file define both the executable and library, while a consumer of the library doesn't want the executable part. We may consider releasing executable and library separately. @zojw suggested learn from how tokio does release.

cc @huachaohuang @PsiACE

Find tool and add Github Action to check toml files

We have two toml files style now. For example, this uses the style:

tokio = {version = "1.13", features = ["full"]}

But this use the style:

tokio = { version = "1.13", features = ["full"] }

Note the whitespace between {}. It would be nice to keep toml files consistent for all contributors.

Improve Cargo.toml descriptions

We should complete some common fields like license, homepage, description before we publish crates to crates.io.

Find a best practice to define async traits

While async-trait works fine, it will result in a heap allocation per-function-call. So for performance-critical scenarios, we still need a better way to define async traits.

A grpc kernel that stores everything in a remote grpc kernel service

A grpc kernel integrates a grpc journal, storage, etc. It consists of a client and a server part. We need to provide a binary to start the kernel server.

Roadmap for demo 1

The project is just started and is still in the demo stage now.
The primary goal in this stage is to explore the possibility of our designs.

We plan to release the first demo at the end of Sep 2021.
We are going to achieve the following objectives:

A working demo with simple read/write APIs on AWS
A demo report about what we did and the lessons learned

Tasks:

However, we are short of hands.
If you think we are behind schedule, please push us with a 👍

Update: this demo has ended, please check the report for more details.

Roadmap 0.2

Overview

Goals:

Set up basic project management
Present the fundamental ideas and usages of Engula

Non-goals:

Reliability
Performance

Project

#59
#144

Modules

Documents

ci - cargo test should be run with --workspace

Otherwise inner crates aren't covered.

Write down the first version CONTRIBUTING.md

In order not to disturb your prototyping and rapidly development, I'm glad to write this file and propose a PR, with your help on collecting necessary information.

The draft is looks like:

How to Contribute

I'm really glad you're reading this, because we need volunteer developers to help this project come to fruition.

If you haven't already, come find us on gitter. We want you working on things you're excited about.

Welcome to review our design or participant discussions about the roadmap!

Get Started

We develop Engula with rust stable toolchain.

You're able to get started with Engula with three steps:

Setup the environment with rustup.
Build Engula via cargo build.
Run the example via cargo run --example hello.

Report an Issue

If you think you have found an issue in Engula, you can report it to the issue tracker.

Before filing an issue report is to see whether the problem has already been reported. You can use the search bar to search existing issues. This doesn't always work, and sometimes it's hard to know what to search for, so consider this extra credit. We won't mind if you accidentally file a duplicate report. Don't blame yourself if your issue is closed as duplicated.

If the problem you're reporting is not already in the issue tracker, you can open a GitHub issue with your GitHub account.

Submitting a Pull Request

Please send a GitHub Pull Request to Engula with a clear list of what you've done (read more about pull requests). When you send a pull request, we're looking forward to an expressive description, clear commit messages, and more test coverage if it is code contribution.

Before submitting the pull request, please make sure all tests pass locally:

cargo build --release
cargo test
cargo clippy -- -D warnings
cargo fmt --all -- --check

Thank you for your participation!

The questions are:

Do we protect main and only modify it by PR?
What merge strategy do we use, specially, merge with commit, rebase and merge, or squash and merge? I highly recommend the latter two where merge with commit make history hard to read - however, I'm not participant your project deeply, so it's your choice.
Shall we adopt a code of conduct? If so, I'd suggest Contributor Covenant Code of Conduct and the project should provide a contact method.
Any other concern on the draft above?

A grpc journal implementation

We can wrap a gRPC Journal similar to the gRPC Storage. I think it is not very hard, so maybe someone can give it a try :)

Port mini-redis to store data in the hash engine

From Zulip:

Hmm, maybe we can let mini-redis run on our hash engine. I think the job should not be hard since mini-redis stores everything in a HashMap. It will be an interesting way to demonstrate the usage of our hash engine in v0.2.
We can consider working on that once we get #143 landed

This is not required for v0.2. But if anyone is interested in this job, we can add it to the list 🤓

Make a simple release process and release v0.2

This should be the last closed issue before releasing v0.2.

Add more crate-level documents

Remove if: steps.cache.outputs.cache-hit != 'true' lines should work. @Xuanwo we may later figure why current settings create false positive.

Originally posted by @tisonkun in #185 (comment)

Roadmap v0.2 - Kernel

We will provide an abstraction and three implementations:

engula / engula Goto Github PK

engula's Introduction

Architecture

Quick start

Contributing

More information

engula's People

Contributors

Stargazers

Watchers

Forkers

engula's Issues

Discussed in #141

Overview

Project

Modules

Documents

How to Contribute

Get Started

Report an Issue

Submitting a Pull Request

Recommend Projects

Recommend Topics

Recommend Org