varchar-io / nebula Goto Github PK

View Code? Open in Web Editor NEW

153.0 153.0 17.0 179.44 MB

A distributed block-based data storage and compute engine

Home Page: https://nebula.bz

License: Apache License 2.0

C++ 93.67% CMake 5.32% C 0.13% Dockerfile 0.08% Shell 0.80%

access-control analytics big-data data-analysis data-visualization distributed-computing distributed-systems real-time

nebula's People

Stargazers

Watchers

Forkers

blockspacer nipatil-cybage anewczs chenqin shaikhanas1993 hukid samprasyork simhaonline gutomata jwright707 wangscript007 fly3366 wyuan1704 xiaolushuo icodein caoash cleardry

nebula's Issues

Use local time in UI

Right now, we're using UTC time through the whole stack from Nebula Engine to Nebula UI.

However it's not intuitive for UI to not use local time, like I'm in PST and I don't want to use my brain to translate this back and forth, very inconvenient. This work should be done in a single place to serving UI time translation, better to have an option in UI to allow user to choose UTC or Local time.

Work can be started from https://github.com/varchar-io/nebula/blob/master/src/service/http/nebula/_/time.js

Support fixed-length string

In DB - there are types like fixed length string char(3) or variable length string varchar(30).

Many scenarios, fixed-length string types hints big reduction in both storage and computation.

Table only view

Some query results aren't fit for graph visual, such as error message and its count, in this case, user may want to hide graph visual, but display as table only.

We should provide an option to allow user to hide graph or just table only option for display.

Headless mode

To support embedding nebula UI into other surface (dashboard, embedded env).
Introduce a state property to indicate if current rendering includes visualization display only.

Support Instant UDF In Filter

Currently Instant UDF (javascript function/lambda) can be used to define new column. So we can use it in select clause.
However, we should support it in Where clause as well, so below pseudo code could be executed successfully.
const x = () => { return nebula.column("age") % 3; }; nebula.apply("x", nebula.Type.INT, x); nebula.select("type", count("id")).where(eq("x", 1));

Update file header

Replace all occurrences of below

Setup Github Action

Set up Github Action to build Nebula nightly.

Right now, to make the nebula build work (Ubuntu) on a fresh machine, it is not EASY!
@samprasyork probably had a great feeling about it, :)

This issue is to track let's make the build experience easy for future developers on this project. Github Action supports cmake based project, by enabling it, I guess we will fix most of the build issues.

Remove display type from the interface.

Nebula query engine should not care about what display type the front end is.
The only thing it cares about are the query fields

does it have aggregation field
does it aggregated by timeline/window

And client should be able to handle whatever the final query result is.

support compatible casting in type eval

test case using constant value eval as example

  {                                                             
    int8_t value = 90;
    auto c = nebula::surface::eval::constant(value);
    auto ev = c->eval<int32_t>(ctx);        
    LOG(INFO) << "Origin=" << value << ", Eval=" << ev.value();
  }

This is debatable as type enforcement is a good thing, but the issue is if we made wrong as above example, it will fail in runtime rather than compile time. So I feel either we support compile time check or add compatible cast in runtime.

MetaDB backup integrity

Currently we're backup the internal metadata DB (leveldb) when it is 'dirty' and exceeding backup interval.
However the process is not atomic and may leave a broken version in the backup.

I think the issue could be just to fix the current design to ensure a version left in the backup media is a valid.
Also open to different design, such as operate metadb completely separate which definitely requires more diligent work.

Though MetaDB only powers Nebula short links today, it becomes pivotal and will be more and more important as we are leveraging it for future data set integrity source as well as internal load balancing data set across different nebula nodes. Would love to see more deeper thoughts around this issue.

Histogram values seems incorrect

steps:

clone latest code from master and build it locally
run local stack by invoking ./run.sh
go to http://localhost:8088

in the test set, run hist(value) only, the value distribution doesn't meet the expectation, I think value column should be even distributed as it's random generated in range [-128, 128] -

nebula/src/surface/MockSurface.cpp

Line 44 in c8cdd6b

return std::numeric_limits<int8_t>::max() * rand_();

so I believe every bucket should have similar number of values in the histogram chart.

Also this query (value > 90) should produce non-zero buckets for value greater than 90, some bug out there! :)

Publish a one page doc about supported data source spec

This is a topic that we haven't had a post of doc page talking about, and I think it's the most matter thing to users who want to adopt Nebula.

Similar to https://nebula.bz/sdk.html, let's add a doc page for "ingestion" and link to the menu bar in the home page at https://nebula.bz, also we can have this ingestion page linked in repo home README.md as it's something new users care about.

coding IDE refresh

Better discovery for user to write javascript UDF to do data analysis.
Keep code logic in-sync with user interface.

Implement histogram view on numeric columns

Basically "select hist("col") from

", we can start with 100 buckets and make it configurable in query interface.
folly:Histogram can be used to achieve this:

figure out range of column value at query time (metadata support)
Histogram aggregation function to be added
UI/display with column/bar chart.

Histogram function may crash the whole nebula node

Some error log captured when a nebula node dies due to a hist() call in production

I0128 19:12:14.229609 20261 Dsl.cpp:354] Nodes to execute the query: 1
I0128 19:12:14.230307 20261 BlockManager.cpp:147] Fetch blcoks 1009 / 1009 for table cdn_requests in window [1611838876, 1611860964].
I0128 19:12:14.230335 20261 NodeExecutor.cpp:84] Processing total blocks: 1009
F0128 19:12:14.232291 36092 Histogram-inl.h:34] Check failed: bucketSize_ > ValueType(0) (0 vs. 0)
*** Check failure stack trace: ***
*** Aborted at 1611861134 (Unix time, try 'date -d @1611861134') ***
*** Signal 6 (SIGABRT) (0x3e800000006) received by PID 6 (pthread TID 0x7fe8c5ec2700) (linux TID 36092) (maybe from PID 6, UID 1000) (code: -6), stack trace: ***
(error retrieving stack trace)

Support Time Pattern In Column Property

column property can have time pattern, which could be used

ingesting the string literals from data source which follows this pattern.
display this column in the format defined by pattern.

example:
columnx:
timestamp: "%Y-%m-%d %H:%M:%S"

Anomaly detection on top of timeline

(Experimental & Direction Exploration)
Investigate Facebook prophet.

A timeline could be defined, and an anomaly detection module could wrap it around, this could be saved as an item, when things go wild, fire an event on a dashboard.

Revamp Nebula UI To Remove Visualization Choices In Query

Today Nebula UI requires user to choose which visualization to use. This is unnecessary and not good UX.

Make the change as below:
Based on fields (with/without aggregations), Nebula UI will always display a table for the result of data (truncating super long string column), and above it, user can choose display as any visual they want.

Timeline query needs a special treatment to respect window column as x-axis always.

Support Ingesting Compressed Data Files

If the data source is file system, such as S3, based on the file extension, we should be able to decompress it transparently before ingest the data files (csv/tsv files are much less in compressed format) if they are compressed files, such as .gz, .zst, .lz4, etc. Basically we can decompress the file in place after download from object store.

Enable LTO on Nebula release build

Enable -flto and/or related link time optimization for nebula build. Currently it seems not working.

Icicle/Flame view doesn't capture the right canvas height on some machine

I found the same view has incorrect height on my laptop while it is okay showing on a dell monitor.

Also - the main color for icicle/flame is too dark (dark red), not good for this visual, should do some adjustment on it.

[Ingestion] Hourly Roll spec support

The goal here is to support reading from hourly partitioned s3 directory.

https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/filesystem.html#full-example

we need to improve genSpecs4Roll and add hourly partitioned specs instead of daily. Currently macro is hard coded as date (daily), hence some work to read macro and decided how genSpecs4Roll doing listing respectively
often there are commit flag _SUCCESS when an hourly partition finished. fs->list() should be able to customize it's behavior and let user decide if they want fresh yet incomplete dateset or complete yet a bit long latency dateset.

detail discussion on #34

Benchmark Nebula performance

Lots of unknown for this item;

who to compare with? Druid/Pinot/Clickhouse?
did any of those engines do benchmark work before that we can rerun on Nebula?
what the setup look like? sing node / cluster?

Regarding performance metrics: data storage size/cost, query latency, query set to run, etc.

Make a thread-safe paged slice

Paged Slice can effectively reduce memory waste by organizing internal compressed blocks for the whole data block.
However, concurrent reading the list will requires lots of locking and isolation for thread safety.

Not high priority but an interesting problem to tackle.

Improve build steps

It's painful to make the C++ build pass today, the major problems are dependencies and linker issues.
Target platform: Linux Ubuntu 18

Though we may not be able to make build fully automation, we at least can have a clear guide which is for sure working.

At the same time, we should maintain the build for MacOS as well.

prototyping Nebula Ingestion DDL

Nebula Ingestion DDL

YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.

Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.

Use an inspiring example that not yet supported by nebula.

User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)

The complexity of this use case comes from three folds

hive table may read data and eventually shard per account basis
kafka stream may need to rpc and convert currency into usd
both kafka stream may need to do stream/table join on per account basis before land result to run slice and dice.

If user write a RDBMS query, it should look like

OPTION 1 Materialized View with schema as part of config

create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)

Alternatively, we can support two statement flow like

OPTION 2 Full Table with schema inference

DDL
`
// mapping of hive table sync to nebula table
create table hive.account (
accountid bigint PRIMARY KEY,
spend double,
dt varchat(20) UNIQUE NOT NULL
) with ();

create table kafka.transaction (
transactionid bigint PRIMARY KEY,
accountid bigint not null,
transaction_amount double,
_time timestamp
) with ();

create table transaction_analytic (
accountid bigint PRIMARY KEY,
avg_transaction double,
transaction_amount_in_usd double,
_time timestamp
) with ();
`

DML
insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;

Auto Nebula build using build.sh

Right now, manual steps mentioned in dev.md, it's for Ubuntu 18.
Let's at least use build.sh to automate this supporting Ubuntu 18 only.

Small bug in hist graph

the chart should not display "undefined" for x-axis in histogram chart

Support Google Cloud Storage

Currently Nebula defines file system interface in its storage component. Under this interface, there are a few implementations such as

local file system
S3 file system

There two are mostly used so far, now we're asking a new support to allow Nebula read and write data with GCS. This will be needed for nebula deployment in Google Cloud Platform.

support data transformation on the query result before visualize

Sometimes, user have defined metrics on the query result itself. For example

new metric column per row = col(A) / col(B)
further aggregate across different rows: (A B 1) (A C 2) => (A B/C=0.5)

No matter what - we will pass the raw JSON blob to user and let it transform uses its own transform lambda, new schema will be discovered in the transformed result.

Revamp doc site https://nebula.bz

The feedback from Chris makes sense: we should write docs from users' perspective instead of developers' perspective.

For dev notes, we should use more md files in source code. Leaves these docs for users who can finally adopt the project in their work。

support dynamic updating time line based on interval and relative time range

Client will make continuous query template and materialize with new interval start and end for new data point.
Push the new data point into the timeline buckets and allow visual rendering part decide how to consume the growing queue.

If we cap the queue size, the old data point will be pushed out while new data points appended.

Enable cardinality function all all columns

This is a tiny task, good for first issue for any new comers.
Right now, cardinality estimate function only attached to NUMBER typed column when populating in the UI, just allow it to populate for any type since cardinality is not limited to numbers only.

Support filters in client SDK

Right now, nebula client (javascript) SDK doesn't support filters yet, basically where clause function.
This issue will tracking the support for that.

Implement Bubble chart support

Bubble chart support is useful for some use cases. Bubble chart is popular, would love to see it to be available in Nebula UI.

Be adaptive to stream speed

Today, for real time streaming data source, such as Kafka, nebula create each data block based on offset start and end, and seal a data block when all records arrived, then put this sealed block into queryable blocks pool. Query will scan blocks that are in the pool.

For busy streams, this may not be an issue, as the system will place data blocks very often. But for slow streams, it may wait for a few minutes to generate a sizable batch, our current solution is to decrease the batch size for that stream, this will end up with more blocks to manage and not adaptable to stream speed, some traffic speeds up and down in different period.

The issue is to ask improvement to make ingestion adaptive to stream speed and still maintain ideal block size, options are

keep latest block open for both query and append
make a copy of a progressive block until it's sealed.
open for other designs

Histogram view enhancement by updating min/max from predicates/filters

This is an enhancement based on user's intention, not a priority.

When there is no filter or filter has nothing to do with the column to run histogram on, we use min/max from metadata. But when user put a predicate like value>90 and run hist("value"), we should be able to update the min value as [90, max] and then run 10 buckets over it, this scenario is called zoom in histogram which helps user to keep zoom into smaller data range with different granularity.

@shuoshang1990 feel free to take a look, but you don't have to take it, it's a new feature essentially on top of current hist.

Ingest sampling support

Some use case needs sampling support during data ingestion for some super heavy data source, users can get insights without scan full data.

a few initial thoughts

sampling policy: random sampling, shard key sampling, partition level sampling
sampling ratio: percentage

Experiment deploying Nebula with Kubernetes

One of the design goals for Nebula is elasticity - meaning nebula nodes should be flexible to join and leave to react workload changes. Kubernetes is a great environment to exercise it, it will help improve Nebula design to its ideal state.

As a first step, I would love to see somebody to give it a try and tell us what the gap is.

Treemerge performance

Treemerge is the core algo to provide call stack merge into a weighted tree which is displayed as icicle or flame graph in the frontend.

However, observed large performance degradation, it needs some investigation to understand where the optimizations could be implemented.

A few initial leads

Algorithm itself may not be effective. (may take long time in parsing large string blobs into frames)
Large data set (the final collection could be further trimmed during aggregation - lift up threshold, introduce compression, etc)
Use vector/list to store the call stack frames? And probably introduce dictionary encoding for list item.

duplicate data returned

Metadata needs to track if there are duplicate blocks exist for the same spec.

Nebula server keeps generate new data spec, specially in "swap" scenario, how to guarantee a block belongs to current spec rather than last spec even though the they may have the same spec definition.

Revisit the structure of metadata layers

(offline, expired specs support)
Table -> Specs -> Blocks
Spec Versioning

Table are identified by name, there are no duplicate table names in nebula.
Spec are identified by signature which has time stamped.
Block are identified by signature.

Decouple time macro in source path from time spec

Currently we support roll spec with time MACROs (date, hour, min, second) by specifying MACRO pattern in time spec. I think we should decouple this to get a better flexibility.

For example, below spec should be a legit spec:

test-table:
  ...
  source: s3://xxx/{date}/{hour}/
  time:
      type: column
      column: col2
      pattern: UNIXTIME_MS

This spec is basically asking us to scan s3 file path with supported macros in it, but the time is actually from an existing column. My understanding is we don't support MACRO parsing if we don't specify it in time spec.

cc @chenqin

design parition roll-up v0.1

A placeholder to discuss nebula internal file -> hourly bucketed s3 columnar files split

ingestion UDF

Often when user read s3 dir to nebula, the data schema need to apply certain transform

extract nested columns and explode to multiple rows
data type translate from timestamp to date_str

The goal is to allow user define their own ingestion UDF and applied to each row read from s3

declare after transform schema in yaml config
define UDF and applied column in yaml config
evaluate python/node.js ? user can define and apply udf ondemand.
refactor IngestionSpec::build
add test converage to ingestion UDF

Support flatten field in JSON ingestion

Some JSON data has structure, similar like thrift data object which uses field-mapping to ingest data in.
For JSON, an easier way is to flatten structure into field name.

for example:
schema "ROW<a.b:int, a.c:string>" can ingest JSON data like

{
  a: {
    b: 230,
    c: "xyz"
  }
}

An alternative approach is to fix all issues with nested type in Nebula including Map, List and Struct which haven't been tested at all so I assume there are will be tons of work in this space.

Histogram query can have multiple charts based on the keys

We should be able to support query show histograms for each key in the query.

For example, if user query tag, hist(value), we should display 4 charts in the UI and each display histogram view for each key (a, b, c, d in this example).

As a reference, we have similar handling for TREEMERGE function in flame/icicle view.

Support timeline visual for normal query result

Nebula timeline requires time column presents, but in ephemeral case, a time column may be represented by some custom column, it could be named by anything (even generated by instant UDF).

For this type of query result, what if user wants to visualize it as timeline? yes, we should add this type of pure client side support for visual, as a parallel feature in addition to existing timeline query.

Support hot swapping schema

The schema of a table (data set) could be updated. When user adds/remove fields to existing data set, Nebula is not updating their schema. The reason is Nebula maintains an immutable schema when it first see the table schema.

Current workaround to this issue is to adding a new data set name and delete old dataset name, basically keep schema immutable.

A better schema compatibility fix is expected:

User can update existing table's schema
Nebula can support different schemas (old, new) co-existing given a time point.
When query, Nebula will treat missing field as NULL on some data blocks, but compute correctly on new data blocks.
The schema presented to UI/API should be a union.
Schema is dynamic and capture all columns data at any given time point.

This type of support is good for UI (flexibility) but may not be good for consistency expectation from API. We should think a bit more carefully on this issue before implement the support.

Nebula technical white paper

write up a technical white paper for Nebula

Support compare view

This is an end to end feature request to show data in a diff view. Including

Table
Timeline
Bar / line
Flame (🔥 )

I think there are two major comparison types we should support, and this will simplify how diff is defined and supported in Nebula.

Diff on two different time range. In this mode, we allow user to input two different time ranges, and the rest of the data/metrics are the same. In some visualization, we may want to introduce T1 vs T2 as dimensions for diff view.
Diff on two different filter set. (time range is essentially one specific filter) Here, we generate two types T1 and T2 to represent two different filters, and the result will be presented in two groups for comparison.