varchar-io / nebula Goto Github PK
View Code? Open in Web Editor NEWA distributed block-based data storage and compute engine
Home Page: https://nebula.bz
License: Apache License 2.0
A distributed block-based data storage and compute engine
Home Page: https://nebula.bz
License: Apache License 2.0
Right now, we're using UTC time through the whole stack from Nebula Engine to Nebula UI.
However it's not intuitive for UI to not use local time, like I'm in PST and I don't want to use my brain to translate this back and forth, very inconvenient. This work should be done in a single place to serving UI time translation, better to have an option in UI to allow user to choose UTC or Local time.
Work can be started from https://github.com/varchar-io/nebula/blob/master/src/service/http/nebula/_/time.js
In DB - there are types like fixed length string char(3) or variable length string varchar(30).
Many scenarios, fixed-length string types hints big reduction in both storage and computation.
Some query results aren't fit for graph visual, such as error message and its count, in this case, user may want to hide graph visual, but display as table only.
We should provide an option to allow user to hide graph or just table only option for display.
To support embedding nebula UI into other surface (dashboard, embedded env).
Introduce a state property to indicate if current rendering includes visualization display only.
Currently Instant UDF (javascript function/lambda) can be used to define new column. So we can use it in select clause.
However, we should support it in Where clause as well, so below pseudo code could be executed successfully.
const x = () => { return nebula.column("age") % 3; }; nebula.apply("x", nebula.Type.INT, x); nebula.select("type", count("id")).where(eq("x", 1));
Replace all occurrences of below
Copyright 2017-present Shawn Cao
from file header into Copyright 2017-present varchar.io
Set up Github Action to build Nebula nightly.
Right now, to make the nebula build work (Ubuntu) on a fresh machine, it is not EASY!
@samprasyork probably had a great feeling about it, :)
This issue is to track let's make the build experience easy for future developers on this project. Github Action supports cmake based project, by enabling it, I guess we will fix most of the build issues.
Nebula query engine should not care about what display type the front end is.
The only thing it cares about are the query fields
And client should be able to handle whatever the final query result is.
test case using constant value eval as example
{
int8_t value = 90;
auto c = nebula::surface::eval::constant(value);
auto ev = c->eval<int32_t>(ctx);
LOG(INFO) << "Origin=" << value << ", Eval=" << ev.value();
}
This is debatable as type enforcement is a good thing, but the issue is if we made wrong as above example, it will fail in runtime rather than compile time. So I feel either we support compile time check or add compatible cast in runtime.
Currently we're backup the internal metadata DB (leveldb) when it is 'dirty' and exceeding backup interval.
However the process is not atomic and may leave a broken version in the backup.
I think the issue could be just to fix the current design to ensure a version left in the backup media is a valid.
Also open to different design, such as operate metadb completely separate which definitely requires more diligent work.
Though MetaDB only powers Nebula short links today, it becomes pivotal and will be more and more important as we are leveraging it for future data set integrity source as well as internal load balancing data set across different nebula nodes. Would love to see more deeper thoughts around this issue.
steps:
./run.sh
in the test set, run hist(value)
only, the value distribution doesn't meet the expectation, I think value
column should be even distributed as it's random generated in range [-128, 128] -
nebula/src/surface/MockSurface.cpp
Line 44 in c8cdd6b
so I believe every bucket should have similar number of values in the histogram chart.
Also this query (value > 90) should produce non-zero buckets for value greater than 90, some bug out there! :)
This is a topic that we haven't had a post of doc page talking about, and I think it's the most matter thing to users who want to adopt Nebula.
Similar to https://nebula.bz/sdk.html, let's add a doc page for "ingestion" and link to the menu bar in the home page at https://nebula.bz, also we can have this ingestion page linked in repo home README.md as it's something new users care about.
Better discovery for user to write javascript UDF to do data analysis.
Keep code logic in-sync with user interface.
Basically "select hist("col")
from
Some error log captured when a nebula node dies due to a hist() call in production
I0128 19:12:14.229609 20261 Dsl.cpp:354] Nodes to execute the query: 1
I0128 19:12:14.230307 20261 BlockManager.cpp:147] Fetch blcoks 1009 / 1009 for table cdn_requests in window [1611838876, 1611860964].
I0128 19:12:14.230335 20261 NodeExecutor.cpp:84] Processing total blocks: 1009
F0128 19:12:14.232291 36092 Histogram-inl.h:34] Check failed: bucketSize_ > ValueType(0) (0 vs. 0)
*** Check failure stack trace: ***
*** Aborted at 1611861134 (Unix time, try 'date -d @1611861134') ***
*** Signal 6 (SIGABRT) (0x3e800000006) received by PID 6 (pthread TID 0x7fe8c5ec2700) (linux TID 36092) (maybe from PID 6, UID 1000) (code: -6), stack trace: ***
(error retrieving stack trace)
column property can have time pattern, which could be used
example:
columnx:
timestamp: "%Y-%m-%d %H:%M:%S"
(Experimental & Direction Exploration)
Investigate Facebook prophet.
A timeline could be defined, and an anomaly detection module could wrap it around, this could be saved as an item, when things go wild, fire an event on a dashboard.
Today Nebula UI requires user to choose which visualization to use. This is unnecessary and not good UX.
Make the change as below:
Based on fields (with/without aggregations), Nebula UI will always display a table for the result of data (truncating super long string column), and above it, user can choose display as any visual they want.
Timeline query needs a special treatment to respect window column as x-axis always.
If the data source is file system, such as S3, based on the file extension, we should be able to decompress it transparently before ingest the data files (csv/tsv files are much less in compressed format) if they are compressed files, such as .gz, .zst, .lz4, etc. Basically we can decompress the file in place after download from object store.
Enable -flto
and/or related link time optimization for nebula build. Currently it seems not working.
I found the same view has incorrect height on my laptop while it is okay showing on a dell monitor.
Also - the main color for icicle/flame is too dark (dark red), not good for this visual, should do some adjustment on it.
The goal here is to support reading from hourly partitioned s3 directory.
we need to improve genSpecs4Roll and add hourly partitioned specs instead of daily. Currently macro is hard coded as date (daily), hence some work to read macro and decided how genSpecs4Roll doing listing respectively
often there are commit flag _SUCCESS when an hourly partition finished. fs->list() should be able to customize it's behavior and let user decide if they want fresh yet incomplete dateset or complete yet a bit long latency dateset.
detail discussion on #34
Lots of unknown for this item;
Regarding performance metrics: data storage size/cost, query latency, query set to run, etc.
Paged Slice can effectively reduce memory waste by organizing internal compressed blocks for the whole data block.
However, concurrent reading the list will requires lots of locking and isolation for thread safety.
Not high priority but an interesting problem to tackle.
It's painful to make the C++ build pass today, the major problems are dependencies and linker issues.
Target platform: Linux Ubuntu 18
Though we may not be able to make build fully automation, we at least can have a clear guide which is for sure working.
At the same time, we should maintain the build for MacOS as well.
YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.
Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.
Use an inspiring example that not yet supported by nebula.
User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)
The complexity of this use case comes from three folds
If user write a RDBMS query, it should look like
create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)
Alternatively, we can support two statement flow like
DDL
`
// mapping of hive table sync to nebula table
create table hive.account (
accountid bigint PRIMARY KEY,
spend double,
dt varchat(20) UNIQUE NOT NULL
) with ();
create table kafka.transaction (
transactionid bigint PRIMARY KEY,
accountid bigint not null,
transaction_amount double,
_time timestamp
) with ();
create table transaction_analytic (
accountid bigint PRIMARY KEY,
avg_transaction double,
transaction_amount_in_usd double,
_time timestamp
) with ();
`
DML
insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;
Right now, manual steps mentioned in dev.md, it's for Ubuntu 18.
Let's at least use build.sh to automate this supporting Ubuntu 18 only.
Currently Nebula defines file system interface in its storage
component. Under this interface, there are a few implementations such as
There two are mostly used so far, now we're asking a new support to allow Nebula read and write data with GCS. This will be needed for nebula deployment in Google Cloud Platform.
Sometimes, user have defined metrics on the query result itself. For example
No matter what - we will pass the raw JSON blob to user and let it transform uses its own transform lambda, new schema will be discovered in the transformed result.
The feedback from Chris makes sense: we should write docs from users' perspective instead of developers' perspective.
For dev notes, we should use more md files in source code. Leaves these docs for users who can finally adopt the project in their workγ
Client will make continuous query template and materialize with new interval start and end for new data point.
Push the new data point into the timeline buckets and allow visual rendering part decide how to consume the growing queue.
If we cap the queue size, the old data point will be pushed out while new data points appended.
This is a tiny task, good for first issue for any new comers.
Right now, cardinality estimate function only attached to NUMBER typed column when populating in the UI, just allow it to populate for any type since cardinality is not limited to numbers only.
Right now, nebula client (javascript) SDK doesn't support filters yet, basically where clause function.
This issue will tracking the support for that.
Bubble chart support is useful for some use cases. Bubble chart is popular, would love to see it to be available in Nebula UI.
Today, for real time streaming data source, such as Kafka, nebula create each data block based on offset start and end, and seal a data block when all records arrived, then put this sealed block into queryable blocks pool. Query will scan blocks that are in the pool.
For busy streams, this may not be an issue, as the system will place data blocks very often. But for slow streams, it may wait for a few minutes to generate a sizable batch, our current solution is to decrease the batch size for that stream, this will end up with more blocks to manage and not adaptable to stream speed, some traffic speeds up and down in different period.
The issue is to ask improvement to make ingestion adaptive to stream speed and still maintain ideal block size, options are
This is an enhancement based on user's intention, not a priority.
When there is no filter or filter has nothing to do with the column to run histogram on, we use min/max from metadata. But when user put a predicate like value>90
and run hist("value")
, we should be able to update the min value as [90, max]
and then run 10 buckets over it, this scenario is called zoom in
histogram which helps user to keep zoom into smaller data range with different granularity.
@shuoshang1990 feel free to take a look, but you don't have to take it, it's a new feature essentially on top of current hist.
Some use case needs sampling support during data ingestion for some super heavy data source, users can get insights without scan full data.
a few initial thoughts
One of the design goals for Nebula is elasticity - meaning nebula nodes should be flexible to join and leave to react workload changes. Kubernetes is a great environment to exercise it, it will help improve Nebula design to its ideal state.
As a first step, I would love to see somebody to give it a try and tell us what the gap is.
Treemerge is the core algo to provide call stack merge into a weighted tree which is displayed as icicle or flame graph in the frontend.
However, observed large performance degradation, it needs some investigation to understand where the optimizations could be implemented.
A few initial leads
Metadata needs to track if there are duplicate blocks exist for the same spec.
Nebula server keeps generate new data spec, specially in "swap" scenario, how to guarantee a block belongs to current spec rather than last spec even though the they may have the same spec definition.
Revisit the structure of metadata layers
Table are identified by name, there are no duplicate table names in nebula.
Spec are identified by signature which has time stamped.
Block are identified by signature.
Currently we support roll spec
with time MACROs (date, hour, min, second) by specifying MACRO pattern in time spec. I think we should decouple this to get a better flexibility.
For example, below spec should be a legit spec:
test-table:
...
source: s3://xxx/{date}/{hour}/
time:
type: column
column: col2
pattern: UNIXTIME_MS
This spec is basically asking us to scan s3 file path with supported macros in it, but the time is actually from an existing column. My understanding is we don't support MACRO parsing if we don't specify it in time spec.
cc @chenqin
A placeholder to discuss nebula internal file -> hourly bucketed s3 columnar files split
Often when user read s3 dir to nebula, the data schema need to apply certain transform
The goal is to allow user define their own ingestion UDF and applied to each row read from s3
Some JSON data has structure, similar like thrift data object which uses field-mapping to ingest data in.
For JSON, an easier way is to flatten structure into field name.
for example:
schema "ROW<a.b:int, a.c:string>" can ingest JSON data like
{
a: {
b: 230,
c: "xyz"
}
}
An alternative approach is to fix all issues with nested type in Nebula including Map, List and Struct which haven't been tested at all so I assume there are will be tons of work in this space.
We should be able to support query show histograms for each key in the query.
For example, if user query tag, hist(value)
, we should display 4 charts in the UI and each display histogram view for each key (a, b, c, d in this example).
As a reference, we have similar handling for TREEMERGE function in flame/icicle view.
Nebula timeline requires time column presents, but in ephemeral case, a time column may be represented by some custom column, it could be named by anything (even generated by instant UDF).
For this type of query result, what if user wants to visualize it as timeline? yes, we should add this type of pure client side support for visual, as a parallel feature in addition to existing timeline query.
The schema of a table (data set) could be updated. When user adds/remove fields to existing data set, Nebula is not updating their schema. The reason is Nebula maintains an immutable schema when it first see the table schema.
Current workaround to this issue is to adding a new data set name and delete old dataset name, basically keep schema immutable.
A better schema compatibility fix is expected:
This type of support is good for UI (flexibility) but may not be good for consistency expectation from API. We should think a bit more carefully on this issue before implement the support.
write up a technical white paper for Nebula
This is an end to end feature request to show data in a diff view. Including
Table
Timeline
Bar / line
Flame (π₯ )
I think there are two major comparison types we should support, and this will simplify how diff is defined and supported in Nebula.
Diff on two different time range. In this mode, we allow user to input two different time ranges, and the rest of the data/metrics are the same. In some visualization, we may want to introduce T1 vs T2 as dimensions for diff view.
Diff on two different filter set. (time range is essentially one specific filter) Here, we generate two types T1 and T2 to represent two different filters, and the result will be presented in two groups for comparison.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.