kuzudb / kuzu Goto Github PK
View Code? Open in Web Editor NEWEmbeddable property graph database management system built for query speed and scalability. Implements Cypher.
Home Page: https://kuzudb.com/
License: MIT License
Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
Home Page: https://kuzudb.com/
License: MIT License
Consider the following query:
MATCH (a)-e1->(b)
RETURN e1.time;
Currently, we consider two possible QVOs [a, b] and therefore generate two plans:
P1: Scan(a) → Extend(b) → ScanEdgeProperty(e1.time) → Project([e1.time])
OR
P2: Scan(b) → Extend(a) → ScanEdgeProperty(e1.time) → Project([e1.time])
The extension to (b) comes from the fact that we tend to cover all of the query edges and vertices blindly regardless of what the query actually requests. The two plans should actually be:
P1: Scan(a) → ScanEdgeProperty(e1.time) → Project([e1.time])
OR
P2: Scan(b) → ScanEdgeProperty(e1.time) → Project([e1.time])
In other cases, where we don’t actually need any properties all together, we should simply get the size from the Lists header and therefore not pin/unpin unnecessarily. The query:
MATCH (a)-e1->(b)
RETURN count(*);
Should generate:
P1: Scan(a) → GetListSize(e1) → GROUP BY (COUNT(*))
P1: Scan(b) → GetListSize(e1) → GROUP BY (COUNT(*))
We need to be more aware of the actual properties needed. ID is just another property we do a join on.
Add support of SEMI, OUTER, and ANTI join types to the hash join operator.
Current enumerator uses share_ptr for LogicalOperator because multiple operators may share the same prev operator.
Consider following example (a)->(b)->(c), when enumerating size = 2, E(a) and E(c) will both point to S(b).
The right thing to do is to use raw pointer during enumeration. When enumerator outputs, we copy each plan and use unique_ptr.
This is not a performance degrading because eventually we will replace enumerator with optimizer. And optimizer only output one single plan. And enumerator is for internal usage, copying multiple plans is fine.
Update mapToPhysical for ColumnExtends to use dataChunkPos from boundVarName instead of numChunks - 1
The function appendDataChunks()
is a little bit lengthy. It's better to find a good way to split the function to make it more readable.
ValueVector constructor(s) with elementSize are currently created inside our operators and are passed the elementSize. Instead Columns and Lists need to store both elementSize and dataType and the dataType need to be passed to the ValueVector(s).
Currently we enumerate HashJoin when left and right plan has at least 2 Extend. This works if the two extends are many-to-many. In the case of many-to-one (column) extend, we should not enumerate HashJoin (Plan still work with HashJoin but it can always be replaced with Extend).
The right thing to do is when left and right has at least 2 many-to-many Extends, we enumerate a HashJoin.
Currently, the Sink
operator only keeps track of the number of result tuples in the final output. But sometimes it is not enough to do proper tests by only comparing the number of tuples.
To better support tests (and maybe debugging), we need to add an iterator
for DataChunks
, which can output its content one tuple (a vector of values) at a time. So we can directly compare the contents of the result DataChunks with expected results or print out results tuples.
The returned values for one tuple is copied into a Tuple
object, which holds a vector of Literal
, and provides a toString()
function.
Usage example:
auto vectorTypes = dataChunks.getVectorTypes();
Tuple tuple(vectorTypes);
DataChunksIterator dataChunksIterator(dataChunks);
while (dataChunksIterator.hasNextTuple()) {
dataChunksIterator.getNextTuple(tuple);
cout << tuple.ToString() << endl;
}
...
dataChunksIterator.setCurrentDataChunks(newDataChunks);
...
ColumnExtend And ScanProperty operator should be applied before the list get flattened to avoid extra IO.
During the hash table build phase, there are two options for computing hashes:
htBlocks
as a field of tuples.htBlocks
. (We don't need store hashes as a field of tuples).Currently, we take the latter approach. While it's unclear whether the first approach can bring us more benefits. Might be worth doing microbenchmarks.
The execution of a serialized LogicalPlan is the server is no longer supported and the interface will be fixed once we add an enumerator.
Just to keep our thoughts a bit organized, I am listing the set of features that needs to be implemented before we can show the system to someone:
Phase 1 Features
Currently, the testing framework will create a temp directory to write out loaded graph binary data. It should be cleaned up after tests are all done.
Suppose
void functionA() {
setT(move(functionB()))
}
unique_pre<T> functionB()
the move() disables copy elision. Understand what's the consequence of this and if this is the right thing to do
This issue lists cypher features we skipped during parser implementation
Graph Pattern
[e1:knows|:likes]
(a:Person:Student)
Expression
[0, 1, 2, 3]
[0.. 3]
x IN [0, 1, 2, 3]
CASE ... WHEN ... THEN ... ELSE ... END
$param
[x IN range(0, 10) WHERE x % 2 = 0 | x^3 ]
MATCH (a:Person {name: 'Keanu Reeves'}) RETURN [(a)-->(b) WHERE b:Movie | b.released] AS years
WHERE NOT (r:Recipe)-[:INCLUDES]->(excluded:Ingredient)
Consider following example MATCH (a:Person)-[r:KNOWS]->(a:Person)
,
current enumerator generate plan as Scan(Person)-Extend(Knows edge to Person Label)
without understanding fromNode and toNode should be the same node. A filter on nodeID should be append
a.isProperty needs to be rewritten as a.isProperty == TRUE.
Currently this is not applied as a filter.
Update the test under queries/filtered/paths.test from:
-NAME OneHopKnowsFilteredTest3
-QUERY MATCH (a:person)-[e1:knows]->(b:person) WHERE a.isStudent == TRUE
---- 6
to:
-NAME OneHopKnowsFilteredTest3
-QUERY MATCH (a:person)-[e1:knows]->(b:person) WHERE a.isStudent
---- 6
And ensure it passes.
Give cleaner APIs for appendOperator(Plan&) which ideally takes only a plan as a parameter. And plan contains all necessary information.
Match (n)
is currently interpreted as n has label -1 (match everything). The right thing to do is to replace -1 with vector which should be all node labels in the systemFor now, in order to call an operation, we ensure the operation both operands are not NULL first.
The goal of this issue is to introduce a simple non-parallel hash table for future integration with hash join. A parallel version is gonna be extended based on this implementation.
The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks
) and directory (htDirectory
).
Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling.
For each tuple, the storage layout is: |key|payloads|prev|
.
Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks
.
Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).
Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1
. this is a bidirectional query, hash join happens on c.id
. and the build side is the subquery (a)->(b)->(c)
. which starts from SCAN(a)
, then EXTEND(b)
and EXTEND(c)
. thus, c might be a list chunk.
Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e)
. the hash join happens on b.id
, and the build side is the subquery (a)->(b)->(c)
, which starts from SCAN(a)
, then EXTEND(b)
. thus, b
might be a flat chunk.
During materialization, there are two cases for a list chunk:
Currently, the hash table only supports hash on a single NodeID field, and we assume this is the main usage scenario for hash join, too.
HashTable(MemoryManager& memoryManager, vector<PayloadInfo>& payloadInfos);
// add a DataChunks into the hash table
void addDataChunks(
DataChunk& keyDataChunk, uint64_t keyVectorIdx, DataChunks& payloadDataChunks);
// initialize and construct the hash table: allocate the HT directory
void buildHTDirectory();
Usage example:
...
while (right->hasNextMorsel()) {
// fetch all data chunks from the right side, and add to ht
right->getNextTuples();
hashTable->addDataChunks(*rightKeyDataChunk, 0, buildDataChunks);
}
hashTable->buildHTDirectory();
...
Large intermediate memory allocation and de-allocation should be managed by a unified memory manager, which is separate from our buffer manager for now.
The memory manager can limit the usage of memory under a specified maxMemory
size.
If the allocation is beyond the limit, the memory manager should perform one of the following:
Currently, we add a simple memory manager, which is only responsible for allocating memory blocks (the releasing is done by the requester through deconstructing the block handle), if the allocation is beyond the max memory size limit, an exception will be thrown.
DO NOT USE THIS MEMORY MANAGER FOR OTHER OPERATORS FOR NOW
A more advanced one will be integrated later and described in a separate issue.
Currently, in types.h
, we have system-level types such as gf_string_t
, nodeID_t
, etc. And we also have constants such as MAX_VECTOR_SIZE
, NUM_BYTES_PER_NODE_ID
, etc.
It's more clear to separate constants from types into a separate constants.h
file when we have more system-level constants avaiable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.