cwida / duckdb-pgq Goto Github PK
View Code? Open in Web Editor NEWThis project forked from duckdb/duckdb
DuckDB is an in-process SQL OLAP Database Management System
Home Page: http://www.duckdb.org
License: MIT License
This project forked from duckdb/duckdb
DuckDB is an in-process SQL OLAP Database Management System
Home Page: http://www.duckdb.org
License: MIT License
Should the following query be accepted?
CREATE PROPERTY GRAPH pg
VERTEX TABLES (
Student PROPERTIES ( * ) LABEL Person,
School as school_alias PROPERTIES ( * ) LABEL School IN School_kind (Hogeschool, University)
)
EDGE TABLES (
know SOURCE KEY ( src ) REFERENCES Student ( id )
DESTINATION KEY ( dst ) REFERENCES Student ( id )
PROPERTIES ( createDate ) LABEL Knows
)
On the following query:
CREATE PROPERTY GRAPH pg
VERTEX TABLES (
Student PROPERTIES ( id, name ) LABEL Person
)
EDGE TABLES (
know SOURCE KEY ( src ) REFERENCES Student ( id )
DESTINATION KEY ( dst ) REFERENCES Student ( id )
PROPERTIES ( createDate ) LABEL Knows
)
The following error is thrown:
Parser Error: syntax error at or near "name"
LINE 3: Student PROPERTIES ( id, name ) LABEL Person
CREATE TABLE Student(id BIGINT, name VARCHAR);
CREATE TABLE know(src BIGINT, dst BIGINT, createDate BIGINT);
INSERT INTO Student VALUES (0, 'Daniel'), (1, 'Tavneet'), (2, 'Gabor'), (3, 'Peter');
INSERT INTO know VALUES (0,1, 10), (0,2, 11), (0,3, 12), (1,2, 14), (1,3, 15), (2,3, 16);
CREATE PROPERTY GRAPH pg
VERTEX TABLES (
Student PROPERTIES ( id, name ) LABEL Person
)
EDGE TABLES (
know SOURCE KEY ( src ) REFERENCES Student ( id )
DESTINATION KEY ( dst ) REFERENCES Student ( id )
PROPERTIES ( createDate ) LABEL Knows
)
macOs 13 - Apple M1 Pro
6.0.1
CLI
Daniel ten Wolde
Centrum Wiskunde & Informatica
master
branch?We should write what the extension does, how the new syntax looks like, etc.
Perhaps also a logo
Add in check to not run path length functions when the CSR has not been initialized
(Currently creates a segfault, see test_path_length.test)
This was forgotten when migrating to the updated DuckDB repo
keywords such as GROUPS and PATH can now not be used outside of PGQ queries, but this should be possible.
See graphs here https://github.com/neo4j/neo4j/blob/5.1/community/community-it/algo-it/src/test/java/org/neo4j/kernel/impl/traversal/TestBidirectionalTraversal.java
(Also use the nice ascii art to avoid having to draw the graph a million times)
SQL/PGQ allows for every column mentioned in the PROPERTIES (list of columns) to have an alias. This is currently not supported
Adding support for this requires that the PropertyGraphTable->column_names will be slightly different.
For a given case Student PROPERTIES (id AS id_alias)
, the column id should exist in the table Student
Labels are not yet case insensitive, though they should be to avoid duplicate entries
There are currently some test cases in test/sql/function/sql-pgq that are outdated and should be updated. This issue should go through these test cases and make sure to update them
see title
Run the test/sql/sqlpgq/delete_csr.test
file. Make sure to add the query I
or statement ok
before the query as it will crash in the current state
macOs 13 - Apple M1 Pro
latest
CLI
Daniel ten Wolde
CWI
master
branch?Currently returns an empty string. Not sure what should exactly be in this.
Currently it is not possible to use both a basic pattern as well as a path-finding operation in the queries. We should be able to connect them (this will likely be one of the last things before we can formulate all SNB BI and interactive queries)
FROM GRAPH_TABLE (pg
MATCH (a:Person)-[k:Knows]->*(b:Person)-[s:StudyAt]->(u:University)
COLUMNS (*)
This would be effectively the same as create, but instead of throwing an error when the property graph already exists, it would drop the previous property graph and replace it with the new one
Should throw a catalog/Binder exception
Right now the translated query contains a LIMIT between lower and upper. However this is only enforced after the path-finding UDFs are finished. It would potentially save time if the limits are enforced in the UDF, to ensure the minimum required iterations. However, the added complexity might slow down the UDFs too much. To be investigated
Would require changes to the matchref_bind.cpp
and the path-finding UDFs
What should be the result (or equivalent query of) the following query:
FROM GRAPH_TABLE (pg
MATCH SHORTEST 5 (a:Person)-[k:Knows]->*(b:Person), (b:Person)-[k2:Knows]->(c:Person)
COLUMNS (*)
SELECT *
FROM pg
(SELECT *
FROM person a, knows k, person b, knows k2, person c
WHERE ...
LIMIT 5
)
I'm afraid that the LIMIT 5
will come too early and perhaps select the first 5 results of the entire join, while those might not be the 5 shortest who are then joined on the (c:Person)
Various things from the parser are not yet supported in the transformer and binder. In particular, keywords to define what properties are in the pg table
bool IterativeLengthFunctionData::Equals(const FunctionData &other_p) const {
// TODO: Change this to check if both are on same CSR
return true;
}
After CSR creation, run connected component analysis. An optimization would be to look at whether the source and destination are in the same connected component (store this in the CSR class). If they are not, you don't need to run the search since there will be no path found.
When we want to attach just a weight array to an already an existing array, there should perhaps be a new UDF to attach just a new array for the weights. Currently you would have to recreate the entire CSR
The query for path-finding should be something like:
WITH cte1 AS (
SELECT CREATE_CSR_EDGE(
0,
(SELECT count(a.id) FROM person a),
CAST (
(SELECT sum(CREATE_CSR_VERTEX(
0,
(SELECT count(a.id) FROM person a),
sub.dense_id,
sub.cnt)
)
FROM (
SELECT a.rowid as dense_id, count(k.person1id) as cnt
FROM person a
LEFT JOIN person_knows_person k ON k.person1id = a.id
GROUP BY a.rowid) sub
)
AS BIGINT),
a.rowid,
c.rowid,
k.rowid) as temp
FROM person_knows_person k
JOIN person a on a.id = k.person1id
JOIN person c on c.id = k.person2id
)
SELECT distinct a.name AS a_name, b.name AS b_name,
FROM (SELECT count(temp) * 0 AS temp FROM cte1) x, (SELECT a.id, a.rowid FROM person a WHERE a.id = 332) a, person b
where (x.temp + iterativelength(0, (SELECT count(c.id) FROM person c), a.rowid, b.rowid)) NOT NULL
Instead of throwing an error when a property graph doesn't exist, we could just ignore it and continue
See branch debug_generating_path_segfault
Test debug_generating_path.test with threads not limited, the result will be a segmentation fault seemingly because started_searches
goes out of range. Setting the number of threads to 1 will not cause this to happen.
Similar to cheapest path, template the number of lanes such that the minimal required of lanes is used depending on the number of source-destination pairs
Allow one of the columns in the COLUMNS() list to be ELEMENT_ID(), this will return the rowid for that vertex or edge. So it will probably have to be a.element_id() for instance(?)
Do many terminate early and wait for a lower number of lanes to finish?
The keyword DEFAULT LABEL is currently not supported.
Additionally, the list following LABEL discriminator (labelList) can currently contain duplicate labels. This should not be allowed since all labels should be unique.
Related to #12
Once a lane is finished it currently sits idle until all lanes have been completed. It could be an improvement to reuse lanes as soon as they are finished
Need to keep track of which iteration every lane is in. Can be done using a std::vector<int16_t> iteration(LANE_LIMIT, 0);
that increments every iteration. Once a destination has been found, the iteration for that lane is reset to 0 and is refilled with the next source-destination pair.
When registering a new UDF, make sure to add it to the list of files in the sqlpgq_config.py @vlowingkloude
Also now you have to make a pull request and (if correct) I have to approve the merge :)
Can be done by using hooks (figure out how)
This should be used when a weight is defined inside the edge
(a:Person)-[k:Knows COST <some expression>]->*(b:Person)
Loading the snb.duckdb dataset followed by these queries
CREATE PROPERTY GRAPH pg
VERTEX TABLES(
Person PROPERTIES(id,firstName) LABEL Person,
Tag PROPERTIES(id,name) LABEL Tag,
University PROPERTIES(id,name) LABEL University)
EDGE TABLES(
Person_knows_Person
SOURCE KEY(person1id) REFERENCES Person(id)
DESTINATION KEY(person2id) REFERENCES Person(id)
PROPERTIES(creationDate,person1id,person2id)
LABEL know,
Person_hasInterest_Tag
SOURCE KEY(personid) REFERENCES Person(id)
DESTINATION KEY(tagid) REFERENCES Tag(id)
PROPERTIES(personid,tagid)
LABEL hasInterest,
Person_studyAt_University
SOURCE KEY(personid) REFERENCES Person(id)
DESTINATION KEY(universityid) REFERENCES University(id)
PROPERTIES(personid,universityid,classyear)
LABEL studyAt)
SELECT study.university_name, study.id FROM GRAPH_TABLE (pg,
MATCH (:Person WHERE a.name='Bob')-[s:studyAt]->
(u:University)
COLUMNS (u.name as university_name, a.id as pid)) study
Apparently not necessary to have a comma according to the official spec ¯_(ツ)_/¯
see title
SELECT *
FROM GRAPH_TABLE(pg
MATCH (p:Person)-[w:worksAt]->(u:university)
COLUMNS (*)
) result
macOs 13 - Apple M1 Pro
latest
cli
Daniel ten Wolde
CWI
master
branch?see title
Set pragma enable_verification
and create a CSR
macOs 13 - Apple M1 Pro
latest
CLI
Daniel ten Wolde
CWI
master
branch?Inheritance will work as follows:
A vertex table can be defined as:
Organisation LABEL Organisation_kind IN (Company, University)
This means that there is a column Organisation_kind that is a BIGINT in which every row that is also part of the table company the value is 1 (0001). The row that is also a University has value (0010).
If the user then does a check on an inherited label, we need to add a clause to the where that is:
WHERE organisation_kind & <index of that label>
When true, it means that the row is part of the table . An element can only be part of one other table
SQL/PGQ marked cost as a language opportunity
(a:Person)-[k:Knows COST b_expr DefaultOptional)]->(b:Person)
This will be part of a subpath
SIGKILL on 16 threads bidirectional-new implementation with graphalytics graph500-22 dataset
The following query does not work:
import pandas as pd
import duckdb
df = pd.DataFrame([1,2,3,4,5])
df2 = pd.DataFrame([[0,1], [1,2], [2,3]])
con = duckdb.connect('')
con.execute('load sqlpgq')
con.execute("""CREATE PROPERTY GRAPH pg VERTEX TABLES ( df LABEL person ) EDGE TABLES ( df2 SOURCE KEY (i) REFERENCES df (j) DESTINATION KEY (i) REFERENCES df (j) LABEL knows);""")
We could add a feature that allows the creation of property graphs from a TableFunction, but we need to add support for the table replacement scan
See above
macOs 13 - Apple M1 Pro
latest
python
Daniel ten Wolde
CWI
master
branch?What should happen in the following case:
CREATE TABLE person(id BIGINT, name VARCHAR);
CREATE TABLE know(src BIGINT, dst BIGINT, createDate BIGINT);
CREATE TABLE employer(id BIGINT);
CREATE PROPERTY GRAPH pg
VERTEX TABLES (
person PROPERTIES ( id, name ) LABEL Person,
)
EDGE TABLES (
employs SOURCE KEY ( src ) REFERENCES employer ( id )
DESTINATION KEY ( dst ) REFERENCES Student ( id )
PROPERTIES ( createDate ) LABEL Employs
)
In this case the table employer
is not defined as a vertex table. Therefore this should not be valid?
Reading the docs the definition of an edge table includes:
The name of the destination vertex table.
[p.12]
But in this case, the table referenced is not a vertex table as defined on page 11.
Old bidirectional implementation is very slow on graph500-22 dataset
This query will return a list of rowids alternating between the vertex rowid and the edge rowid between a source and destination
We will only support (ANY) SHORTEST path for now, there is also opportunity for ALL
For the various algorithms, experiment with the number of lanes to use
8, 16, 32, 64, 128, 256, 512, 1024, 2048
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.