Git Product home page Git Product logo

duckdb-pgq's People

Contributors

azimafroozeh avatar bleskes avatar carlopi avatar chenxi8611 avatar diegomestre2 avatar douenergy avatar dtenwolde avatar hannes avatar hawkfish avatar hmb1 avatar jens-h avatar jwills avatar krlmlr avatar kryonix avatar lnkuiper avatar lokax avatar mause avatar maxxen avatar mytherin avatar papparapa avatar pdet avatar rj-atw avatar rsund avatar samansmink avatar stephaniewang526 avatar taniabogatsch avatar tdoehmen avatar tiagokepe avatar tishj avatar tmonster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

zahidabasher

duckdb-pgq's Issues

Edge cases in create property graph

Should the following query be accepted?

CREATE PROPERTY GRAPH pg
VERTEX TABLES (
    Student PROPERTIES ( * ) LABEL Person,
    School as school_alias  PROPERTIES ( * ) LABEL School IN School_kind (Hogeschool, University)
    )
EDGE TABLES (
    know    SOURCE KEY ( src ) REFERENCES Student ( id )
            DESTINATION KEY ( dst ) REFERENCES Student ( id )
            PROPERTIES ( createDate ) LABEL Knows
    )

Implement shortest path

  • #7
  • Move to a different file
  • Base on the new iterative length code
  • Verify with Neo4j
  • Make output list of altering vertex id, edge id

Parser create property graph multiple properties

What happens?

On the following query:

CREATE PROPERTY GRAPH pg
VERTEX TABLES (
    Student PROPERTIES ( id, name ) LABEL Person
    )
EDGE TABLES (
    know    SOURCE KEY ( src ) REFERENCES Student ( id )
            DESTINATION KEY ( dst ) REFERENCES Student ( id )
            PROPERTIES ( createDate ) LABEL Knows
    )

The following error is thrown:

Parser Error: syntax error at or near "name"
LINE 3:     Student PROPERTIES ( id, name ) LABEL Person

To Reproduce

CREATE TABLE Student(id BIGINT, name VARCHAR);

CREATE TABLE know(src BIGINT, dst BIGINT, createDate BIGINT);

INSERT INTO Student VALUES (0, 'Daniel'), (1, 'Tavneet'), (2, 'Gabor'), (3, 'Peter');

INSERT INTO know VALUES (0,1, 10), (0,2, 11), (0,3, 12), (1,2, 14), (1,3, 15), (2,3, 16);

CREATE PROPERTY GRAPH pg
VERTEX TABLES (
Student PROPERTIES ( id, name ) LABEL Person
)
EDGE TABLES (
know SOURCE KEY ( src ) REFERENCES Student ( id )
DESTINATION KEY ( dst ) REFERENCES Student ( id )
PROPERTIES ( createDate ) LABEL Knows
)

OS:

macOs 13 - Apple M1 Pro

DuckDB Version:

6.0.1

DuckDB Client:

CLI

Full Name:

Daniel ten Wolde

Affiliation:

Centrum Wiskunde & Informatica

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Updating the README.md

We should write what the extension does, how the new syntax looks like, etc.
Perhaps also a logo

Add error handling when CSR has not been created

Add in check to not run path length functions when the CSR has not been initialized
(Currently creates a segfault, see test_path_length.test)

This was forgotten when migrating to the updated DuckDB repo

Column aliases

SQL/PGQ allows for every column mentioned in the PROPERTIES (list of columns) to have an alias. This is currently not supported

Adding support for this requires that the PropertyGraphTable->column_names will be slightly different.

For a given case Student PROPERTIES (id AS id_alias), the column id should exist in the table Student

Clean up tests

There are currently some test cases in test/sql/function/sql-pgq that are outdated and should be updated. This issue should go through these test cases and make sure to update them

Create CSR leads to segfault in delete_csr.test

What happens?

see title

To Reproduce

Run the test/sql/sqlpgq/delete_csr.test file. Make sure to add the query I or statement ok before the query as it will crash in the current state

OS:

macOs 13 - Apple M1 Pro

DuckDB Version:

latest

DuckDB Client:

CLI

Full Name:

Daniel ten Wolde

Affiliation:

CWI

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Connect basic pattern and path-finding

Currently it is not possible to use both a basic pattern as well as a path-finding operation in the queries. We should be able to connect them (this will likely be one of the last things before we can formulate all SNB BI and interactive queries)

FROM GRAPH_TABLE (pg
MATCH (a:Person)-[k:Knows]->*(b:Person)-[s:StudyAt]->(u:University)
COLUMNS (*)

Support CREATE OR REPLACE PROPERTY GRAPH

This would be effectively the same as create, but instead of throwing an error when the property graph already exists, it would drop the previous property graph and replace it with the new one

Create a test case that creates a property graph on non-existing tables

Should throw a catalog/Binder exception

  • Table does not exist
  • Column of a table does not exist
  • FK of a table does not exist
  • Primary keys are not unique
  • PK of referenced source table does not exist
  • PK of referenced destination table does not exist
  • FK of source of edge table does not exist
  • Multiple PKs are referenced from source/destination table (How would this work?)

Try pushing the limits of path queries into the UDFs

Right now the translated query contains a LIMIT between lower and upper. However this is only enforced after the path-finding UDFs are finished. It would potentially save time if the limits are enforced in the UDF, to ensure the minimum required iterations. However, the added complexity might slow down the UDFs too much. To be investigated

Would require changes to the matchref_bind.cpp and the path-finding UDFs

Using topk with multiple patterns

What should be the result (or equivalent query of) the following query:

FROM GRAPH_TABLE (pg
MATCH SHORTEST 5 (a:Person)-[k:Knows]->*(b:Person), (b:Person)-[k2:Knows]->(c:Person)
COLUMNS (*)
SELECT * 
FROM pg
	(SELECT * 
	FROM person a, knows k, person b, knows k2, person c
	WHERE ...
	LIMIT 5
	)

I'm afraid that the LIMIT 5 will come too early and perhaps select the first 5 results of the entire join, while those might not be the 5 shortest who are then joined on the (c:Person)

Support the EXCEPT list for a property graph table

Various things from the parser are not yet supported in the transformer and binder. In particular, keywords to define what properties are in the pg table

  • NO PROPERTIES
  • PROPERTIES ALL COLUMNS
  • PROPERTIES ALL COLUMNS EXCEPT (list of columns)
  • Optional keyword ARE should also be checked if parsed correctly

Improve Equals function

bool IterativeLengthFunctionData::Equals(const FunctionData &other_p) const {
	// TODO: Change this to check if both are on same CSR
	return true;
}

Connected component before searching

After CSR creation, run connected component analysis. An optimization would be to look at whether the source and destination are in the same connected component (store this in the CSR class). If they are not, you don't need to run the search since there will be no path found.

Add columns to CSR

When we want to attach just a weight array to an already an existing array, there should perhaps be a new UDF to attach just a new array for the weights. Currently you would have to recreate the entire CSR

Sync to the latest DuckDB version

  • CSR
  • Shortest path length
  • Shortest path
  • Cheapest path
  • Script to create python package
  • Test cases
  • SNB SF 1 person and person_knows_person data sets
  • Pragma delete CSR

Reduce the number of pairs to compute path length for

The query for path-finding should be something like:

WITH cte1 AS (
    SELECT  CREATE_CSR_EDGE(
            0,
            (SELECT count(a.id) FROM person a),
            CAST (
                (SELECT sum(CREATE_CSR_VERTEX(
                            0,
                            (SELECT count(a.id) FROM person a),
                            sub.dense_id,
                            sub.cnt)
                            )
                FROM (
                    SELECT a.rowid as dense_id, count(k.person1id) as cnt
                    FROM person a
                    LEFT JOIN person_knows_person k ON k.person1id = a.id
                    GROUP BY a.rowid) sub
                )
            AS BIGINT),
            a.rowid,
            c.rowid,
            k.rowid) as temp
    FROM person_knows_person k
    JOIN person a on a.id = k.person1id
    JOIN person c on c.id = k.person2id
)
SELECT distinct a.name AS a_name, b.name AS b_name,
FROM (SELECT count(temp) * 0 AS temp FROM cte1) x, (SELECT a.id, a.rowid FROM person a WHERE a.id = 332) a, person b
    where (x.temp + iterativelength(0, (SELECT count(c.id) FROM person c), a.rowid, b.rowid)) NOT NULL

Multithreaded iterativelength getting a segfault

See branch debug_generating_path_segfault
Test debug_generating_path.test with threads not limited, the result will be a segmentation fault seemingly because started_searches goes out of range. Setting the number of threads to 1 will not cause this to happen.

Return ELEMENT_ID()

Allow one of the columns in the COLUMNS() list to be ELEMENT_ID(), this will return the rowid for that vertex or edge. So it will probably have to be a.element_id() for instance(?)

DEFAULT LABEL

The keyword DEFAULT LABEL is currently not supported.
Additionally, the list following LABEL discriminator (labelList) can currently contain duplicate labels. This should not be allowed since all labels should be unique.

Implement reusing lanes once they are finished

Related to #12

Once a lane is finished it currently sits idle until all lanes have been completed. It could be an improvement to reuse lanes as soon as they are finished
Need to keep track of which iteration every lane is in. Can be done using a std::vector<int16_t> iteration(LANE_LIMIT, 0); that increments every iteration. Once a destination has been found, the iteration for that lane is reset to 0 and is refilled with the next source-destination pair.

Implement CHEAPEST

This should be used when a weight is defined inside the edge

(a:Person)-[k:Knows COST <some expression>]->*(b:Person)

Unaliased pattern leads to segfault

Loading the snb.duckdb dataset followed by these queries

CREATE PROPERTY GRAPH pg 
VERTEX TABLES(
 Person     PROPERTIES(id,firstName) LABEL Person,
 Tag        PROPERTIES(id,name)      LABEL Tag,
 University PROPERTIES(id,name)      LABEL University)
EDGE TABLES(
 Person_knows_Person  
  SOURCE KEY(person1id) REFERENCES Person(id) 
  DESTINATION KEY(person2id) REFERENCES Person(id)
  PROPERTIES(creationDate,person1id,person2id)
  LABEL know, 
 Person_hasInterest_Tag 
  SOURCE KEY(personid) REFERENCES Person(id)
  DESTINATION KEY(tagid) REFERENCES Tag(id)
  PROPERTIES(personid,tagid) 
  LABEL hasInterest,
 Person_studyAt_University 
  SOURCE KEY(personid) REFERENCES Person(id)
  DESTINATION KEY(universityid) REFERENCES University(id)
  PROPERTIES(personid,universityid,classyear)
  LABEL studyAt)
SELECT study.university_name, study.id FROM GRAPH_TABLE (pg, 
  MATCH (:Person WHERE a.name='Bob')-[s:studyAt]->
        (u:University)
  COLUMNS (u.name as university_name, a.id as pid)) study

Crash when * in column clause

What happens?

see title

To Reproduce

SELECT *
FROM GRAPH_TABLE(pg
    MATCH (p:Person)-[w:worksAt]->(u:university)
    COLUMNS (*)
    ) result

OS:

macOs 13 - Apple M1 Pro

DuckDB Version:

latest

DuckDB Client:

cli

Full Name:

Daniel ten Wolde

Affiliation:

CWI

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

CSR Creation seg faults when running with pragma enable_verification

What happens?

see title

To Reproduce

Set pragma enable_verification and create a CSR

OS:

macOs 13 - Apple M1 Pro

DuckDB Version:

latest

DuckDB Client:

CLI

Full Name:

Daniel ten Wolde

Affiliation:

CWI

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Support inheritance

Inheritance will work as follows:
A vertex table can be defined as:
Organisation LABEL Organisation_kind IN (Company, University)
This means that there is a column Organisation_kind that is a BIGINT in which every row that is also part of the table company the value is 1 (0001). The row that is also a University has value (0010).
If the user then does a check on an inherited label, we need to add a clause to the where that is:
WHERE organisation_kind & <index of that label>

When true, it means that the row is part of the table . An element can only be part of one other table

Cost expression in subpath

SQL/PGQ marked cost as a language opportunity
(a:Person)-[k:Knows COST b_expr DefaultOptional)]->(b:Person)
This will be part of a subpath

Replacement scan for creating and querying SQL/PGQ graphs

What happens?

The following query does not work:

import pandas as pd
import duckdb
df = pd.DataFrame([1,2,3,4,5])
df2 = pd.DataFrame([[0,1], [1,2], [2,3]])
con = duckdb.connect('')
con.execute('load sqlpgq')
con.execute("""CREATE PROPERTY GRAPH pg VERTEX TABLES ( df LABEL person ) EDGE TABLES ( df2 SOURCE KEY (i) REFERENCES df (j) DESTINATION KEY (i) REFERENCES df (j) LABEL knows);""")

We could add a feature that allows the creation of property graphs from a TableFunction, but we need to add support for the table replacement scan

To Reproduce

See above

OS:

macOs 13 - Apple M1 Pro

DuckDB Version:

latest

DuckDB Client:

python

Full Name:

Daniel ten Wolde

Affiliation:

CWI

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Edge table references a table that is not defined as a vertex table

What should happen in the following case:

CREATE TABLE person(id BIGINT, name VARCHAR);
CREATE TABLE know(src BIGINT, dst BIGINT, createDate BIGINT);
CREATE TABLE employer(id BIGINT);


CREATE PROPERTY GRAPH pg
VERTEX TABLES (
    person PROPERTIES ( id, name ) LABEL Person,
    )
EDGE TABLES (
    employs    SOURCE KEY ( src ) REFERENCES employer ( id )
            DESTINATION KEY ( dst ) REFERENCES Student ( id )
            PROPERTIES ( createDate ) LABEL Employs
    )

In this case the table employer is not defined as a vertex table. Therefore this should not be valid?
Reading the docs the definition of an edge table includes:
The name of the destination vertex table. [p.12]

But in this case, the table referenced is not a vertex table as defined on page 11.

Create support for SHORTEST path

This query will return a list of rowids alternating between the vertex rowid and the edge rowid between a source and destination
We will only support (ANY) SHORTEST path for now, there is also opportunity for ALL

Optimal number of lanes

For the various algorithms, experiment with the number of lanes to use
8, 16, 32, 64, 128, 256, 512, 1024, 2048

  • Shortest path
  • Bidirectional path length
  • Shortest path length
  • Cheapest path length

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.