prql / prql Goto Github PK

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

License: Apache License 2.0

Rust 83.72% JavaScript 4.24% HTML 2.18% CSS 2.80% Python 0.38% Shell 0.32% Java 0.99% Dockerfile 0.07% Elixir 0.82% C 0.56% PHP 1.03% C# 1.07% Makefile 0.09% C++ 0.43% Handlebars 0.02% Zig 0.38% TypeScript 0.69% Nix 0.21%

data pipeline sql

prql's Introduction

PRQL

Pipelined Relational Query Language, pronounced "Prequel".

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement. Like SQL, it's readable, explicit and declarative. Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it compiles to SQL.

PRQL can be as simple as:

from tracks
filter artist == "Bob Marley"                 # Each line transforms the previous result
aggregate {                                   # `aggregate` reduces each column to a value
  plays    = sum plays,
  longest  = max length,
  shortest = min length,                      # Trailing commas are allowed
}

Here's a fuller example of the language;

from employees
filter start_date > @2021-01-01               # Clear date syntax
derive {                                      # `derive` adds columns / variables
  gross_salary = salary + (tax ?? 0),         # Terse coalesce
  gross_cost = gross_salary + benefits_cost,  # Variables can use other variables
}
filter gross_cost > 0
group {title, country} (                      # `group` runs a pipeline over each group
  aggregate {                                 # `aggregate` reduces each group to a value
    average gross_salary,
    sum_gross_cost = sum gross_cost,          # `=` sets a column name
  }
)
filter sum_gross_cost > 100_000               # `filter` replaces both of SQL's `WHERE` & `HAVING`
derive id = f"{title}_{country}"              # F-strings like Python
derive country_code = s"LEFT(country, 2)"     # S-strings allow using SQL as an escape hatch
sort {sum_gross_cost, -country}               # `-country` means descending order
take 1..20                                    # Range expressions (also valid here as `take 20`)

For more on the language, more examples & comparisons with SQL, visit prql-lang.org. To experiment with PRQL in the browser, check out PRQL Playground.

Current Status - April 2024

PRQL is being actively developed by a growing community. It's ready to use by the intrepid, either with our supported integrations, or within your own tools, using one of our supported language bindings.

PRQL still has some bugs and some missing features, and is probably only ready to be rolled out to non-technical teams for fairly simple queries.

In particular, we're working on a new resolver, which will let us squash many bugs and simplify our code a lot. It'll also let us scale the language without scaling the complexity of the compiler.

While we work on that, we're also focusing on

Ensuring our supported features feel extremely robust; resolving any priority bugs. As more folks have started using PRQL, we've had more bug reports — good news, but also gives us more to work on.
Filling remaining feature gaps, so that PRQL is possible to use for almost all standard SQL queries.
Expanding our set of supported features — we've recently added experimental support for modules / multi-file projects, and for auto-formatting.

We're also spending time thinking about:

Making it really easy to start using PRQL. We're doing that by building integrations with tools that folks already use; for example our VS Code extension & Jupyter integration. If there are tools you're familiar with that you think would be open to integrating with PRQL, please let us know in an issue.
Whether all our initial decisions were correct — for example how we handle window functions outside of a window transform.
Making it easier to contribute to the compiler. We have a wide group of contributors to the project, but contributions to the compiler itself are quite concentrated. We're keen to expand this; #1840 for feedback, some suggestions on starter issues are below.

Get involved

To stay in touch with PRQL:

Follow us on Twitter
Join us on Discord
Star this repo
Contribute — join us in building PRQL, through writing code (send us your use-cases!), or inspiring others to use it.
See the development documentation for PRQL. It's easy to get started — the project can be built in a couple of commands, and we're a really friendly community!
For those who might be interested in contributing to the code now, check out issues with the good first issue label. Always feel free to ask questions or open a draft PR.

Explore

PRQL Playground — experiment with PRQL in the browser.
PRQL Book — the language documentation.
Jupyter magic — run PRQL in Jupyter, either against a DB, or a Pandas DataFrame / CSV / Parquet file through DuckDB.
pyprql Docs — the pyprql documentation, the Python bindings to PRQL, including Jupyter magic.
PRQL VS Code extension
prql-js — JavaScript bindings for PRQL.

Repo organization

This repo is composed of:

prqlc — the compiler, written in rust, whose main role is to compile PRQL into SQL. Also contains the CLI and bindings from various languages.
web — our web content: the Book, Website, and Playground.

It also contains our testing / CI infrastructure and development tools. Check out our development docs for more details.

Contributors

Many thanks to those who've made our progress possible:

prql's People

Contributors

Stargazers

Watchers

Forkers

pm-hwks gasi isgasho kokizzu graydon yutiansut cyberflamego ashwinramanipsg fifofefe andreypopp andrewshukhtin laudehenri mikeyhodl saiddddd hs-heddy datadevopscloud watsonso dannluciano stjordanis doytsujin toko-kanabuchi ikbhal danjones1618 instinxt absk1317 sophiexu sthagen cuulee tcholewik suryatmodulus jacke laplacekorea nilbot vkayar7 kiibo382 tetsu5555 haifenghuang voidexpr clifton sakthipitchaiah fdncred phial3 chaosprint wnghngxn mre wahello alexvoda mdesmet snth hirosassa rog0d igmor delpikye-v shinroo lineuve deveops-villeneuve tjlauer ryanrussell bagasstrongman martinadamsdev alexsiera xialvjun dmoena spread0x ayushghosh mwatts kktwo paulomach roldugin altoplano mklopets florent-martineau auaan sreev feedloop geofmureithi wangfenjin hulxv gatesn vemoo tcncloud kianmeng shuozeli zarlo a-kenji evanrichter sercandogan mayhemheroes shogysd aj-bagwell ekusiadadus scarletquasar opentinyhouse rich-s-organization marinpostma eitsupi stormasm seshwcs waitingkuo jaredramirez

prql's Issues

Formal grammar spec for prql

A formal grammar specification would allow others to easily create parsers, do you have a preference PEG or BNF ?

About Handling of NULL

If this has already been discussed somewhere and a conclusion has been reached, please close.

When representing that a column is null, the

IS NULL
= null
Other

Which one do you use?

Add subcommand to CLI

Currently the prql binary has no subcommands — it always writes SQL (or AST depending on the arguments). This limits us from doing other things in the future, like querying a DB, or opening a console like PyPrql does.

Let's add a subcommand like write-sql, even if it's initially the only one at first. (Open to other names, though having all subcommands be verbs is nice for consistency).

Inline filters

Something that's come up a couple of times (discussion of tertiary operator in https://github.com/max-sixty/prql/issues/11, Malloy impl in https://github.com/max-sixty/prql/issues/71) is the concept of inline filters:

Taken from Malloy's "filtering measures", note the COUNT( CASE....

SELECT 
   base.state as state,
   COUNT( CASE WHEN base.fac_type='AIRPORT' THEN 1 END) as airports,
   COUNT( CASE WHEN base.fac_type='HELIPORT' THEN 1 END) as heliports,
   COUNT( 1) as total
FROM `malloy-data.faa.airports` as base
GROUP BY 1
ORDER BY 2 desc

This allows for aggregating over a subset of rows for only that field.

Here's a truncated example from our readme:

func if_valid x = is_valid_price ? x : null

from prices
derive [
  return_total:      prices_adj   | ret | if_valid
  return_usd:        prices_usd   | ret | if_valid
]

With an inline filter, this could be:

from prices
derive [
  return_total:      prices_adj   | ret | filter is_valid_price
  return_usd:        prices_usd   | ret | filter is_valid_price
]

Or with an expression rather than a column:

from prices
derive [
  return_total:      prices_adj   | ret | filter price > 1
  return_usd:        prices_usd   | ret | filter price > 1
]

We'd need to think through what restrictions we'd need on the filter — is any expression allowed?

Handle simple inline pipelines

Currently the (x | lag_day) in func ret x = x / (x | lag_day) - 1 + dividend_return doesn't work. There are some notes on this in prql.pest and elsewhere. I don't think there's much ambiguity here on the intended result. Some things we need to do:

Parse these into an InlinePipeline struct
Change the Materialize functions to handle them the same as functions.
- We discussed whether the arg would be first vs final in https://github.com/max-sixty/prql/issues/35; unless people disagree I plan to implement with the variable becoming the final arg.

Issues like https://github.com/max-sixty/prql/issues/9 can wait I think; this issue is just making them equivalent to functions

I love this XD

a bit similar with what I thought for a while
https://www.reddit.com/r/Database/comments/qe0jwk/why_sql_why/

recursive sql / prql

Hello, I'm very interested how would you approach on solving recursions in prql, for instance how to rewrite this simple sql recursion:

with numbers as (
  select 1 as n
  union all
  select n + 1 from numbers where n <= 10
)
select * from numbers

love the idea of prql, best of luck!

Syntax for adding columns

tfehring made a good observation:

Did you consider using a keyword like let for column declarations, e.g. let gross_salary = salary + payroll_tax instead of just gross_salary = salary + payroll_tax? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.

Currently the syntax is like:

from employees
gross_salary = salary + payroll_tax            # This _adds_ a column / variable.

It's not great. tfehring's keyword scanning point is good.

As well as the let option, Kusto uses extend; dplyr uses mutate; pandas uses assign. I would find it easier to read if we kept the = — i.e. let gross_salary = salary + payroll_tax rather than let gross_salary (salary + payroll_tax) — even if it would be an exception to other functions.

The invitation link of Discord has expired.

This link has expired. Please fix it.

https://github.com/max-sixty/prql/blob/4f4387d79aafbb9fe5b37fc5d1c10369542876e2/README.md?plain=1#L262

An examples/ directory with lots of PRQL examples

We could use these for ensuring our SQL generated is in sync between different impls also.

Add CLI

It would be good to have a CLI which can take PRQL and do things like:

Produce SQL, using writer.rs (most important, it'd be fine if this is all it does tbh)
Produce an AST, using parser.rs
Produce a parse tree, using ast.rs

This will provide the basis of the tool for producing SQL, and provide a simple entry point for people to see intermediate data.

If anyone wants to take this on, please feel free to start with a very small PR.

Materialize functions within functions

Currently:

func lag_day x = s"lag_day_todo({x})"
func ret x = x / (lag_day x) - 1 + dividend_return

from a
select (ret b)

materializes the select as:

- Select:
    - Items:
        - Terms:
            - Ident: b
            - Raw: /
            - Items:
                - Terms:
                    - Ident: lag_day      # <- Doesn't run the function
                    - Ident: b
            - Raw: "-"
            - Raw: "1"
            - Raw: +
            - Ident: dividend_return

We should fix!

Keywords and Builtin functions for PRQL

Just trying to capture this somewhere, so far I have:

Builtins

from
join (side)
select
aggregate (by)
sort (order)
take
derive
func
table
value
let

Aggregate functions

count
sum
min
max
average
stddev
variance

Operators

[ ] , lists
( ), code body
x:y , passing named argument x value y
|, pipes results to next function

What's missing ?

Translate Joins to SQL

This will likely require a few PRs:

Parse into a newly made a Joins struct
Materialize (easy, nothing special here)
Translate into SQL (could be a decent effort, e.g. will require thinking about when we need to start a new CTE)
Decide how to handle default values / corner cases / etc

Cool things in Malloy

I've been spending some time looking at Malloy, a language from Looker that looks promising. I've been a fan of @lloydtabb's for a while, and briefly chatted to him about Malloy a couple of weeks ago.

As ever with these "better SQL" languages — our fates are positively correlated. So I hope it's collegial to explore the things Malloy does particularly well at.

Without a full review of the differences — Malloy seems more focused on a broader vision of data modeling with structure on top of existing tables — like explores — blocks that combines tables — and dimensions & measures (these are things I always liked in Looker). The language integrates an approach for defining charts. There's less emphasis on pipelines — ordering isn't semantically important — though there is a way to pipe queries into other queries. It's more typing to get started.

Some things that look compelling at first glance:

Inline filters:
```
ny_flights is count() { where: origin.state = 'NY' }
```
compiles to
```
COUNT( CASE WHEN origin_0.state='CA' THEN 1 END)  as ca_flights
```
I'm not sure how general these can be; presumably they can't be full where clauses like Malloy's normal where clauses.
Small one — or syntax: where: carrier: 'UA' | 'AA'

Nesting — similar to the discussion in our readme, this is executed quite well, with a nest block giving an array. So:

query: airports -> {
  group_by: state
  aggregate: airport_count
  nest: by_facility is {
    group_by: fac_type
    aggregate: airport_count
  }
}

...produces a nested array (SQL is mangled so showing the result)

[
  {
    "state": "TX",
    "airport_count": 1845,
    "by_facility": [
      {
        "fac_type": "AIRPORT",
        "airport_count": 1389
      },
      {
        "fac_type": "HELIPORT",
        "airport_count": 435
      },
      ...

Joins —
https://looker-open-source.github.io/malloy/documentation/language/join.html
I rather like that it limits SQL's complexity — most joins in practice are left joins using primary keys. So this is nice:
```
explore: flights is table('malloy-data.faa.flights') {
  join_one: origin_airport is airports with origin
  join_one: destination_airport is airports with destination
  join_one: aircraft with tail_num
}
```
This works because the primary keys has been defined above (similar to what we discussed on discord around defining relationships separately, reducing verbosity for inline joins):
```
explore: airports is table('malloy-data.faa.airports') {
  primary_key: code
}
```
It is possible to do less standard joins as an escape hatch.
Timestamps — small one — they prepend timestamps with @, like { where: dep_time: @2003 to @2005 }, and have attribute access with functions (which seem to be both prefix and suffix?) dep_time.month / day_of_year(some_date)

That's all for now. I think it'd be good to keep in touch with the Malloy team and would welcome any cross-pollination.

Datetimes and timedeltas

Postgres has a lot of types, with arithmetic behavior that makes sense. Should PRQL support a way to express datetimes and timedeltas?

Postgres example:

SELECT * FROM table
  WHERE date > now() - 2 * INTERVAL '1 year';

Check if table exists, UPSERT, REPLACE

Rationale

A large part of work with SQL databases is maintenance of statics. E.g. you run a predictive engine (ML or simple regressions) and some variables are current stats, like "amount of orders in the recent 1 hour". These tables are then queried often by predictive services or libraries.

Currently, we chose to do this with normal tables, and have to truncate or recreate them. Other ways of doing this with SQL would be VIEW or MATERIALIZED VIEW. But their main drawback is that they must be in the same DB as the source table, and block maintenance on it: ALTER TABLE becomes impossible (or requires recreation of the views).

That's why many use some Python code to extract data, transform and load in another DB. (An important requirement is to have just one such Python script per feature, that does both inititalization and updates. Because any business will generate many features, and doubling them and scattering the logic creates a lot of extra pain.)

Such operations require several kinds of actions:

UPSERT

Example 1: re-calculate total orders per city/client every day and store in a separate table (for performance reasons).
Example 2: maintain log-like table: select new orders from source table, but only a set of fields, transform, insert into another DB.

Without upsert, you have to run a program to do maintenance and call SQL scripts.

check if the target stats table exists
create if it doesn't (because we want to do both creation and update in one code unit)
check if the stat is recent, or requires update (this can be necessary if we had a large system outage and not sure if our logs are correct)
If this is on another DB, you have also to create temporary tables and insert the aggregated data into them
after all that run 2 SQL commands:

insert into city_stat
select o.city_id, count(*) orders, sum(o.cost) cost, current_time as update_timestamp
from orders o left join city_stat c on o.city_id=c.id
where c.id is null
group by o.city_id;

update city_stat
from orders o inner join city_stat c on o.city_id=c.id
set orders = count(*), cost=sum(o.cost), update_timestamp=current_time
where city_id=o.city_id
group by o.city_id;

Upserts in noSQL databases make this quite trivial:

upsert db/table
from <select query>

Note 1 probably it should support table recreation.

Note 2: it should also support increments to existing and to non-existing records. E.g.

upsert city_stats
set orders = (orders or 0) + i where city_id = j
from (some input data here ....)

REPLACE table data

Sometimes you need to keep just current stats, e.g. orders within last 1 hour per district/neighborhood.

Pandas package recognized this need and made it possible with a to_sql function:

 city_orders_stat.to_sql(sql_alchemy_connection_obj, 'city_stat', method='replace')

in this case the table will be dropped and recreated.

In SQL, it must be broken down into 4 stages:

test if target table exists (again, this should be done, because we want to keep minimum number of scripts per feature)
drop or truncate
recreate if necessary
create as select or select if in another db
insert if in another db

Thoughts on dbt

Hey Maximilian!

I found this yesterday on HN and was impressed - really love how it looks.
I see that the intent of prql is to replace SQL for analytical purposes. That’s really ambitious! However, if you really want to become the de facto tool for analytics, you do not only have to replace sql but also dbt.

Do you have any thoughts on this?

From the top of my head, there is some stuff that prql should be atleast as good as dbt in order for it to become feasible for data folks:

Built-in testing
Documentation
Modularisation & Model Lineage (i.e. being able to refer model A in model B, and create a DAG that showcases that relation)

At the same time, there are some advantages prql could offer against dbt:

(Presumably) faster compilation times, due to it being written in Rust.
Column-level lineage: This might be naive, but if you were able to identify code that called a source table, and made it impossible for the user to perform a ‘select * from source’, you would be able to track the origin of that column from their child models.
Built-in formatter, a la gofmt. If you did this, you could make some behaviour that can lead to bugs impossible, as sqlfluff tries to do now (albeit it should be much easier due to only building it against one language). For example, when joining two tables or more, not explicitly adding the parent table of a given column. Mind you that if you could assert that, you would be able to make sure that column-level lineage is right, opening the door to other features (such as documentation inheritance, explain your column once and let your child models use that same definition).
No jinja.

I believe the exercise of thinking about what a possible killer app prql could offer against dbt is worth doing. Right now I haven’t thought of any, but I will follow up if I do.

Hope to hear from you soon!

Implement named args

Named args are parsed in function calls, but not definitions. They're also not materialized. The by argument to aggregate is handled specifically.

If anyone wants to work on the parser — probably the best part of the codebase and well-documented — parsing named args would be a great contribution (also in #26)

Using sqlparser-rs for SQL AST

We're getting to the point where we're converting PRQL AST to SQL AST, and then we'll need to convert the SQL AST to text.

sqlparser-rs is predominantly for parsing SQL — i.e. going the other direction.

But it has:

A very complete AST: https://github.com/sqlparser-rs/sqlparser-rs/tree/main/src/ast
It seems to have a complete Display implemented for this AST, which may do the writing for us; e.g. https://github.com/sqlparser-rs/sqlparser-rs/blob/main/src/ast/query.rs#L152-L195

It looks excellent quality, something I think we'd love to build on.

This would be "huge if true" — so adding this issue to socialize the idea. I'll optimistically tag @alamb as they've been the main contributor to the repo very recently, in case they have a view.

Transpiling from SQL to PRQL

This is a longer-term idea — but now we're using sqlparser to write SQL, we could also use it to read SQL and produce PRQL.

We'd need to write:

a fairly simple compiler from sqlparser AST to PRQL AST
Display implementations which convert PRQL AST to PRQL code

This would be useful for people transitioning from SQL. At least initially, the main users are likely to be engineers who are frustrated by SQL because they've written a lot of it.

We wouldn't need to support all of SQL for it to be useful — there's a fairly small subset that's used a lot — and possibly a larger subset which we could implement with s-strings.

We also wouldn't need to automatically do things like variable "creation" (e.g. replacing salary + payroll_tax + benefits_cost with gross_cost), which would be much more difficult (but great if we could!).

`count` vs `count *`

In the example functions, there are two functions with the same name:

func count = s"COUNT(*)"
func count column = s"COUNT({column})"

We need to decide how to handle these.

Options:

Don't allow this — count * would be required rather than just count
- This would be tempting to use in the short term, but has the disadvantage that we'd need to special-case the *, since it's currently an operator (i.e. foo * 3). I don't think it's a CS PhD to implement, but it's the sort of special-casing that causes issues later, both in the code and our ability to provide good error messages.
- We could find a different symbol than *, but that's also unfamiliar.
Multiple dispatch — i.e. the compiler allows for multiple functions with the same name as long as they have different signatures, and it selects the function that matches the signature. In this case the number of positional args defines the signature. So the function definitions above would be correct.
Use a named argument — so func count column="*" = s"COUNT({column})". But then people have to write count column:salary, which isn't friendly — and either we do it everywhere, or it's unfamiliar in the count case (there's no sum *).
Allow optional positional args, and find a different way of specifying named args. As a reminder, this is our current function specification, and has the advantage that it's extremely simple — only two types of arguments, with named arguments defined by having a default. This is python's.

JSON queries

Believe it or not, many use SQL tables as a document store, dumping massive amounts of JSON into a column and querying these.
Each SQL engine manipulates the JSON objects differently:

Postgres provides various homegrown infix operators to filter and query the JSON objects, while MySQL and Sqlite simply provide a bunch of functions that take the json blob as input.

Should such a syntax be based on
- postgres-like infix operators
- functions like sqlite/mysql
- C/python/javascript/rust inspired hashmap operators (a[b]), optionally with null-conditional and coalescing operators like in c# (a?[b ?? 0])
- Or instead be a DSL inspired by jq
Should PRQL even try to address JSON?

an exclude clause, that would select all columns except whats specified

Big Query has a SELECT * EXCEPT , but otherwise I don't think it's a widely supported feature.

But it would be a neat feature when and if it's more supported.

One method might be to select into a temp table then alter that table , would take some exploring to see how wildly supported that is.

An implementation of a similar idea

This is an exciting project to stumble across. I had very similar frustrations with SQL last year so I hacked on a prototype until I was happy. My implementation looks a lot like your spec. There are differences - certainly in syntax - but the differences are shallow.

My project has been my daily SQL driver for at least half a year now and works well - so I can validate your proposal makes sense. Being able to iterate faster on queries has completely changed my workflow.

I just uploaded the project so you can take a look: https://github.com/gregsexton/gql

I don't have an agenda here. Just thought you might like the anecdotal project validation. Maybe my project can give you some inspiration (maybe of what you don't want!). Feel free to grab from the implementation if it ramps you up faster. It's written in lisp though, which might be a turn off.

Let me know if you want to bounce ideas around. I tried a few different things before settling on what I have.

let vs set in new columns

justinpombrio makes a good point:

Another suggestion around let: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called let and set. Those are in effect pretty different operations: you need to know which one is happening to know how many columns the table will have, and renaming a table column can with your current system change which operation is happening.

Splitting them into separate operations would make things easier on the reader: they can tell what's happening without having to know all the column names of the table. And it shouldn't really be harder for the writer, who ought to already know which they're doing.

I think this is a good idea! dplyr has something similar with mutate & transmute.

It can mostly be enforced by PRQL. There's a case where we transpile to:

select *, x+1 as x_plus_one

...where we don't know whether or not we're overwriting an existing column. But it's a minority of cases, and the contract could stand within PRQL.

let & set seem reasonable but also open to other syntax.

Subscribe for updates

I'm hoping to bring prql from a proposal to reality. If you'd like updates in your GH notifications, subscribe to this issue by hitting the button on the right.

Group and select rows which have min/max value

I wonder if PRQL can solve this very practical problem that many people suffer from every day: https://stackoverflow.com/q/3800551

Named args vs assignments

One issue https://github.com/max-sixty/prql/pull/52 raises is how assignments are parsed; specifically:

derive gross_salary: salary + payroll_tax,

Should parse into:

func: derive
assign:
- lvalue: gross_salary
- rvalue: salary + payroll_tax

But named args' syntax conflict with this. For example:

aggregate by:[title] [sum salary]

should parse into:

func: aggregate
assign:
- lvalue: by
- rvalue: [title]
arg: [sum salary]

...and not

func: aggregate
assign:
- lvalue: by
- rvalue:[title] [sum salary]

...which is what the assign implies.

Options:

change syntax so only one of these uses :
- Though trying to avoid using = so we can keep the SQL (and human!) use of this
force the assign to always take one arg, so gross_salary: salary + payroll_tax needs to be gross_salary: (salary + payroll_tax)
restore the rules around whitespace, such that passing named args disallows whitespace (i.e. by:[title] and not by: [title]). But this makes semantics very dependent on small changes in whitespace, which could get confusing. An autoformatter isn't going to be able to deal with this

References:

OCaml approach
F# approach (broadly similar)
Swift approach (broadly similar)

Argument substitution

Many SQL engines allow you to substitute in variables in a safe and escaped way.

Example taken from here:

values = await conn.fetch(
    'SELECT * FROM mytable WHERE id = $1',
    10,
)

If the end goal is to generate SQL, then these substitution markers should also be supported by PRQL.

Should these be positional only?
Should PRQL support key=value substitution?

Table vs Value types

@qorrect raises an interesting point on a Discord thread.

Something I find frustrating about SQL is that a query result can have three types:

a table
a column
a value — one column, one row

Some queries implicitly require one type. For example, in:

SELECT name
FROM facts
WHERE population = (SELECT MIN(population) FROM facts);

...the subquery SELECT MIN(population) FROM facts has to be a Value. But it's difficult to know that before running the query

Here it's a full aggregation, so it is possible to statically check that, but it could be select population where country = "USA" which is ambiguous. And I'm not aware of live feedback on SQL tools to show that error even in the unambiguous case

Here's the PRQL of that query, separated into a table for effect:

table min_pop = (
  facts
  aggregate [min population]
)

from facts
select name
filter population = min_pop

We could make this be statically checked, by defining a value type:

-table min_pop = (
+value min_pop = (
   facts
   aggregate [min population]
 )

...and then having aggregate [min population, max population] would fail at "compile time".

Ambiguous subqueries such as filter country = "USA" could require a take 1 to guarantee it's a single value.

Syntax comparison to QUEL

Before prql gets implemented, I'd like to see some comparison of the proposed syntax with QUEL.

https://en.wikipedia.org/wiki/QUEL_query_languages

QUEL is a more readable but fully composable alternative to SQL. It was created by a mathematician and fully implemented in POSTGRES 4.2 (yeah, POSTGRES got the frontend thrown away later and exchanged for SQL due to market pressure).

Btw. I'd strongly recommend everyone reading the paper What Goes Around Comes Around from M. Stonebraker and co.

It's a summary of 35 years of data model proposals (and thus what query languages are designed around), grouped into 9 different eras. The outcome of the paper is a list of lessons learned:

Lesson 1: Physical and logical data independence are highly desirable
Lesson 2: Tree structured data models are very restrictive
Lesson 3: It is a challenge to provide sophisticated logical reorganizations of tree structured data
Lesson 4: A record-at-a-time user interface forces the programmer to do manual query optimization, and this is often hard.
Lesson 5: Directed graphs are more flexible than hierarchies but more complex
Lesson 6: Loading and recovering directed graphs is more complex than hierarchies
Lesson 7: Set-a-time languages are good, regardless of the data model, since they offer much improved physical data independence.
Lesson 8: Logical data independence is easier with a simple data model than with a complex one.
Lesson 9: Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.
Lesson 10: Query optimizers can beat all but the best record-at-a-time DBMS application programmers.
Lesson 11: Functional dependencies are too difficult for mere mortals to understand. Another reason for KISS (Keep It Simple Stupid).
Lesson 12: Unless there is a big performance or functionality advantage, new constructs will go nowhere.
Lesson 13: Packages will not sell to users unless they are in “major pain”
Lesson 14: Persistent languages will go nowhere without the support of the programming language community.
(yes, there is a numbering mistake in the paper)
Lesson 14: The major benefits of OR is two-fold: putting code in the data base (and thereby bluring the distinction between code and data) and a general purpose extension mechanism that allows OR DBMSs to quickly respond to market requirements.
Lesson 15: Widespread adoption of new technology requires either standards and/or an elephant pushing hard.
Lesson 16: Schema-later is a probably a niche market
Lesson 17: XQuery is pretty much OR SQL with a different syntax
Lesson 18: XML will not solve the semantic heterogeneity either inside or outside the enterprise.

Create table

Is PRQL strictly a query language, or should it also support defining and modifying the database schema/tables?

Potentially using Chumsky

There are great readability advantages to having a parser grammar, and we're still iterating between the language design and the parser, so we're not going to rewrite it at this stage.

But this project looks really impressive: https://github.com/zesterer/chumsky, from @zesterer (who seems to be actively supporting & developing it, unlike pest & nom)

We're already hitting the limits of pest — e.g. on pest-parser/pest#271 has been frustrating, and my current work refactoring how functions are handled is not efficient — so If we find the existing error messages aren't good enough, it's worth considering.

Removing implicitly named identifiers

Some good feedback from ajkjk:

"sort sum_gross_cost # Uses the auto-generated column name." ... seems like a huge landmine. Languages really should not have any implicit way of constructing identifiers (among other reasons it is not easily greppable).

You might consider using a syntax like sum:gross_cost which can function as a sort parameter and an aggregation, but is actually recognizable as an object instead of having an implicit transformation going on in the background. Like this:

filter gross_cost > 0
aggregate by:[title, country] [
  sum:gross_cost,
]
sort sum:gross_cost

I think this is right — while I think it's helpful to implicitly generate column names, to use them as identifiers I think should require being explicit.

A few possible approaches:

The proposal above, which uses sum:gross_cost. I had planned for : to be used for passing values to named arguments in functions — see the func example in the readme. I'm not precious about that syntax, but it should be unambiguous.
Mandate that the column in sort should be assigned a name, like:
```
filter gross_cost > 0
aggregate by:[title, country] [
  sum salary,
  let sum_gross_cost = sum gross_cost,
]
sort sum_gross_cost
```
One downside of this is the incongruence between sum salary and let sum_gross_cost = sum gross_cost in the aggregate list. But we need some way to assign names in general, so we'll have to handle that regardless.
Allow for expressions in the sort, like:
```
filter gross_cost > 0
aggregate by:[title, country] [
  sum salary
  sum gross_cost,
]
sort (sum gross_cost)
```
The issue here is that the columns we're aggregating by are implicit, based on the last aggregate expression, which isn't great.

Atm I'd lean towards "Mandate that the column in sort should be assigned a name" but as ever open to feedback

Starting on the parser

A couple of people reached out asking whether they could help writing the parser, and potentially there are others who would be interested in contributing.

My plan was to use nom, because I'd heard good things, had a good first experience, and it's easy to adapt it if we decide to make it more hand-written or use a PEG grammar.

My initial plan was to try and get some of the initial simple examples parsing. This will be mostly activation energy — it's not the most difficult language to parse. So if we can parse this into an AST, that would be a great success:

from employees
filter country = "USA"                           # Each line transforms the previous result.
derive [                                         # This adds columns / variables.
  gross_salary: salary + payroll_tax,
  gross_cost:   gross_salary + benefits_cost     # Variables can use other variables.
]           
filter gross_cost > 0
aggregate by:[title, country] [                  # `by` are the columns to group by.
    average salary,                              # These are aggregation calcs run on each group.
    sum     salary,
    average gross_salary,
    sum     gross_salary,
    average gross_cost,
    sum_gross_cost: sum gross_cost,
    count,
]
sort sum_gross_cost
filter count > 200
take 20

If anyone wants to have a go, I would suggest:

Take one function (e.g. derive)
Put a PR in after you have a few lines of code — then people won't do the same thing (though I'm not sure if that will actually be a problem!)
Don't worry about making it complete (e.g. it'd be fine to start on derive only for anonymous columns), add comments where stuff is missing
Do add a few tests. You can see the very early functions I added have tests

If parts are unclear, please feel free to reply here.

Filter by unselected columns

Is it possible to write a query that filters rows by unselected columns. I think if every line is transformation of previous line it is not posibble

Test the documented examples

We have some documented examples — some of which currently work and some of which don't.

It would be great to test our documented examples. The current format for docs seems good; I'd also be open to a different format, but there should be no rust in these examples — they're for users rather than developers of prql.

(TBC, we already have an order of magnitude more tests, mostly using insta — not tests need to be docs, but all docs should be tested)

Assuming we're happy with the format, a really useful tool would:

Parse the .md files
Extract the PRQL
Run the PRQL
Compare to SQL
Add something like a # Not yet implemented comment if it raises an error
Allows for copying & pasting the result if it works but the output isn't exactly the same

Community — ideas?

Would folks use a Slack / Discord / etc?

Both to aid contributions — e.g. folks who are keen to contribute but would need some help from others on getting up to speed on writing rust code for the parser — and as a community space for how to more generally make SQL less painful.

Function syntax and the pipe operator

I have thoughts about this example function and its usage:

func if_valid x = is_valid_price ? x : null

let return_total = prices_adj   | ret | if_valid

The use of is_valid_price bugs me a lot as it's not actually in scope of the function. What would happen if it was called on a table that didn't have that column?

I'd suggest having if_valid take in is_valid_price as an argument, then calling it like this:

let return_total = prices_adj   | ret | if_valid(is_valid_price) # or maybe if_valid is_valid_price

There is a bit of a precedent for this with languages like F# and their pipeline operator, where x | foo(a, b, c) is interpreted as foo(x, a, b, c):

let squareAndAddOdd values =
    values
    |> List.filter (fun x -> x % 2 <> 0)
    |> List.map (fun x -> x * x + 1)

In fact, I think there's a fair bit of syntax similarity between the proposed this project and ML languages, so they might be worth taking a look at for syntax ideas.

In this proposal, | or newlines act as a kind of implied infix "pipeline" operator - I think this should be made into an explicit operator, especially given this example:

func ret x = x / (x | lag_day) - 1 + dividend_return

Though I think this specific scenario would be better off written in a more traditional way:

func ret x = x / lag_day(x) - 1 + dividend_return  # or maybe (lag_day x)

I suspect there may be some weird syntax scenarios to handle, however, especially with expressions. For example, how would something like this get interpreted:

where a > b
      and c < d

You'd want this to be interpreted as where a > b and c < d and not as where a > b | and c < d, since although and c < d may be a valid function call, it would make expression precedence parsing really weird and that statement might also be possible to interpret as applying and c < d to where a > b.

Perhaps having a newline followed by an infix operator can prevent the implied | from getting inserted?

Maybe expressions should not implicitly insert | on new lines, but then how would you tell if an expression ended?

Actually, how do you tell apart where (a | foo b) and (where a) | (foo b)?

Finally, some small thoughts:

That ternary operator should be replaced with a proper if ... then ... else if ... then ... else ..., this would be easier to read and is the precedent set by SQL and F#, along with also being used in Rust, the language the current parser is written in
Keep or and and instead of using || and && - with | being as important as it is, || will become an absolutely deadly footgun
I understand why func was chosen as the keyword, but maybe function should be accepted as well since most other keywords appear to not be abbreviations
That raw SQL syntax seems more like some sort of a macro syntax and I'm not sure if it's actually necessary? Even if it is, it really shouldn't use double quotes since that would be very weird.
- Honestly, given the many SQL varieties out there I'd just give up and have both single and double quotes be used purely for strings. For escaping identifiers and column names, maybe backticks (used in MySQL) could work?
- For raw SQL, maybe func upper col = sql UPPER(col) could work?

Add CI

We should add CI to run the tests on cargo. This can start out very simple.

Improving GitHub workflows

It would be nice to add a GitHub workflow to automatically format Rust code using rustfmt to ensure that all PRs use consistent formatting.

I found https://github.com/mbrobbel/rustfmt-check - it doesn't appear to be very widely used, but it might get the job done.

Quotes & literals

Single vs. double quotes for literals (literals like the USA in country = "USA")

While it's generally better to only have one way of doing things, being flexible on this dimension is friendlier (like Python is).
SQL uses single-quotes for string literals. But some engines use double-quoted strings in some specific circumstances.

Passing through SQL, and f-strings

It would be useful to be able to pass through SQL, because there's be lots we haven't implemented in PRQL when we start. for example, UPPER could be defined as:
```
func upper col = f"UPPER({col})"
```

We could use double quotes for this
We could also use backticks, but they make pasting code snippets difficult given markdown
We could use f-strings, which would be required even if there were no expression — basically any f-string gets interpolated and passed through as SQL
- If this is confusing because it's similiar-but-not-the-same as python — because they don't end up as literal strings — we could call them s-strings etc

One potential synthesis of these:

Both double quotes and single quotes are allowed for literals
We use f-strings for SQL (or s-strings)
To represent double quotes, you need to combine these; e.g. f' "column name with spaces" ', would compile to "column name with spaces".

Refresh the roadmap

We've got a lot done — write down the gaps between now and the v0.1, refresh the longer-term roadmap where necessary

Compile with wasm

It would be awesome to have a site like Pest's where someone can type in PRQL and see the SQL.

That'll require compiling to webassembly. Naively doing this fails, because webassembly doesn't have access to file APIs.

If we could make a feature that successfully compiled the parts we need to run in a browser (which notably don't require the file APIs), that would be awesome.

Boolean operators

Some discussion on discord about these.

Two forms:

In SQL expressions — filter salary > 0 and title = "Architect" — these can compile straight to SQL; there's no need for prql to do anything beyond forward them through.

In PRQL code, where we have to evaluate the expression in PRQL to decide what SQL to produce:

func foo x y = (x or y) ? salary : total_comp  # (assuming we have a tertiary operator, separate discussion!)

from employees
derive [
  salary: foo 1 0
  also_salary: foo 0 1
  total_comp: foo 0 0
]

For the latter, the default would be to have and and or as infix operators, like they are in most languages.

But my sense is that having these in PRQL is not that important, and we probably don't need them in the first version. If there are example queries that do require them, feel free to post them here.

CC @qorrect

Test the CLI

It would be great to add tests on the CLI. These could start out very simple indeed.

Window functions over groups

I've worked with SQL a lot and now already 5 years with Pandas, and sometimes both drive me absolutely mad. In some aspects Pandas in Python is nicer, but things like window functions are absolutely horrible.

Example:

orders
------
cell   (a cell on a map)
hour   (datetime but rounded to hours)
amount

with some calculations we should produce this:

orders
------
cell
hour_index (unix timestamp // 3600)
hour
day_of_year
day_of_week
week_of_year
year

And then for each row, calculate

A = orders for the same day of week and same hour for the previous 10 weeks.
B = calculate mean orders per whole day for the same days of week for the previous 10 weeks.
C = calculate total orders yesterday.
A * C / B

I've done only part 1 in Pandas, and it's awful.

dc = data.copy()
data['mean_orders'] = 0
for i in tqdm(range(1, 11)):
    dc.week += 1
    data = data.merge(dc[['cell', 'week', 'weekday', 'orders']], on=['cell', 'weekday', 'week'], suffixes=('', '_other'))
    data['mean_orders'] += data['orders_other'] / 10
    data.drop('orders_other', inplace=True, axis=1)

I can think of PostgreSQL equivalent (though window function names and params might not be exactly these):

select cell, first(orders),
    sum(orders, 10) over (partition by cell, day_of_week, hour order by week) / 10 mean_same_hour_over_week,
    sum(orders, 168) over (partition by cell order by hour) / 7 mean_per_day_over_week,
    sum(b.orders)
from orders a inner join orders y on y.hour_index between (a.hour_index - a.hour) and (a.hour_index - a.hour - 24)
group by cell, orders;

...and then put it into a CTE and do the multiplication & division.

Syntax for aggregation with multiple arguments

What is the syntax for aggregations which take multiple arguments such as percentile, corr, max_by, etc? All examples only contain a single argument. I was thinking to deliminate with space

fn(a, b, c) -> fn a b c

aggregate by:[title, country] [
    min salary,
    pct salary 0.25,
    pct salary 0.50,
    pct salary 0.75,
    max salary
    ...
]

Don't hang on no CLI input

One nit I have with tools which take some stdin input is to wait indefinitely on no input. It's unclear whether something is happening or not.

Unfortunately we do that currently — run ./target/debug/prql. Ideally we should just return the --help.

prql / prql Goto Github PK

prql's Introduction

PRQL

Current Status - April 2024

Get involved

Explore

Repo organization

Contributors

prql's People

Contributors

Stargazers

Watchers

Forkers

prql's Issues

Builtins

Aggregate functions

Operators

Recommend Projects

Recommend Topics

Recommend Org