tobymao / sqlglot Goto Github PK

Python SQL Parser and Transpiler

License: MIT License

Python 98.76% Makefile 0.03% Jinja 0.01% Rust 1.21%

transpiler sql python parser optimizer bigquery duckdb hive mysql postgres

sqlglot's Introduction

SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 21 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.

It is a very comprehensive generic SQL parser with a robust test suite. It is also quite performant, while being written purely in Python.

You can easily customize the parser, analyze queries, traverse expression trees, and programmatically build SQL.

Syntax errors are highlighted and dialect incompatibilities can warn or raise depending on configurations. However, SQLGlot does not aim to be a SQL validator, so it may fail to detect certain syntax errors.

Learn more about SQLGlot in the API documentation and the expression tree primer.

Contributions are very welcome in SQLGlot; read the contribution guide and the onboarding document to get started!

Install
Versioning
Get in Touch
FAQ
Examples
Used By
Documentation
Run Tests and Lint
Benchmarks
Optional Dependencies

Install

From PyPI:

pip3 install "sqlglot[rs]"

# Without Rust tokenizer (slower):
# pip3 install sqlglot

Or with a local checkout:

make install

Requirements for development (optional):

make install-dev

Versioning

Given a version number MAJOR.MINOR.PATCH, SQLGlot uses the following versioning strategy:

The PATCH version is incremented when there are backwards-compatible fixes or feature additions.
The MINOR version is incremented when there are backwards-incompatible fixes or feature additions.
The MAJOR version is incremented when there are significant backwards-incompatible fixes or feature additions.

Get in Touch

We'd love to hear from you. Join our community Slack channel!

FAQ

I tried to parse SQL that should be valid but it failed, why did that happen?

Most of the time, issues like this occur because the "source" dialect is omitted during parsing. For example, this is how to correctly parse a SQL query written in Spark SQL: parse_one(sql, dialect="spark") (alternatively: read="spark"). If no dialect is specified, parse_one will attempt to parse the query according to the "SQLGlot dialect", which is designed to be a superset of all supported dialects. If you tried specifying the dialect and it still doesn't work, please file an issue.

I tried to output SQL but it's not in the correct dialect!

Like parsing, generating SQL also requires the target dialect to be specified, otherwise the SQLGlot dialect will be used by default. For example, to transpile a query from Spark SQL to DuckDB, do parse_one(sql, dialect="spark").sql(dialect="duckdb") (alternatively: transpile(sql, read="spark", write="duckdb")).

I tried to parse invalid SQL and it worked, even though it should raise an error! Why didn't it validate my SQL?

SQLGlot does not aim to be a SQL validator - it is designed to be very forgiving. This makes the codebase more comprehensive and also gives more flexibility to its users, e.g. by allowing them to include trailing commas in their projection lists.

What happened to sqlglot.dataframe?

The PySpark dataframe api was moved to a standalone library called SQLFrame in v24. It now allows you to run queries as opposed to just generate SQL.

Examples

Formatting and Transpiling

Easily translate from one dialect to another. For example, date/time functions vary between dialects and can be hard to deal with:

import sqlglot
sqlglot.transpile("SELECT EPOCH_MS(1618088028295)", read="duckdb", write="hive")[0]

'SELECT FROM_UNIXTIME(1618088028295 / POW(10, 3))'

SQLGlot can even translate custom time formats:

import sqlglot
sqlglot.transpile("SELECT STRFTIME(x, '%y-%-m-%S')", read="duckdb", write="hive")[0]

"SELECT DATE_FORMAT(x, 'yy-M-ss')"

Identifier delimiters and data types can be translated as well:

import sqlglot

# Spark SQL requires backticks (`) for delimited identifiers and uses `FLOAT` over `REAL`
sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""

# Translates the query into Spark SQL, formats it, and delimits all of its identifiers
print(sqlglot.transpile(sql, write="spark", identify=True, pretty=True)[0])

WITH `baz` AS (
  SELECT
    `a`,
    `c`
  FROM `foo`
  WHERE
    `a` = 1
)
SELECT
  `f`.`a`,
  `b`.`b`,
  `baz`.`c`,
  CAST(`b`.`a` AS FLOAT) AS `d`
FROM `foo` AS `f`
JOIN `bar` AS `b`
  ON `f`.`a` = `b`.`a`
LEFT JOIN `baz`
  ON `f`.`a` = `baz`.`a`

Comments are also preserved on a best-effort basis:

sql = """
/* multi
   line
   comment
*/
SELECT
  tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
  CAST(x AS SIGNED), # comment 3
  y               -- comment 4
FROM
  bar /* comment 5 */,
  tbl #          comment 6
"""

# Note: MySQL-specific comments (`#`) are converted into standard syntax
print(sqlglot.transpile(sql, read='mysql', pretty=True)[0])

/* multi
   line
   comment
*/
SELECT
  tbl.cola /* comment 1 */ + tbl.colb /* comment 2 */,
  CAST(x AS INT), /* comment 3 */
  y /* comment 4 */
FROM bar /* comment 5 */, tbl /*          comment 6 */

Metadata

You can explore SQL with expression helpers to do things like find columns and tables in a query:

from sqlglot import parse_one, exp

# print all column references (a and b)
for column in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Column):
    print(column.alias_or_name)

# find all projections in select statements (a and c)
for select in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Select):
    for projection in select.expressions:
        print(projection.alias_or_name)

# find all tables (x, y, z)
for table in parse_one("SELECT * FROM x JOIN y JOIN z").find_all(exp.Table):
    print(table.name)

Read the ast primer to learn more about SQLGlot's internals.

Parser Errors

When the parser detects an error in the syntax, it raises a ParseError:

import sqlglot
sqlglot.transpile("SELECT foo FROM (SELECT baz FROM t")

sqlglot.errors.ParseError: Expecting ). Line 1, Col: 34.
  SELECT foo FROM (SELECT baz FROM t
                                   ~

Structured syntax errors are accessible for programmatic use:

import sqlglot
try:
    sqlglot.transpile("SELECT foo FROM (SELECT baz FROM t")
except sqlglot.errors.ParseError as e:
    print(e.errors)

[{
  'description': 'Expecting )',
  'line': 1,
  'col': 34,
  'start_context': 'SELECT foo FROM (SELECT baz FROM ',
  'highlight': 't',
  'end_context': '',
  'into_expression': None
}]

Unsupported Errors

It may not be possible to translate some queries between certain dialects. For these cases, SQLGlot may emit a warning and will proceed to do a best-effort translation by default:

import sqlglot
sqlglot.transpile("SELECT APPROX_DISTINCT(a, 0.1) FROM foo", read="presto", write="hive")

APPROX_COUNT_DISTINCT does not support accuracy
'SELECT APPROX_COUNT_DISTINCT(a) FROM foo'

This behavior can be changed by setting the unsupported_level attribute. For example, we can set it to either RAISE or IMMEDIATE to ensure an exception is raised instead:

import sqlglot
sqlglot.transpile("SELECT APPROX_DISTINCT(a, 0.1) FROM foo", read="presto", write="hive", unsupported_level=sqlglot.ErrorLevel.RAISE)

sqlglot.errors.UnsupportedError: APPROX_COUNT_DISTINCT does not support accuracy

There are queries that require additional information to be accurately transpiled, such as the schemas of the tables referenced in them. This is because certain transformations are type-sensitive, meaning that type inference is needed in order to understand their semantics. Even though the qualify and annotate_types optimizer rules can help with this, they are not used by default because they add significant overhead and complexity.

Transpilation is generally a hard problem, so SQLGlot employs an "incremental" approach to solving it. This means that there may be dialect pairs that currently lack support for some inputs, but this is expected to improve over time. We highly appreciate well-documented and tested issues or PRs, so feel free to reach out if you need guidance!

Build and Modify SQL

SQLGlot supports incrementally building SQL expressions:

from sqlglot import select, condition

where = condition("x=1").and_("y=1")
select("*").from_("y").where(where).sql()

'SELECT * FROM y WHERE x = 1 AND y = 1'

It's possible to modify a parsed tree:

from sqlglot import parse_one
parse_one("SELECT x FROM y").from_("z").sql()

'SELECT x FROM z'

Parsed expressions can also be transformed recursively by applying a mapping function to each node in the tree:

from sqlglot import exp, parse_one

expression_tree = parse_one("SELECT a FROM x")

def transformer(node):
    if isinstance(node, exp.Column) and node.name == "a":
        return parse_one("FUN(a)")
    return node

transformed_tree = expression_tree.transform(transformer)
transformed_tree.sql()

'SELECT FUN(a) FROM x'

SQL Optimizer

SQLGlot can rewrite queries into an "optimized" form. It performs a variety of techniques to create a new canonical AST. This AST can be used to standardize queries or provide the foundations for implementing an actual engine. For example:

import sqlglot
from sqlglot.optimizer import optimize

print(
    optimize(
        sqlglot.parse_one("""
            SELECT A OR (B OR (C AND D))
            FROM x
            WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
        """),
        schema={"x": {"A": "INT", "B": "INT", "C": "INT", "D": "INT", "Z": "STRING"}}
    ).sql(pretty=True)
)

SELECT
  (
    "x"."a" <> 0 OR "x"."b" <> 0 OR "x"."c" <> 0
  )
  AND (
    "x"."a" <> 0 OR "x"."b" <> 0 OR "x"."d" <> 0
  ) AS "_col_0"
FROM "x" AS "x"
WHERE
  CAST("x"."z" AS DATE) = CAST('2021-02-01' AS DATE)

AST Introspection

You can see the AST version of the parsed SQL by calling repr:

from sqlglot import parse_one
print(repr(parse_one("SELECT a + 1 AS z")))

Select(
  expressions=[
    Alias(
      this=Add(
        this=Column(
          this=Identifier(this=a, quoted=False)),
        expression=Literal(this=1, is_string=False)),
      alias=Identifier(this=z, quoted=False))])

AST Diff

SQLGlot can calculate the semantic difference between two expressions and output changes in a form of a sequence of actions needed to transform a source expression into a target one:

from sqlglot import diff, parse_one
diff(parse_one("SELECT a + b, c, d"), parse_one("SELECT c, a - b, d"))

[
  Remove(expression=Add(
    this=Column(
      this=Identifier(this=a, quoted=False)),
    expression=Column(
      this=Identifier(this=b, quoted=False)))),
  Insert(expression=Sub(
    this=Column(
      this=Identifier(this=a, quoted=False)),
    expression=Column(
      this=Identifier(this=b, quoted=False)))),
  Keep(
    source=Column(this=Identifier(this=a, quoted=False)),
    target=Column(this=Identifier(this=a, quoted=False))),
  ...
]

Custom Dialects

Dialects can be added by subclassing Dialect:

from sqlglot import exp
from sqlglot.dialects.dialect import Dialect
from sqlglot.generator import Generator
from sqlglot.tokens import Tokenizer, TokenType


class Custom(Dialect):
    class Tokenizer(Tokenizer):
        QUOTES = ["'", '"']
        IDENTIFIERS = ["`"]

        KEYWORDS = {
            **Tokenizer.KEYWORDS,
            "INT64": TokenType.BIGINT,
            "FLOAT64": TokenType.DOUBLE,
        }

    class Generator(Generator):
        TRANSFORMS = {exp.Array: lambda self, e: f"[{self.expressions(e)}]"}

        TYPE_MAPPING = {
            exp.DataType.Type.TINYINT: "INT64",
            exp.DataType.Type.SMALLINT: "INT64",
            exp.DataType.Type.INT: "INT64",
            exp.DataType.Type.BIGINT: "INT64",
            exp.DataType.Type.DECIMAL: "NUMERIC",
            exp.DataType.Type.FLOAT: "FLOAT64",
            exp.DataType.Type.DOUBLE: "FLOAT64",
            exp.DataType.Type.BOOLEAN: "BOOL",
            exp.DataType.Type.TEXT: "STRING",
        }

print(Dialect["custom"])

<class '__main__.Custom'>

SQL Execution

SQLGlot is able to interpret SQL queries, where the tables are represented as Python dictionaries. The engine is not supposed to be fast, but it can be useful for unit testing and running SQL natively across Python objects. Additionally, the foundation can be easily integrated with fast compute kernels, such as Arrow and Pandas.

The example below showcases the execution of a query that involves aggregations and joins:

from sqlglot.executor import execute

tables = {
    "sushi": [
        {"id": 1, "price": 1.0},
        {"id": 2, "price": 2.0},
        {"id": 3, "price": 3.0},
    ],
    "order_items": [
        {"sushi_id": 1, "order_id": 1},
        {"sushi_id": 1, "order_id": 1},
        {"sushi_id": 2, "order_id": 1},
        {"sushi_id": 3, "order_id": 2},
    ],
    "orders": [
        {"id": 1, "user_id": 1},
        {"id": 2, "user_id": 2},
    ],
}

execute(
    """
    SELECT
      o.user_id,
      SUM(s.price) AS price
    FROM orders o
    JOIN order_items i
      ON o.id = i.order_id
    JOIN sushi s
      ON i.sushi_id = s.id
    GROUP BY o.user_id
    """,
    tables=tables
)

user_id price
      1   4.0
      2   3.0

Used By

Documentation

SQLGlot uses pdoc to serve its API documentation.

A hosted version is on the SQLGlot website, or you can build locally with:

make docs-serve

Run Tests and Lint

make style  # Only linter checks
make unit   # Only unit tests (or unit-rs, to use the Rust tokenizer)
make test   # Unit and integration tests (or test-rs, to use the Rust tokenizer)
make check  # Full test suite & linter checks

Benchmarks

Benchmarks run on Python 3.10.12 in seconds.

Query	sqlglot	sqlglotrs	sqlfluff	sqltree	sqlparse	moz_sql_parser	sqloxide
tpch	0.00944 (1.0)	0.00590 (0.625)	0.32116 (33.98)	0.00693 (0.734)	0.02858 (3.025)	0.03337 (3.532)	0.00073 (0.077)
short	0.00065 (1.0)	0.00044 (0.687)	0.03511 (53.82)	0.00049 (0.759)	0.00163 (2.506)	0.00234 (3.601)	0.00005 (0.073)
long	0.00889 (1.0)	0.00572 (0.643)	0.36982 (41.56)	0.00614 (0.690)	0.02530 (2.844)	0.02931 (3.294)	0.00059 (0.066)
crazy	0.02918 (1.0)	0.01991 (0.682)	1.88695 (64.66)	0.02003 (0.686)	7.46894 (255.9)	0.64994 (22.27)	0.00327 (0.112)

Optional Dependencies

SQLGlot uses dateutil to simplify literal timedelta expressions. The optimizer will not simplify expressions like the following if the module cannot be found:

x + interval '1' month

sqlglot's People

Contributors

Stargazers

Watchers

Forkers

blthree rsmuthu z123 wseaton mrcodechef pseudobobsmith tkforks cafonso wiggimt srinipunuru akashmavle5 acreux trendingtechnology er77 zeninpalm izeigerman wyfunique hudakas oneraynyday robert8138 zosbhai maxwellboss robinl mamonu kelsin jfontestad southdy markchalloner nemetris jack1981 nolamesa maggie1121 junpeng-jp yingyingqiqi ebell495 th368 ptallada isananic ssuarezbe anshbhadauria vdt sirrice blentle cclauss bsmth creative-research-project-v1-1 mitack futurebody mohnish226 bobvanratingen grai-io aloenobilis victorioberra lumiqai wshao12 sloon-nextdoor honsa munro theotherside theseyi akbartrilaksana cuulee bigrlab thelastnode pinkerltm bo1658a yaowenqiang giancarlopro ubizone cxz guptam eternalerrors teej mawuva adamf cpcloud eugenetorap looi sid-kap saligrama crericha consultingmd 51ak rolandwalker smar10 ginter harshg999 constantinek woodyzhou ctimmortal edscode akravetz antusqus alingse g5539220y liangchenm aleh-rymasheuski zaskittles chengchengpei jli

sqlglot's Issues

Parse Postgres style "::" casts

Hey, amazing work on this library so far.
Are there plans to support "::" style type casts, as seen in postgres & redshift?
https://www.postgresqltutorial.com/postgresql-cast/

Here are some example scenarios I've tested in the 1.2.0 release.

import sqlglot
sqlglot.transpile("select foo::INTEGER from bar", read='postgres')

results in:
['SELECT CAST(foo AS ) AS INT FROM bar']

where I would expect:
['SELECT CAST(foo AS INT) FROM bar']

Another edge-case is combining the cast with an alias (which is what sqlglot currently interprets the :: as):

import sqlglot
sqlglot.transpile("select foo::INTEGER as baz from bar", read='postgres')

results in:

ParseError: Invalid expression / Unexpected token. Line 0, Col: 20.
select foo::INTEGER as baz from bar

where I would expect:
['SELECT CAST(foo AS INT) AS baz FROM bar']

Thanks!

CROSS JOIN UNNEST not supported going from Presto --> Hive

CROSS JOIN UNNEST is currently not supported when translating from Presto to Hive, e.g.:

sqlglot.transpile("""
SELECT
    col1
   , col2
   , item
FROM
    my_db.my_table
    CROSS JOIN UNNEST(items) t (item)
""", read='presto', write='hive')

returns the query unmodified, but it should be

SELECT
    col1
   , col2
   , item
FROM
    my_db.my_table
    LATERAL VIEW EXPLODE(items) t AS item

It would be great to add support for this expression!

Support for Spark SQL literals with postfix to indicate datatype (e.g. `2Y` or `123.08BD`)

sqlglot does not yet support Spark SQL queries that include literals that are specified with a postfix character to indicate the datatype.

I'd be open to trying to write a PR to implement this functionality. If you'd like me to attempt this, it'd be useful to have some high level pointers about how you think it would be best to implement. (I've not written a tokeniser or parser before, but stepping through your code, I think I have a reasonable understanding of how it works).

Spark - syntax and example of problem

See here for the Spark language reference. For example a postfix of 'Y' is a shorthand way of indicating a TINYINT literal, so you can write 2Y rather than cast(2 as tinyint)

import sqlglot
sql = '2Y'
sqlglot.parse_one(sql, read="spark").sql(dialect="spark")

Results in:

'2 as Y'

When it should result in either:

CAST(2 AS TINYINT) or
CAST(2 AS BYTE) or
2Y

To give an example of how this can affect the result:


from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark_session = SparkSession(sc)
sql = "select 2Y"
print(spark_session.sql(sql).schema)
sql_parsed = sqlglot.parse_one(sql, read="spark").sql(dialect="spark")
print(spark_session.sql(sql_parsed).schema)

Results in:

StructType(List(StructField(2,ByteType,false)))
StructType(List(StructField(Y,IntegerType,false)))

Where this appears in the Spark codebase

I've also looked in the Spark codebase. I believe this may be the code to parse the literal syntax, and the syntax is also referenced here, as well as in various parts of the test suite like here.

Literals in other SQL flavours

I've done some research to understand how widespread the postfix syntax is, and to get a sense of what approach may be best to tokenising/parsing these queries. I recognise some of these flavours are not supported by sqlglot. The intention is to provide some examples that may help decide the most sensible appraoch to tokenising/parsing.

DuckDB

I don't think DuckDB supports postfixes. In fact, it parses postfixes in a surprising way;

DuckDB evaluates syntax like:
select 0.1d
to the table:

d
0.1

i.e. the result is the same as select 0.1 as d

Note that the same query in PostGres results in the same result (a column named d with value 0.1). Whereas MySQL and MS SQL Server, and SQLite all give you a syntax error.

Hive

In Hive, a postfix can be used, much like Spark, documented here

Postgres

In postgres, the :: operator is a 'PostgreSQL-style typecast'. This means that you can write things like select 123::smallint.

More info in the docs.

Oracle

In Oracle it is possible to use postfixes, see here

SQLite

Postfixes are not allowed:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute("select 2Y")

p.s.

Thanks so much for the library. I'm using it as part of splink, a FOSS record linkage tool. At the moment Splink uses Spark as a backend, but I'd like it to be able to support both DuckDB and Spark - sqlglot looks pretty ideal for this purpose!!

No token for ILIKE postgres

There is no token for ILIKE in glot

https://postgrespro.com/docs/postgresql/9.6/functions-matching#functions-like

The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.

Union does not support CTE

Thanks for sharing great parser.
I would like to report a bug using presto sql.
Below should work in presto but parser produces an error, tested on sqlglot ==2.2.7

WITH temp_1 as (SELECT 1 FROM test2.table2),
    temp_2 as (SELECT 2 FROM test2.table3)

    SELECT * FROM temp_1 
    UNION ALL
    SELECT * FROM temp_2

error

sqlglot.errors.ParseError: union does not support CTE. Line 7, Col: 19.
    temp_2 as (select 2 from test2.table3)

    select * from temp_1 
    union all
    select * from temp_2

"with clause" in insert statement in presto

Tested on 2.3.0

INSERT INTO test.table2
with base as (
    SELECT * FROM table3
)
SELECT * FROM BASE

Hi Toby,
This should work in Presto
Thanks.

Parsing error in CTAS statement with parenthesis and with clause

Following sql works in presto but it produces parse error

   CREATE TABLE  schema1.table1 AS
   (
        WITH base AS (
            SELECT *
                    ,row_number() over(partition by col1 order by col2 asc) AS test
            FROM test.table2
        )
        SELECT * FROM base
    )

ERROR:

sqlglot.errors.ParseError: Expecting ). Line 4, Col: 9.
  
   CREATE TABLE  schema1.table1 AS
   (
        WITH base AS (
            SELECT *
                    ,row_number() over(partition by col1 order by co

if you remove parenthesis, it works.

   CREATE TABLE  schema1.table1 AS
   --(
        WITH base AS (
            SELECT *
                    ,row_number() over(partition by col1 order by col2 asc) AS test
            FROM test.table2
        )
        SELECT * FROM base
    --)

feature request - add support for TBLPROPERTIES when creating a new table

It would be great to add support for TBLPROPERTIES in CREATE statements (Hive doc, Presto doc). This is useful for specifying things like table retention periods.

For simplicity , it, could use an interface like Airflow's Hive query builder where a dict of key/value pairs is just unrolled into the relevant syntax.

Seems like this would require changes to expressions.Create, Rewriter.ctas(), Generator.create_sql(), etc., and then the language specific implementations. For Hive there's an explicit TBLPROPERTIES keyword in the CREATE statement, for Presto they're just added into the WITH statement.

> working on a solution, should have something by tonight, i need to also handle complex types like map<> and struct

working on a solution, should have something by tonight, i need to also handle complex types like map<> and struct

it has been solved, but, another sql failed:

CREATE TABLE `t_customer_account` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(11) DEFAULT NULL COMMENT '客户id',
  `bank` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '行别',
  `account_no` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '账号',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='客户账户表';

error info

ERROR:root:Expected ). Line 2, Col: 10.

CREATE TABLE `t_customer_account` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(1
ERROR:root:Invalid expression / Unexpected token. Line 2, Col: 10.

CREATE TABLE `t_customer_account` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(1
[(CREATE this: 
   (TABLE this: 
     (SCHEMA this: `t_customer_account`, expressions: 
       (COLUMNDEF this: 
         (COLUMN this: `id`, db: , table: ), kind: int, comment: )), db: ), kind: table, expression: , exists: , file_format: , temporary: )]

Originally posted by @exitNA in #20 (comment)

dialects don't like optional parts in table and view creation

Quite a lot of the views/tables that I write use the optional parts
E.g.

Presto
CREATE OR REPLACE VIEW {name} AS

message = 'Expected TABLE or View'
token = <Token token_type: TokenType.OR, text: OR, line: 1, col: 7, arg_key: None>

MySQL
CREATE TABLE IF NOT EXISTS {name} (

)

message = 'Invalid expression / Unexpected token'
token = <Token token_type: TokenType.L_PAREN, text: (, line: 1, col: 0, arg_key: None>

I believe both Presto and MySQL support these optional parts in the SQL syntax for tables and views.
Many thanks in advance!

Docs and Support

I have been looking all over trying to find docs, and reading the doc strings, but is there a place to ask for support on this. Trying to utilize the .find method for an expression, and can't get it to work.

I am not putting in the write expression type to find... but can't seem to find out what the options are. Where am I not looking?

Sample:

test = "Select *, a.ID from `table_name` as b"
parsed_output = sqlglot.parse(test, read='mysql')

print(type(parsed_output))
print(type(parsed_output[0]))
print(parsed_output[0].find(('Select',)))

Error:

ile ~\AppData\Roaming\Python\Python38\site-packages\sqlglot\expressions.py:119, in Expression.find_all(self, *expression_types)
    [108](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=107) """
    [109](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=108) Returns a generator object which visits all nodes in this tree and only
    [110](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=109) yields those that match at least one of the specified expression types.
   (...)
    [116](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=115)     the generator object.
    [117](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=116) """
    [118](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=117) for expression, _, _ in self.walk():
--> [119](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=118)     if isinstance(expression, expression_types):
    [120](file:///c%3A/Users/owner/AppData/Roaming/Python/Python38/site-packages/sqlglot/expressions.py?line=119)         yield expression

TypeError: isinstance() arg 2 must be a type or tuple of types

Parse DELETE SQL DML command

See
sqlglot - Why is not able to parse correctly DELETE SQL Statement?
https://stackoverflow.com/q/71224391/14700552?sem=2

WITHIN GROUP and Filter Not Supported

Hi Toby Mao,

In the sqlglot latest release (2.1.1) the Oracle SQL Analytic Function LISTAGG is not supported.

For example with the following SQL Query that works perfectly when executed in ORACLE SQL Developer :

SQL_Query_ = "SELECT CLUSTERNAME, DATETYPE, MIN(CLUSTERDATE) CLUSTERDATE, LISTAGG(APPLICATION, '|') WITHIN GROUP (ORDER BY CLUSTERDATE) AS APPLICATION FROM TYPOLOGY_CLUSTERDATE_T_L WHERE DATETYPE = 'ACCEPTANCE DATE' GROUP BY CLUSTERNAME, DATETYPE"

The result of sqlglot parsing is:

sqlglot.parse_one(SQL_Query_)

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot_init_.py", line 44, in parse_one
return parse(sql, read=read, **opts)[0]
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot_init_.py", line 27, in parse
return dialect.parse(sql, **opts)
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\dialects.py", line 60, in parse
return self.parser(**opts).parse(self.tokenizer().tokenize(sql), sql)
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 258, in parse
return self._parse(
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 309, in _parse
self.check_errors()
File "C:\Users\femoreira\AppData\Local\Programs\Python\Python310\lib\site-packages\sqlglot\parser.py", line 316, in check_errors
raise error
sqlglot.errors.ParseError: Expecting BY. Line 1, Col: 100.
SELECT CLUSTERNAME, DATETYPE, MIN(CLUSTERDATE) CLUSTERDATE, LISTAGG(APPLICATION, '|') WITHIN GROUP ←[4m(←[0mORDER BY CLUSTERDATE) AS APPLICATION FROM TYPOLOGY_CLUSTERDATE_T_L WHERE DATETYPE = 'ACCEPTANCE DATE

It seems that after the keywords WITHIN GROUP sqlglot is expecting the keyword BY, but that is not applicable in this case

Could you solve this in the earliest possible release?

Thanks in advance.

Best Regards,
F. Moreira

Adding a new formatter

Hello

Thank you for the very nice and great tool! I want to use it to format presto sql in the way below, it's different from regular style.

Is it possible that I will send a style guide for you and you will add this format?

WITH
    SOME_CTE AS (
        SELECT DISTINCT
          id          AS my_id,
          otherField1 AS my_other_field_1,
          otherField2 AS my_other_field_2
        FROM
            prod.my_other_data
        WHERE
                id IN (1, 2, 3)
            OR
                (
                        otherField1 = 'value1'
                    AND
                        otherField2 = 'value2'
                )
    ),
    
    FINAL AS (
        SELECT
            id_field,
            field_1                    AS detailed_field_1,
            field_2                    AS detailed_field_2,
            detailed_field_3,
            
            CASE
                WHEN
                    cancellation_date IS NULL AND expiration_date IS NOT NULL
                THEN
                    expiration_date
                
                WHEN
                    cancellation_date IS NULL
                THEN
                    start_date + 7
                
                ELSE
                    cancellation_date
            END                        AS cancellation_date,
            
            rank() OVER (
                PARTITION BY id_field
                ORDER BY cancellation_date
            )                          AS previous_detailed_field_3,
            SUM(field_4)               AS field_4_sum,
            MAX(field_5)               AS field_5_max
        FROM
                (
                    SELECT
                        *
                    FROM
                        prod.my_data
                    WHERE
                        filter = 'my_filter'
                ) AS A
            LEFT JOIN
                SOME_CTE
            ON
                A.id_field = SOME_CTE.my_id
        WHERE
            reduce(
                field_1,
                0,
                (x, s) -> s + x,
                s -> s
            ) > 100
        GROUP BY
            1, 
            2, 
            3, 
            4, 
            5
        HAVING
            COUNT(*) > 5
        ORDER BY
            4 DESC
)

SELECT
    *
FROM
    FINAL

Adding Trino parser

Trino and presto are now different and are likely to diverge in subsequent development: https://trino.io/blog/2020/12/27/announcing-trino.html. Is it possible to add Trino as a Dialect? I'm thinking since Trino's development cycle appears to be much faster (they are hosted by a startup called Starburst) compared to facebook, we should either copy the parser rules for Presto -> Trino and carry forward with Trino, or inherit from Presto. I think with the way the Dialect classes are designed it makes more sense to do a copy.

What about differnt parsing policy for different dialects?

Hi Toby,

I'm using sqlglot to parse the SQL syntax from Hive 3.0. There are some new keywords and new syntax, I have to change the lexer and parser somehow, but those keywords and syntax are not supported in MySQL or PostgreSQL at all.

So, I got this idea, what about moving the tasks of parsing those uncommon and special syntax into those specific dialects? And we may need to change the interface a little, for example, the users may use sqlglot.parse(code, dialect='hive-3') to indicate which dialect to use. Of course, if the user doesn't specify any dialect, we will keep using the existing logic 😄

I would like to do this refactoring for sqlglot if you don't mind, or if you think this change violates the philosophy of sqlglot, I may create a new project based on sqlglot.

Please let me know if you think it's OK, thanks.

Unable to Parse VALUES clause with named table alias (Presto -> Spark)

Getting an error trying to parse the following query

q = """
SELECT zoo, animal
FROM (VALUES
        ('OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
        ('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS t(zoo,animals)
"""

sqlglot.transpile(q, 'presto', 'spark')

ParseError: Invalid expression / Unexpected token. Line 5, Col: 7.
  'OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
        ('SanFranciscoZoo', ARRAY['dog', 'cat'])
) AS t(zoo,animals)

Token List

[<Token token_type: TokenType.SELECT, text: SELECT, line: 1, col: 1>,
...,
 <Token token_type: TokenType.FROM, text: FROM, line: 2, col: 1>,
 <Token token_type: TokenType.L_PAREN, text: (, line: 2, col: 6>,
 <Token token_type: TokenType.VALUES, text: VALUES, line: 2, col: 7>,
 <Token token_type: TokenType.L_PAREN, text: (, line: 3, col: 9>,
 ...,
 <Token token_type: TokenType.R_PAREN, text: ), line: 5, col: 1>,
 <Token token_type: TokenType.ALIAS, text: AS, line: 5, col: 3>,
 <Token token_type: TokenType.VAR, text: t, line: 5, col: 6>,
 **<Token token_type: TokenType.L_PAREN, text: (, line: 5, col: 7>,**
 <Token token_type: TokenType.VAR, text: zoo, line: 5, col: 8>,
 <Token token_type: TokenType.COMMA, text: ,, line: 5, col: 11>,
 <Token token_type: TokenType.VAR, text: animals, line: 5, col: 12>,
 <Token token_type: TokenType.R_PAREN, text: ), line: 5, col: 19>]

Parse Time Formats

Presto and Hive have different time formats, these are not transpiled when converting

Presto CTAS with bucketing option

Hi Toby, I found a bug. Please review and fix it.

PRESTO SQL

CREATE TABLE schema2.table2 
WITH (
    bucketed_by  = Array['Key'],
    bucket_count = 64
)
AS
SELECT * FROM schema2.table2

sqlglot.errors.ParseError: Required keyword: 'value' missing for <class 'sqlglot.expressions.Property'>. Line 3, Col: 20.

tested on sqlglot==2.3.2

Support catalog for Trino SQL

At the moment we get an unexpected token error when we write a query with a fully qualified name catalog.schema.table

support clickhouse sql dialect

Looks to https://clickhouse.tech/docs/en/
clickhouse is a blazing fast vectorized columnar mpp database

Arguments to lambda functions wrapped in brackets even with only one argument

sql = """
transform(array('a','b','c'), X -> upper(X))
"""

import sqlglot
sqlglot.transpile(sql)

> ["TRANSFORM(ARRAY('a', 'b', 'c'), (X) -> UPPER(X))"]

I discovered this as it was breaking some pyspark queries. Minimal pyspark example below:

from pyspark.context import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
from pyspark.sql import Row
data_list = [
    {"array_col": ['a', 'b']},

        ]

df = spark.createDataFrame(Row(**x) for x in data_list)
df.createOrReplaceTempView("df")

import sqlglot 
sql = "select transform(array_col, x -> x) as transformed from df"
spark.sql(sql).toPandas()

sql = sqlglot.transpile(sql, read='spark', write='spark')
spark.sql(sql).toPandas()

Will submit a PR to:

distinguish between x -> f(x) and (x,y,z) -> f(x,y,z) syntax
add a new test
update identity.sql accordingly

add sqloxide benchmarks

From the README:

It is currently the fastest Python SQL parser.

I have python bindings to sqlparser-rs, would you be open to adding these to the benchmark suite? I went ahead and added your library to my benchmarks and it indeed compares favorably to the other pure-python implementations.

Not to take away from your awesome project, it does a lot of other cool things that simple bindings cannot!

Stale Syntax Tree Transformation example

This is the Syntax Tree Transformation example in the README:

import sqlglot
import sqlglot.expressions as exp

expression_tree = sqlglot.parse_one("SELECT a FROM x")

def transformer(node):
    if isinstance(node, exp.Column) and node.args["this"].text == "a":
        return sqlglot.parse_one("FUN(a)")
    return node

transformed_tree = expression_tree.transform(transformer)
transformed_tree.sql()

When I run it, the transformed_tree isn't changed, the SQL is still SELECT a FROM x.

I played with it a bit and found it worked to change the .text to .args["this"].

bug - switching between 0 and 1 based indexes

Hive uses 0-based indexing, while Presto uses 1-based indexing, however in the following example sqlglot doesn't properly translate between the two:

print(sqlglot.transpile("""
SELECT
    SPLIT(str_col_with_space, ' ')[0]
FROM
    my_db.my_table
""", read='hive', write='presto', pretty=True)[0])

returns the query unchanged, but it should be

SELECT
    SPLIT(str_col_with_space, ' ')[1]
FROM
    my_db.my_table

and conversely

print(sqlglot.transpile("""
SELECT
    SPLIT(str_col_with_space, ' ')[1]
FROM
    my_db.my_table
""", read='presto', write='hive', pretty=True)[0])

returns the query unchanged, but it should be

SELECT
    SPLIT(str_col_with_space, ' ')[0]
FROM
    my_db.my_table

This is particularly dangerous when going from presto --> hive, as the wrongly translated code will always still be syntactically correct.

SQLite3 `BEGIN IMMEDIATE TRANSACTION` is rejected

$ sqlite3
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> BEGIN IMMEDIATE TRANSACTION;
sqlite> COMMIT;
sqlite> ^D

$ python -c 'from sqlglot import parse; parse("BEGIN IMMEDIATE TRANSACTION", read="sqlite")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/__init__.py", line 36, in parse
    return dialect.parse(sql, **opts)
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/dialects.py", line 60, in parse
    return self.parser(**opts).parse(self.tokenizer().tokenize(sql), sql)
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 257, in parse
    return self._parse(
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 308, in _parse
    self.check_errors()
  File "/nix/store/88wlq1s1ks201j3sxvdhf4rb3qmfsi9d-python3-3.9.6-env/lib/python3.9/site-packages/sqlglot/parser.py", line 315, in check_errors
    raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 1, Col: 17.
  BEGIN IMMEDIATE TRANSACTION

exists clause does not support

I get the following error:

# version: sqlglot 1.26.1

from sqlglot import parse_one
parse_one("SELECT * FROM demo where exists (select * from demo)")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/__init__.py", line 44, in parse_one
    return parse(code, read=read, **opts)[0]
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/__init__.py", line 27, in parse
    return dialect.parse(code, **opts)
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/dialects.py", line 62, in parse
    return self.parser(**opts).parse(self.tokenizer().tokenize(code), code)
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/parser.py", line 255, in parse
    self.check_errors()
  File "/home/sasano8/projects/sqlsuper/.venv/lib/python3.8/site-packages/sqlglot/parser.py", line 263, in check_errors
    raise error
sqlglot.errors.ParseError: Expecting ). Line 1, Col: 34.
  SELECT * FROM demo where exists (select * from demo)

Rewrite queries to remove nested queries in WHERE clause

It would be great if SQLGlot can "flatten" a nested SQL query such that the only nested subqueries are operands of the FROM clause and are independent from each other. This mostly involves taking care of subqueries inside the WHERE clause.

There are a couple cases here. The first case is this subquery is independent of the outer query. In this case this subquery should just be evaluated as a separate SQL query and its result can be used in the original query.

The second case is a bit more complicated, and occurs when the subquery depends on the outer query. For example in TPCH 2 and 4: https://github.com/marsupialtail/nest-query-rewrites. In this case we need to also reflect the dependency between the outer query and the nested query in the rewrite.

I think taking care of the first case will go a long way already.

Parser fails o recursive CTEs

On following

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

parser fails with following error:

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in _parse_cte(self)
354
355             if not self._match(TokenType.ALIAS):
--> 356                 self.raise_error('Expected AS after WITH')
357
358             expressions.append(self._parse_table(alias=alias))

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in raise_error(self, message, token)
202         )
203         if self.error_level == ErrorLevel.RAISE:
--> 204             raise self.error
205         if self.error_level == ErrorLevel.WARN:
206             logging.error(self.error)

ParseError: Expected AS after WITH. Line 0, Col: 15.
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1

Cannot parse CTE Update

It seems CTE parsing is hard-coded to be followed by a SELECT statement, thus parsing this sql query fails:

WITH baz as (SELECT 1 as col1) UPDATE some_table SET cid = baz.col1 from baz;

Note: this is a valid Postgres query.

Error:

Traceback (most recent call last):
  ...
  File "sqlglot\parser.py", line 263, in check_errors
    raise error
sqlglot.errors.ParseError: Required keyword: 'this' missing for <class 'sqlglot.expressions.CTE'>. Line 3, Col: 1.
  WITH baz as
  (SELECT 1 as col1)
UPDATE some_table
SET cid = baz.col1
from baz;

I have tried to fix it myself by changing

def _parse_cte(self):
    ...
    return self.expression(
        exp.CTE,
        this=self._parse_select(),
        expressions=expressions,
        recursive=recursive,
    )

def _parse_cte(self):
    ...
    return self.expression(
        exp.CTE,
        this=self._parse_statement(), # parse based on keyword ?!
        expressions=expressions,
        recursive=recursive,
    )

But that only resulted in

  File "sqlglot\parser.py", line 263, in check_errors
    raise error
sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 5, Col: 1.

Create table - parser errors

I've run into some create table statements that throw parser errors:

PostgreSQL

create table t1(a int primary key)
create table t1(s bytea)
create table t1(n uuid)
create table t1(n int generated always as identity primary key)

Oracle

create table t1(n number)
create table t1(s long raw)
create table t1(s long varchar)
create table t1(c1 NUMBER GENERATED by default on null as IDENTITY)

The error is basically always Error: Expecting )

Ability to parse TIMESTAMP WITH TIME ZONE for Trino/Presto

Trino and presto both support TIMESTAMP WITH TIME ZONE types as shown in the docs here.

trino> SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE);
                  _col0
------------------------------------------
 2021-01-01 18:00:00.000 America/New_York

I tried the following to parse the expression:

>>> p = Presto()
>>> p.parse("SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE)")
Traceback (most recent call last):
...
sqlglot.errors.ParseError: Expected ). Line 1, Col: 51.
SELECT CAST('2021-01-01 18:00:00' AS TIMESTAMP(3) WITH TIME ZONE)

(WITH is underlined above)

sqlglot doesn't parse UPDATE statement?

Hi Toby,

As far as I know, sqlglot doesn't support parsing the UPDATE statements yet, do you have a recent plan to support it?
Or, if you don't have such a plan, I would like to send an MR to support this 😄

Support all CASE statement syntax variations.

Currently, the behaviour of transpile is such that this given input is mapped as follows:

Versions v1.16.0 and v1.22.0 tested.

sqlglot.transpile(
    ...,
    read="trino",
    write="trino",
    identity=False,
    pretty=True,
)

input

CASE col
    WHEN val_a THEN mappend_val_a
    ELSE mapped_val_general
  END

output
IF(val_a, ...

IF(val_a, mappend_val_a, mapped_val_general)

Clearly, the comparison condition is not being included, only the value. It should be:

output
IF(col_a = val_a, ...

IF(col_a = val_a, mappend_val_a, mapped_val_general)

Support for Struct data types or struct parsing for Spark dialect

I'm wondering if it is possible to implement struct handling such that a query like this:

SELECT a.b.c from table_1 would parse as a struct field type.

parse create mysql string error

raw = '''
CREATE TABLE `t_customer_account` (
  `id` int,
  `customer_id` int,
  `bank` varchar(100),
  `account_no` varchar(100)
);
'''
parsed = sqlglot.parse(raw, sqlglot.dialects.MySQL)

I got error:

ParseError: Expected ). Line 2, Col: 7.

CREATE TABLE `t_customer_account` (
  `id` int,
  `customer_id` int,
  `bank` varchar(100),
  `a

sqlglot version: v1.3.2

Parse error with 'extract' function (Postgresql)

Hello,

This code:

import sqlglot
sqlglot.transpile("select extract(month from '2021-01-31'::timestamp without time zone)", read='postgres', pretty=True)

raises an error:
ParseError: Expected ). Line 0, Col: 26.
select LPAD(extract(month from '2021-01-31'::timestamp without time zone)::text,

The query works without any issue in Postgresql.

Here is the complete traceback:
tb_sqlglot.txt

Thanks.

Best,

Support TRY_CAST in Presto

Are we really parsing CREATE statement the right way?

Hi Toby,

I'm trying to parse "CREATE OR REPLACE TEMPORARY VIEW ...", but I noticed that in generator.py, you wrote CREATE{temporary}{replace}. I think that's a typo because then the generated SQL will be "CREATE TEMPORARY OR REPLACE ...".

And in parser.py, you wrote

def _parse_create(self):
    temporary = self._match(TokenType.TEMPORARY)
    replace = self._match(TokenType.OR) and self._match(TokenType.REPLACE)

I guess it should be

def _parse_create(self):
    replace = self._match(TokenType.OR) and self._match(TokenType.REPLACE)
    temporary = self._match(TokenType.TEMPORARY)

Please let me if I understand your code correctly 😄

mysql: truncate and optimize take optional table keyword

This doesn't seem to be supported, parser complains:

TRUNCATE TABLE mytable;

Reference: https://dev.mysql.com/doc/refman/8.0/en/truncate-table.html

OPTIMIZE TABLE doesn't even seem optional:: https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html

Spark's explode should be somehow transformed with duckdb's unnest

Following query

select sub.year, sub.genre, count(id)
from (
         select m.movieId                                                    as id,
                int(regexp_extract(m.title, '\\s*\\((\\d{4})\\)\\s*"?$', 1)) as year,
                explode(split(m.genres, '\\|'))                              as genre
         from movies m
     ) sub
where genre <> '(no genres listed)'
  and year IS NOT NULL
group by sub.year, sub.genre
order by sub.year DESC, sub.genre

is being incorrectly translated to spark

Error on parsing query from Postgres example

On following query:

WITH regional_sales AS (
    SELECT region, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region
), top_regions AS (
    SELECT region
    FROM regional_sales
    WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)
SELECT region,
       product,
       SUM(quantity) AS product_units,
       SUM(amount) AS product_sales
FROM orders
WHERE region IN (SELECT region FROM top_regions)
GROUP BY region, product;

execution fails on following:

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in _parse_primary(self)
650
651             if not self._match(TokenType.R_PAREN):
--> 652                 self.raise_error('Expecting )', paren)
653             return exp.Paren(this=this)
654

~/.local/lib/python3.9/site-packages/sqlglot/parser.py in raise_error(self, message, token)
202         )
203         if self.error_level == ErrorLevel.RAISE:
--> 204             raise self.error
205         if self.error_level == ErrorLevel.WARN:
206             logging.error(self.error)

ParseError: Expecting ). Line 7, Col: 24.
n
FROM regional_sales
WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales)
)

Add support for `PARTITIONED BY`

Now got ParseError: Invalid expression / Unexpected token.

Codes to reproduce:

sql = '''create table test (
a int,
b string
) partitioned by (dt string)
stored as parquet'''

sqlglot.transpile(sql, read='hive')

Error message:

ParseError: Invalid expression / Unexpected token. Line 4, Col: 3.
  create table test (
a int,
b string
) partitioned by (dt string)
stored as parquet

subqueries not supported in where clause

parser.py:_parse_range is throwing out the error message 'Expected ) after IN'

minimal reproducible example:-

from sqlglot import transpile

test_sql = "SELECT column FROM db.tbl WHERE column IN (SELECT column FROM otherdb.othertbl)"
transpile(test_sql, read="spark")

sqlglot.errors.ParseError: Expected ) after IN. Line 0, Col: 50.

Presto Like expression with "back slash" escape

it seems "back slash" on escape causes an error

Presto SQL

SELECT TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE,
  TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX,
  NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB,
  CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE,
  SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE,
  SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN
FROM system.jdbc.columns
WHERE TABLE_CAT = 'hive' AND TABLE_SCHEM LIKE 'stage' ESCAPE '\' AND TABLE_NAME LIKE '%' ESCAPE '\' AND COLUMN_NAME LIKE '%' ESCAPE '\'
ORDER BY TABLE_CAT, TABLE_SCHEM, TABLE_NAME, ORDINAL_POSITION

EXCEPTION

sqlglot.errors.ParseError: Invalid expression / Unexpected token. Line 8, Col: 55.
  , IS_GENERATEDCOLUMN
FROM system.jdbc.columns
WHERE TABLE_CAT = 'hive' AND TABLE_SCHEM LIKE 'stage' ESCAPE '' AND TABLE_NAME LIKE '%' ESCAPE '' AND COLUMN_NAME LIKE '%' ESCAPE ''

Snowflake Semi-structured data support

I believe this would be a bigger change than simply adding a new dialect as new parsing is required for json extraction syntax

Test Cases

SELECT foo:bar

SELECT foo:bar:baz::DATE

SELECT foo:bar[0]

SELECT foo:bar.baz

Ref:
https://docs.snowflake.com/en/user-guide/querying-semistructured.html#traversing-semi-structured-data

Support `INTERSECT` and `EXCEPT` set operations.

UNION is currently supported but it would be be useful if INTERSECT and EXCEPT were also supported.

Parentheses at union

Nice lib, I found some minor issues:

This is ok:

sql_str = """
select A from dbo.tablename
union all
select B from dbo.tablename
"""
sqlglot.parse(sql_str)

This is not ok:

sql_str = """
(select A from dbo.tablename)
union all
(select B from dbo.tablename)
"""
sqlglot.parse(sql_str)

ParseError: Invalid expression / Unexpected token. Line 3, Col: 1.

Parsing incomplete SQL statements

Hey @tobymao! This is excellent work and I'm pretty sure it will help lots of people in their project.
sqlglot will for sure help us in https://github.com/MariaDB/mariadb_kernel

We are trying to build an autocompletion and introspection feature (as part of a GSoC project) and the ability to get the parsed expression and the type of each token in the expression is exactly what we were looking for (and eventually implementing ourselves if no reasonable solution would be found in the wild).

My question for you is, is it very complicated to make sqlglot able to parse and generate a partial expression tree on incomplete SQL statements?
Now for instance, if you execute parse("select a from t1 where"), you'd get an exception. If you execute parse("select a from"), you would get a partial expression tree, which is fantastic.
Is this a bug in sqlglot and it is designed to deal with incomplete statements? Or it is just by accident that the latter statement above works?

Thanks again for this great project!

Transpiling / Parsing INSERT INTO with columns

INSERT INTO table2 (column1, column2, column3) SELECT column1, column2, column3 FROM table1 WHERE condition

transpiles to

INSERT INTO TABLE2(column1, column2, column3) SELECT column1, column2, column3 FROM table1 WHERE condition

Note the uppercase TABLE2 and the "missing" space.

I would expect the transpiled SQL to look the same as the initial string.

Comparing to INSERT INTO w/o columns:

INSERT INTO table2 SELECT column1, column2, column3 FROM table1 WHERE condition which parses "table2" as IDENTIFIER:

(INSERT this:
  (TABLE this:
    (IDENTIFIER this: table2, quoted: False)), expression:
  (SELECT expressions:
  ...

but with columns it parses "table2" as ANONYMOUS:

(INSERT this:
  (TABLE this:
    (ANONYMOUS this: table2, expressions:
      (COLUMN this:
        (IDENTIFIER this: column1, quoted: False)),
      (COLUMN this:
        (IDENTIFIER this: column2, quoted: False)),
      (COLUMN this:
        (IDENTIFIER this: column3, quoted: False)))), expression:
  (SELECT expressions:
  ...

tobymao / sqlglot Goto Github PK

sqlglot's Introduction

Table of Contents

Install

Versioning

Get in Touch

FAQ

Examples

Formatting and Transpiling

Metadata

Parser Errors

Unsupported Errors

Build and Modify SQL

SQL Optimizer

AST Introspection

AST Diff

Custom Dialects

SQL Execution

Used By

Documentation

Run Tests and Lint

Benchmarks

Optional Dependencies

sqlglot's People

Contributors

Stargazers

Watchers

Forkers

sqlglot's Issues

Spark - syntax and example of problem

Where this appears in the Spark codebase

Literals in other SQL flavours

DuckDB

Hive

Postgres

Oracle

SQLite

p.s.

PostgreSQL

Oracle

Recommend Projects

Recommend Topics

Recommend Org