Git Product home page Git Product logo

xnuinside / simple-ddl-parser Goto Github PK

View Code? Open in Web Editor NEW
161.0 6.0 37.0 1.83 MB

Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, BigQuery, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc. & table properties, types, domains, etc.

License: MIT License

Python 99.98% Shell 0.02%
sql ddl parser columns types ddl-parser sql-parser postgresql hive mysql

simple-ddl-parser's Introduction

Simple DDL Parser

badge1 badge2 badge3 workflow

Build with ply (lex & yacc in python). A lot of samples in 'tests/.

Is it Stable?

Yes, library already has about 9000+ downloads per day - https://pypistats.org/packages/simple-ddl-parser..

As maintainer, I guarantee that any backward incompatible changes will not be done in patch or minor version. But! Pay attention that sometimes output in keywords can be changed in minor version because of fixing wrong behaviour in past.

Updates in version 1.x

The full list of updates can be found in the Changelog below (at the end of README).

Version 1.0.0 was released due to significant changes in the output structure and a stricter approach regarding the scope of the produced output. Now, you must provide the argument 'output_mode=name_of_your_dialect' if you wish to see arguments/properties specific to a particular dialect

How does it work?

Parser supports:

  • SQL
  • HQL (Hive)
  • MSSQL dialec
  • Oracle dialect
  • MySQL dialect
  • PostgreSQL dialect
  • BigQuery
  • Redshift
  • Snowflake
  • SparkSQL
  • IBM DB2 dialect

You can check dialects sections after Supported Statements section to get more information that statements from dialects already supported by parser. If you need to add more statements or new dialects - feel free to open the issue.

Feel free to open Issue with DDL sample

Pay attentions that I'm adding functional tests for all supported statement, so if you see that your statement is failed and you didn't see it in the test 99,9% that I did n't have sample with such SQL statement - so feel free to open the issue and I will add support for it.

If you need some statement, that not supported by parser yet: please provide DDL example & information about that is it SQL dialect or DB.

Types that are used in your DB does not matter, so parser must also work successfully to any DDL for SQL DB. Parser is NOT case sensitive, it did not expect that all queries will be in upper case or lower case. So you can write statements like this:

    Alter Table Persons ADD CONSTRAINT CHK_PersonAge CHECK (Age>=18 AND City='Sandnes');

It will be parsed as is without errors.

If you have samples that cause an error - please open the issue (but don't forget to add ddl example), I will be glad to fix it.

A lot of statements and output result you can find in tests on the github - https://github.com/xnuinside/simple-ddl-parser/tree/main/tests .

How to install

    pip install simple-ddl-parser

How to use

Extract additional information from HQL (& other dialects)

In some dialects like HQL there is a lot of additional information about table like, fore example, is it external table, STORED AS, location & etc. This property will be always empty in 'classic' SQL DB like PostgreSQL or MySQL and this is the reason, why by default this information are 'hidden'. Also some fields hidden in HQL, because they are simple not exists in HIVE, for example 'deferrable_initially' To get this 'hql' specific details about table in output please use 'output_mode' argument in run() method.

example:

    ddl = """
    CREATE TABLE IF NOT EXISTS default.salesorderdetail(
        SalesOrderID int,
        ProductID int,
        OrderQty int,
        LineTotal decimal
        )
    PARTITIONED BY (batch_id int, batch_id2 string, batch_32 some_type)
    LOCATION 's3://datalake/table_name/v1'
    ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
        COLLECTION ITEMS TERMINATED BY '\002'
        MAP KEYS TERMINATED BY '\003'
    STORED AS TEXTFILE
    """

    result = DDLParser(ddl).run(output_mode="hql")
    print(result)

And you will get output with additional keys 'stored_as', 'location', 'external', etc.

    # additional keys examples
  {
    ...,
    'location': "'s3://datalake/table_name/v1'",
    'map_keys_terminated_by': "'\\003'",
    'partitioned_by': [{'name': 'batch_id', 'size': None, 'type': 'int'},
                        {'name': 'batch_id2', 'size': None, 'type': 'string'},
                        {'name': 'batch_32', 'size': None, 'type': 'some_type'}],
    'primary_key': [],
    'row_format': 'DELIMITED',
    'schema': 'default',
    'stored_as': 'TEXTFILE',
    ... 
  }

If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.

Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

From python code

    from simple_ddl_parser import DDLParser


    parse_results = DDLParser("""create table dev.data_sync_history(
        data_sync_id bigint not null,
        sync_count bigint not null,
        sync_mark timestamp  not  null,
        sync_start timestamp  not null,
        sync_end timestamp  not null,
        message varchar(2000) null,
        primary key (data_sync_id, sync_start)
    ); """).run()

    print(parse_results) 

To parse from file

    
    from simple_ddl_parser import parse_from_file

    result = parse_from_file('tests/sql/test_one_statement.sql')
    print(result)

From command line

simple-ddl-parser is installed to environment as command sdp

    sdp path_to_ddl_file

    # for example:

    sdp tests/sql/test_two_tables.sql
    

You will see the output in schemas folder in file with name test_two_tables_schema.json

If you want to have also output in console - use -v flag for verbose.

    
    sdp tests/sql/test_two_tables.sql -v
    

If you don't want to dump schema in file and just print result to the console, use --no-dump flag:

    
    sdp tests/sql/test_two_tables.sql --no-dump
    

You can provide target path where you want to dump result with argument -t, --target:

    
    sdp tests/sql/test_two_tables.sql -t dump_results/
    

Get Output in JSON

If you want to get output in JSON in stdout you can use argument json_dump=True in method .run() for this

    from simple_ddl_parser import DDLParser


    parse_results = DDLParser("""create table dev.data_sync_history(
        data_sync_id bigint not null,
        sync_count bigint not null,
    ); """).run(json_dump=True)

    print(parse_results) 

Output will be:

[{"columns": [{"name": "data_sync_id", "type": "bigint", "size": null, "references": null, "unique": false, "nullable": false, "default": null, "check": null}, {"name": "sync_count", "type": "bigint", "size": null, "references": null, "unique": false, "nullable": false, "default": null, "check": null}], "primary_key": [], "alter": {}, "checks": [], "index": [], "partitioned_by": [], "tablespace": null, "schema": "dev", "table_name": "data_sync_history"}]

More details

DDLParser(ddl).run() .run() method contains several arguments, that impact changing output result. As you can saw upper exists argument output_mode that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

Also in .run() method exists argument group_by_type (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.

To make work little bit easy you can set group_by_type=True and you will get output already sorted by types, like:

    { 
        'tables': [all_pasrsed_tables], 
        'sequences': [all_pasrsed_sequences], 
        'types': [all_pasrsed_types], 
        'domains': [all_pasrsed_domains],
        ...
    }

For example:

    ddl = """
    CREATE TYPE "schema--notification"."ContentType" AS
        ENUM ('TEXT','MARKDOWN','HTML');
        CREATE TABLE "schema--notification"."notification" (
            content_type "schema--notification"."ContentType"
        );
    CREATE SEQUENCE dev.incremental_ids
        INCREMENT 10
        START 0
        MINVALUE 0
        MAXVALUE 9223372036854775807
        CACHE 1;
    """

    result = DDLParser(ddl).run(group_by_type=True)

    # result will be:

    {'sequences': [{'cache': 1,
                    'increment': 10,
                    'maxvalue': 9223372036854775807,
                    'minvalue': 0,
                    'schema': 'dev',
                    'sequence_name': 'incremental_ids',
                    'start': 0}],
    'tables': [{'alter': {},
                'checks': [],
                'columns': [{'check': None,
                            'default': None,
                            'name': 'content_type',
                            'nullable': True,
                            'references': None,
                            'size': None,
                            'type': '"schema--notification"."ContentType"',
                            'unique': False}],
                'index': [],
                'partitioned_by': [],
                'primary_key': [],
                'schema': '"schema--notification"',
                'table_name': '"notification"'}],
    'types': [{'base_type': 'ENUM',
                'properties': {'values': ["'TEXT'", "'MARKDOWN'", "'HTML'"]},
                'schema': '"schema--notification"',
                'type_name': '"ContentType"'}]}

ALTER statements

Right now added support only for ALTER statements with FOREIGEIN key

For example, if in your ddl after table definitions (create table statements) you have ALTER table statements like this:

ALTER TABLE "material_attachments" ADD FOREIGN KEY ("material_id", "material_title") REFERENCES "materials" ("id", "title");

This statements will be parsed and information about them putted inside 'alter' key in table's dict. For example, please check alter statement tests - tests/test_alter_statements.py

More examples & tests

You can find in tests/ folder.

Dump result in json

To dump result in json use argument .run(dump=True)

You also can provide a path where you want to have a dumps with schema with argument .run(dump_path='folder_that_use_for_dumps/')

Raise error if DDL cannot be parsed by Parser

By default Parser does not raise the error if some statement cannot be parsed - and just skip & produce empty output.

To change this behavior you can pass 'silent=False' argumen to main parser class, like:

DDLParser(.., silent=False)

Normalize names

Use DDLParser(.., normalize_names=True)flag that change output of parser: If flag is True (default 'False') then all identifiers will be returned without '[', '"' and other delimiters that used in different SQL dialects to separate custom names from reserved words & statements. For example, if flag set 'True' and you pass this input:

CREATE TABLE [dbo].[TO_Requests]( [Request_ID] [int] IDENTITY(1,1) NOT NULL, [user_id] [int]

In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO_Requests]'.

Supported Statements

  • CREATE [OR REPLACE] TABLE [ IF NOT EXISTS ] + columns definition, columns attributes: column name + type + type size(for example, varchar(255)), UNIQUE, PRIMARY KEY, DEFAULT, CHECK, NULL/NOT NULL, REFERENCES, ON DELETE, ON UPDATE, NOT DEFERRABLE, DEFERRABLE INITIALLY, GENERATED ALWAYS, STORED, COLLATE

  • STATEMENTS: PRIMARY KEY, CHECK, FOREIGN KEY in table definitions (in create table();)

  • ALTER TABLE STATEMENTS: ADD CHECK (with CONSTRAINT), ADD FOREIGN KEY (with CONSTRAINT), ADD UNIQUE, ADD DEFAULT FOR, ALTER TABLE ONLY, ALTER TABLE IF EXISTS; ALTER .. PRIMARY KEY; ALTER .. USING INDEX TABLESPACE; ALTER .. ADD; ALTER .. MODIFY; ALTER .. ALTER COLUMN; etc

  • PARTITION BY statement

  • CREATE SEQUENCE with words: INCREMENT [BY], START [WITH], MINVALUE, MAXVALUE, CACHE

  • CREATE TYPE statement: AS TABLE, AS ENUM, AS OBJECT, INTERNALLENGTH, INPUT, OUTPUT

  • LIKE statement (in this and only in this case to output will be added 'like' keyword with information about table from that we did like - 'like': {'schema': None, 'table_name': 'Old_Users'}).

  • TABLESPACE statement

  • COMMENT ON statement

  • CREATE SCHEMA [IF NOT EXISTS] ... [AUTHORIZATION] ...

  • CREATE DOMAIN [AS]

  • CREATE [SMALLFILE | BIGFILE] [TEMPORARY] TABLESPACE statement

  • CREATE DATABASE + Properties parsing

SparkSQL Dialect statements

  • USING

HQL Dialect statements

  • PARTITIONED BY statement
  • ROW FORMAT, ROW FORMAT SERDE
  • WITH SERDEPROPERTIES ("input.regex" = "..some regex..")
  • STORED AS (AVRO, PARQUET, etc), STORED AS INPUTFORMAT, OUTPUTFORMAT
  • COMMENT
  • LOCATION
  • FIELDS TERMINATED BY, LINES TERMINATED BY, COLLECTION ITEMS TERMINATED BY, MAP KEYS TERMINATED BY
  • TBLPROPERTIES ('parquet.compression'='SNAPPY' & etc.)
  • SKEWED BY
  • CLUSTERED BY

MySQL

  • ON UPDATE in column without reference

MSSQL

  • CONSTRAINT [CLUSTERED]... PRIMARY KEY
  • CONSTRAINT ... WITH statement
  • PERIOD FOR SYSTEM_TIME in CREATE TABLE statement
  • ON [PRIMARY] after CREATE TABLE statement (sample in test files test_mssql_specific.py)
  • WITH statement for TABLE properties
  • TEXTIMAGE_ON statement
  • DEFAULT NEXT VALUE FOR in COLUMN DEFAULT

MSSQL / MySQL/ Oracle

  • type IDENTITY statement
  • FOREIGN KEY REFERENCES statement
  • 'max' specifier in column size
  • CONSTRAINT ... UNIQUE, CONSTRAINT ... CHECK, CONSTRAINT ... FOREIGN KEY, CONSTRAINT ... PRIMARY KEY
  • CREATE CLUSTERED INDEX
  • CREATE TABLE (...) ORGANIZATION INDEX

Oracle

  • ENCRYPT column property [+ NO SALT, SALT, USING]
  • STORAGE column property

PotgreSQL

AWS Redshift Dialect statements

  • ENCODE column property

  • SORTKEY, DISTSTYLE, DISTKEY, ENCODE table properties

  • CREATE TEMP / TEMPORARY TABLE

  • syntax like with LIKE statement:

create temp table tempevent(like event);

Snowflake Dialect statements

  • CREATE .. CLONE statements for table, database and schema
  • CREATE TABLE [or REPLACE] [ TRANSIENT | TEMPORARY ] .. CLUSTER BY ..
  • CONSTRAINT .. [NOT] ENFORCED
  • COMMENT = in CREATE TABLE & CREATE SCHEMA statements
  • WITH MASKING POLICY
  • WITH TAG, including multiple tags in the same statement.
  • DATA_RETENTION_TIME_IN_DAYS
  • MAX_DATA_EXTENSION_TIME_IN_DAYS
  • CHANGE_TRACKING

BigQuery

  • OPTION in CREATE SCHEMA statement
  • OPTION in CREATE TABLE statement
  • OPTION in column definition statement

Parser settings

Logging

  1. Logging to file

To get logging output to file you should provide to Parser 'log_file' argument with path or file name:

    DDLParser(ddl, log_file='parser221.log').run(group_by_type=True)
  1. Logging level

To set logging level you should provide argument 'log_level'

    DDLParser(ddl, log_level=logging.INFO).run(group_by_type=True)

Thanks for involving & contributions

Most biggest 'Thanks' ever goes for contributions in parser: https://github.com/dmaresma https://github.com/cfhowes https://github.com/swiatek25 https://github.com/slurpyb https://github.com/PBalsdon

Big thanks for the involving & contribution with test cases with DDL samples & opening issues goes to:

for help with debugging & testing support for BigQuery dialect DDLs:

Changelog

v1.1.0

Improvements

MySQL:

  1. Added support for INDEX statement inside table definition
  2. Added support for MySQL INVISIBLE/VISIBLE statement - #243

Snowflake:

  1. Added support for cluster by statement before columns definition - #234

v1.0.4

Improvements

  1. Support functions with schema prefix in DEFAULT and CHECK statements. #240

Fixes

  1. Fix for REFERENCES NOT NULL - #239
  2. Fix for snowflake stage name location format bug fix - #241

v1.0.3

Improvements

  1. Fixed bug with CREATE OR REPLACE SCHEMA.
  2. Added support of create empty tables without columns CREATE TABLE tablename (); (valid syntax in SQL)

Snowflake

  1. Fixed bug with snowflake stage_ fileformat option value equal a single string as FIELD_OPTIONALLY_ENCLOSED_BY = '\"', FIELD_DELIMITER = '|'
  2. improve snowflake fileformat key equals value into dict. type.

v1.0.2

Improvements

  1. Fixed bug with places first table property value in 'authorization' key. Now it is used real property name.
  2. Fixed typo on Databricks dialect
  3. improved equals symbols support within COMMENT statement.
  4. turn regexp into functions

MySQL Improvements

  1. UNSIGNED property after int parsed validly now

Snowflake

  1. Snowflake TAG now available on SCHEMA definitions.

v1.0.1

Minor Fixes

  1. When using normalize_names=True do not remove [] from types like decimal(21)[].
  2. When using normalize_names=True ensure that "complex"."type" style names convert to complex.type.

v1.0.0 In output structure was done important changes that can in theory breaks code.

Important changes

  1. Important change:

all custom table properties that are defined after column definition in 'CREATE TABLE' statement and relative to only one dialect (only for SparkSQL, or HQL,etc), for example, like here: https://github.com/xnuinside/simple-ddl-parser/blob/main/tests/dialects/test_snowflake.py#L767 or https://github.com/xnuinside/simple-ddl-parser/blob/main/tests/dialects/test_spark_sql.py#L133 will be saved now in property table_properties as dict. Previously they was placed on same level of table output as columns, alter, etc. Now, they grouped and moved to key table_properties.

  1. Formatting parser result now represented by 2 classes - Output & TableData, that makes it more strict and readable.

  2. The output mode now functions more strictly. If you want to obtain output fields specific to a certain dialect, use output_mode='snowflake' for Snowflake or output_mode='hql' for HQL, etc. Previously, some keys appeared in the result without being filtered by dialect. For example, if 'CLUSTER BY' was in the DDL, it would show up in the 'cluster_by' field regardless of the output mode. However, now all fields that only work in certain dialects and are not part of the basic SQL notation will only be shown if you choose the correct output_mode.

New Dialects support

  1. Added as possible output_modes new Dialects:
  • Databricks SQL like 'databricks',
  • Vertica as 'vertica',
  • SqliteFields as 'sqlite',
  • PostgreSQL as 'postgres'

Full list of supported dialects you can find in dict - supported_dialects:

from simple_ddl_parser import supported_dialects

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Snowflake updates:

  1. For some reasons, 'CLONE' statement in SNOWFLAKE was parsed into 'like' key in output. Now it was changed to 'clone' - inner structure of output stay the same as previously.

MySQL updates:

  1. Engine statement now parsed correctly. Previously, output was always '='.

BigQuery updates:

  1. Word 'schema' totally removed from output. Dataset used instead of schema in BigQuery dialect.

v0.32.1

Minor Fixes

  1. Removed debug print

v0.32.0

Improvements

  1. Added support for several ALTER statements (ADD, DROP, RENAME, etc) - #215 In 'alter' output added several keys:
    1. 'dropped_columns' - to store information about columns that was in table, but after dropped by alter
    2. 'renamed_columns' - to store information about columns that was renamed
    3. 'modified_columns' - to track alter column changes for defaults, datetype, etc. Argument stores previous columns states.

Fixes

  1. Include source column names in FOREIGN KEY references. Fix for: #196
  2. ALTER statement now will be parsed correctly if names & schemas written differently in create table statement and alter. For example, if in create table you use quotes like "schema_name"."table_name", but in alter was schema_name.table_name - previously it didn't work, but now parser understand that it is the same table.

v0.31.3

Improvements

Snowflake update:

  1. Added support for Snowflake Virtual Column definition in External Column AS () statement - #218
  2. enforce support for Snowflake _FILE_FORMAT options in External Column ddl statement - #221

Others

  1. Support for KEY statement in CREATE TABLE statements. KEY statements will now create INDEX entries in the DDL parser.

v0.31.2

Improvements

Snowflake update:

  1. Added support for Snowflake AUTOINCREMENT | IDENTITY column definitions with optional parameter ORDER|NOORDER statement - #213

Common

  1. Added param 'encoding' to parse_from_file function - #142. Default encoding is utf-8.

v0.31.1

Improvements

Snowflake update:

  1. Support multiple tag definitions in a single WITH TAG statement.
  2. Added support for Snowflake double single quotes - #208

v0.31.0

Fixes:

  1. Move inline flag in regexp (issue with python 3.11) - #200 Fix for: #199

Improvements:

  1. Added Snowflake Table DDL support of WITH MASKING POLICY column definition - #201

Updates:

  1. All deps updated to the latest versions.

v0.30.0

Fixes:

  1. IDENTITY now parsed normally as a separate column property. Issue: #184

New Features:

  1. IN TABLESPACE IBM DB2 statement now is parsed into 'tablespace' key. Issue: #194. INDEX IN also parsed to 'index_in' key. Added support for ORGANIZE BY statement

  2. Added support for PostgreSQL INHERITS statement. Issue: #191

v0.29.1

Important updates:

  1. Python 3.6 is deprecated in tests and by default, try to move to Python3.7, but better to 3.8, because 3.7 will be deprecated in 2023.

Fixes

  1. Fix for #177

Improvements

  1. Added support for Oracle 2 component size for types, like '30 CHAR'. From #176

v0.29.0

Fixes

  1. AUTOINCREMENT statement now parsed validly same way as AUTO_INCREMENT and showed up in output as 'autoincrement' property of the column Fix for: #170
  2. Fix issue ' TypeError argument of type 'NoneType' is not iterable' on some foreigen keys #148

New Features

  1. Support for non-numeric column type parameters #171 It shows in column attribute 'type_parameters'.

v0.28.1 Improvements:

  1. Lines started with INSERT INTO statement now successfully ignored by parser (so you can keep them in ddl - they will be just skipped)

Fixes:

  1. Important fix for multiline comments

v0.28.0

Important Changes (Pay attention):

  1. Because of parsing now AUTO_INCREMENT as a separate property of column previous output changed. Previously it was parsed as a part of type like: 'INT AUTO_INCREMENT'. Now type will be only 'INT', but in column property you will see 'autoincrement': True.

Amazing innovation:

  1. It's is weird to write in Changelog, but only in version 0.28.0 I recognize that floats that not supported by parser & it was fixed. Thanks for the sample in the issue: #163

Improvements: MariaDB:

  1. Added support for MariaDB AUTO_INCREMENT (from ddl here - #144) If column is Auto Incremented - it indicated as 'autoincrement': True in column definition

Common:

  1. Added parsing for multiline comments in DDL with /* */ syntax.
  2. Comments from DDL now all placed in 'comments' keyword if you use group_by_type= arg in parser.
  3. Added argument 'parser_settings={}' (dict type) in method parse_from_file() - this way you can pass any arguments that you want to DDLParser (& that supported by it) So, if you want to set log_level=logging.WARNING for parser - just use it as: parse_from_file('path_to_file', parser_settings={'log_level': logging.WARNING}). For issue: #160

v0.27.0

Fixes:

  1. Fixed parsing CHECKS with IN statement - #150
  2. @# symbols added to ID token - (partially) #146

Improvements:

  1. Added support for '*' in size column (ORACLE dialect) - #151
  2. Added arg 'debug' to parser, works same way as 'silent' - to get more clear error output.

New features:

  1. Added support for ORACLE 'ORGANIZATION INDEX'
  2. Added support for SparkSQL Partition by with procedure call - #154
  3. Added support for DEFAULT CHARSET statement MySQL - #153

v0.26.5

Fixes:

  1. Parsetab included in builds.
  2. Added additional argumen log_file='path_to_file', to enable logging to file with providen name.

v0.26.4

  1. Bugfix for (support CREATE OR REPLACE with additional keys like transient/temporary): #133

v0.26.3

Improvements:

  1. Added support for OR REPLACE in CREATE TABLE: #131
  2. Added support for AUTO INCREMENT in column:#130

v0.26.2

Fixes:

  1. Fixed a huge bug for incorrect parsing lines with 'USE' & 'GO' strings inside.
  2. Fixed parsing for CREATE SCHEMA for Snowlake & Oracle DDLs

Improvements:

  1. Added COMMENT statement for CREATE TABLE ddl (for SNOWFLAKE dialect support)
  2. Added COMMENT statement for CREATE SCHEMA ddl (for SNOWFLAKE dialect support)

v0.26.1

Fixes:

  1. support Multiple SERDEPROPERTIES - #126
  2. Fix for issue with LOCATION and TBLPROPERTIES clauses in CREATE TABLE LIKE - #125
  3. LOCATION now works correctly with double quote strings

v0.26.0 Improvements:

  1. Added more explicit debug message on Statement errors - #116
  2. Added support for "USING INDEX TABLESPACE" statement in ALTER - #119
  3. Added support for IN statements in CHECKS - #121

New features:

  1. Support SparkSQL USING - #117 Updates initiated by ticket #120:
  2. In Parser you can use argument json_dump=True in method .run() if you want get result in JSON format.
  • README updated

Fixes:

  1. Added support for PARTITION BY one column without type
  2. Alter table add constraint PRIMARY KEY - #119
  3. Fix for paring SET statement - #122
  4. Fix for disappeared columns without properties - #123

v0.25.0

Fixes:

  1. Fix for issue with 'at time zone' #112

New features:

  1. Added flag to raise errors if parser cannot parse statement DDLParser(.., silent=False) - #109
  2. Added flag to DDLParser(.., normalize_names=True) that change output of parser: if flag is True (default 'False') then all identifiers will be returned without '[', '"' and other delimiters that used in different SQL dialects to separate custom names from reserved words & statements. For example, if flag set 'True' and you pass this input:

CREATE TABLE [dbo].[TO_Requests]( [Request_ID] [int] IDENTITY(1,1) NOT NULL, [user_id] [int]

In output you will have names like 'dbo' and 'TO_Requests', not '[dbo]' and '[TO_Requests]'.

v0.24.2

Fixes:

  1. Fix for the issue: #108 (reserved words can be used as table name after '.')

v0.24.1

Fixes:

HQL:

  1. fields_terminated_by now parses , as "','", not as '' previously

Common:

  1. To output added 'if_not_exists' field in result to get availability 1-to-1 re-create ddl by metadata.

v0.24.0

Fixes:

HQL:

  1. More then 2 tblproperties now are parsed correctly #104

Common:

  1. 'set' in lower case now also parsed validly.
  2. Now names like 'schema', 'database', 'table' can be used as names in CREATE DATABASE | SCHEMA | TABLESPACE | DOMAIN | TYPE statements and after INDEX and CONSTRAINT.
  3. Creation of empty tables also parsed correctly (like CREATE Table table;).

New Statements Support:

HQL:

  1. Added support for CLUSTERED BY - #103
  2. Added support for INTO ... BUCKETS
  3. CREATE REMOTE DATABASE | SCHEMA

v0.23.0

Big refactoring: less code complexity & increase code coverage. Radon added to pre-commit hooks.

Fixes:

  1. Fix for issue with ALTER UNIQUE - #101

New Features

  1. SQL Comments string from DDL now parsed to "comments" key in output.

PostgreSQL:

  1. Added support for ALTER TABLE ONLY | ALTER TABLE IF EXISTS

v0.22.5

Fixes:

  1. Fix for issue with '<' - #89

v0.22.4

Fixes:

BigQuery:

  1. Fixed issue with parsing schemas with project in name.
  2. Added support for multiple OPTION() statements

v0.22.3

Fixes:

BigQuery:

  1. CREATE TABLE statement with 'project_id' in format like project.dataset.table_name now is parsed validly. 'project' added to output. Also added support project.dataset.name format in CREATE SCHEMA and ALTER statement

v0.22.2

Fixes:

  1. Fix for the issue: #94 (column name starts with CREATE)

v0.22.1

New Features:

BigQuery:

  1. Added support for OPTION for full CREATE TABLE statement & column definition

Improvements:

  1. CLUSTED BY can be used without ()

simple-ddl-parser's People

Contributors

burakuyar avatar cfhowes avatar dependabot[bot] avatar dmaresma avatar gliptak avatar kliushnichenko avatar mkaraev avatar nurih avatar pbalsdon avatar slurpyb avatar xnuinside avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

simple-ddl-parser's Issues

Multi partition is not handled

Describe the bug
Multi partition is not handled

To Reproduce
ddl = """
CREATE EXTERNAL TABLE test (
test STRING NULL COMMENT 'xxxx',
)
PARTITIONED BY (snapshot STRING, cluster STRING)
STORED AS PARQUET LOCATION 'hdfs://xxxx';
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
non empty json response

MSSQL SEQUENCE syntax correction

Describe the bug
On MSSQL , SEQUENCE definition has the optional | INCREMENT BY and START WITH

Traceback (most recent call last):
File "/home/fred/tmp/testeparse.py", line 16, in
result = DDLParser(s0).run()
File "/home/fred/.local/lib/python3.9/site-packages/simple_ddl_parser/parser.py", line 217, in run
tables = self.parse_data()
File "/home/fred/.local/lib/python3.9/site-packages/simple_ddl_parser/parser.py", line 166, in parse_data
self.parse_statement(tables, statement)
File "/home/fred/.local/lib/python3.9/site-packages/simple_ddl_parser/parser.py", line 173, in parse_statement
_parse_result = yacc.parse(statement)
File "/usr/lib/python3/dist-packages/ply/yacc.py", line 333, in parse
return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
File "/usr/lib/python3/dist-packages/ply/yacc.py", line 1120, in parseopt_notrack
p.callable(pslice)
File "/home/fred/.local/lib/python3.9/site-packages/simple_ddl_parser/dialects/sql.py", line 730, in p_expression_seq
p[0].update({p[2].lower(): int(p_list[-1])})
ValueError: invalid literal for int() with base 10: 'WITH'

To Reproduce

from simple_ddl_parser import DDLParser

CREATE SEQUENCE [dbo].[sqCdSLIPEvt]
AS [bigint]
START WITH 1
INCREMENT BY 1
MINVALUE 1
CACHE
GO

ddlx = '''
CREATE SEQUENCE [dbo].[sqCdSLIPEvt]
AS [bigint]
START WITH 1
INCREMENT BY 1
MINVALUE 1
CACHE
GO
'''
result = DDLParser(ddlx).run()
print( result )

Expected behavior
Interpretation as stated by docs

Desktop (please complete the following information):

  • OS: Debian 10, python 3.9,
  • Version
  • /home/fred/.local/lib/python3.9/site-packages/simple_ddl_parser/parsetab.py
    Proceed (y/n)? y
    Successfully uninstalled simple-ddl-parser-0.21.0
    fred@alpha:~/Documentos/GoiasFomento/QtPy/src/SQLs$ pip3 install simple-ddl-parser
    Collecting simple-ddl-parser
    Using cached simple_ddl_parser-0.21.0-py3-none-any.whl (32 kB)
    Requirement already satisfied: ply<4.0,>=3.11 in /usr/lib/python3/dist-packages (from simple-ddl-parser) (3.11)
    Installing collected packages: simple-ddl-parser
    Successfully installed simple-ddl-parser-0.21.0

Additional context
If "BY" and "WITH" are omitted , it works.,

changing dialects/sql.py from ::
def p_expression_seq(self, p: List) -> None:
"""expr : seq_name
| expr INCREMENT ID
| expr START ID
| expr MINVALUE ID
| expr MAXVALUE ID
| expr CACHE ID
"""
to::
def p_expression_seq(self, p: List) -> None:
"""expr : seq_name
| expr INCREMENT BY ID
| expr START WITH ID
| expr MINVALUE ID
| expr MAXVALUE ID
| expr CACHE ID
"""

Seemed to solve the problem., ..

Don't know if thats the right place .,

As said before, great gig.,...
Thanks for the work.,

Have a nice day

Failure to parse output from SSMS

Describe the bug
Here's an example of (silently) failing to parse DDL generated by MS SQL Server Management Studio. Other DDL output from SSMS parses just fine. So I'm not sure where this one is failing.

To Reproduce
Steps to reproduce the behavior:

sdp aud.tcal_tgt.Table.sql

Expected behavior
Out in schemas/aud_schema.json that contains the table info.

Additional context
Try it with this:

USE [sasgolddevdb]
GO
/****** Object:  Table [aud].[tcal_tgt]    Script Date: 11/11/2021 11:18:41 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [aud].[tcal_tgt](
	[TCAL_SID] [decimal](30, 0) NOT NULL,
	[TERM_YR] [varchar](4) NULL,
	[TERM_CD] [varchar](4) NULL,
	[CAL_CD] [varchar](2) NULL,
	[LMOD_USER] [varchar](4) NULL,
	[LMOD_DT] [datetime2](7) NULL,
	[TERM_BEGIN_DT] [datetime2](7) NULL,
	[TERM_END_DT] [datetime2](7) NULL,
	[EFF_EXP_DT] [datetime2](7) NULL,
	[EFF_EXP_FLG] [varchar](1) NULL,
	[FEED_OPR] [varchar](1) NULL,
	[REC_CREATED_DT] [datetime2](7) NULL,
	[REC_CREATED_BY] [varchar](25) NULL,
	[REC_MODIFIED_DT] [datetime2](7) NULL,
	[REC_MODIFIED_BY] [varchar](25) NULL,
	[REC_DELETED_DT] [datetime2](7) NULL,
	[REC_DELETED_BY] [varchar](25) NULL,
	[SYSTEM_OPERATION] [varchar](1) NULL,
PRIMARY KEY CLUSTERED 
(
	[TCAL_SID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY]

GO

Incorrect JSON wrapper break JQ

Describe the bug

Commonly used Linux command line JQ tool doesn't work with JSON which prepared by sdp, because jq expect that wrapper is double quote.
Example of real GCP BigQuery DDL:

DDL:

$ cat sample_ddl.sql

CREATE  TABLE `dataset.table`
(
    Updated_Date           DATE   NOT NULL OPTIONS (description =Description1),
    Lead_Store_Id          INT64  NOT NULL OPTIONS (description =Description2),
)
PARTITION BY LeadCreatedDate
CLUSTER BY Lead_Store_Id

OUTPUT:

$ sdp -o bigquery --no-dump sample_ddl.sql
Start parsing file sample_ddl.sql

File with result was saved to >> schemas folder
[{'alter': {},
  'checks': [],
  'cluster_by': ['Lead_Store_Id'],
  'columns': [{'check': None,
               'default': None,
               'name': 'Updated_Date',
               'nullable': False,
               'options': [{'description': '=Description1'}],
               'references': None,
               'size': None,
               'type': 'DATE',
               'unique': False},
              {'check': None,
               'default': None,
               'name': 'Lead_Store_Id',
               'nullable': False,
               'options': [{'description': '=Description2'}],
               'references': None,
               'size': None,
               'type': 'INT64',
               'unique': False}],
  'constraints': {'checks': None, 'references': None, 'uniques': None},
  'index': [],
  'partition_by': {'columns': ['LeadCreatedDate'], 'type': None},
  'partitioned_by': [],
  'primary_key': [],
  'schema': '`dataset',
  'table_name': 'table`',
  'tablespace': None}]

JQ run:

$ sdp -o bigquery --no-dump sample_ddl.sql |  egrep -v 'Start parsing file |File with result was saved to ' | jq .
parse error: Invalid numeric literal at line 2, column 10

Followed JSON is OK:

[{"alter": {},
  "checks": [],
  "cluster_by": ["Lead_Store_Id"],
  "columns": [{"check": "None",
               "default": "None",
               "name": "Updated_Date",
               "nullable": "False",
               "options": [{"description": "=Description1"}],
               "references": "None",
               "size": "None",
               "type": "DATE",
               "unique": "False"},
              {"check": "None",
               "default": "None",
               "name": "Lead_Store_Id",
               "nullable": "False",
               "options": [{"description": "=Description2"}],
               "references": "None",
               "size": "None",
               "type": "INT64",
               "unique": "False"}],
  "dataset": "`dataset",
  "index": [],
  "partition_by": {"columns": ["LeadCreatedDate"], "type": "None"},
  "partitioned_by": [],
  "primary_key": [],
  "table_name": "table`",
  "tablespace": "None"}]

Here is few issues:

  1. JQ expect that all values wrapped (None, False, True - as well)

  2. JQ expect that wrapper will be double quote, not single quote.
    JSON validator https://jsonformatter.curiousconcept.com/ said that points 1 and 2 are RFC violation.

  3. Incorrect parsing if table name wrapped by back quote: "CREATE TABLE `dataset.table`". As you can see from output, parser keeped back quote at the beginning (but sometimes at the ending) of the table name:

 'schema': '`dataset',

getting an empty output for CREATE SCHEMA and OPTIONS

Hi @xnuinside
I was trying to parse this ddl but getting an empty [ ] output

from simple_ddl_parser import parse_from_file
import json
import os

filename = r'example.ddl'
result = parse_from_file(filename)
print(result)
output
[]

json_data = json.dumps(result)
print(json_data)
output
[]

Not supporting CREATE SCHEMA IF NOT EXISTS and it is expecting only one OPTIONS

CREATE SCHEMA IF NOT EXISTS name-name
OPTIONS (
location="path"
);
CREATE TABLE name.hub.REF_CALENDAR (
calendar_dt DATE,
calendar_dt_id INT,
fiscal_half_year_day_no INT,
fiscal_half_year_end_dt DATE,
fiscal_half_year_month_no INT,
fiscal_half_year_nm STRING,
fiscal_half_year_no INT,
fiscal_half_year_reporting_week_no INT,
fiscal_half_year_start_dt DATE,
fiscal_half_year_week_no INT,
fiscal_year_day_no INT,
fiscal_year_end_dt DATE,
fiscal_year_long_nm STRING,
fiscal_year_month_no INT,
fiscal_year_nm STRING,
fiscal_year_no INT,
fiscal_year_quarter_no INT,
fiscal_year_reporting_week_no INT,
fiscal_year_start_dt DATE,
fiscal_year_week_no INT,
month_day_no INT,
month_end_dt DATE,
month_id STRING,
month_nm STRING,
month_reporting_week_no INT,
month_start_dt DATE,
month_week_no INT,
quarter_day_no INT,
quarter_end_dt DATE,
quarter_month_no INT,
quarter_reporting_week_no INT,
quarter_start_dt DATE,
quarter_week_no INT,
reporting_week_day_no INT,
reporting_week_end_dt DATE,
week_day_nm STRING,
week_day_no INT,
week_day_short_nm STRING,
weekend_ind INT,
weekend_ind_desc STRING,
year_day_no INT,
year_month_no INT,
year_no INT,
year_quarter_no STRING,
year_reporting_week_no INT,
year_week_no INT
)
OPTIONS (
description="Calendar table records reference list of calendar dates and related attributes used for reporting."
);

Adding Support for Check Constratint

Hi @xnuinside

I was wondering if you plan on adding support for check constraints. Below are some examples for the check.

CREATE TABLE employees (
id SERIAL PRIMARY KEY,
first_name VARCHAR (50),
last_name VARCHAR (50),
birth_date DATE CHECK (birth_date > '1900-01-01'),
joined_date DATE CHECK (joined_date > birth_date),
salary numeric CHECK(salary > 0)
);
CREATE TABLE Persons (
ID int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int,
City varchar(255),
CONSTRAINT CHK_Person CHECK (Age>=18 AND City='Sandnes')
);

ALTER TABLE Persons
ADD CHECK (Age>=18);

Postgres DEFAULT statement with "at time zone" causes entire CREATE statement to be omitted from results

Description

When trying out simple-ddl-parser on some existing DDL, I attempted to parse 4 CREATE statements but I would only get 3 results. No exception was produced. By process of elimination, I found that this line was the source of the issue:

created_timestamp TIMESTAMPTZ NOT NULL DEFAULT (now() at time zone 'utc'),

However. with the following, the CREATE statement is parsed as expected:

created_timestamp TIMESTAMPTZ NOT NULL DEFAULT (now()),

Which suggests that the issue is around the "at time zone 'utc'" portion of the DFEAULT clause.

To Reproduce
Parse a CREATE statement including a TIMESTAMPTZ column with a DEFAULT clause including at time zone 'utc'. NOTE: the specific time zone used does not appear to matter.

Expected behavior
The CREATE statement should be parsed and included in parsing results. Although this statement is valid Postgres DDL, I would expect an exception when a statement is not parseable or at something other than just ignoring the CREATE statement. I did not see anything in the documentation about error handling, I apologize if I have overlooked that.

Example code

from simple_ddl_parser import DDLParser

parser = DDLParser("""
CREATE TABLE example (
  primary_key            SERIAL PRIMARY KEY,
  created_timestamp      TIMESTAMPTZ NOT NULL DEFAULT (now() at time zone 'utc')
);
""")

result = parser.run()
print(len(result))

for r in result:
    print(r["table_name"])

Expected output: 1 followed by the table name: 'example'
Actual output: 0

Desktop (please complete the following information):

  • OS: Windows10 (Gitbash/MINGW64)
  • Version 0.24.2

Debug option

Is your feature request related to a problem? Please describe.
I tried to use the library to parse HQL DDLs, but some of them I got the error below

---------------------------------------------------------------------------
DDLParserError                            Traceback (most recent call last)
/var/folders/gv/rh_2w83x16s3t1bkll6gd9ym0000gq/T/ipykernel_7778/1646707155.py in <module>
      2     print("="*40, tbl_name)
      3     if tbl_file["parse"] != "PARSE" or tbl_name not in {"silver.hvc_general", "silver.hvc_relog", "silver.hvc_telemetry"}: continue
----> 4     contents = parse_from_file(tbl_file["path"], output_mode="hql")
      5 
      6 

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/ddl_parser.py in parse_from_file(file_path, **kwargs)
    203     """get useful data from ddl"""
    204     with open(file_path, "r") as df:
--> 205         return DDLParser(df.read()).run(file_path=file_path, **kwargs)

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/parser.py in run(self, dump, dump_path, file_path, output_mode, group_by_type, json_dump)
    270             Dict == one entity from ddl - one table or sequence or type.
    271         """
--> 272         self.tables = self.parse_data()
    273         self.tables = result_format(self.tables, output_mode, group_by_type)
    274         if dump:

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/parser.py in parse_data(self)
    184 
    185         for num, self.line in enumerate(lines):
--> 186             self.process_line(num != len(lines) - 1)
    187         if self.comments:
    188             self.tables.append({"comments": self.comments})

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/parser.py in process_line(self, last_line)
    216         self.set_default_flags_in_lexer()
    217 
--> 218         self.process_statement()
    219 
    220     def process_statement(self):

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/parser.py in process_statement(self)
    220     def process_statement(self):
    221         if not self.set_line and self.statement:
--> 222             self.parse_statement()
    223         if self.new_statement:
    224             self.statement = self.line

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/parser.py in parse_statement(self)
    227 
    228     def parse_statement(self) -> None:
--> 229         _parse_result = yacc.parse(self.statement)
    230         if _parse_result:
    231             self.tables.append(_parse_result)

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/ply/yacc.py in parse(self, input, lexer, debug, tracking, tokenfunc)
    331             return self.parseopt(input, lexer, debug, tracking, tokenfunc)
    332         else:
--> 333             return self.parseopt_notrack(input, lexer, debug, tracking, tokenfunc)
    334 
    335 

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/ply/yacc.py in parseopt_notrack(self, input, lexer, debug, tracking, tokenfunc)
   1061                 if not lookahead:
   1062                     if not lookaheadstack:
-> 1063                         lookahead = get_token()     # Get the next token
   1064                     else:
   1065                         lookahead = lookaheadstack.pop()

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/ply/lex.py in token(self)
    384                     tok.lexpos = lexpos
    385                     self.lexpos = lexpos
--> 386                     newtok = self.lexerrorf(tok)
    387                     if lexpos == self.lexpos:
    388                         # Error method didn't change text position at all. This is an error.

~/.pyenv/versions/3.7.10/envs/lab/lib/python3.7/site-packages/simple_ddl_parser/ddl_parser.py in t_error(self, t)
    193 
    194     def t_error(self, t):
--> 195         raise DDLParserError("Unknown symbol %r" % (t.value[0],))
    196 
    197     def p_error(self, p):

DDLParserError: Unknown symbol "'"

It was hard to find the problem in a big DDL. After finding another shorter example, I could isolate the cause and figured out that the following comment was the issue.

column_name STRING COMMENT 'yada yada yada donโ€™t bla bla bla',   -- the problem was the single stylized quote (&rsquo; in HTML) in "don't".

Hence, some debugging parameters to see which error lexer/yaccer got might be helpful. For example, I could see ply lets you do that.

Describe the solution you'd like
parse_from_file(tbl_file["path"], output_mode="hql", debug=True)

Describe alternatives you've considered
Show the character it caused the problem

Additional context
Add any other context or screenshots about the feature request here.

import error - version 0.11

hi @xnuinside
Hope you are doing well

i am getting following error while importing simple_ddl_parser. Could you please check

Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from simple_ddl_parser import DDLParser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Dev/venv/lib/python3.7/site-packages/simple_ddl_parser/__init__.py", line 1, in <module>
    from simple_ddl_parser.ddl_parser import DDLParser, parse_from_file
  File "/Users/Dev/venv/lib/python3.7/site-packages/simple_ddl_parser/ddl_parser.py", line 3, in <module>
    from simple_ddl_parser.parser import Parser
  File "/Users/Dev/venv/lib/python3.7/site-packages/simple_ddl_parser/parser.py", line 3, in <module>
    from typing import Dict, List, Optional, Tuple, Literal
ImportError: cannot import name 'Literal' from 'typing' (/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/typing.py)

support for Snowflake Terraform provider column(?) format generation

Is your feature request related to a problem? Please describe.

Many DB tools generate DDL (with various extensions) and looking for conversion into Snowflake Terraform format

Describe the solution you'd like

Support https://registry.terraform.io/providers/chanzuckerberg/snowflake/latest/docs/resources/table#nested-schema-for-column

Example syntax:

  column {
    name     = "data"
    type     = "text"
    nullable = false
  }

or maybe support for snowflake_table although snowflake_table would likely refer to other components/variables in the TF files

Additional context

Snowflake-Labs/terraform-provider-snowflake#685

Parse Hive External Table DDL with Partitions

I have below hive DDL which i need to parse but it is not working. @xnuinside - could you please help

CREATE EXTERNAL TABLE IF NOT EXISTS database.table_name
(
    day_long_nm     string,
    calendar_dt     date,
    source_batch_id string,
    field_qty       decimal(10, 0),
    field_bool      boolean,
    field_float     float,
    create_tmst     timestamp,
    field_double    double,
    field_long      bigint
) PARTITIONED BY (batch_id int) STORED AS PARQUET LOCATION 's3://datalake/table_name/v1'

Fails to parse if DDL lacks semicolon or GO between statements.

Describe the bug
Fails to parse some ddl if there's no semicolon or GO in between statements.

This silently fails to produce output:

DROP TABLE IF EXISTS sample
CREATE TABLE sample
(
     sid BIGINT NOT NULL,
     foo CHAR(5),
     CONSTRAINT sample_key PRIMARY KEY NONCLUSTERED (sid)
)

While this succeeds (adds a semicolon).

DROP TABLE IF EXISTS sample;
CREATE TABLE sample
(
     sid BIGINT NOT NULL,
     foo CHAR(5),
     CONSTRAINT sample_key PRIMARY KEY NONCLUSTERED (sid)
)

To Reproduce

sdp sample_ddl.sql

Expected behavior
Parser output or an error message or non-zero exit code.

Partitions are not handled

Describe the bug
Partition definition is not handled by the parser, 'parse_from_file' Returns an empty list

To Reproduce
Steps to reproduce the behavior:

Add this code to a simple create table query:

PARTITION BY col(id)
(
PARTITION a VALUES('12') WITH (appendonly='false'),
PARTITION b VALUES('9') WITH (appendonly='false')
);

Expected behavior
Return an non empty result or If there is a problem with partition,skip partition result instead of giving no result at all.

Escaped single quote in COMMENT is not handled properly

Describe the bug
Escaped single quote in COMMENT is not handled properly

To Reproduce
ddl = """
CREATE EXTERNAL TABLE test (
job_id STRING COMMENT 'test's'
)
STORED AS PARQUET LOCATION 'hdfs://test'
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
Non empty json should be returned

'key' as a column name

The table will be skipped if it has a column named 'key'

CREATE TABLE foo
(
    entity_id        UUID PRIMARY KEY DEFAULT getId(),
    key              VARCHAR(256)
);

Output []

Upper case "CREATE" in column name ends table parse

Describe the bug
It looks like you are seeing CREATE and breaking the table DDL parse...

To Reproduce

t1.sql:

DROP TABLE IF EXISTS demo;
--
CREATE TABLE demo
(
     foo                             char(1),
     create_date                     DATETIME2,
     created_BY                      VARCHAR (20)
)
--

t2.sql:

DROP TABLE IF EXISTS demo;
--
CREATE TABLE demo
(
     foo                             char(1),
     CREATE_date                     DATETIME2,
     created_by                      VARCHAR (20),
)
--
sdp -v t1.sql
$ sdp -v t1.sql
Start parsing file t1.sql 

File with result was saved to >> schemas folder
[{'alter': {},
  'checks': [],
  'columns': [{'check': None,
               'default': None,
               'name': 'foo',
               'nullable': True,
               'references': None,
               'size': 1,
               'type': 'char',
               'unique': False},
              {'check': None,
               'default': None,
               'name': 'create_date',
               'nullable': True,
               'references': None,
               'size': None,
               'type': 'DATETIME2',
               'unique': False},
              {'check': None,
               'default': None,
               'name': 'created_BY',
               'nullable': True,
               'references': None,
               'size': 20,
               'type': 'VARCHAR',
               'unique': False}],
  'index': [],
  'partitioned_by': [],
  'primary_key': [],
  'schema': None,
  'table_name': 'demo',
  'tablespace': None}]

vs:

$ sdp -v t2.sql
Start parsing file t2.sql 

File with result was saved to >> schemas folder
[{'alter': {},
  'checks': [],
  'columns': [{'check': None,
               'default': None,
               'name': 'foo',
               'nullable': True,
               'references': None,
               'size': 1,
               'type': 'char',
               'unique': False}],
  'index': [],
  'partitioned_by': [],
  'primary_key': [],
  'schema': None,
  'table_name': 'demo',
  'tablespace': None}]

Expected behavior
A capitalized substring field name that is CREATE shouldn't break the parser.

Support CHECK constraint with procedure call in it

Example

CREATE TABLE foo
(
    entity_id        UUID PRIMARY KEY DEFAULT getId()
    name             VARCHAR(64),
    CONSTRAINT my_constraint  CHECK(my_function(name) IS TRUE)
);

Check constraint works well if there is no procedure call in it (my_function).
With procedure call it leads to an empty result []

When DDL has TBLPROPERTIES, result is empty

Describe the bug
When DDL has TBLPROPERTIES, result is empty

To Reproduce
ddl = """
CREATE EXTERNAL TABLE test (
job_id STRING COMMENT 'test'
)
STORED AS PARQUET LOCATION 'hdfs://test'
TBLPROPERTIES (
'parquet.compression'='SNAPPY')
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
non empty result

Less output with 'on update' string

Describe the bug
The sdp command generates less output than normal when a input file contains the string 'on update'

To Reproduce

  1. install from pip
$ pip3 install simple-ddl-parser
Defaulting to user installation because normal site-packages is not writeable
Collecting simple-ddl-parser
  Downloading simple_ddl_parser-0.19.9-py3-none-any.whl (29 kB)
Collecting ply<4.0,>=3.11
  Downloading ply-3.11-py2.py3-none-any.whl (49 kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 49 kB 9.2 MB/s
Installing collected packages: ply, simple-ddl-parser
Successfully installed ply-3.11 simple-ddl-parser-0.19.9
  1. Run sdp command
$ cat <<'EOF' > test.sql
create table test(
  `id` bigint not null,
  `updated_at` timestamp(3) not null default current_timestamp(3) on update current_timestamp(3),
  primary key (id)
);
EOF
$ ~/.local/bin/sdp test.sql  -v
Start parsing file schema/test.sql

File with result was saved to >> schemas folder
[{'primary_key': ['id']}]

Expected behavior
Output json like following with some on update key...

[{'alter': {},
  'checks': [],
  'columns': [{'check': None,
               'default': None,
               'name': '`id`',
               'nullable': False,
               'references': None,
               'size': None,
               'type': 'bigint',
               'unique': False},
              {'check': None,
               'default': 'current_timestamp(3)',
               'name': '`updated_at`',
               'nullable': False,
               'references': None,
               'size': 3,
               'type': 'timestamp',
               'unique': False}],
  'index': [],
  'partitioned_by': [],
  'primary_key': ['id'],
  'schema': None,
  'table_name': 'test',
  'tablespace': None}]

Desktop (please complete the following information):

  • OS: Ubuntu 20.04 LTS

keep "<" and ">" special char in columns names when properly quoted

First of all, wanna compliment you for this helpful tool!

I came into a (possibly?) minor bug when trying to parse an hql file containing column names with special characters "<" or ">" properly quoted like:
something<2% float
I found out this is due to "parser.pre_process_line" that insert spaces around "<" and ">" if "(" not in line.
Possible solution is not to insert spaces if "<" ">" are inbetween quotes.

DDL statement without "NULL" or "not NULL" is not supported

Describe the bug
DDL statement without "NULL" or "not NULL" is not supported

To Reproduce
The following failed. (empty response)
ddl = """
CREATE EXTERNAL TABLE test (
job_id STRING COMMENT 'test'
)
STORED AS PARQUET LOCATION 'hdfs://test'
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

But the following passed.
ddl = """
CREATE EXTERNAL TABLE test (
job_id STRING NULL COMMENT 'test'
)
STORED AS PARQUET LOCATION 'hdfs://test'
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
Above both cases should return the same result.

MSSQL/TSQL Support

Hi @xnuinside

I hope your are doing well. I saw that you made some major enhancements since to this project (FYI this is still amazing). I wanted to provide this comprehensive query for when you plan to add MS/T SQL support to the parser. You will notice that Postgres and MS SQL have similar syntaxes . The DDL below includes almost every TSQL data type. I also included the max sizes for each of the data types. Please feel free to let me know if you have any issues with the query.

CREATE TABLE Persons (
    PersonID int PRIMARY KEY,
    LastName varchar(255),
    FirstName varchar(255),
    Address varchar(255),
    City varchar(255)
);


CREATE TABLE sqlserverlist (

id INT IDENTITY (1,1) PRIMARY KEY, -- NOTE THE IDENTITY (1,1) IS SIMILAR TO serial in postgres - Format for IDENTITY [ (seed , increment) ]
company_id BIGINT ,
primary_id INT FOREIGN KEY REFERENCES Persons(PersonID), -- ADD THIS COLUMN FOR THE FOREIGN KEY
age TINYINT NULL UNIQUE,
days_active SMALLINT NOT NULL,
user_origin_of_birth char(255),
user_account VARCHAR(8000) NOT NULL,
user_first_name VARCHAR(max) NOT NULL,
user_last_name 	VARBINARY(8000) NOT NULL,
user_street NCHAR(400) NULL,
user_city NVARCHAR(4000),
about_user NTEXT NULL,
user_description TEXT,
starting_funds FLOAT(53) NULL,
extra_funds REAL,
current_funds DECIMAL (38,20),
ending_funds SMALLMONEY NOT NULL,
birth_date DATE NOT NULL,
time_of_birth TIME(7),
enrollment_date SMALLDATETIME,
delete_date DATETIME NULL,
create_date DATETIME2(7) NOT NULL,
user_time_zone DATETIMEOFFSET(7),
oder_date date DEFAULT GETDATE(), -- added to demonstrate sql sever Defaults
country varchar(255) DEFAULT 'Sandnes', -- added to demonstrate sql sever Defaults
active bit NULL,
home_size GEOMETRY, -- Sql Server Defaults to Null
user_photo IMAGE, -- Sql Server Defaults to Null
--UNIQUE (id),
CONSTRAINT UC_sqlserverlist_last_name UNIQUE (company_id,user_last_name),
CONSTRAINT CHK_Person_Age_under CHECK (days_active<=18 AND user_city='New York'),
FOREIGN KEY (id) REFERENCES Persons(PersonID),
CONSTRAINT FK_Person_Age_under  FOREIGN KEY (id)REFERENCES Persons(PersonID)
)

-- UNIQUE CONSTRAINTS
ALTER TABLE sqlserverlist ADD UNIQUE (birth_date);
ALTER TABLE sqlserverlist ADD CONSTRAINT UC_Person_ening_funds UNIQUE (current_funds,create_date);
--CHECK CONSTRAINTS
ALTER TABLE sqlserverlist ADD CHECK (starting_funds>=0);
ALTER TABLE sqlserverlist ADD CONSTRAINT CHK_Person_Age CHECK (age >=18 AND user_city='Sandnes');
-- FOREIGN KEY Constraints
ALTER TABLE sqlserverlist ADD FOREIGN KEY (id) REFERENCES Persons(PersonID);
ALTER TABLE sqlserverlist ADD CONSTRAINT FK_PersonOrder4 FOREIGN KEY (id) REFERENCES Persons(PersonID);
-- INDEXES 
CREATE INDEX i1 ON sqlserverlist (extra_funds);
--CREATE CLUSTERED INDEX i2 ON sqlserverlist  (delete_date,create_date); --- This line is commented because TSQL doesnt allow two index on table added as an example
--CREATE UNIQUE INDEX i1 ON t1 (delete_date DESC, create_date ASC, ending_funds  DESC); --- This line is commented because TSQL doesnt allow two index on table added as an example

-- DEFAULTS
ALTER TABLE sqlserverlist ADD CONSTRAINT df_user_street DEFAULT '1 WAY STREET' FOR user_street;
SELECT * FROM #sqlserverlist

Fails to parse output from pg_dump?

I'm not sure if I'm using this wrong, but DDLParser() fails to parse the attached PostgreSQL schema output from pg_dump

schema.txt

Traceback:

~/.cache/pypoetry/virtualenvs/capgras-r404Yw5Z-py3.8/lib/python3.8/site-packages/simple_ddl_parser/parser.py in run(self, dump, dump_path, file_path, output_mode, group_by_type, json_dump)
    228             Dict == one entity from ddl - one table or sequence or type.
    229         """
--> 230         tables = self.parse_data()
    231         tables = result_format(tables, output_mode, group_by_type)
    232         if dump:

~/.cache/pypoetry/virtualenvs/capgras-r404Yw5Z-py3.8/lib/python3.8/site-packages/simple_ddl_parser/parser.py in parse_data(self)
    167                 if final_line or new_statement:
    168                     # end of sql operation, remove ; from end of line
--> 169                     statement = statement[:-1]
    170                 elif num != len(lines) - 1 and not skip:
    171                     # continue combine lines in one massive

TypeError: 'NoneType' object is not subscriptable

Please feel free to close this issue if you don't intend to support this use-case.

Enum handling

Postgres dialect allows to define enumerations as a type, for example:

CREATE TYPE my_status AS enum (
    'NEW',
    'IN_PROGRESS',
    'FINISHED'
);

CREATE TABLE foo
(
    entity_id        UUID PRIMARY KEY DEFAULT getId(),
    status           my_status
);

Right now parser will fail in this case with:
ValueError: dictionary update sequence element #0 has length 1; 2 is required

It would be nice if the parser could just omit enum definitions.
Or support enum parsing as well if you think it is make sense.

One more example of enum definition for testing:

DO
$$
    BEGIN
        IF NOT EXISTS(SELECT 1
                      FROM pg_type
                      WHERE typname = 'my_status')
        THEN
            CREATE TYPE my_status AS enum (
                'NEW',
                'IN_PROGRESS',
                'FINISHED'
                );
        END IF;
    END
$$;

Typo in Partition By parsing

Describe the bug

Based on the documentation (and looking at the source code and tests), the keyword matched for partitioning is "PARTITIONED BY".
But the actual SQL syntax is "PARTITION BY".

This result in the parsing not working.

Could you please fix this issue?

CREATE TABLE table_name ( table_id INT64, a_column STRING, another_column BOOLEAN, a_partion_column DATE ) PARTITION BY a_partion_column

Column name USER_COMMENT causes column parsing to end

Describe the bug

There are 2 issues with parsing the enclosed create table statement in Snowflake.

  1. The 'OR REPLACE' keywords cause the output to be blank.
  2. The column with name 'USER_COMMENT' causes the column parsing to end so the output only includes the first column 'ASIN'.

Also a feature request: Support for constraint and table comment parsing

To Reproduce

create or replace TABLE ASIN.EXCLUSION (
	ASIN VARCHAR(10) NOT NULL,
	USER_COMMENT VARCHAR(100),
	PROCESS_SQN NUMBER(10,0) NOT NULL,
	constraint PK_EXCLUSION primary key (ASIN)
) COMMENT='ASINs to be excluded from the ASIN List File'
;

create TABLE ASIN.EXCLUSION (
	ASIN VARCHAR(10) NOT NULL,
	USER_COMMENT VARCHAR(100),
	PROCESS_SQN NUMBER(10,0) NOT NULL,
	constraint PK_EXCLUSION primary key (ASIN)
) COMMENT='ASINs to be excluded from the ASIN List File'
;

Support REFERENCES `ON` construction

Adding ON DELETE CASCADE or similar ON construction after REFERENCES ... leads to an empty parse result.
It would be nice to support ON also.

Thanks! Cool parser BTW :)

Feature: AWS Redshift Support

Would it be possible to add support for AWS Redshift? It's based on Postgres 8.0 but, being a distributed database with columnar storage, has added some syntax to control how rows are distributed and compressed. You can find a BNF-esque grammar here.

Currently, this doesn't detect the DDL at all:

>>> results = DDLParser("""
...     CREATE TABLE foo.bar(
...         asdf INTEGER ENCODE ZSTD NOT NULL,
...         qwerty VARCHAR(255) ENCODE LZO
...     )
...     SORTSTYLE COMPOUND
...    SORTKEY (qwerty)
...     DISTSTYLE EVEN
...     ;
... """).run()
>>> results
[]

It'd be great if we could get the information about encoding, sortkey, distkey, etc. with a "redshift" output mode.

Support Spark SQL (or HQL `USING` clause)

Is your feature request related to a problem? Please describe.
Spark SQL DDL is very similar to HQL, with a notable difference being that Spark SQL supports a USING clause in CREATE TABLE statements (see the DDL docs here).

The USING clause is importantly used for table definition in many storage formats provide APIs for Spark - for example, Apache Iceberg and Delta Lake

Describe the solution you'd like
Support Spark SQL DDL parsing, likely as a simple extension of HQL, with support for the USING clause

Describe alternatives you've considered
Potentially just support a USING clause in the HQL dialect parser

Quoted Identifiers in Column Names

Hi! If I use of quoted identifiers in the names of table columns, then return [].

CREATE TABLE IF NOT EXISTS shema.table
(
    field_1        BIGINT,
    `partition`   STRING,
);

Unexpected logs

Hi, with the new version 0.19.8 I get unexpected logs when calling DDLParser(...).run(...). For example:

WARNING: Token 'NEWLINE' defined, but not used
WARNING: There is 1 unused token
CREATE CREATE
TABLE TABLE
IF IF
NOT NOT
EXISTS EXISTS
shema ID
table ID
field_1 ID
BIGINT ID
, COMMA
`partition` ID
STRING ID
) RP

Default value with expression

Describe the bug

For such an example, where default value contains expression - the table is missing in result:

CREATE TABLE foo
(
    bar_timestamp  bigint DEFAULT (extract(epoch from now()) * 1000)
);

*Postgresql dialect

CHECK with IN clause does not parse

Using the IN keyword in a CHECK causes parsing to fail. For example:

`col_name` varchar(5) CHECK( `col_name` = 'year' OR `col_name` = 'month' ),

Works but with an IN statement doesn't:

`col_name` varchar(5) CHECK( `col_name` IN ('year', 'month') ),

With silent=False we get an error like:

  File ".../lib/python3.9/site-packages/simple_ddl_parser/ddl_parser.py", line 199, in p_error
    raise DDLParserError(f"Unknown statement at {p}")
simple_ddl_parser.ddl_parser.DDLParserError: Unknown statement at LexToken(STRING,"'year'",1,544)

Using OR statements becomes unwieldy when there are more than a handful of options in the enum so it would be nice to support this. Thanks.

The References field is null when adding a schema

HI @xnuinside

First of all, I just wanted to tell you how amazing this parser is. I was testing the parser and noticed that the references fields in the Json output returns a value of "none" when adding a schema name to the table. For example if added the following values after REFERENCES, the "none" value appears:

other_schema.other_table
"other_schema".other_table

However, if just add the table name other_table, the references field does produce an out. Below is an example of the code that will reproduce the issue.

from simple_ddl_parser import DDLParser


parse_results = DDLParser("""

 create table prod.super_table
(
    data_sync_id bigint not null default 0,
    id_ref_from_another_table int REFERENCES other_schema.other_table (id) --- this will product a null output. 
    primary key (data_sync_id, sync_start)
);

);


""").run()

print(parse_results) 

Table name which is also a "reserved word" fails to parse

Consider:

create table index (col1:int);

create table foo.[index] (col1:int);

create table foo.index (col1:int);

The first 2 lines understand a table named "index" is intended.
The third line classifies the word "index" as the token 'INDEX' probably here.

Though not a good or recommended practice, table names are allowed to be arbitrary strings, even reserved words.

If strict adherence to quoted/delimited usage of key-word-as-identifier DDL is desired, then the first line (bare "index" ) should have failed.
If relaxed adherence is intended, then all 3 should have succeeded.

LOCATION and TBLPROPERTIES clauses in CREATE TABLE LIKE statement

Hi! If i create a table using the LIKE statement and add LOCATION and/or TBLPROPERTIES clauses to it, then returns [].

stmt = 'CREATE EXTERNAL TABLE IF NOT EXISTS schema.specific_table LIKE schema.table_template LOCATION "/path/to/table" TBLPROPERTIES ("external.table.purge" = "true")'
DDLParser(stmt).run(output_mode="hql")

Primary key with default value

Describe the bug

If the primary key defined with a DEFAULT value - the parser won't treat it as a primary key in the result

CREATE TABLE foo
(
    entity_id        UUID PRIMARY KEY DEFAULT getId()
);
  • getId() is a function that generate UUID
  • Postgresql dialect

In result: 'primary_key': []

'STORED AS INPUTFORMAT', 'OUTPUTFORMAT', and 'ROW FORMAT SERDE' is not supported

Describe the bug
'STORED AS INPUTFORMAT' and 'OUTPUTFORMAT' is not supported

To Reproduce
ddl = """
CREATE EXTERNAL TABLE test (
test STRING NULL COMMENT 'xxxx',
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xxxx'
"""

Expected behavior
non empty response

Some columns from Postgres DDL are not parsed

Describe the bug
Some columns with type character varying are not parsed when NOT NULL constraint does not exist for the column. Please see 3 example inputs and outputs below.
CREATE TABLE statements are generated by pg_dump command in Postgres.

To Reproduce
run result = parse_from_file(path_to_ddl_sql) for the inputs below:

  • Input-1:
CREATE TABLE myset.mytable (
    id_a character varying,
    id_b character varying,
    id_c character varying
);
  • Output-1: There were 3 columns in the input but only 1 column is parsed.
[{'alter': {},
  'checks': [],
  'columns': [{'check': None,
               'default': None,
               'name': 'id_a',
               'nullable': True,
               'references': None,
               'size': None,
               'type': 'character varying',
               'unique': False}],
  'id_b': 'varying',
  'id_c': 'varying',
  'index': [],
  'partitioned_by': [],
  'primary_key': [],
  'schema': 'myset',
  'table_name': 'mytable',
  'tablespace': None}]
  • Input-2: Added NOT NULL constraint to id_b column.
CREATE TABLE myset.mytable (
    id_a character varying,
    id_b character varying NOT NULL,
    id_c character varying
);
  • Output-2: There were 3 columns in the input but only 2 column is parsed.
[{'alter': {},
  'checks': [],
  'columns': [{'check': None,
               'default': None,
               'name': 'id_a',
               'nullable': True,
               'references': None,
               'size': None,
               'type': 'character varying',
               'unique': False},
              {'check': None,
               'default': None,
               'name': 'id_b',
               'nullable': False,
               'references': None,
               'size': None,
               'type': 'character varying',
               'unique': False}],
  'id_c': 'varying',
  'index': [],
  'partitioned_by': [],
  'primary_key': [],
  'schema': 'myset',
  'table_name': 'mytable',
  'tablespace': None}]
  • Input-3: Added NOT NULL constraint to id_c column, too.
CREATE TABLE myset.mytable (
    id_a character varying,
    id_b character varying NOT NULL,
    id_c character varying NOT NULL
);
  • Output-3: All 3 columns in input are parsed.
[{'alter': {},
  'checks': [],
  'columns': [{'check': None,
               'default': None,
               'name': 'id_a',
               'nullable': True,
               'references': None,
               'size': None,
               'type': 'character varying',
               'unique': False},
              {'check': None,
               'default': None,
               'name': 'id_b',
               'nullable': False,
               'references': None,
               'size': None,
               'type': 'character varying',
               'unique': False},
              {'check': None,
               'default': None,
               'name': 'id_c',
               'nullable': False,
               'references': None,
               'size': None,
               'type': 'character varying',
               'unique': False}],
  'index': [],
  'partitioned_by': [],
  'primary_key': [],
  'schema': 'myset',
  'table_name': 'mytable',
  'tablespace': None}]

Expected behavior
All 3 columns from Input-1 should be parsed into Output-1.

Desktop (please complete the following information):

  • OS: MacOS Monterey 12.2
  • Python 3.8.9 | 3.8.12
  • Package Version: 0.25.0

Cannot use special character like # or | on COMMENT

Describe the bug
Cannot use special character like # or | on COMMENT

To Reproduce
Try to parse

ddl = """
CREATE EXTERNAL TABLE test (
job_id STRING NULL COMMENT 'test'
)
STORED AS PARQUET LOCATION 'hdfs://test'
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
Should get json result but got empty.

ALTER Constraint UNIQUE must be showed in column also

Issue was reported here: #99

This is a big help. It parses without throwing an exception now! I think there's still room for improvement in understanding the meaning of the Postgres schema. For example, the DDL contains

ALTER TABLE ONLY public.accounts ADD CONSTRAINT accounts_username_key UNIQUE (username);

But in the parsed output, we see:

{'name': 'username',
'type': 'character varying',
'size': 50,
'references': None,
'unique': False, <======= should be True?
'nullable': False,
'default': None,
'check': None},
But the bug I reported is fixed, so closing this issue.

Question mark in COMMENT is not handled

Describe the bug
Question mark in COMMENT is not handled

To Reproduce
ddl = """
CREATE EXTERNAL TABLE test (
job_id STRING COMMENT 'test?'
)
STORED AS PARQUET LOCATION 'hdfs://test'
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
Response should not be empty.

Tab between column name and column type

If the column name is separated from the column type only by a tab, then return []

CREATE TABLE IF NOT EXISTS schema.table
(
    field_1\tSTRING,
    field_2\tTIMESTAMP
);

If the first column name is separated from the column type by a tab and some spaces (before or after the tab), but second column name is separated from the DECIMAL type only by a tab then an error is returned

CREATE TABLE IF NOT EXISTS schema.table
(
    field_1\t STRING,
    field_2\tDECIMAL(38,10)
);

If I replace all tabs with a space with method replace('\t', ' '), then I get the correct result

Multiple SERDEPROPERTIES fails

Describe the bug
Entering multiple key/value pairs in SERDEPROPERTIES causes parsing to fail and return empty

To Reproduce
This works:

ddl="CREATE TABLE `x` (`a` STRING) ROW FORMAT SERDE 'my_serde' WITH SERDEPROPERTIES ( 'key1'='value1' )"
DDLParser(ddl,normalize_names=True).run(output_mode="hql")

This doesn't:

ddl="CREATE TABLE `x` (`a` STRING) ROW FORMAT SERDE 'my_serde' WITH SERDEPROPERTIES ( 'key1'='value1', 'key2'='value2' )"
DDLParser(ddl,normalize_names=True).run(output_mode="hql")

Expected behavior
Multiple key/value pairs in SERDEPROPERTIES should in the row format properties dict.

Parse adding a primary key as an alter table statement

I'm trying to parse a sizeable DDL script with the primary key detached in separate alter table statements

alter table MY_TABLE
   add constraint PK_MY_TABLE primary key (DATETIME, REGIONID)
      using index tablespace SOME_TABLE_SPACE;

I notice it picks up the additional indexes, but at the moment simple-ddl-parser does not pick up the primary key when it's added in this way.

Unsupported "DEFAULT NEXT VALUE FOR" & ON <DESTINATIONCLAUSE>

Unsupported parse of NEXT VALUE FOR and ON on CREATE TABLE

When CREATE TABLE uses DEFAULT NEXT VALUE or script does user storage options (ex.: ON PRIMARY) , the parser does not output any results.,

To Reproduce

from simple_ddl_parser import DDLParser

ddl_Orig = '''
CREATE TABLE [dbo].[SLIPEVENTO] (
[cdSLIP] [bigint] NOT NULL
DEFAULT NEXT VALUE FOR [dbo].[sqCdSLIPEvt] ,
-- Referencia da origem do evento
[hashOrigem] VARBINARY NOT NULL,
-- HASH pela tabela de onde vem o lancamento
-- SIC, DARE21, SF, Manual, ...
[tbOrigem] [int] NOT NULL,
[tbVrtDstDflt] [int] NULL,
CONSTRAINT [pk_slipsEvt] PRIMARY KEY CLUSTERED ( [cdSLIP] ),
-- chave de referencia para multiplos lancamentos distribuidos
-- ??? Identifica um documento origem via HASH na tbOrigem
CONSTRAINT [uk_slipsEvtHASHdocs] UNIQUE([hashOrigem],[tbOrigem])
-- A FK precisa vir por trigger, pois sao diferentes origens.,
--
) ON [PRIMARY]
GO

'''

ddl1 = '''
CREATE TABLE [dbo].[SLIPEVENTO] (
[cdSLIP] [bigint] NOT NULL,
-- Referencia da origem do evento
[hashOrigem] VARBINARY NOT NULL,
-- HASH pela tabela de onde vem o lancamento
-- SIC, DARE21, SF, Manual, ...
[tbOrigem] [int] NOT NULL,
[tbVrtDstDflt] [int] NULL,
CONSTRAINT [pk_slipsEvt] PRIMARY KEY CLUSTERED ( [cdSLIP] ),
-- chave de referencia para multiplos lancamentos distribuidos
-- ??? Identifica um documento origem via HASH na tbOrigem
CONSTRAINT [uk_slipsEvtHASHdocs] UNIQUE([hashOrigem],[tbOrigem])
-- A FK precisa vir por trigger, pois sao diferentes origens.,
--
) ON [PRIMARY]
'''

ddl2 = '''
CREATE TABLE [dbo].[SLIPEVENTO] (
[cdSLIP] [bigint] NOT NULL,
-- Referencia da origem do evento
[hashOrigem] VARBINARY NOT NULL,
-- HASH pela tabela de onde vem o lancamento
-- SIC, DARE21, SF, Manual, ...
[tbOrigem] [int] NOT NULL,
[tbVrtDstDflt] [int] NULL,
CONSTRAINT [pk_slipsEvt] PRIMARY KEY CLUSTERED ( [cdSLIP] ),
-- chave de referencia para multiplos lancamentos distribuidos
-- ??? Identifica um documento origem via HASH na tbOrigem
CONSTRAINT [uk_slipsEvtHASHdocs] UNIQUE([hashOrigem],[tbOrigem])
-- A FK precisa vir por trigger, pois sao diferentes origens.,
--
);
'''

result = DDLParser(ddl_Orig).run(dump=True)
print(result)

result = DDLParser(ddl1).run(dump=True)
print(result)

result = DDLParser(ddl2).run(dump=True)
print(result)

Expected behavior
Interpretation as proposed on docs.

Desktop (please complete the following information):

  • SQL Dialect - MSSQL - Autogenerated scripts - V12
  • OS: Debian - Python3.9
  • Idle3 ; Eric6, QtCreator (all of them )
  • Version
  • fred@alpha:~/Documentos/GoiasFomento/QtPy/src$ pip3 install simple-ddl-parser
    Collecting simple-ddl-parser
    Downloading simple_ddl_parser-0.20.0-py3-none-any.whl (31 kB)
    Requirement already satisfied: ply<4.0,>=3.11 in /usr/lib/python3/dist-packages (from simple-ddl-parser) (3.11)
    Installing collected packages: simple-ddl-parser
    Successfully installed simple-ddl-parser-0.20.0

Additional context
Add any other context about the problem here.

CREATE TYPE does not parse too.

DROP TYPE dbo.T_LCT_SLIPS ;
GO
CREATE TYPE dbo.T_LCT_SLIPS AS TABLE (
hashKY varbinary(48),
numContratoGF bigint,
FATORGERADOR date ,
dtREFSIC date,
cdTbOrigem int,
cdTbVrtDstDflt int,
Sq int,
vl DECIMAL(32,0),
descr varchar(254),
tpEvento varchar(254),
tpLcto char
);
GO

semi colon in COMMENT is not handled

Describe the bug
semi colon is not handled

To Reproduce
ddl = """
CREATE EXTERNAL TABLE test (
test STRING NULL COMMENT 'test;',
)
COMMENT 'test'
"""
from simple_ddl_parser import DDLParser
parse_results = DDLParser(ddl).run(output_mode="hql")

Expected behavior
Non-empty result

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.