pganalyze / libpg_query Goto Github PK

C library for accessing the PostgreSQL parser outside of the server environment

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.10% C 80.88% Ruby 0.49% C++ 0.05% PLpgSQL 18.09% PLSQL 0.38%

libpg_query's Introduction

libpg_query

C library for accessing the PostgreSQL parser outside of the server.

This library uses the actual PostgreSQL server source to parse SQL queries and return the internal PostgreSQL parse tree.

Note that this is mostly intended as a base library for pg_query (Ruby), pg_query.go (Go), pg_query.rs (Rust), pgsql-parser (Node), psqlparse (Python) and pglast (Python 3).

You can find further background to why a query's parse tree is useful here: https://pganalyze.com/blog/parse-postgresql-queries-in-ruby.html

Installation

git clone -b 16-latest git://github.com/pganalyze/libpg_query
cd libpg_query
make

Due to compiling parts of PostgreSQL, running make will take a bit. Expect up to 3 minutes.

For a production build, its best to use a specific git tag (see CHANGELOG).

When compiling on Windows with Visual Studio, instead use nmake with the Makefile.msvc:

nmake /F Makefile.msvc

Usage: Parsing a query

A full example that parses a query looks like this:

#include <pg_query.h>
#include <stdio.h>

int main() {
  PgQueryParseResult result;

  result = pg_query_parse("SELECT 1");

  printf("%s\n", result.parse_tree);

  pg_query_free_parse_result(result);
}

Compile it like this:

cc -Ilibpg_query -Llibpg_query example.c -lpg_query

This will output the parse tree (whitespace adjusted here for better readability):

{
    "version": 160001,
    "stmts": [
        {
            "stmt": {
                "SelectStmt": {
                    "targetList": [
                        {
                            "ResTarget": {
                                "val": {
                                    "A_Const": {
                                        "val": {
                                            "Integer": {
                                                "ival": 1
                                            }
                                        },
                                        "location": 7
                                    }
                                },
                                "location": 7
                            }
                        }
                    ],
                    "limitOption": "LIMIT_OPTION_DEFAULT",
                    "op": "SETOP_NONE"
                }
            }
        }
    ]
}

Usage: Scanning a query into its tokens using the PostgreSQL scanner/lexer

pg_query also exposes the underlying scanner of Postgres, which is also used in the very first part in the parsing process. It can be useful on its own for e.g. syntax highlighting, where one is mostly concerned with differentiating keywords from identifiers and other parts of the query:

#include <stdio.h>

#include <pg_query.h>
#include "protobuf/pg_query.pb-c.h"

int main() {
  PgQueryScanResult result;
  PgQuery__ScanResult *scan_result;
  PgQuery__ScanToken *scan_token;
  const ProtobufCEnumValue *token_kind;
  const ProtobufCEnumValue *keyword_kind;
  const char *input = "SELECT update AS left /* comment */ FROM between";

  result = pg_query_scan(input);
  scan_result = pg_query__scan_result__unpack(NULL, result.pbuf.len, (void *) result.pbuf.data);

  printf("  version: %d, tokens: %ld, size: %d\n", scan_result->version, scan_result->n_tokens, result.pbuf.len);
  for (size_t j = 0; j < scan_result->n_tokens; j++) {
    scan_token = scan_result->tokens[j];
    token_kind = protobuf_c_enum_descriptor_get_value(&pg_query__token__descriptor, scan_token->token);
    keyword_kind = protobuf_c_enum_descriptor_get_value(&pg_query__keyword_kind__descriptor, scan_token->keyword_kind);
    printf("  \"%.*s\" = [ %d, %d, %s, %s ]\n", scan_token->end - scan_token->start, &(input[scan_token->start]), scan_token->start, scan_token->end, token_kind->name, keyword_kind->name);
  }

  pg_query__scan_result__free_unpacked(scan_result, NULL);
  pg_query_free_scan_result(result);

  return 0;
}

This will output the following:

  version: 160001, tokens: 7, size: 77
  "SELECT" = [ 0, 6, SELECT, RESERVED_KEYWORD ]
  "update" = [ 7, 13, UPDATE, UNRESERVED_KEYWORD ]
  "AS" = [ 14, 16, AS, RESERVED_KEYWORD ]
  "left" = [ 17, 21, LEFT, TYPE_FUNC_NAME_KEYWORD ]
  "/* comment */" = [ 22, 35, C_COMMENT, NO_KEYWORD ]
  "FROM" = [ 36, 40, FROM, RESERVED_KEYWORD ]
  "between" = [ 41, 48, BETWEEN, COL_NAME_KEYWORD ]

Where the each element in the token list has the following fields:

Start location in the source string
End location in the source string
Token value - see Token type in protobuf/pg_query.proto
Keyword type - see KeywordKind type in protobuf/pg_query.proto, possible values: NO_KEYWORD: Not a keyword UNRESERVED_KEYWORD: Unreserved keyword (available for use as any kind of unescaped name) COL_NAME_KEYWORD: Unreserved keyword (can be unescaped column/table/etc names, cannot be unescaped function or type name) TYPE_FUNC_NAME_KEYWORD: Reserved keyword (can be unescaped function or type name, cannot be unescaped column/table/etc names) RESERVED_KEYWORD: Reserved keyword (cannot be unescaped column/table/variable/type/function names)

Note that whitespace does not show as tokens.

Usage: Fingerprinting a query

Fingerprinting allows you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different object ids in the WHERE clause), or because of formatting.

Example:

#include <pg_query.h>
#include <stdio.h>

int main() {
  PgQueryFingerprintResult result;

  result = pg_query_fingerprint("SELECT 1");

  printf("%s\n", result.fingerprint_str);

  pg_query_free_fingerprint_result(result);
}

This will output:

50fde20626009aba

See https://github.com/pganalyze/libpg_query/wiki/Fingerprinting for the full fingerprinting rules.

Usage: Parsing a PL/pgSQL function

A full example that parses a PL/pgSQL method looks like this:

#include <pg_query.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  PgQueryPlpgsqlParseResult result;

  result = pg_query_parse_plpgsql(" \
  CREATE OR REPLACE FUNCTION cs_fmt_browser_version(v_name varchar, \
                                                  v_version varchar) \
RETURNS varchar AS $$ \
BEGIN \
    IF v_version IS NULL THEN \
        RETURN v_name; \
    END IF; \
    RETURN v_name || '/' || v_version; \
END; \
$$ LANGUAGE plpgsql;");

  if (result.error) {
    printf("error: %s at %d\n", result.error->message, result.error->cursorpos);
  } else {
    printf("%s\n", result.plpgsql_funcs);
  }

  pg_query_free_plpgsql_parse_result(result);

  return 0;
}

This will output:

[
{"PLpgSQL_function":{"datums":[{"PLpgSQL_var":{"refname":"found","datatype":{"PLpgSQL_type":{"typname":"UNKNOWN"}}}}],"action":{"PLpgSQL_stmt_block":{"lineno":1,"body":[{"PLpgSQL_stmt_if":{"lineno":1,"cond":{"PLpgSQL_expr":{"query":"SELECT v_version IS NULL"}},"then_body":[{"PLpgSQL_stmt_return":{"lineno":1,"expr":{"PLpgSQL_expr":{"query":"SELECT v_name"}}}}]}},{"PLpgSQL_stmt_return":{"lineno":1,"expr":{"PLpgSQL_expr":{"query":"SELECT v_name || '/' || v_version"}}}}]}}}}
]

Versions

For stability, it is recommended you use individual tagged git versions, see CHANGELOG.

Each major version is maintained in a dedicated git branch. Only the latest Postgres stable release receives active updates.

PostgreSQL Major Version	Branch	Status
16	16-latest	Active development
15	15-latest	Critical fixes only
14	14-latest	Critical fixes only
13	13-latest	Critical fixes only
12	(n/a)	Not supported
11	(n/a)	Not supported
10	10-latest	No longer supported
9.6	(n/a)	Not supported
9.5	9.5-latest	No longer supported
9.4	9.4-latest	No longer supported

Resources

pg_query wrappers in other languages:

Ruby: pg_query
Go: pg_query_go
Javascript (Node): pgsql-parser
Javascript (Browser): pg-query-emscripten
Python: psqlparse, pglast, psqlparse2
OCaml: pg_query-ocaml
Rust: pg_query.rs

Products, tools and libraries built on pg_query:

Please feel free to open a PR to add yours! :)

Authors

Lukas Fittl

License

PostgreSQL server source code, used under the PostgreSQL license.
Portions Copyright (c) 1996-2023, The PostgreSQL Global Development Group
Portions Copyright (c) 1994, The Regents of the University of California

All other parts are licensed under the 3-clause BSD license, see LICENSE file for details.
Copyright (c) 2015, Lukas Fittl [email protected] Copyright (c) 2016-2023, Duboce Labs, Inc. (pganalyze) [email protected]

libpg_query's People

Contributors

Stargazers

Watchers

Forkers

zhm carymrobbins lisael thehogfather nothingisdead pombredanne alculquicondor smkingsoft eshg huyphams pmenon michalmuskala herwinw eagle518 emil-perhinschi lelit msp-greg langpavel shallweizeng brianwitte lmawarid shz glockio corion oylz amiganguli milleniumbug renesugar craiggwilson anemy b-xiang cwida andersastrand relationalai-oss lmwnshn mgorkov d hannes fazzone jayvdb ethanresnick expobrain tommorris rdunklau loremkang pborne zer0y roddyyaga dio kensaggy hoangpq silverbullet233 library1010 elliotcourant stanhu fgribreau setzero erisonliang gsagula pyramation chsjiang digoal qiuwenhuifx zhaodiankui jirutka surajmaher paupino kaylaleung ciusji wingsgo clapdb xunliu letalvoj egnchen yosiat fentik elitelau wyfunique emaxerrno anole1980 zshell31 chenxi8611 jsommr abc3 wesselvdv svenklemm sgreene570 julianjensen chenbk85 cjiajiazhuiqiu mathtin d01b zhengtongyan vbarua froque jfelipefloress zhuomingliang medflyt clownsw ankane

libpg_query's Issues

Expose the PG_VERSION

It would be nice if the actual PG_VERSION the library was built could be programmatically queried thru a function, or even simpler by a constant string.

change the content of data in a node

Excuse me
How to change the content of a field in the node
Such as:
Relname in RangeVar is t_test
I hope it can be changed to t_test_1
Please tell me how to handle memory if it is convenient

What is missing to be able to fully parse PL/pgSQL functions?

Hello,

First of all thanks a lot for this amazing project. :-)

I really need a fully functional parser for PostgreSQL and PL/pgSQL for my PhD.
Can you summarise what is missing and eventually give me pointers to how to do it?

I may help to finish the parser but I can not promise, it depends on the effort to provide.

I have examples of PL/pgSQL code from a real database that the parser can not handle yet.

Thanks in advance,

Julien

enum_defs.json is missing values for enums with explicit values

The example I saw is for the FunctionParameterMode enum, which is defined in src/postgres/include/nodes/parsenodes.h this way:

typedef enum FunctionParameterMode
{
	/* the assigned enum values appear in pg_proc, don't change 'em! */
	FUNC_PARAM_IN = 'i',		/* input only */
	FUNC_PARAM_OUT = 'o',		/* output only */
	FUNC_PARAM_INOUT = 'b',		/* both */
	FUNC_PARAM_VARIADIC = 'v',	/* variadic (always input) */
	FUNC_PARAM_TABLE = 't'		/* table function output column */
} FunctionParameterMode;

The JSON output for parsing will be something like "mode": 105 where 105 is the ASCII value for 'i'. Anything that's attempting to turn this JSON back into a language's representation (Rust in my case) will blow up here, because it's expecting these to be 0-4.

Stack overflow when parsing specific query

In the process of attempting to port https://github.com/zhm/pg-query-parser to the newer AST, I encountered an error parsing this statement when running the postgres regression tests:

CREATE FOREIGN TABLE ft1 () SERVER no_server

I also checked the latest ruby parser just to see if it was an issue with the node module and it also fails there.

irb(main):002:0> PgQuery.parse 'CREATE FOREIGN TABLE ft1 () SERVER no_server'
SystemStackError: stack level too deep
    from /usr/local/lib/ruby/gems/2.2.0/gems/pg_query-0.9.2/lib/pg_query/parse.rb:5:in `_raw_parse'
    from /usr/local/lib/ruby/gems/2.2.0/gems/pg_query-0.9.2/lib/pg_query/parse.rb:5:in `parse'
    from (irb):2
    from /usr/local/bin/irb:11:in `<main>'

Note that this isn't a statement I care about, it's just in the Postgres regression folder.

https://github.com/postgres/postgres/blob/master/src/test/regress/sql/foreign_data.sql#L269

PLPGSQL - Using record fields

Hi,

Can't attribute values to record fields:

CREATE FUNCTION test(str character varying) RETURNS integer
LANGUAGE plpgsql
AS $$
DECLARE
  v3 RECORD;
  v4 integer;
BEGIN
  select 1 as c1, 2 as c2 into v3;
  v3.c1 := 4;
END;$$;

Error:

ERROR: "v3.c1" is not a known variable
CONTEXT: compilation of PL/pgSQL function "test" near line 7
LOCATION: cword_is_not_variable, pl_gram.y:2534

But postgresql accepts this syntax.

Great library by the way, keep up the good work.

Rafael.

Build fails on 32 bit system

qnet@devel-o9:~/tmp/libpg_query$ CC=/usr/bin/gcc-4.9 make
compiling src/pg_query.c
In file included from ./src/postgres/include/c.h:53:0,
                 from ./src/postgres/include/postgres.h:47,
                 from src/pg_query_internal.h:4,
                 from src/pg_query.c:2:
./src/postgres/include/pg_config.h:725:24: error: ‘__int128’ is not supported for this target
 #define PG_INT128_TYPE __int128
                        ^
./src/postgres/include/c.h:298:9: note: in expansion of macro ‘PG_INT128_TYPE’
 typedef PG_INT128_TYPE int128;
         ^
In file included from ./src/postgres/include/postgres.h:47:0,
                 from src/pg_query_internal.h:4,
                 from src/pg_query.c:2:
./src/postgres/include/c.h:298:24: warning: type defaults to ‘int’ in declaration of ‘int128’ [-Wimplicit-int]
 typedef PG_INT128_TYPE int128;
                        ^
In file included from ./src/postgres/include/c.h:53:0,
                 from ./src/postgres/include/postgres.h:47,
                 from src/pg_query_internal.h:4,
                 from src/pg_query.c:2:
./src/postgres/include/pg_config.h:725:24: error: ‘__int128’ is not supported for this target
 #define PG_INT128_TYPE __int128
                        ^
./src/postgres/include/c.h:299:18: note: in expansion of macro ‘PG_INT128_TYPE’
 typedef unsigned PG_INT128_TYPE uint128;
                  ^
Makefile:73: recipe for target 'src/pg_query.o' failed
make: *** [src/pg_query.o] Error 1

Case sensitive names?

I am trying to make a tool which would automatically quote all identifiers to make them case sensitive. I thought of parsing end deparsing (using this node module) but I am realizing that all identifiers in the parsed tree are already lowercased. Is there some way to not do that? So to get parse tree with original string values?

A_CONST and NULL/empty string

These 2 queries output the exact same tree:

SELECT ''

SELECT NULL

I think there is a bug in the handling of A_CONST with empty string values.

Here is the output for reference:

https://gist.github.com/zhm/715a15c1b4d89a6d9807

any plans to upgrade to 11 or 12?

Hello @lfittl !

Hope you're doing well, it's been a while! I've been building a parser that relies on this library, and I've started to notice some differences now that I'm using 11.2. Do you think you'll be adding support for newer versions?

context, here's the old thread we were on... zhm/node-pg-query-native#9 my plan is to keep upgrading each release pg_query has :)

Issues with latest 10-1.0.5

Hi, I spent some time to upgrade v1 of my pglast going from 1.0.2 to the just released 1.0.5, and I'm noticing some strange things, that turned lots of tests to red.

For example, the statement

CREATE TABLE a(t text collate "C")

produced the following JSON tree:

[
  {
    "RawStmt": {
      "stmt": {
        "CreateStmt": {
          "oncommit": 0,
          "relation": {
            "RangeVar": {
              "inh": true,
              "location": 13,
              "relname": "a",
              "relpersistence": "p"
            }
          },
          "tableElts": [
            {
              "ColumnDef": {
                "collClause": {
                  "CollateClause": {
                    "collname": [
                      {
                        "String": {
                          "str": "C"
                        }
                      }
                    ],
                    "location": 22
                  }
                },
                "colname": "t",
                "is_local": true,
                "location": 15,
                "typeName": {
                  "TypeName": {
                    "location": 17,
                    "names": [
                      {
                        "String": {
                          "str": "text"
                        }
                      }
                    ],
                    "typemod": -1
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
]

while today's version emits:

[
  {
    "RawStmt": {
      "stmt": {
        "CreateStmt": {
          "oncommit": 0,
          "relation": {
            "RangeVar": {
              "inh": true,
              "location": 13,
              "relname": "a",
              "relpersistence": "p"
            }
          },
          "tableElts": [
            {
              "ColumnDef": {
                "collOid": 34931104,
                "colname": "t",
                "is_local": true,
                "typeName": {
                  "TypeName": {
                    "location": 17,
                    "names": [
                      {
                        "String": {
                          "str": "text"
                        }
                      }
                    ],
                    "typemod": -1
                  }
                }
              }
            }
          ]
        }
      }
    }
  }
]

As you can see, there's that strange collOid item that replaced the CollateClause node.

Another thing that surprised me is that the following statement:

CREATE TABLE measurement_y2016m07
    PARTITION OF measurement (
    unitsales DEFAULT 0
) FOR VALUES FROM ('2016-07-01') TO ('2016-08-01') TABLESPACE olddata

the resulting JSON trees are very similar, but with the following peculiarity:

$ diff -u /tmp/old /tmp/new
--- /tmp/old	2021-02-19 00:05:10.546223430 +0100
+++ /tmp/new	2021-02-19 00:04:57.694362528 +0100
@@ -68,7 +68,7 @@
             {
               "ColumnDef": {
                 "colname": "unitsales",
-                "constraints": [
+                "fdwoptions": [
                   {
                     "Constraint": {
                       "contype": 2,

Isn't it a bit strange to see that fdwoptions in a statement that is not related to foreign data wrappers?

Are these differences expected?

structs/enums questions for new PG 13 upgrade

@lfittl so happy to see the PG 13 launch! congrats 🎉

I'm starting to rebuild the node deparser and leveraging the enums and structs json files. I'm hoping to try and automate some parts by leveraging these files and wanted to make sure my assumptions are correct.

I noticed sometimes there are (pointers?) to objects and other times the objects themselves. Are they potentially arrays of objects when there is a "*" or can I just treat either Thing and Thing* as a single nested Node (without the wrapper)?
Node and List I imagine are the typical Node object which we can parse as before, and then List is an array of Nodes

Here is an example I pulled out demonstrating Expr and Expr* is used:


        {
          name: 'xpr',
          c_type: 'Expr',
          comment: null
        },

...

        {
          name: 'aggdistinct',
          c_type: 'List*',
          comment: '/* DISTINCT (list of SortGroupClause) */'
        },
        {
          name: 'aggfilter',
          c_type: 'Expr*',
          comment: '/* FILTER expression, if any */'
        },
        {
          name: 'aggstar',
          c_type: 'bool',
          comment: "/* true if argument list was really '*' */"
        },

...

        {
          name: 'exprnulls',
          c_type: 'bool*',
          comment: null
        },

ARM header file(s) are missing

port/atomics/arch-arm.h is missing from src/postgres/include. As a result, this library can't be compiled on ARM as it is and requires manual patching.

This issue is more pronounced when installing pg_query gem.

Is it too late to ask for version option for parsing?

I have various versions of postgresql in my environment so i'd love to be able to choose which version of the postgres parser to parse with.

Comment token for lexical analysis

We are running lexical analysis on PostgreSQL queries and using the expose-scanner commit (66470f2) on top of your 10-latest branch. The only thing we are missing are tokens for comments:

SELECT * FROM x /* foo */;

currently results in tokens

  tokens: 5, size: 45
  [ 0, 6, 597, 4 ]
  [ 7, 8, 42, 0 ]
  [ 9, 13, 417, 4 ]
  [ 14, 15, 258, 0 ]
  [ 25, 26, 59, 0 ]

For our use case of lexical analysis we require a token for the comments.

We are aware that this is not required for the parser. Anyway, any idea how we can get these tokens from the scanner that requires minor code changes? (e.g. src/postgres/src_backend_parser_scan.c)

Thumbs up for the great work!

Best, Patrick

Use of libpg_query on a slightly modified postgres.

Hi Lukas,
Thanks for this awesome job, it's really impressive.

I'm trying to compile libpg_query against a slightly modified Postgres (A couple of KEYWORDS and syntax, works as expected in a normal Postgres compilation). I'm trying to understand your process to extract the .c and .h files following the Makefile, but I'm pretty sure I'm missing something.

Can you give a two line description of the process ? I'm aware I might need to add more files to the blacklist and mock a couple of things but I want to be sure I'm following the same process you followed.

Thanks @lfittl !!

Could we do something like rpgffi to help you out

Familiar with rpgffi. Would it be helpful if we worked together?

"ructures" node type looks like a bug

The nodetypes.json file has a member named "ructures". I can't see how this snuck in by looking at the Ruby code but surely this is a bug.

PLPSQL parser not recognize procedure parameters in declare statement

This is valid PLPSQL script, but pg_query_free_plpgsql_parse_result returns error.

Input:
`CREATE OR REPLACE FUNCTION cs_fmt_browser_version(v_name varchar, v_version varchar)
RETURNS varchar AS $$
DECLARE
_a int;
_v_name_alias ALIAS FOR $0;
BEGIN
IF v_version IS NULL THEN
RETURN v_name;
END IF;

RETURN v_name || '/' || v_version;
END; $$ LANGUAGE plpgsql;`

Error output:
variable "$0" does not exist

This seems like statement parser hasn't a procedure parameters list.

postgresql 9.6

Hi!
Sorry for my english!! :(

I have no experience creating branchs or sharing code

For my project, I need to use your library (pg_query) with postgresql 6.3; then i made the next changes

In "Makefile", I did:

PG_VERSION = 9.6.3
Comments the next lines (I did it manually)
#cd $(PGDIR); patch -p1 < $(root_dir)/patches/01_parse_replacement_char.patch
#cd $(PGDIR); patch -p1 < $(root_dir)/patches/02_normalize_alter_role_password.patch

Then I compile with "make" command; i fixed some errors as:
src/postgres/src_backend_utils_misc_guc.c:1334:4: error: ‘max_function_args’ no se declaró aquí (no en una función) &max_function_args, ^~~~~~~~~~~~~~~~~
src/postgres/src_backend_utils_misc_guc.c:174:13: aviso: se usa ‘assign_tcp_keepalives_idle’ pero nunca se define static void assign_tcp_keepalives_idle(int newval, void *extra);
etc...

finally it run the test...but it display the next error:

Makefile:106: fallo en las instrucciones para el objetivo 'test' make: *** [test] Segmentation fault (core dumped) [Núcleo vaciado a un archivo]
When i run the programs in the examples directory (simple, simple_plpgsql, etc), there are not error; that is, its execution is correct

Can you help me?

Thank you in advance

Question: json to sql?

Hi,

I apologize if this is the wrong place for the question. Is there a way to turn the JSON back into SQL?

Thanks

JSON format documentation

Is there any place which summarizes the shape of the JSON returned by the parser? It seems something crucial to me, but it's proving very hard to find.

I'm working with pg-query-parser in TypeScript, and I would like to create type definitions for the result of the parse function to be able to work with the library safely.

Install fails on CentOS 6.6

Server: CentOS release 6.6 (Final)

[deploy@hamming tmp]$ git clone -b 9.5-latest git://github.com/lfittl/libpg_query
[deploy@hamming current]$ cd /tmp/
[deploy@hamming libpg_query]$ make
compiling src/pg_query.c
In file included from ./src/postgres/include/postgres.h:47,
                 from src/pg_query_internal.h:4,
                 from src/pg_query.c:2:
./src/postgres/include/c.h:298: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘int128’
./src/postgres/include/c.h:299: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘uint128’
make: *** [src/pg_query.o] Error 1

determining enum constants mapping

When I upgraded to the latest version many of the objtype numbers changed on nodes.

I'm trying to find a programatic way to determine the mapping so I can go back and forth between representations. Currently it seems that some of the objtype numbers are coming from the ObjectType enum data structure in the original source, but some are not.

For example, take these two statements as an example:

COMMENT ON CAST (text AS int4) IS 'Allow casts from text to int4';
COMMENT ON TABLE my_schema.my_table IS 'Employee Information';

the CommentStmt empirically yields anOBJECT_CAST since the node has objtype=5, which is correct! The other node yields OBJECT_STATISTIC_EXT because objtype=37, however, we know it is actually supposed to be OBJECT_TABLE.

if you look below in the source, that's not exactly the order of the enums. Any idea why this is the case? I'm not sure if the node type would change the objtype, as I imagine they are shared throughout the source.

inside of src/include/nodes/parsenodes.h
https://doxygen.postgresql.org/parsenodes_8h.html#a842c5e2e69277690b064bf363c017980

{
     OBJECT_ACCESS_METHOD,
     OBJECT_AGGREGATE,
     OBJECT_AMOP,
     OBJECT_AMPROC,
     OBJECT_ATTRIBUTE,           /* type's attribute, when distinct from column */
     OBJECT_CAST,
     OBJECT_COLUMN,
     OBJECT_COLLATION,
     OBJECT_CONVERSION,
     OBJECT_DATABASE,
     OBJECT_DEFAULT,
     OBJECT_DEFACL,
     OBJECT_DOMAIN,
     OBJECT_DOMCONSTRAINT,
     OBJECT_EVENT_TRIGGER,
     OBJECT_EXTENSION,
     OBJECT_FDW,
     OBJECT_FOREIGN_SERVER,
     OBJECT_FOREIGN_TABLE,
     OBJECT_FUNCTION,
     OBJECT_INDEX,
     OBJECT_LANGUAGE,
     OBJECT_LARGEOBJECT,
     OBJECT_MATVIEW,
     OBJECT_OPCLASS,
     OBJECT_OPERATOR,
     OBJECT_OPFAMILY,
     OBJECT_POLICY,
     OBJECT_PROCEDURE,
     OBJECT_PUBLICATION,
     OBJECT_PUBLICATION_REL,
     OBJECT_ROLE,
     OBJECT_ROUTINE,
     OBJECT_RULE,
     OBJECT_SCHEMA,
     OBJECT_SEQUENCE,
     OBJECT_SUBSCRIPTION,
     OBJECT_STATISTIC_EXT,
     OBJECT_TABCONSTRAINT,
     OBJECT_TABLE,
     OBJECT_TABLESPACE,
     OBJECT_TRANSFORM,
     OBJECT_TRIGGER,
     OBJECT_TSCONFIGURATION,
     OBJECT_TSDICTIONARY,
     OBJECT_TSPARSER,
     OBJECT_TSTEMPLATE,
     OBJECT_TYPE,
     OBJECT_USER_MAPPING,
     OBJECT_VIEW
 } ObjectType;

Is this where the objtypes are even coming from?

any help is appreciated! Currently I'm thinking to manually write test cases for each objtype and empirically discover them all at this point.

Build issue on Alpine with 13-latest-develop branch

tldr; Alpine Linux uses musl-libc instead of glibc and has a package for libexecinfo. -lexecinfo needs to be appended to LDFLAGS for the project to compile/link cleanly.

When I do a fresh checkout of libpg_query against the 13-latest-develop branch and attempt to compile the project, I get the following error when linking:

ar: creating libpg_query.a
cc -I. -g -o examples/simple -g examples/simple.c libpg_query.a -pthread
/usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: libpg_query.a(src_backend_utils_error_elog.o): in function `set_backtrace':
/pgparse/libpg_query/src/postgres/src_backend_utils_error_elog.c:829: undefined reference to `backtrace'
/usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: /pgparse/libpg_query/src/postgres/src_backend_utils_error_elog.c:830: undefined reference to `backtrace_symbols'
/usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: libpg_query.a(src_backend_utils_error_assert.o): in function `ExceptionalCondition':
/pgparse/libpg_query/src/postgres/src_backend_utils_error_assert.c:59: undefined reference to `backtrace'
/usr/lib/gcc/x86_64-alpine-linux-musl/10.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: /pgparse/libpg_query/src/postgres/src_backend_utils_error_assert.c:60: undefined reference to `backtrace_symbols_fd'
collect2: error: ld returned 1 exit status
make: *** [Makefile:171: examples/simple] Error 1

execinfo is obviously being imported in Postgres, so I checked configure in the Postgres repo and found https://github.com/postgres/postgres/blob/master/configure#L11877

By setting TEST_LDFLAGS=-lexecinfo in my shell, it compiles cleanly and passes all tests.

I'm going to try to find a work around in my build process, but ideally libpg_query would be smart enough to work around this as well.

If you're looking for a PR, and want to point me in the right direction for how you'd like it addressed, I'm happy to see if I can figure it out.

"Segmentation fault (core dumped)"

I try to compile a .c file with a long complex SQL statement (PostgresQL query) as follows:

#include <pg_query.h>
#include <stdio.h>

int main() {
  PgQueryParseResult result;

  result = pg_query_parse("with	'log dir(theta|alpha)' as (select dirichlet(theta,alpha,'topic') from topic ), 'log p(phi|beta)' as (with 'log dir(phi|beta)' as (select dirichlet(phi,beta,'word join topic_word using (word_id)') from word join topic_word using (word_id) ) select sum('log dir(phi|beta)') as 'log p(phi|beta)' from 'log dir(phi|beta)' ), 'log p(z|theta)' as (with 'log cat(z_m|theta)' as (select categorical(document_id,theta,'document join topic using (topic_id)') from document join topic using (topic_id) ) select sum('log cat(z_m|theta)') as 'log p(z|theta)' from 'log cat(z_m|theta)' ), 'log p(w|phi)' as (with 'log cat(w_n|phi_z_doc_n)' as (select categorical(word_instance_id,phi,'word_instance join document using(document_id) join topic_word using (topic_id,word_id)') from word_instance join document using(document_id) join topic_word using (topic_id,word_id) ) select sum('log cat(w_n|phi_z_doc_n)') as 'log p(w|phi)' from 'log cat(w_n|phi_z_doc_n)' ) select 'log dir(theta|alpha)' + 'log p(phi|beta)' + 'log p(z|theta)' + 'log p(w|phi)' as 'log p(w,z,phi,theta|alpha,beta)' from 'log dir(theta|alpha)', 'log p(phi|beta)', 'log p(z|theta)', 'log p(w|phi)' ");

  printf('%s\n', result.parse_tree);

  pg_query_free_parse_result(result);
}

I replaced all inner double quotation marks (") within the statement with single ones (') as it wouldn't be compileable otherwise.
Then I get warnings like this:

osboxes@osboxes:~/libpg_query/examples$ cc -I.. -L.. long_statement.c -lpg_query
long_statement.c: In function ‘main’:
long_statement.c:9:10: warning: multi-character character constant [-Wmultichar]
printf('%s\n', result.parse_tree);
^~~~~~
long_statement.c:9:10: warning: passing argument 1 of ‘printf’ makes pointer from integer without a cast [-Wint-conversion]
In file included from long_statement.c:2:0:
/usr/include/stdio.h:318:12: note: expected ‘const char * restrict’ but argument is of type ‘int’
extern int printf (const char *__restrict __format, ...);
^~~~~~

.. and if I try to open the compiled result, there's an error:

osboxes@osboxes:~/libpg_query/examples$ ./a.out
Segmentation fault (core dumped)

Is there anything I'm doing wrong?

Doubt about parameter placeholders

Hi,

by pure chance I fed a statement like SELECT foo FROM bar WHERE id = ? to my pg_query prettifier, and was surprised to see that libpg_query parsed it without a noise, emitting a ParamRef without the usual number slot.

Is this intentional or not? In other words, should I release the trivial tweak I made to pg_query to handle the case? 😃

SubLink node doesn't emit operName field

In the V2 output, operName is missing from SubLink nodes. This is required to differentiate IN from ANY in sublinks with subLinkType of 2. I'm not sure what's happening because I do see operName defined in the data files.

(Yet another corner case from the postgres test files :)

Test query:

select sum(unique1) FILTER (WHERE unique1 IN (SELECT unique1 FROM onek where unique1 < 100)) FROM tenk1

V2 Output:

https://gist.github.com/zhm/29ae8af205f135596fd024172121f220

V1 Output:

https://gist.github.com/zhm/c6c23e9889011d617eb0a4163a888779

Support configuring the parser w/ relevant server settings?

My understanding is that some PG server options can change how parsing works (e.g., allow_nulls and standard_conforming_strings). Does libpg_query support configuring the parser with these settings, to match the server? Can/should it?

Escape JSON strings in error message

PgQueryPlpgsqlParseResult.error and PgQueryParseResult.error contains quotation marks in message string without escape. This make impossible parse result as JSON.

Sample:
{ "error" : {"cursorpos" : 0 , "filename" : "pl_gram.y", "funcname" : "plpgsql_yyparse", "lineno" : 716, "message" : "variable "contract_case_split_supplier_contracts.caller_id" does not exist"}}

Correct version:
{ "error" : {"cursorpos" : 0 , "filename" : "pl_gram.y", "funcname" : "plpgsql_yyparse", "lineno" : 716, "message" : "variable "contract_case_split_supplier_contracts.caller_id" does not exist"}}

PgQueryError not zero'd on allocation

I think the new funcname field is never being filled in, and since the struct isn't zero'd it contains garbage. I was seeing segfaults when trying to use the new version, which turned out to be me forgetting to update the header file, but then I was still getting them when adding support for the new field in the errors.

[Bug] Will occasionally run into errors on functioning queries.

This is in the normalize function.

Input: "SELECT * FROM banana"

Error - (signal: 11, SIGSEGV: invalid memory reference)

Occasionally, there is also a syntax error.

Let me know how I can help you resolve this issue.

Problems with linking of headers/libraries?

Hello Lukas!
I installed libpg_query just as described in Readme/Wiki.
Now I'm facing a problem with compiling the examples because of undefined reference to some functions, regardless of strictly following your instructions about compiling.

dmitry@dmitry-Aspire-V3-571G ~/libpg_query/examples $ cc -I../ -L../ -lpg_query simple.c

/tmp/ccx5VZ4i.o: In Funktion main': simple.c:(.text+0x29): Nicht definierter Verweis auf pg_query_parse'
simple.c:(.text+0x7f): Nicht definierter Verweis auf `pg_query_free_parse_result'
collect2: error: ld returned 1 exit status

Why do I get this linking error?
Thanks you very much in advance for your answer.

Line/Column numbers

Does postgres return line/column numbers in its parse tree? If it does, is it possible to put them in the JSON output?

I would like to use node-pg-query-native to split a complex schema creation file into its individual commands, but can't just use query.split(/;/g) because it contains plpgsql functions. Since pg-query-parser can't deparse some of the types that I'm using (e.g., DROP statements). If the libpg_query output included line number/column number, I could use this parser to determine when commands start and end, then use that information to split out the original queries from the original sql file.

Source data for BoolExprType is missing "NOT_EXPR"

I'm not sure how this happened, but it's missing the NOT_EXPR type.

Missing ppc atomics

Compiling libpg_query on powerpc64 platform results in:

./src/postgres/include/port/atomics.h:74:10: fatal error: 'port/atomics/arch-ppc.h' file not found
#include "port/atomics/arch-ppc.h"
         ^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

Preserve comments

Is there anyway way to preserve comments in the output of the queryparser? It seems that comments aren't part of the pg parsetree, but it would be nice to be able to preserve their content/location somehow.

Wishlist: Indicate optional vs required fields

I have no idea if this is possible, but it'd be really nice if the JSON data had an indication of whether a given struct field is required or optional. It's not really possible to intuit this simply from the C type.

I'm using these JSON defs to generate Rust code, with one struct for each struct that the parser can parse. Since I know some fields are optional, but now which ones, I'm wrapping every type in an Option<>. This makes working with this data tedious since you constantly have to either check whether the option is_some() or just go for it and call unwrap(), possibly panicking. Only using Option<> where needed would greatly improve the API.

Thread safety?

Is this library intended to be thread safe? I seem to be able to be able to segfault it when running these tests in parallel, but not when limiting it to a single thread: https://github.com/sfackler/rust-pg_query/blob/2169d608e0f4df4f17a7798f662393f9dd3f9e66/pg_query/src/lib.rs#L68

pg_query ❯ env RUST_TEST_THREADS=1 cargo test
     Running target/debug/pg_query-773a16cd18fc4144

running 2 tests
test test::err ... ok
test test::ok ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured

   Doc-tests pg_query

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

pg_query ❯ env RUST_TEST_THREADS=2 cargo test
     Running target/debug/pg_query-773a16cd18fc4144

running 2 tests
WARNING:  01000: problem in alloc set pg_query_parse: detected write past chunk end in block 0x7f9568219000, chunk 0x7f9568219190
LOCATION:  AllocSetCheck, aset.c:1354
WARNING:  01000: problem in alloc set pg_query_parse: bad single-chunk 0x7f95682191b0 in block 0x7f9568219000
LOCATION:  AllocSetCheck, aset.c:1337
WARNING:  01000: problem in alloc set pg_query_parse: bogus aset link in block 0x7f9568219000, chunk 0x7f95682191b0
LOCATION:  AllocSetCheck, aset.c:1346
WARNING:  01000: problem in alloc set pg_query_parse: detected write past chunk end in block 0x7f9568219000, chunk 0x7f95682191b0
LOCATION:  AllocSetCheck, aset.c:1354
WARNING:  01000: problem in alloc set pg_query_parse: found inconsistent memory block 0x7f9568219000
LOCATION:  AllocSetCheck, aset.c:1364
error: Process didn't exit successfully: `/home/sfackler/rust/rust-pg_query/pg_query/target/debug/pg_query-773a16cd18fc4144` (signal: 11, SIGSEGV: invalid memory reference)

To learn more, run the command again with --verbose.'

[BUG]: invalid plpsql function parsing

found some strange bug: declaring results to out arguments throws strange error:

when i calling ParsePlPgSql method with this sql query:

CREATE FUNCTION public.somefunc(OUT _result uuid[])
    RETURNS uuid[]
    LANGUAGE 'plpgsql'

AS $BODY$
DECLARE
	active_on_to_date uuid[];
BEGIN
_result := ARRAY( SELECT some_id FROM some_table);
END;
$BODY$;

i got really funny issue

"_result" is not a known variable

i used official go wrapper v1.0.2, but it looks like this issue happens on exactly this base library

deparsing

I know a few libraries are based on this one I believe are manually writing deparsing code, but leveraging the native parsing, https://github.com/lfittl/pg_query and https://github.com/zhm/pg-query-parser to name a few.

Is there a way to currently deparse with pg_query that I'm unaware of?

I know postgres has a lot of this stuff internally (https://doxygen.postgresql.org/deparse_8c_source.html), and wondering if it would potentially benefit the community in general.

@lfittl thoughts or tips?

Memory leak in pg_query_parse.c:52

Hi, I compiled the 'parse_plpgsql.c' and tested it with file "plpgsql_samples.sql". ASan and valgrind find memory leak in pg_query_parse.c:52

==136926==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 1 byte(s) in 1 object(s) allocated from:
    #0 0x7fd98c8b32f0 in __strdup (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x752f0)
    #1 0x414020 in pg_query_raw_parse src/pg_query_parse.c:52

SUMMARY: AddressSanitizer: 1 byte(s) leaked in 1 allocation(s).

==137686== Memcheck, a memory error detector
==137686== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==137686== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==137686== Command: ./parse_plpgsql plpgsql_samples.sql
==137686== 
==137686== 
==137686== HEAP SUMMARY:
==137686==     in use at exit: 345,690 bytes in 45 blocks
==137686==   total heap usage: 198 allocs, 153 frees, 1,276,459 bytes allocated
==137686== 
==137686== 1 bytes in 1 blocks are definitely lost in loss record 1 of 31
==137686==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==137686==    by 0x4EBFB49: strdup (strdup.c:42)
==137686==    by 0x414020: pg_query_raw_parse (pg_query_parse.c:52)
==137686==    by 0x4022BC: pg_query_parse_plpgsql (pg_query_parse_plpgsql.c:396)
==137686==    by 0x401785: main (parse_plpgsql.c:44)
==137686== 
==137686== LEAK SUMMARY:
==137686==    definitely lost: 1 bytes in 1 blocks
==137686==    indirectly lost: 0 bytes in 0 blocks
==137686==      possibly lost: 0 bytes in 0 blocks
==137686==    still reachable: 345,689 bytes in 44 blocks
==137686==         suppressed: 0 bytes in 0 blocks
==137686== Reachable blocks (those to which a pointer was found) are not shown.
==137686== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==137686== 
==137686== For counts of detected and suppressed errors, rerun with: -v
==137686== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 1)

After debugging by gdb, I found it seems 'result.stderr_buffer' doesn't free correctly when parse plpgsql.(seems function pg_query_free_parse_result(pg_query_parse.c:116) doesn't be called.)

store procedure with plperl

Hi!

For my project I need to parse stored procedures with plperl; is it possible? Can you guide me how to add this feature?

thank you in advance!

potential memory leak?

Hi I am new to libpg_query and found a potential memory leak in the library. To repro, run:

git clone -b 10-latest git://github.com/lfittl/libpg_query
cd libpg_query
make
cd example
valgrind --tool=memcheck --leak-check=yes --show-reachable=yes ./simple

output:

==14875== HEAP SUMMARY:
==14875==     in use at exit: 16,601 bytes in 3 blocks
==14875==   total heap usage: 25 allocs, 22 frees, 77,160 bytes allocated
==14875== 
==14875== 217 bytes in 1 blocks are still reachable in loss record 1 of 3
==14875==    at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14875==    by 0x10E7AF: MemoryContextCreate (src_backend_utils_mmgr_mcxt.c:624)
==14875==    by 0x13E717: AllocSetContextCreate (src_backend_utils_mmgr_aset.c:375)
==14875==    by 0x10E39B: MemoryContextInit (src_backend_utils_mmgr_mcxt.c:133)
==14875==    by 0x11182A: pg_query_init (pg_query.c:15)
==14875==    by 0x111848: pg_query_enter_memory_context (pg_query.c:23)
==14875==    by 0x10D9FF: pg_query_parse (pg_query_parse.c:92)
==14875==    by 0x10D740: main (simple.c:26)
==14875== 
==14875== 8,192 bytes in 1 blocks are still reachable in loss record 2 of 3
==14875==    at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14875==    by 0x13E367: AllocSetAlloc (src_backend_utils_mmgr_aset.c:791)
==14875==    by 0x10E5EB: MemoryContextAlloc (src_backend_utils_mmgr_mcxt.c:682)
==14875==    by 0x10E6ED: MemoryContextCreate (src_backend_utils_mmgr_mcxt.c:618)
==14875==    by 0x13E624: AllocSetContextCreate (src_backend_utils_mmgr_aset.c:375)
==14875==    by 0x10E3E3: MemoryContextInit (src_backend_utils_mmgr_mcxt.c:156)
==14875==    by 0x11182A: pg_query_init (pg_query.c:15)
==14875==    by 0x111848: pg_query_enter_memory_context (pg_query.c:23)
==14875==    by 0x10D9FF: pg_query_parse (pg_query_parse.c:92)
==14875==    by 0x10D740: main (simple.c:26)
==14875== 
==14875== 8,192 bytes in 1 blocks are still reachable in loss record 3 of 3
==14875==    at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==14875==    by 0x13E697: AllocSetContextCreate (src_backend_utils_mmgr_aset.c:418)
==14875==    by 0x10E3E3: MemoryContextInit (src_backend_utils_mmgr_mcxt.c:156)
==14875==    by 0x11182A: pg_query_init (pg_query.c:15)
==14875==    by 0x111848: pg_query_enter_memory_context (pg_query.c:23)
==14875==    by 0x10D9FF: pg_query_parse (pg_query_parse.c:92)
==14875==    by 0x10D740: main (simple.c:26)
==14875== 
==14875== LEAK SUMMARY:
==14875==    definitely lost: 0 bytes in 0 blocks
==14875==    indirectly lost: 0 bytes in 0 blocks
==14875==      possibly lost: 0 bytes in 0 blocks
==14875==    still reachable: 16,601 bytes in 3 blocks
==14875==         suppressed: 0 bytes in 0 blocks
==14875== 
==14875== For counts of detected and suppressed errors, rerun with: -v

I wonder whether this is really a memory leak? Thank you!

Build with os x target.

Hi,
I got warning when make with default config.

libpg_query.a(pg_query_json_plpgsql.o)) was built for newer OSX version (10.12) than being linked (10.11)

How I can specify os x target when build source code,
Awesome work and many thanks 👍

Windows Support

Issue to track support of Windows - currently it is not supported, in part due to build process, in part because the code extraction from Postgres intentionally ignores Windows code.

Questions on the new (upcoming?) v13 support

Hi Lukas, first of all, thanks for your effort to maintain this library!

I spent a few hours into understanding the "what's new" in the 13-latest-develop branch, with the short term goal of adapting my pglast wrapper.

The following changes have obviously a considerable impact:

the new top level result (introduced by commit 53ae093)
omission of attributes with value of zero, in particular for enums (commit f36f9ab)

For the first, while I can see the advantage of getting rid of the mostly useless RawStmt container, I wonder what's the added value of reporting the underlying PG version in the parse result: why cannot it be simply a constant #define exposed by the library as before?

The second is more worrysome: I'm quite sure that the PG devteam won't change enums definitions with a light heart, but IMHO introducing such implicitness just to reduce the output size is not a good move; is there any other rationale behind this change?

why not just submit a patch to postgresql project make raw_parser exported to libpq

Apologies that it may not be a proper issue for this project.
But I just want to know whether anyone has tried to push the changes to the postgresql project.

Can the next version support windows msvc?

Thanks!

Expose frame options

The frameOptions field shows up in a couple different structs. It's currently defined as an int in the struct_defs.json file, but it's really a set of bitmasks. It'd be nice if these were exposed in some form or other. I'm not sure exactly where the data should go, though. Maybe this needs a new file in srcdata for such things?