Git Product home page Git Product logo

zombodb's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zombodb's Issues

Create .rpm and .deb and .tgz packages

ZomboDB needs .rpm and .deb and .tgz packages, rather than forcing users to install from sources.

Initially the plan is to support CentoOS and Ubuntu on 64bit Intel platforms only.

Alter table re-creates ES index...but zdb_estimate_count does not work until you re-index

Altering a field type on a table that has an ES index will 'delete index...then create index' according to the es logs...you can search some fields...but zdb_estimate_count returns '0' until you re-index.

steps to re-create:

create table public.mam_test_reindex(pk_id serial8,my_text text[], constraint idx_mam_test_reindex_pkey primary key (pk_id));

CREATE INDEX es_mam_test_reindex ON public.mam_test_reindex USING zombodb (zdb('mam_test_reindex', ctid), zdb(mam_test_reindex.*)) WITH (url='http://localhost:9200/', replicas=1, shards=5);

insert into public.mam_test_reindex(my_text) values(array['bob']);
insert into public.mam_test_reindex(my_text) values(array['mark']);
my_db=# select * from public.mam_test_reindex t where zdb(t) ==>'my_text: "bob"';
 pk_id | my_text
-------+---------
     1 | {bob}
(1 row)

Time: 6.387 ms
my_db=# select * from zdb_estimate_count('mam_test_reindex','my_text: "bob"');
 zdb_estimate_count
--------------------
                  1
(1 row)


alter table public.mam_test_reindex alter column my_text type text using my_text[1];

from the ES master logs it implies the index was re-created after the alter table command.

[2015-10-12 12:39:34,291][INFO ][cluster.metadata         ] [my_db_esmaster0] [my_db.public.mam_test_reindex.es_mam_test_reindex] deleting index
[2015-10-12 12:39:34,368][INFO ][cluster.metadata         ] [my_db_esmaster0] [my_db.public.mam_test_reindex.es_mam_test_reindex] creating index, cause [api], templates [], shards [55]/[0], mappings [xact, data]
[2015-10-12 12:39:34,582][INFO ][cluster.metadata         ] [my_db_esmaster0] updating number_of_replicas to [1] for indices [my_db.public.mam_test_reindex.es_mam_test_reindex]
[2015-10-12 12:39:34,979][INFO ][cluster.metadata         ] [my_db_esmaster0] [my_db.public.mam_test_reindex.es_mam_test_reindex] update_mapping [data]

zdb_estimate_no longer returns a result...but searching does

my_db=# select * from public.mam_test_reindex t where zdb(t) ==>'my_text: "bob"';
 pk_id | my_text
-------+---------
     1 | bob
(1 row)

Time: 10.784 ms
my_db=# select * from zdb_estimate_count('mam_test_reindex','my_text: "bob"');
 zdb_estimate_count
--------------------
                  0

my_db=# reindex index es_mam_test_reindex;

from Es master log

[2015-10-12 12:50:29,668][INFO ][cluster.metadata         ] [acscv6esmaster0] [my_db.public.mam_test_reindex.es_mam_test_reindex] deleting index
[2015-10-12 12:50:29,762][INFO ][cluster.metadata         ] [acscv6esmaster0] [my_db.public.mam_test_reindex.es_mam_test_reindex] creating index, cause [api], templates [], shards [55]/[0], mappings [xact, data]
[2015-10-12 12:50:29,977][INFO ][cluster.metadata         ] [acscv6esmaster0] updating number_of_replicas to [1] for indices [my_db.public.mam_test_reindex.es_mam_test_reindex]

now we have results again

my_db=# select * from public.mam_test_reindex t where zdb(t) ==>'my_text: "bob"';
 pk_id | my_text
-------+---------
     1 | bob
(1 row)

Time: 6.542 ms
my_db=# select * from zdb_estimate_count('mam_test_reindex','my_text: "bob"');
 zdb_estimate_count
--------------------
                  1
(1 row)

Allow for ES-side limit of size and params

Perhaps this is already solved in your API but I wasn't able to find it. If so, apologies!

Let's say we had a lot of songs.


SELECT \* FROM songs WHERE zdb(song) ==> 'release_year>1980';

Appending "LIMIT 20" onto this will not prevent ES from getting and sending back all songs that match this constraint (is this correct?). If that's a lot of songs, that's a lot of serialization, bandwidth, etc from what I can see.

Also, have you considered a way to limit the number of fields in the ES query? What if only the band_id was of interest?

Multi-level aggregates with fields in different indexes returns bogus results

If when using zdb_arbitrary_aggregate(), you specify, for example, a #tally() against a field in one index and it contains another #tally() against a field in a linked index, the resulting JSON contains all the terms for the outer #tally(), but contains no sub-buckets for the nested #tally(). This is because ES cannot build aggregates across indexes.

The way to fix this would be to run the first-level aggregate and then iteratively run the second-level aggregate for each term found by the first-level.

In my view this is more work than will be useful for ZDB, so instead, I'm going to teach ZDB about this situation and have it throw an exception.

This will solve the TODO at QueryRewriter.java:305

New release packages don't necessarily work

The zombodb.so library for Ubuntu:Precise .deb (and .tgz) fail to load with a complaint about missing MD5_init.

To fix this, don't compile our statically-linked version of libcurl with ssl support. Additionally, ZomboDB also needs to be linked to librt due to libcurl's use of (at least) clock_gettime.

I'm just going to go ahead and do the same for all the packages so that they're consistant.

VACUUM needs to batch requests to _bulk

The VACUUM hook needs to batch requests to the _bulk endpoint, otherwise it's possible for Postgres to:

vacuumdb: vacuuming of database "xxxx" failed: ERROR: out of memory
DETAIL: Cannot enlarge string buffer containing 1073727206 bytes by 16384 more bytes.

when VACUUM has millions of dead rows to delete.

PostgreSQL extension - Undefined symbol: curl_multi_wait on RHEL/CentOS 6.x

Hi Eric,

Installing the ZomboDB PostgreSQL extension on RHEL/CentOS 6.x results in the following error message on session start:

ERROR:  could not load library "/usr/pgsql-9.3/lib/plugins/zombodb.so": /usr/pgsql-9.3/lib/plugins/zombodb.so: undefined symbol: curl_multi_wait

It appears to be caused by the version of libcurl on this platform:

Installed Packages
libcurl.x86_64                  7.19.7-40.el6_6.4
libcurl-devel.x86_64            7.19.7-40.el6_6.4

According to this, Redhat and CentOS backport bugfixes into the 7.19 release but apparently not features (curl_multi_wait appears to have been introduced in 7.28 according to the author - it also seems that 7.19 is the latest package available for 6.x, and 7.23 is the latest for 7.x, neither of which would include it).

Unescape handling broken

ZomboDB's character un-escape routine is bugged such that it converts \\\\Foobar into \\\Foobar when it should be \\Foobar.

Additionally the zdb_tally() function should just flat out avoid doing a regex filter on the term if the incoming regex contains a backslash.

Need Postgres-compatible 'make installcheck' test suite

Find a decent-sized set of public/free relational and structured data that also contains large textual content to replicate what is currently an internal-only test suite.

Maybe US census data? Stackoverflow data dumps? IMDB? Not exactly sure yet.

Record count estimation and aggregrations

In a project I work on, we have a couple of requirements that are difficult (impossible?) to implement in a performant manner using just PostgreSQL, namely:

  • counting the number of records in a table (possibly filtered with a WHERE clause)
  • providing counts of occurrences of particular strings across a whole table.

In my experimentation with elasticsearch, it seems to be well suited to exactly this kind of thing (where exact counts are not needed, just near enough).

Hence, I'm particularly interested in this project, and in particular, where the README includes in the list of features record count estimation and access to Elasticsearch's full set of aggregations.

I can't find anywhere in the documentation that shows how this can work. Is that coming in https://github.com/zombodb/zombodb/blob/master/SQL-API.md perhaps?

zdb_determine_index can't find relations when schemas are not in the search path

When running a query like:

SELECT * FROM zdb_tally('foo.bar_view', 'customer', '0', '^.*', '', 2147483647, 'count'::zdb_tally_order); 

the foo schema must be in the search path, even though it's qualified when calling into the zdb_tally function. The error is:

ERROR:  relation "XYZ" does not exist
CONTEXT:  PL/pgSQL function zdb_determine_index(regclass) line 20 at assignment
PL/pgSQL function zdb_tally(regclass,text,boolean,text,text,bigint,zdb_tally_order) line 11 at assignment 

ZomboDB needs to support Sequential Scans, Filter, and ReCheck conditions

Until now, ZDB has only been able to work with Postgres query plans that project Index Scans (and Bitmap Index Scans w/ caveats). This needs to change so that complex "relational" queries can be used.

The seqscan branch gets this process started. It also slightly changes the syntax used for CREATE INDEX and for SELECT.

As such, when query plans like:

EXPLAIN SELECT * FROM products WHERE zdb('products', products.ctid) ==> 'sports or box';
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Seq Scan on products  (cost=0.00..1.15 rows=1 width=153)
   Filter: (zdb('products'::regclass, ctid) ==> 'sports or box'::text)
(2 rows)

are generated, things actually work!

So far, it's working really well. The documentation has been updated to the new syntax, but no "what's changed" document has been written yet.

There's a list of things that need to happen first:

  • Fix segfault cases when the ==> operator is improperly used
  • Remove the "w/ caveats" bit around Bitmap Index Scans by copying the TIDBitmap structure from Postgres and hacking it to flat out ignore "work_mem" settings
  • Correctly implement the operator selectivity function (zdbsel) by using zdb_estimate_count()
  • Figure out what a good costing value is for the zdb_tid_query_func() method is so that PG doesn't necessarily prefer sequential scans
  • Document "what's changed" regarding SQL-level syntax
  • More SQL-level unit tests

My goal is to have all of the above wrapped up and merged into 'develop' by Monday September 21.

ZomboDB requests keep-alive connections via libcurl when it shouldn't

The non-multi API codepath requests that libcurl reuse connections when it shouldn't. The connections are always closed/freed by ZomboDB when the transaction is finished, so keep-alives are pointless. This can cause an nginx reverse proxy to complain:

2015/08/11 20:44:10 [info] 20459#0: *49426 client 127.0.0.1 closed keepalive connection

Difficulty with aggregations

I created the 'books' table from the tutorial and loaded a couple rows. The SELECT works great!

I'm really interested in the aggregations but can't get either of the agg-related functions to work. This might be operator error - if so, perhaps the API docs can match the story started in the tutorial?


SELECT * FROM zdb_arbitrary_aggregate('books', '{
 "aggregations": {
   "my_agg": {
     "terms": {
       "field": "title"
     }
   }
 }
}', '');

yields the following error in ES:


ERROR:  rc=500; QueryRewriteException[com.tcdi.zombodb.query_parser.ParseException: Encountered "  ": "" at line 2, column 16.
Was expecting one of:
     
     ...
    "and" ...
    "or" ...
    "," ...
    "&" ...
    "!" ...
    "or" ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "," ...
    "and" ...
    "&" ...
     ...
    "!" ...

Tried with field "content" to no avail.

No joy with the significant terms method as well:

SELECT * FROM zdb_significant_terms('books', 'content', '', '', 5000);

yields the following error in ES:

org.elasticsearch.ElasticsearchIllegalStateException: Field data loading is forbidden on content

Allow for different tokenization in text fields

If this is already possible, my apologies - couldn't find it in the docs.

The 'exact' ES analyzer (with the keyword tokenizer) is applied to 'text' Postgres fields. This makes getting the most common individual words from 'text' columns, for example, unattainable via zdb_tally().

Wondering what you think about this - thanks for your continued attention (and feel free to ignore me if I'm taking up too much of your time)!

`count_of_table()` function improperly quotes table names

The count_of_table() function uses %I to quote the table_name argument when in fact it should use %s because the implicit conversion from regclass to text will do the right thing.

This can cause the zdb_index_stats view to ERROR: relation "schema.table" does not exist if the table isn't in the search path or is obscured by another table with the same name in a schema ahead of the specified one in the search path.

Search JSON field with multiple criteria on same JSON element.

No results when searching a JSON field if using multiple criteria on the same JSON element.

steps to re-create:

create table public.mam_test_json(pk_id serial8,my_json json);
 CREATE INDEX es_mam_test_json ON public.mam_test_json USING zombodb (zdb(mam_test_json.*)) WITH (url='http://###.##.###.##:####/', replicas=1, shards=5)

insert into public.mam_test_json(my_json) values('[{"sub_id":"1","sub_state":"NC","sub_status":"A"},{"sub_id":"2","sub_state":"SC","sub_status":"I"}]');
insert into public.mam_test_json(my_json) values('[{"sub_id":"1","sub_state":"NC","sub_status":"A"}]')
  • below is our attempt to find a record using multiple pieces of criteria on records in a json.
  • we need to identify a record that has both NC and SC in the JSON(my_json.sub_state) field...but there can also be additional criteria on other JSON elements

these both return 0 results

select * from mam_test_json j where zdb(j)==>'(my_json.sub_state:"SC" AND my_json.sub_status:"I") AND (my_json.sub_state:"NC" AND my_json.sub_status:"A")';
select * from mam_test_json j where zdb(j)==>'my_json.sub_state:"SC" AND my_json.sub_state:"NC"';

-this is what the data looks like:

select * from mam_test_json j where zdb(j)==>'(my_json.sub_state:"NC" AND my_json.sub_status:"A")';
 pk_id |                                               my_json
-------+-----------------------------------------------------------------------------------------------------
     1 | [{"sub_id":"1","sub_state":"NC","sub_status":"A"},{"sub_id":"2","sub_state":"SC","sub_status":"I"}]
     2 | [{"sub_id":"1","sub_state":"NC","sub_status":"A"}]
(2 rows)

Preferred method for range aggregations?

Let's say there exists a 100M row table and we'd like to run a range aggregation over part of it. The integer values are 1-50. Of the 100M total rows, 6M of them match the query label_id:123. The bucket size we want is 5 (10 buckets of 5). So ultimately we'd like:

bucket,count
6-10,350000
31-35,200000
1-5,150000
.......

I can't locate an API hook for the range aggregation so the best solution I could find is to run 10 zdb_tally() requests like:


SELECT \* FROM zdb_tally('songs', 'integer_value', '^.*', 'label_id:123 integer_value:1 /TO/ 5', 100, 'count');

and sum up the counts to complete the 1-5 bucket. Not an awesome solution since that's 10 3-second requests.

Any suggestions?

If this is too custom of a use case but technically feasible, perhaps this is something I can code and contribute.

Thanks, Ryan

zdb_suggest_terms() throws a NullPointerException

java.lang.NullPointerException
    at com.tcdi.zombodb.query_parser.QueryRewriter.getAggregateFieldName(QueryRewriter.java:261)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:325)
    at com.tcdi.zombodb.query_parser.QueryRewriter.rewriteSuggestions(QueryRewriter.java:245)
    at com.tcdi.zombodb.postgres.PostgresAggregationAction.handleRequest(PostgresAggregationAction.java:62)
    at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:53)

Enable fielddata for columns of type 'phrase' and 'phrase_array'

Coming out of #31, I think it makes sense to enable fielddata for columns of type 'phrase' and 'phrase_array'.

Doing so will enable zdb_tally() and zdb_significant_terms() to work with such fields, which could be extremely useful for building word-clouds (or whatever) over columns that contain "short text".

The downside will be that ES will keep a fielddata cache for these fields, and if the data is dense (and if there's a lot of it), this could have negative impacts to ES' memory management. But ya gotta pay to play.

Query expansion (#expand) fails to include original records with NULL value

When expanding results on a field, the results should return all rows from the original query plus any rows that have the same expansion field value. There's a bug where rows from the original query that have a NULL value in the expansion field are omitted from the results.

Using psql and the contrib_regression database, this re-creates the problem:

# select parent_id, count(*) from so_posts where zdb('so_posts', ctid) ==> '#expand<parent_id=<this.index>parent_id>(beer)' group by parent_id order by parent_id desc;
 parent_id | count 
-----------+-------
    270821 |     2
    267981 |     8
    263389 |     1
    258311 |     5
    256833 |     8
...

Note that there's no parent_id rows returned that are NULL, when in fact there should be 21:

# select parent_id, count(*) from so_posts where zdb('so_posts', ctid) ==> 'beer' group by parent_id;
 parent_id | count 
-----------+-------
           |    21
     58768 |     1
     52816 |     1
     91683 |     1
    165543 |     1
    166959 |     1
...

(relates to internal issue 3282)

Support Postgres 'uuid' type

ZDB needs to support Postgres 'uuid' type.

And more generally, ZDB shouldn't ERROR when it encounters a type it doesn't natively know about -- it should instead just assume it's an "exact" text field (ie, unanalyzed in Elasticsearch), and output a WARNing.

Support for Postgres 9.4?

Hi,

Is there a plan to support Postgres 9.4?
(if so, is there an intention to support its JSONB datatype?)

Many thanks

Tim

When creating a new Index, ZDB needs to wait for ES health status of "yellow" before returning control

The travis-ci tests occasionally fail the Postgres regression test for issue #58. After quite a bit of debugging, the failure isn't related to the changes introduced in issue #58, but instead the timing of the test is such that sometimes we try to call ZDB's _pgcount endpoint before the newly created ES index has finished migrating all its shards to the STARTED status.

After some head scratching, chatting with @nz (thanks!), and documentation reading, looks like ZDB needs to call the /_cluster/health/<index_name> endpoint with a ?wait_for_status=yellow before returning control back to Postgres. This should ensure that (at least) all the primary shards of a newly created index are actually available for use.

"binary" data can cause indexing to fail

a text string like:

""________________________________________________________________________
_________________8Qlฯ‡.
pรนvwSยฝWDโˆ‚ฯฝff0รญศŽยพ4rร•ศoUโŠƒdฤ–L4Bรณ ยงร‘รพรฏHsbรฝRฦฏr48ฯ–GฮฑFhยฎฤ–0t4B 3โ‚ฌ5eSYร—1oศบร‹ร”
ยขรŠV82yร‘ศˆTรณ2jNฮฑwยฅรŒGรซTKXSฮกFฮฆI GEQรำฆnuร’0N7hUS 383รˆT8794ำ‰โ‡5รŠรศ„ฯ†Fร”ยถ
jrร™รB5qPKษ†C9ยค5Sรณ4C9TQรขฮ’e รŸฯ‰1iDร–z3รŸล˜Pcยฒ.ว™ร”ร—โŠ‡ร€Gรดv5CS5ยซยกB!

can cause indexing to fail because when ZDB (in Postgres) converts the input string to lowercase, the resulting string has a different length than it did during input, causing the resulting JSON string to be malformed.

Certain phrases with wildcards can cause zdb_highlight() to fail

Using the contrib_regression database as an example, this will cause zdb_highlight() to throw an error about not understanding the ASTNotNull node:

SELECT * FROM zdb_highlight(
   'so_posts', 
   '( title:"* non * programmers" )', 
   'id IN (1,4,9)', 
   '{"title"}'::TEXT[]) 
ORDER BY "primaryKey", "fieldName", "arrayIndex", "position";

VACUUM takes a long time against large tables with thousands of dead rows

ZDB's zdbbulkdelete() function tracks an array of ItemPointers to delete using an O(n^2) algorithm, where N is the number of entries in the index. :(

This needs to be resolved. An autovacuum process can take hours to complete against a large table that's had a lot of churn since the last vacuum.

Backtrace is:

Program received signal SIGINT, Interrupt.
0x00007fad31a4dfcf in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007fad31a4dfcf in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fab1dc17408 in zdbbulkdelete (fcinfo=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:52
#2  0x000000000072824c in FunctionCall4Coll (flinfo=<optimized out>, collation=<optimized out>, arg1=<optimized out>, arg2=<optimized out>, arg3=<optimized out>, arg4=<optimized out>) at fmgr.c:1379
#3  0x000000000048bd03 in index_bulk_delete (info=0x7ffeed0dde80, stats=0x0, callback=0x57cc80 <lazy_tid_reaped>, callback_state=0x1e62cf0) at indexam.c:688
#4  0x000000000057cbf0 in lazy_vacuum_index (indrel=0x7fab1de5e4a8, stats=0x1e62eb0, vacrelstats=0x1e62cf0) at vacuumlazy.c:1313
#5  0x000000000057dbfd in lazy_scan_heap (scan_all=0 '\000', nindexes=3, Irel=<optimized out>, vacrelstats=<optimized out>, onerel=<optimized out>) at vacuumlazy.c:1069
#6  lazy_vacuum_rel (onerel=<optimized out>, vacstmt=<optimized out>, bstrategy=<optimized out>) at vacuumlazy.c:236
#7  0x000000000057b5d6 in vacuum_rel (relid=16785, vacstmt=0x1e613e0, do_toast=1 '\001', for_wraparound=0 '\000') at vacuum.c:1205
#8  0x000000000057c1dc in vacuum (vacstmt=0x1e613e0, relid=<optimized out>, do_toast=1 '\001', bstrategy=<optimized out>, for_wraparound=0 '\000', isTopLevel=<optimized out>) at vacuum.c:234
#9  0x000000000065ae9f in standard_ProcessUtility (parsetree=0x1e613e0, queryString=0x1e609b0 "VACUUM VERBOSE cv_data ;", context=<optimized out>, params=0x0, dest=<optimized out>, completionTag=0x7ffeed0de840 "") at utility.c:639
#10 0x00000000006586a7 in PortalRunUtility (portal=0x1e649d0, utilityStmt=0x1e613e0, isTopLevel=1 '\001', dest=0x1e61740, completionTag=0x7ffeed0de840 "") at pquery.c:1187
#11 0x00000000006592a5 in PortalRunMulti (portal=0x1e649d0, isTopLevel=1 '\001', dest=0x1e61740, altdest=0x1e61740, completionTag=0x7ffeed0de840 "") at pquery.c:1318
#12 0x0000000000659de3 in PortalRun (portal=0x1e649d0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x1e61740, altdest=0x1e61740, completionTag=0x7ffeed0de840 "") at pquery.c:816
#13 0x000000000065641e in exec_simple_query (query_string=0x1e609b0 "VACUUM VERBOSE cv_data ;") at postgres.c:1048
#14 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0x1d440f8 "bamtest", username=<optimized out>) at postgres.c:3997
#15 0x0000000000615ba1 in BackendRun (port=0x1d68040) at postmaster.c:3996
#16 BackendStartup (port=0x1d68040) at postmaster.c:3685
#17 ServerLoop () at postmaster.c:1586
#18 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1253
#19 0x000000000045b280 in main (argc=3, argv=0x1d433d0) at main.c:206
Continuing.

Program received signal SIGINT, Interrupt.
0x00007fad31a4dfcf in ?? () from /lib/x86_64-linux-gnu/libc.so.6
Detaching from program: /tcdi/pgsql/bin/postgres, process 25421

Need Documentation

  • Document a README with an overview of what this thing is
  • Document installation process for both PG & ES
  • Document SQL API
  • Document query language
  • Document CREATE/ALTER/DROP INDEX semantics and possible configuration options.
    • including default values for 'shards', 'replicas', 'preference', and 'options'
  • Document Shadow Indexes
  • Document caveats, gotchas, and known edge-cases
  • Document technical details around ES index structure, tie-ins with Postgres' various commit and executor hooks, etc
  • Document Developer guide (ie, how to change ZomboDB proper)
  • Tutorial for how to index a table and search it

Support Scoring

ZomboDB needs to support scoring and it needs to expose the per-document score at the SQL level.

For performance reasons, it might be necessary to selectively enable/disable scoring on either/both the index level and the query level. Right now the binary representation of a row (for transmission from ES to PG) is just six bytes, but it'll need to be 10 bytes with a float4 score, so being able to disable this either wholesale or per-query could be useful, not to mention eliminating the need for ES to generate the scores at all.

ES' SCAN+SCROLL API doesn't return documents in score order, so returning documents to PG in score order would require sorting (potentially millions of) results in advance. Furthermore, PG doesn't guarantee result order without an ORDER BY clause, and while ZDB could control the order for Index Scan plans, it can't for Bitmap Index Scan and Sequential Scan plans, so I'm not sure there's much value in ZDB sorting by score.

Currently ZomboDB uses Filters (not Queries) to generate ES QueryDSL. Searching with Filters doesn't support scoring, so QueryRewriter.java will need to be changed to use Queries instead. This is advantageous on its own because ES 2.0 has completely done away with Filters and ZDB will need to work with ES 2.0 eventually.

Postgres 9.4 support?

Thanks for opensourcing this! From the description and the HN thread, it looks really promising and I'd love to be able to try it out.

Only problem is, all my data is in Postgres 9.4, while zombo is a 9.3 plugin.

Any chance of >= 9.4 coming along soon?

Provide a "WITH" query connector which will perform an AND comparison within a subfield of a table of type JSON

Given the following cartoon characters table, with very loose JSON and SQL notation,

Id Name Details
1 Fred [ { State: MN, Day: 2 }, { State: TX, Day: 15 }, { State: NC, Day: 25} ]
2 Barney [{ State: NC, Day: 2 },{ State: SC, Day: 5 },{ State: AL, Day: 25 } ]
3 Wilma [{ State: TN, Day: 7 }]

To find the characters visiting NC on Day 2, we need to able to query the values of NC and Day 2 in the same JSON field. This results in Barney being returned.

SELECT name from characters where details.state = 'NC' WITH details.day = 2

To find the cartoon visiting NC on Day 2, anywhere in the JSON field the AND connector is used. The results is both Fred and Barney meet the query criteria because the AND applies to any occurrence of the values.

SELECT name from characters where details.state = 'NC' AND details.day = 2

To find the cartoon visiting NC on Day 2 AND SC on day 5, two "WITH" criteria is combined with an AND connector. This results in Berney in the result.

SELECT name from characters where 
    (details.state = 'NC' WITH details.day = 2) AND (details.state='SC' WITH details.day = 5)

Only support Elasticsearch 1.7.1+

Due to a pretty serious bug in Elasticsearch 1.5, 1.6, and 1.7.0, that was only fixed in the 1.6 and 1.7 branches, I've decided ZomboDB should only support v1.7.1+.

Thanks to @johnrballard for PR #26 to make this happen. The work has already been merged into the develop branch and will be merged into master and a new ZDB release cut (v2.1.38) later this week.

NullPointerException in QueryRewriter.java

An index option configuration that links multiple indexes:

alter index idxso_posts set (options='user_data:(owner_user_id=<so_users.idxso_users>id), comment_data:(id=<so_comments.idxso_comments>post_id)');

can cause an NPE in the Elasticsearch plugin when both named indexes are used in a single query, such as:

select count(*) from so_posts where zdb(so_posts) ==> 'user_data.display_name:j* and comment_data.user_display_name:q*';

The full stack is:

java.lang.RuntimeException: #options(user_data:(owner_user_id=<so_users.idxso_users>id), comment_data:(id=<so_comments.idxso_comments>post_id)) (((_xmin = 6510594 AND _cmin < 0 AND (_xmax = 0 OR (_xmax = 6510594 AND _cmax >= 0))) OR (_xmin_is_committed = true AND (_xmax = 0 OR (_xmax = 6510594 AND _cmax >= 0) OR (_xmax <> 6510594 AND _xmax_is_committed = false))))) AND (#child<data>((user_data.display_name:j* and comment_data.user_display_name:q*)))
    at com.tcdi.zombodb.postgres.PostgresTIDResponseAction.buildJsonQueryFromRequestContent(PostgresTIDResponseAction.java:136)
    at com.tcdi.zombodb.postgres.PostgresTIDResponseAction.handleRequest(PostgresTIDResponseAction.java:83)
    at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:53)
    at org.elasticsearch.rest.RestController.executeHandler(RestController.java:225)
    at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:170)
    at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
    at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
    at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:329)
    at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:63)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:60)
    at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.handler.codec.http.HttpContentEncoder.messageReceived(HttpContentEncoder.java:82)
    at org.elasticsearch.common.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
    at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
    at org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at com.tcdi.zombodb.query_parser.QueryRewriter.expand(QueryRewriter.java:1403)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:543)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:439)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:426)
    at com.tcdi.zombodb.query_parser.QueryRewriter.loadFielddata(QueryRewriter.java:1253)
    at com.tcdi.zombodb.query_parser.QueryRewriter.expand(QueryRewriter.java:1397)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:613)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build0(QueryRewriter.java:480)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:445)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:426)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:499)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:437)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:426)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:574)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:431)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:426)
    at com.tcdi.zombodb.query_parser.QueryRewriter.rewriteQuery(QueryRewriter.java:206)
    at com.tcdi.zombodb.postgres.PostgresTIDResponseAction.buildJsonQueryFromRequestContent(PostgresTIDResponseAction.java:127)
    ... 49 more

Aggregate Corruption When Using pgsql Function With EXCEPTION handling

In using a utility function that includes exception handling when running an update statement fails and corrupts the index.

-- DROP TABLE tas_update_fail;

CREATE TABLE tas_update_fail( pk_id serial8 ,start_date_text TEXT ,end_date_text TEXT ,duration text ,CONSTRAINT tas_update_fail_pkey PRIMARY KEY (pk_id) );

INSERT INTO tas_update_fail(start_date_text, end_date_text) VALUES('1/1/1999', '12/31/1999');
INSERT INTO tas_update_fail(start_date_text, end_date_text) VALUES('1/1/1999', '2/3/1999');
INSERT INTO tas_update_fail(start_date_text, end_date_text) VALUES('12/1/1999', '12/31/1999');
INSERT INTO tas_update_fail(start_date_text, end_date_text) VALUES('2/5/2015', '12/31/2016');
INSERT INTO tas_update_fail(start_date_text, end_date_text) VALUES('1/1/1999', 'UNKNOWN');

-- Function: isdate(text)
-- DROP FUNCTION isdate(text);

CREATE OR REPLACE FUNCTION isdate(text) RETURNS integer AS $BODY$ begin if ($1 is null) then return 0; end if; perform $1::date; return 1; exception when others then return 0; end; $BODY$ LANGUAGE plpgsql VOLATILE COST 100;

SELECT * ,isdate(start_date_text) ,isdate(end_date_text) FROM tas_update_fail;

CREATE INDEX es_idx_tas_update_fail ON tas_update_fail USING zombodb (zdb('tas_update_fail', ctid), zdb(tas_update_fail.*)) WITH (url='http://localhost:9200/', shards=2, replicas=1);

SELECT * FROM zdb_tally('tas_update_fail', 'end_date_text', '0', '^.*', '', 5000, 'term'::zdb_tally_order);

UPDATE tas_update_fail SET duration = CASE WHEN isdate(end_date_text) = 1 AND isdate(start_date_text) = 1 THEN (end_date_text::date - start_date_text::date)::text ELSE NULL END;

The UPDATE statement fails with the below error:

ERROR: Error updating xact data: {"took":10,"errors":true,"items":[{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-1","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-6","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-6]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-2","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-7","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-7]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-3","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-8","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-8]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-4","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-9","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-9]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-5","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-10","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-10]: document missing]"}}]}

********** Error **********

ERROR: Error updating xact data: {"took":10,"errors":true,"items":[{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-1","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-6","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-6]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-2","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-7","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-7]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-3","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-8","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-8]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-4","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-9","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-9]: document missing]"}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-5","_version":2,"status":200}},{"update":{"_index":"my_db.public.tas_update_fail.es_idx_tas_update_fail","_type":"xact","_id":"0-10","status":404,"error":"DocumentMissingException[[my_db.public.tas_update_fail.es_idx_tas_update_fail][-1] [xact][0-10]: document missing]"}}]}
SQL state: XX000`

This will now return zero results:

SELECT * FROM zdb_tally('tas_update_fail', 'end_date_text', '0', '^.*', '', 5000, 'term'::zdb_tally_order);

Running a REINDEX will get things back in shape again:
REINDEX TABLE tas_update_fail;

Ability to search multiple tables/indexes at the same time

ZomboDB needs the ability to concurrently search multiple tables and return the top 10 (sorted by score) matching documents.

I've already done most of the work for this in the develop branch as part of these commits (and PR #70):

3b3daa5
637d0e7
b3c7f6c
a738f88
ec64934

I've been developing this without a specific plan but instead trying to solve a customer problem as I go. b3c7f6c#diff-f56f2f6a78d9a93078b53082e28f5960 describes what the SQL API looks like, but basically it's a new function called zdb_multi_search(table_names regclass[], identifiers text[], query text) or zdb_multi_search(table_names regclass[], identifiers text[], query text[]).

So far this seems to be working out well, but I think it might also need to support aggregating terms across all the tables in order to provide summary information about all the matching documents. However, I'm not yet sure what this would look like.

Need syntax support for ES' "bool query"

(this comes via @j-weber8597)

ZomboDB needs query syntax support for Elasticsearch's "bool query", such that a query can describe what "must", "should", and "not" exist.

My thought for the syntax would be:

#bool(  
           #must(<full text query here>)  
           #should(<full text query here>)  
           #not(<full text query here>) 
)

And the <full text query here> bits can be any valid query formulation (including more #bool()s), but ZomboDB would ignore the operators between individual tokens. So if the user typed food and water into the #should() clause, it would be understood as "food", "water" as two individual terms.

This would allow one to write a query like:

#bool(   #must(beer, wine "cheese sticks")
             #should(food and bar)
             #not(happy-hour)
)

and it would be turned into an ES "bool query" similar to (pseudo-json):

bool: {
   must: { term:beer, term:wine, phrase:"cheese sticks" }
   should: { term:food, term:bar }
   not: { phrase:happy-hour }
}

Allow for custom, per-column analysis/tokenization/mapping

(this comes as a result of issue #31)

I've been thinking a lot about how to expose the ability to define custom analysis, tokenization, and even field-level ES mapping configurations per column. I think the right way to do this is through an expanded version of how phrase, phrase_array, and fulltext already work: through Postgres DOMAINs.

Consider the current phrase domain type. It's simply defined as CREATE DOMAIN phrase AS text; And ZomboDB hardcodes the analyzer (and the analyzer definition) that ES should use.

This might look kinda hacky, but what if was defined as:

CREATE DOMAIN phrase AS text
   CHECK(mapping('{
                "analyzer": { "name": "phrase", "tokenizer": "standard", "filter": null, "char_filter": null },
                "mapping": { "type": "string", "analyzer": "phrase", "field_data": {"format":"paged_bytes"} } 
          }')
    );

The CHECK constraint here uses a new function called mapping() that would always return true so as to not interfere with other constraints you might actually need on the domain.

The function would take one argument of type json, and be of a format (yet-to-be-actually-determined) that would allow the specification of an analyzer, filters, and character filters, along with the ES field mapping information. If you study the ES documentation around mappings, you'll be able to see the power we could actually expose here -- ZomboDB would be able to support the full gamut of what ES does, per column.

Postgres doesn't really have any ability to set arbitrary column metadata. It does have COMMENT ON <column>, but many people use that for other things (such as column labels) and I believe comments are limited to 8k, which might not be enough bytes to express a complex field mapping configuration.

A DOMAIN would allow ZomboDB, at the time of CREATE INDEX (or REINDEX) to dynamically inspect the column type, walk the list of CHECK constraints, find the right one, parse out the JSON specification, and do the RightThing(tm).

Using a DOMAIN with a CHECK constraint, while kinda unusual, will actually expose enough information at the right places. The other benefit, perhaps, is that using domains makes specialized types (think 'email' or 'last_comma_first' or 'key_value_pair') available through Postgres' metadata system (ie various information_schema views). That means client interfaces like JDBC (the one I'm most familiar with) will actually see the column is of type 'email', which can provide a lot more context to the client about how that field needs to be handled.

DOMAINs can be altered allowing their constraints dropped and new constraints added. Doing so would be supported but would necessitate a REINDEX of all tables that use that domain. This is because Elasticsearch requires that mapping changes, especially those that set analyzers, be immutable.

Postgres won't let you drop a DOMAIN that is in use without also specifying CASCADE, which would then drop any columns that use the DOMAIN. This is a benefit in my mind as it would require extra effort, after an explicit error message, before you screwed up your data in an irrecoverable way.

I've done some quick prototyping of this approach and it's extremely straightforward. At the end of the day, it's probably just a few hundred lines of code.

I'd definitely appreciate thoughts or suggestions or other ideas.

zdb_determine_index(<view>) is bugged

Looks like zdb_determine_index(<view>) is bugged when the view joins a few tables and Postgres wants to project index scans for each table involved. Apparently it's unable to determine which index to use and simply returns NULL. It should actually be able to figure out which index to use, or raise an error if it can't.

Re-creation steps to follow...

Provide support for custom-defined "field lists"

ZomboDB needs the ability to define a "meta field" that is composed of a list of other fields (all of the same underlying Postgres data type).

This would allow a query to reference this "meta field" and have it automatically expanded into an OR'd set of individual criteria.

This is a query-time only feature and won't index the fields together.

The field lists will be defined per index using a new WITH(...) parameter.

Add shard_size parameter to zdb_tally()

There have been a couple references to this but I was wondering if you'd consider adding this option. I was experiencing 13s zdb_tally() queries when trying to get the top 200 words out of 11M phrases. Changing the TermsBuilder in QueryRewriter.java to


            TermsBuilder tb = terms(agg.getFieldname())
                    .field(fieldname)
                    .size(agg.getMaxTerms())
                    .shardSize(1000) // this was 0
                    .order(stringToTermsOrder(agg.getSortOrder()));

gave me valid results in 1s.

What do you think?

Mark automatically created triggers as `tgisinternal`

Because Postgres doesn't properly escape string values for index WITH options, ZomboDB indexes cannot be restored from a pg_dump dump.

However, the triggers that ZomboDB creates during CREATE INDEX do restore, but are not actually usable because the underlying index fails to create.

This leads to headaches when manually re-creating the ZomboDB indexes because the old trigger definitions have to be dropped.

So... the automatically created triggers should be marked as tgisinternal = true so that pg_dump won't bother dumping them at all.

This should also be future-proof for whenever Postgres fixes their string WITH options quoting deficiency.

max_terms ignored in zdb_tally()

I loaded in a single row (the 3rd one ie 'Telephone') from the products table example then ran a zdb_tally() query:


test=# SELECT * FROM zdb_tally('products', 'keywords', '^.*', '', 2, 'count');
         term          | count 
-----------------------+-------
 ALEXANDER GRAHAM BELL |     1
 COMMUNICATION         |     1
 PRIMITIVE             |     1
(3 rows)

I was expecting to get 2 rows back, not 3.

Certain wildcarded, complex terms contained within phrases can cause ES plugin to StackOverflow

contrib_regression=# select * from so_posts where zdb(so_posts) ==> 'body:"bob.dole*"';
ERROR:  rc=500; StackOverflowError[null]

And the stacktrace is:

java.lang.StackOverflowError
    at org.elasticsearch.common.compress.BufferRecycler.instance(BufferRecycler.java:40)
    at org.elasticsearch.common.compress.lzf.LZFCompressedStreamInput.<init>(LZFCompressedStreamInput.java:43)
    at org.elasticsearch.common.compress.lzf.LZFCompressor.streamInput(LZFCompressor.java:121)
    at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:114)
    at org.elasticsearch.cluster.metadata.MappingMetaData.sourceAsMap(MappingMetaData.java:423)
    at org.elasticsearch.cluster.metadata.MappingMetaData.getSourceAsMap(MappingMetaData.java:435)
    at com.tcdi.zombodb.query_parser.IndexMetadata.<init>(IndexMetadata.java:84)
    at com.tcdi.zombodb.query_parser.IndexMetadataManager.getMetadata(IndexMetadataManager.java:152)
    at com.tcdi.zombodb.query_parser.IndexMetadataManager.findField0(IndexMetadataManager.java:241)
    at com.tcdi.zombodb.query_parser.IndexMetadataManager.findField(IndexMetadataManager.java:225)
    at com.tcdi.zombodb.query_parser.IndexMetadataManager.getMetadataForField(IndexMetadataManager.java:143)
    at com.tcdi.zombodb.query_parser.QueryRewriter.needsConversionToPhrase(QueryRewriter.java:621)
    at com.tcdi.zombodb.query_parser.QueryRewriter.access$200(QueryRewriter.java:44)
    at com.tcdi.zombodb.query_parser.QueryRewriter$9.b(QueryRewriter.java:833)
    at com.tcdi.zombodb.query_parser.QueryRewriter.buildStandard0(QueryRewriter.java:1146)
    at com.tcdi.zombodb.query_parser.QueryRewriter.buildStandard(QueryRewriter.java:1139)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:830)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build0(QueryRewriter.java:466)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:445)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:426)
    at com.tcdi.zombodb.query_parser.QueryRewriter.buildSpanOrFilter(QueryRewriter.java:1081)
    at com.tcdi.zombodb.query_parser.QueryRewriter.access$300(QueryRewriter.java:44)
    at com.tcdi.zombodb.query_parser.QueryRewriter$9.b(QueryRewriter.java:846)
    at com.tcdi.zombodb.query_parser.QueryRewriter.buildStandard0(QueryRewriter.java:1146)
    at com.tcdi.zombodb.query_parser.QueryRewriter.buildStandard(QueryRewriter.java:1139)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:830)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build0(QueryRewriter.java:466)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:445)
    at com.tcdi.zombodb.query_parser.QueryRewriter.build(QueryRewriter.java:426)
    at com.tcdi.zombodb.query_parser.QueryRewriter.buildSpanOrFilter(QueryRewriter.java:1081)
    at com.tcdi.zombodb.query_parser.QueryRewriter.access$300(QueryRewriter.java:44)
        <pattern repeats for a long time>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.