Git Product home page Git Product logo

pg_shard's Introduction

Deprecation Notice

The new open-source Citus extension for PostgreSQL natively integrates pg_shard and provides a superset of its functionality. We’ll not be making any more pg_shard releases and the project is entering end-of-life status.

We have migration steps to help existing customers perform a drop-in replacement. Contact us for more information, or feel free to stop by our Slack channel to have questions answered in real-time.


pg_shard

Build Status Coverage Release License

pg_shard is a sharding extension for PostgreSQL. It shards and replicates your PostgreSQL tables for horizontal scale and high availability. The extension also seamlessly distributes your SQL statements, without requiring any changes to your application. Join the mailing list to stay on top of the latest developments.

As a standalone extension, pg_shard addresses many NoSQL use cases. It also enables real-time analytics, and has an easy upgrade path to CitusDB for complex analytical workloads (distributed joins). Further, the extension provides access to standard SQL tools, and powerful PostgreSQL features, such as diverse set of indexes and semi-structured data types.

This README serves as a quick start guide. We address architectural questions on sharding, shard rebalancing, failure handling, and distributed consistency mechanisms on our webpage. Also, we're actively working on improving pg_shard, and welcome any questions or feedback on our mailing list.

Building

pg_shard runs on Linux and OS X. The extension works with PostgreSQL 9.3.4+, PostgreSQL 9.4.0+, and CitusDB 3.2+. Building pg_shard requires GCC 4.6 or greater (or a similarly recent version of Clang).

Once you have PostgreSQL or CitusDB installed, you're ready to build pg_shard. For this, you will need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example:

# Path when PostgreSQL is compiled from source
PATH=/usr/local/pgsql/bin/:$PATH make
sudo PATH=/usr/local/pgsql/bin/:$PATH make install

# Path when CitusDB package is installed
PATH=/opt/citusdb/4.0/bin/:$PATH make
sudo PATH=/opt/citusdb/4.0/bin/:$PATH make install

pg_shard also includes regression tests. To verify your installation, start your PostgreSQL instance with the shared_preload_libraries setting mentioned below, and run make installcheck.

Note: If you'd like to build against CitusDB, please contact us at engage @ citusdata.com.

Upgrading from Previous Versions

To upgrade an existing installation, simply:

  1. Build and install the latest pg_shard release (see the Building section)
  2. Restart your PostgreSQL server
  3. Run ALTER EXTENSION pg_shard UPDATE; on the PostgreSQL server

Note that taking advantage of the new repair functionality requires that you also install pg_shard on all your worker nodes.

Setup

pg_shard uses a master node to store shard metadata. In the simple setup, this node also acts as the interface for all queries to the cluster. As a user, you can pick any one of your PostgreSQL nodes as the master, and the other nodes in the cluster will then be your workers.

An easy way to get started is by running your master and worker instances on the same machine. In that case, each instance will be one PostgreSQL database that runs on a different port. You can simply use localhost as the worker node's name in this setup.

Alternatively, you could start up one PostgreSQL database per machine; this is more applicable for production workloads. If you do this, you'll need to configure your PostgreSQL instances so that they can talk to each other. For that, you'll need to update the listen_addresses setting in your postgresql.conf file, and change access control settings in pg_hba.conf.

Whatever you decide, the master must be able to connect to the workers over TCP without any interactive authentication. In addition, a database using the same name as the master's database must already exist on all worker nodes.

Once you decide on your cluster setup, you will need to make two changes on the master node. First, you will need to add pg_shard to shared_preload_libraries in your postgresql.conf:

shared_preload_libraries = 'pg_shard'    # (change requires restart)

Second, the master node in pg_shard reads worker host information from a file called pg_worker_list.conf in the data directory. You need to add the hostname and port number of each worker node in your cluster to this file. For example, to add two worker nodes running on the default PostgreSQL port:

$ emacs -nw $PGDATA/pg_worker_list.conf

# hostname port-number
worker-101  5432
worker-102  5432

Then, you can save these settings and restart the master node.

Table Sharding

Now, let's log into the master node and create the extension:

CREATE EXTENSION pg_shard;

At this point you're ready to distribute a table. To let pg_shard know the structure of your table, define its schema as you would do with a normal table:

CREATE TABLE customer_reviews
(
    customer_id TEXT NOT NULL,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
);

This table will not be used to store any data on the master but serves as a prototype of what a customer_reviews table should look like on the worker nodes. After you're happy with your schema, and have created the desired indexes on your table, tell pg_shard to distribute the table:

-- Specify the table to distribute and the column to distribute it on
SELECT master_create_distributed_table(table_name := 'customer_reviews',
                                       partition_column := 'customer_id');

This function informs pg_shard that the table customer_reviews should be hash partitioned on the customer_id column. Now, let's create shards for this table on the worker nodes:

-- Specify the table name, total shard count and replication factor
SELECT master_create_worker_shards(table_name := 'customer_reviews',
                                   shard_count := 16,
                                   replication_factor := 2);

This function creates a total of 16 shards. Each shard owns a portion of a hash token space, and gets replicated on 2 worker nodes. The shard replicas created on the worker nodes have the same table schema, index, and constraint definitions as the table on the master node. Once all replicas are created, this function saves all distributed metadata on the master node.

Usage

Once you created your shards, you can start issuing queries against the cluster. Currently, UPDATE and DELETE commands require the partition column in the WHERE clause.

INSERT INTO customer_reviews (customer_id, review_rating) VALUES ('HN802', 5);
INSERT INTO customer_reviews VALUES
  ('HN802', '2004-01-01', 1, 10, 4, 'B00007B5DN',
   'Tug of War', 133191, 'Music', 'Indie Music', 'Pop', '{}');
INSERT INTO customer_reviews (customer_id, review_rating) VALUES ('FA2K1', 10);

SELECT avg(review_rating) FROM customer_reviews WHERE customer_id = 'HN802';
SELECT count(*) FROM customer_reviews;

UPDATE customer_reviews SET review_votes = 10 WHERE customer_id = 'HN802';
DELETE FROM customer_reviews WHERE customer_id = 'FA2K1';

Loading Data from a File

A script named copy_to_distributed_table is provided to facilitate loading many rows of data from a file, similar to the functionality provided by PostgreSQL's COPY command. It will be installed into the scripts directory for your PostgreSQL installation (you can find this by running pg_config --bindir).

As an example, the invocation below would copy rows into the users table from a CSV-like file using pipe characters as a delimiter and the word NULL to signify a null value. The file contains a header line, which will be skipped.

copy_to_distributed_table -CH -d '|' -n NULL input.csv users

Call the script with the -h for more usage information.

Maximizing Throughput

Because INSERT commands must wait for a response before returning, loading rows through a single connection can never fully exploit the write capacity of your pg_shard cluster. To achieve the highest insert rates, it is necessary to load data in a concurrent fashion. The difference, even in a small cluster, will be significant.

For example, we could split an input file (input.csv) into chunks and load each chunk using a separate instance of copy_to_distributed_table (requires GNU split):

mkdir chunks
split -nl/64 input.csv chunks/
find chunks -type f | xargs -t -n1 -P64 -I% copy_to_distributed_table -C % users

Note that the above example loads the contents of input.csv using 64 processes. The optimal value will vary depending on factors such as cluster size and hardware.

This advice applies similarly to application design: if you have workers loading data into a pg_shard cluster, experiment to determine the number of workers that maximizes cluster utilization.

Repairing Shards

If for whatever reason a shard placement fails to be updated during a modification command, it will be marked as inactive. The master_copy_shard_placement function can be called to repair an inactive shard placement using data from a healthy placement. In order for this function to operate, pg_shard must be installed on all worker nodes and not just the master node. The shard will be protected from any concurrent modifications during the repair.

SELECT master_copy_shard_placement(shard_id := 12345,
                                   source_node_name := 'good_host',
                                   source_node_port := 5432,
                                   target_node_name := 'bad_host',
                                   target_node_port := 5432);

Usage with CitusDB

When installed within CitusDB, pg_shard will use the distribution metadata catalogs provided by CitusDB. No special syncing step is necessary: your pg_shard-distributed tables will be visible to CitusDB and vice versa. Just ensure the pg_shard.use_citusdb_select_logic config variable is turned on (the default when running within CitusDB) and you'll be good to go!

Look Under the Hood

When you distribute a table and create shards for it, pg_shard saves related metadata on the master node. You can probe into this metadata by logging into the master and running the following:

SELECT * FROM pgs_distribution_metadata.partition;
SELECT * FROM pgs_distribution_metadata.shard;
SELECT * FROM pgs_distribution_metadata.shard_placement;

The partition metadata table indicates to pg_shard which PostgreSQL tables are distributed and how. The shard metadata table then maps a distributed table to its logical shards, and associates each shard with a portion of a hash token space spanning between ]-2B, +2B[. Last, the shard_placement table maintains each shard's location information, that is, the worker node name and port for that shard. As an example, if you're using a replication factor of 2, then each shard will have two shard placements.

Each shard placement in pg_shard corresponds to one PostgreSQL table on a worker node. You can probe into these tables by connecting to any one of the workers, and running standard PostgreSQL commands:

psql -d postgres -h worker-101 -p 5432
postgres=# \d

Limitations

pg_shard is intentionally limited in scope during its first release, but is fully functional within that scope. We classify pg_shard's current limitations into two groups. In one group, we have features that we don't intend to support in the medium term due to architectural decisions we made:

  • Transactional semantics for queries that span across multiple shards — For example, you're a financial institution and you sharded your data based on customer_id. You'd now like to withdraw money from one customer's account and debit it to another one's account, in a single transaction block.
  • Unique constraints on columns other than the partition key, or foreign key constraints.
  • Distributed JOINs also aren't supported in pg_shard - If you'd like to run complex analytic queries, please consider upgrading to CitusDB.

Another group of limitations are shorter-term but we're calling them out here to be clear about unsupported features:

  • Table alterations are not supported: customers who do need table alterations accomplish them by using a script that propagates such changes to all worker nodes.
  • DROP TABLE does not have any special semantics when used on a distributed table. An upcoming release will add a shard cleanup command to aid in removing shard objects from worker nodes.
  • Queries such as INSERT INTO foo SELECT bar, baz FROM qux are not supported.

Besides these limitations, we have a list of features that we're looking to add. Instead of prioritizing this list ourselves, we decided to keep an open discussion on GitHub issues and hear what you have to say. So, if you have a favorite feature missing from pg_shard, please do get in touch!

License

Copyright © 2012–2015 Citus Data, Inc.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

See the LICENSE file for full details.

pg_shard's People

Contributors

adunstan avatar alvherre avatar anarazel avatar begriffs avatar bmomjian avatar gsstark avatar hlinnaka avatar itgacky avatar jasonmp85 avatar jberkus avatar jconway avatar kgrittn avatar marcocitus avatar mhagander avatar nmisch avatar onderkalaci avatar petere avatar robertmhaas avatar samay-sharma avatar scrappy avatar simonat2ndquadrant avatar sumedhpathak avatar tglsfdc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pg_shard's Issues

Shards backed by foreign tables not marked as such

From @samay-sharma on November 21, 2014 15:12

Currently, the storage type is always set as ‘t’ in pgs_distribution_metadata.shard. We also have a SHARD_STORAGE_FOREIGN in distribution_metadata.h. I think we should use that for foreign tables.

Also, this does not seem to hamper anything. I tried with cstore_fdw and INSERT commands for foreign tables are routed to the correct worker shards. The worker node commands fail as INSERT is not yet implemented for cstore_fdw but pg_shard logic on the master seems to work correctly.

Plan bidirectional metadata sync

Citus and pg_shard should be able to sync distribution metadata back and forth without any user intervention. Determine what approaches could meet this need.

Write blog series

I think a series of blog posts to get readers' imaginations going could help pg_shard adoption greatly. I'm talking a real-life app with no big hacks (i.e. possibly using ActiveRecord or similar). It'd have a set of traditional tables and one big table that has NoSQL access patterns. We'd show how pg_shard can be used to transparently access the big table within an otherwise traditional application. Any caveats specific to the environment could be addressed.

I think three to five posts would be best, and the code itself should be in a GitHub repository for perusal by interested parties.

Optimize operator expression construction

MakeOpExpression is used to build:

  • Equality comparisons against a hash value which replace other equality constraints in SELECT, UPDATE, and DELETE queries
  • Synthetic clauses used as the hash value of the partition key of an INSERTed row
  • "Between" comparisons which represent the range of hash values contained within a given shard

@onderkalaci has extensively benchmarked performance characteristics of pg_shard and determined that avoiding this call to GetDefaultOpClass results in a huge performance boost, especially on multi-socket machines (probably due to lock contention).

We previously replaced all calls to GetDefaultOpClass with equivalent calls to lookup_type_cache, but the type cache does not contain the  and operators needed for the "between" comparison in the above list. At the time we had already returned substantial improvements in other places, so we left this final call site, but now it is apparent we should remove it as well.

I had previously tried building an equivalent OR clause using the greater (or less) than operator and an equality clause (thus enabling the use of simple typecache lookups), but that has performance implications if the number of shards is large.

So the straightforward option here is to just cache the and operators ourselves (or cache the entire expression node returned by BuildBaseConstraint and clone it when needed). The need for equality operators can be directly fulfilled by lookup_type_cache responses and we can get rid of all remaining calls to GetDefaultOpClass (and stop doing duplicate work looking up operators for int32 during every query).

Audit error codes and messages

PostgreSQL has a pretty strict style guide for these things and someone needs to do a quick sweep to make sure we're compliant.

Worker commands omit target relation's schema

From @samay-sharma on November 21, 2014 15:23

Steps to reproduce:

# On master and worker nodes:
CREATE SCHEMA test_schema;
# CREATE SCHEMA

# On master node:
CREATE TABLE test_schema.customer_reviews
(
    customer_id INTEGER,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
);
# CREATE TABLE

SELECT master_create_distributed_table('test_schema.customer_reviews', 'customer_id');
#  master_create_distributed_table
# ---------------------------------
#  
# (1 row)

INSERT INTO test_schema.customer_reviews (customer_id, review_rating) VALUES (4687, 5);
# WARNING:  Bad result from ip-10-181-241-37.ec2.internal:5432
# DETAIL:  Remote message: relation "customer_reviews_86" does not exist
# WARNING:  Bad result from ip-10-102-169-241.ec2.internal:5432
# DETAIL:  Remote message: relation "customer_reviews_86" does not exist
# ERROR:  could not modify any active placements

Worker node log:

# LOG:  statement: INSERT INTO customer_reviews_86 (customer_id, review_rating) VALUES (4687, 5)
# ERROR:  relation "customer_reviews_86" does not exist at character 13
# STATEMENT:  INSERT INTO customer_reviews_86 (customer_id, review_rating) VALUES (4687, 5)

First-class COPY support

This ticket is to track full support for the COPY command. Unlike the trigger implementation in #61, this would mean supporting a bulk method for data ingestion. Issues like consistency and isolation will show up, as well as failure modes.

Temporary tables used by multi-shard SELECTs not dropped

I was a bit startled to see this on the master:

ERROR:  could not extend file "base/16385/t3_17723": No space left on device
HINT:  Check free disk space.

When I checked, there were literally hundreds of temp tables on the master, using up 100% of disk space. I was doing a \watch 10 on an aggregate query across shards, and apparently the temp tables pg_shard needs to resolve this query do not get dropped automatically:

select count(*) from members;
# count
# ---------
#1929821
# (1 row)

\dt
#                     List of relations
# Schema   |            Name            | Type  |  Owner
# -----------+----------------------------+-------+---------
# pg_temp_2 | pg_shard_temp_table_3648_0 | table | psocial
# public    | members                    | table | psocial
# public    | members_local              | table | psocial
# (3 rows)

Temp tables used by pg_shard should be ON COMMIT DROP at the least; ideally they should be dropped as soon as the query is over. I'll see about documenting the master's use of temp tables somewhat.

Create databases/schemas during shard creation

Currently the user has to manually create databases on each worker to match the master. Since we require that the database names match anyways, and there are no real options here, we should just have master_create_worker_shards create the databases on workers if they don't already exist. Probably we should do this by connecting to the "postgres" database first, which would then require some extra error handling.

Prohibit non-IMMUTABLE functions in multi-shard SELECT queries

Though all (non-side-effecting) functions are safe to execute in a single-shard SELECT query, the same is not true for multi-shard SELECTs. In particular, a function such as random() or now() would return a different value for each shard and so the query results would not represent a consistent view of the world.

This may overlap somewhat with #47, but that issue is more concerned with single-shard ramifications. We've noticed this bug affects multi-shard queries as well, so they'll need similar treatment.

Avoid copy-pasting ruleutils files

Because ruleutils.c's get_shard_query_def does not have external linkage, pg_shard copy-pastes ruleutils.c from the appropriate PostgreSQL release—with minimal modifications—to enable deparsing a given query back into one with shard identifiers.

These modifications are not likely to be accepted upstream as-is—they reference a shardId field we've added to deparse_context—but PostgreSQL may be open to exposing get_shard_query_def itself for external use. We'd need to:

  • Devise a method that permits us to use get_shard_query_def without further modifications
  • Produce a patch for PostgreSQL to expose get_shard_query_def

Expose remote error fields to client

When I attempt an INSERT that violates a primary key, the message seen by the client is a generic WARNING about a bad result from the worker node rather than the more meaningful message originally produced by the remote error:

INSERT INTO members VALUES (1, 2, 3);

INSERT INTO members VALUES (1, 2, 3);

# WARNING:  Bad result from shard1.demo:5432
# DETAIL:  Remote message: duplicate key value violates unique constraint "members_id_key_10007"
# WARNING:  Bad result from shard8.demo:5432
# DETAIL:  Remote message: duplicate key value violates unique constraint "members_id_key_10007"
# ERROR:  could not modify any active placements

This is a problem because:

  • many languages don't capture the detail, just the error message
  • the error message is cryptic in the extreme

Queries with HAVING can return incorrect results

Assume a simple table distributed along an id column:

CREATE TABLE employees (
    id integer,
    name text
);

SELECT master_create_distributed_table('employees', 'id');
SELECT master_create_worker_shards('employees', 2, 1);

Using HAVING clauses in SELECT queries works correctly so long as columns referenced in the HAVING clause are in a WHERE, ORDER BY, or GROUP BY clause:

SELECT substr(name, 1, 1) AS init, 
       sum(id)            AS id_total 
FROM   employees 
GROUP  BY init 
HAVING SUM(id) > 200000;
# ┌──────┬──────────┐
# │ init │ id_total │
# ├──────┼──────────┤
# │ H    │   319464 │
# │ M    │  1253983 │
# │ J    │   844768 │
# │ V    │   280801 │
# │ R    │   610906 │
# │ L    │  1166332 │
# │ D    │   829092 │
# │ T    │   700698 │
# │ I    │   201319 │
# │ F    │   284724 │
# │ N    │   363474 │
# │ G    │   401067 │
# │ E    │   697631 │
# │ B    │   526513 │
# │ C    │  1148864 │
# │ S    │  1010327 │
# │ P    │   244061 │
# │ K    │   748454 │
# │ A    │  1079310 │
# └──────┴──────────┘
# (19 rows)

But if columns in the HAVING clause do not appear elsewhere, the query returns incorrect results:

SELECT substr(name, 1, 1) AS init 
FROM   employees 
GROUP  BY init 
HAVING SUM(id) > 200000; 
# ┌──────┐
# │ init │
# ├──────┤
# └──────┘
# (0 rows)

This is probably related to #33, since it appears pg_shard currently only considers quals and projections in deciding which columns to retrieve from the remote. The HAVING clause is evaluated locally, so if its columns are not retrieved it cannot be evaluated correctly.

We should either reject the use of HAVING or make sure it has the columns it needs.

VOLATILE functions accepted in non-partition key qual. clauses

This is pretty problematic. pg_shard only checks for constant expressions in clauses involving the partition key, meaning anything goes in other clauses. Take this table:

CREATE TABLE limit_orders (
  id bigint PRIMARY KEY,
  symbol text NOT NULL,
  bidder_id bigint NOT NULL,
  placed_at timestamp NOT NULL,
  kind order_side NOT NULL,
  limit_price decimal NOT NULL DEFAULT 0.00 CHECK (limit_price >= 0.00)
);

SELECT master_create_distributed_table('limit_orders', 'id');

Trying to INSERT into this table using non-constant expressions will fail:

INSERT INTO limit_orders VALUES (2036, 'GOOG', 5634, now(), 'buy', random());
# ERROR:  cannot plan sharded modification containing values which are not constants or constant expressions

But pg_shard happily accepts a random() appearing in a DELETE command's clauses:

DELETE FROM limit_orders WHERE id = 246 AND bidder_id = (random() * 1000);

I believe the root cause here is the fact that UPDATE and DELETE share code with SELECT. It's probably OK that SELECT accepts random() in its WHERE clause: for SELECT queries, we need only execute on a single replica, so the fact that calls to random() on two replicas will return different values is irrelevant.

But for DELETE and UPDATE this is an issue: replicas will diverge. We should either always require constant expressions or work towards evaluating function calls on the master before executing remote portions of the plan.

Clarify lack of DROP TABLE support

Though, it seems there is code to do that, testing proves otherwise

postgres=# CREATE TABLE nums(a int);
CREATE TABLE

postgres=# \d
       List of relations
 Schema | Name | Type  | Owner 
--------+------+-------+-------
 public | nums | table | amit

postgres=# SELECT master_create_distributed_table('nums', 'a');
 master_create_distributed_table 
---------------------------------

(1 row)

postgres=# SELECT master_create_worker_shards('nums', 2, 1);
 master_create_worker_shards 
-----------------------------

(1 row)

postgres=# \d
          List of relations
 Schema |    Name    | Type  | Owner 
--------+------------+-------+-------
 public | nums       | table | amit
 public | nums_10000 | table | amit

postgres=# DROP TABLE nums;
DROP TABLE

postgres=# \d
          List of relations
 Schema |    Name    | Type  | Owner 
--------+------------+-------+-------
 public | nums_10000 | table | amit
(1 row)

Did I miss a step or something?

Misleading errors in queries with subselects

From @samay-sharma on November 21, 2014 14:56

I am reporting this mainly because of the query it sends to the worker node. I don't think this is a high priority but we should show a clearer error message on the master.

Steps to reproduce:

SELECT count(*) FROM lineitem WHERE l_orderkey = (SELECT min(l_orderkey) FROM lineitem);
# WARNING:  Bad result from ip-10-102-169-241.ec2.internal:5432
# DETAIL:  Remote message: there is no parameter $0
# WARNING:  Bad result from ip-10-181-241-37.ec2.internal:5432
# DETAIL:  Remote message: there is no parameter $0
# ERROR:  could not receive query results

Worker node log:

# LOG:  statement: SELECT l_orderkey FROM ONLY lineitem_32 WHERE (l_orderkey = $0)
# ERROR:  there is no parameter $0 at character 61
# STATEMENT:  SELECT l_orderkey FROM ONLY lineitem_32 WHERE (l_orderkey = $0)

The error message while having such a WHERE condition in an UPDATE statement is also misleading.

# Showing min value
SELECT min(l_orderkey) FROM lineitem;
#  min 
# -----
#   1
# (1 row)

# This works as expected
UPDATE lineitem SET l_linenumber = 2 WHERE l_orderkey = 1;
# UPDATE 1

# This statement should touch only one shard, so the error message is a bit misleading.
UPDATE lineitem SET l_linenumber = 2 WHERE l_orderkey = (SELECT min(l_orderkey) FROM lineitem);
# ERROR:  cannot modify multiple shards during a single query

CLUSTER command cannot be extended

ExtendDDLCommand is capable of extending index and table names in an ALTER TABLE table_name CLUSTER ON index_name statement, but when the equivalent CLUSTER table_name USING index_name is provided, DDL extension fails.

Within pg_shard, everything provided to ExtendDDLCommand comes from TableDDLCommandList, which may never produce a CLUSTER statement (it appears to favor ALTER TABLE), but as extend_ddl_commands.c appears to contain some support for this feature, we should investigate to either remove the support or flesh it out.

I first noticed this behavior during testing. I thought I removed my test before merging, but apparently not. I expected those lines to produce a CLUSTER statement with shard IDs in the table and index names, but instead we get the error about node 714 being unsupported.

Add more unit tests

Implementing a true multi-node regression test will require more than pg_regress can provide, but we can get most of the way there with carefully isolated unit tests.

Support STABLE functions in INSERT statements

We took a conservative approach in function evaluation for INSERT: we evaluate immutable expressions (such as 2 + 2 or char_length('postgres')) during planning but reject any other expressions.

This means certain idioms (like using now() to timestamp a row) are unavailable when using pg_shard. The next logical step would be to coerce STABLE functions into constants during INSERT: evaluating them on the master would ensure all replicas receive an identical value.

Misleading error for incompletely created tables

From @samay-sharma on November 19, 2014 15:38

I get this error when I run a select query after running master_create_distributed_table() but before running master_create_worker_shards(). The same query works after running master_create_worker_shards().

Steps to reproduce:

CREATE TABLE customer_reviews
(
    customer_id INTEGER,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
);
# CREATE TABLE

SELECT master_create_distributed_table('customer_reviews', 'customer_id');
#  master_create_distributed_table
# ---------------------------------
#  
# (1 row)

select count(*) from customer_reviews ;
# ERROR:  cannot execute select over multiple shards

I had expected to get 0 as the result. Even if we error out, the error message does not seem very apt.

Multi-shard SELECT queries pull unnecessary columns

pg_shard is pulling all data from any column named in the WHERE clause across the wire, even if that column is not in the target list. If those columns are large, this makes queries which aren't using the partition key very slow due to data transfer time, and results in huge temporary tables on the master.

select count(*) from members where profile @> '{ "groups" : [{ "name" : "SFPUG" } ] }';
#  count
# --------
#  183891
# (1 row)
#
# Time: 20229.489 ms

select id from members where profile @> '{ "groups" : [ {"name" : "SFPUG" } ] }';
# Time: 20315.859 ms

# select count(*) from members_10008 where profile @> '{"groups" : [ { "name" : "SFPUG" } ] }';
#  count
# -------
#   5689
# (1 row)
#
# Time: 30.092 ms

Even if we did all 32 shards serially, that's still only 950ms. So the 20s response time suggests that pg_shard is pulling all of the data from all shards even though we're not looking at the tuples. In fact:

select count(*) from members;
#   count
# ---------
#  7879817
# (1 row)
#
# Time: 9181.476 ms

But this appears to be because qual pushdown is working strangely:

select id from members where email like 'j%' limit 1000;
# Time: 8607.005 ms

select id from members where email like 'jst%' limit 1000;
# Time: 2651.042 ms

So I did some digging, and it turns out that the problem is that pg_shard returns too much data, regardless of what's in the SELECT list. When the records are JSON blobs or very wide rows, this is a major issue:

select sum(1) from members where profile @> '{ "groups" : [ { "name" : "SFPUG" } ] }';
#   sum
# --------
#  183891
# (1 row)

Notice that there are no columns in the projection list. What should get returned to the master is just a column of 183,000 integers. But:

select pg_size_pretty(pg_total_relation_size('pg_shard_temp_table_7220_1'));
#  pg_size_pretty
# ----------------
#  276 MB
# (1 row)

select * from pg_shard_temp_table_7220_1 limit 1;
# id      |
# email   |
# profile | {
#         |     "address": null,
#         |     "country": "Indonesia",
#         |     "created": "2012-03-11T20:00:32Z",
#         |     "email": "[email protected]",
#         |     "first_name": "Amanda",
#         |     (omitted for brevity)
#         |     "quote": null
#         | }

select sum(1) from members where email like 'js%';
# -[ RECORD 1 ]
# sum | 76993

select * from pg_shard_temp_table_7220_2 limit 5;
# -[ RECORD 1 ]-------------------
# id      |
# email   | [email protected]
# profile |
# -[ RECORD 2 ]-------------------
# id      |
# email   | [email protected]
# profile |

… if any column is used in the WHERE clause, pg_shard pulls that column across the wire even if it's not in the projection list at all.

Queries with CTEs produce misleading error

This is a pretty simple bug: queries with a CTE generate an error message about JOINs (Joins are currently unsupported) rather than one about CTEs. Check out this test output to see what I mean.

I'd expect the error message to say something about CTEs being unsupported.

Support in-tree builds

Since the PostgreSQL codebase has such a large library of functions (handling everything from file access to memory allocation) many static analyzers would require substantial "model files" to be able to meaningfully analyze pg_shard and other PostgreSQL plugins.

I've discovered that by building extensions "in-tree" the analyzer can do its thing without any intervention from us: since it has all the source it can reason fully about most paths.

My plan is to create 9.3- and 9.4-based branches in our PostgreSQL fork and add pg_shard and cstore_fdw to the contrib directory as git submodules. By performing a giant build of PostgreSQL and our extensions, we can help analyzers (such as Coverity Scan) do their best.

Run an outreach campaign

Identify individuals or groups for whom we believe pg_shard could be a win and reach out to find whether they are interested in collaboration.

Warn when distributing unenforceable constraints

Certain constraints are unenforceable in a distributed setting: UNIQUE indexes foremost among them. pg_shard will happily propagate a UNIQUE index with nary a peep.

The extension should either warn when propagating these constraints or refuse to distribute tables altogether which contain such constraints.

Unable to restore from backup made with pg_dump

Though pg_shard correctly marks its metadata tables so they are included in the output of pg_dump, these dumps are unusable because the relation_id column of pg_shard's metadata tables is of type oid. Because there is no guarantee a given relation will have identical oids, the oids in the metadata tables will not match any relations, rendering the dump unusable.

I found an old thread that insinuates pg_dump will produce a textual representation of tables that works across a restore boundary if the column is of type regclass instead of oid. As far as lower-level code is concerned, these two types are identical (so we will not need to change any implementation). By switching to regclass we can give users the ability to rebuild a master in an identical fashion to restoring a normal PostgreSQL instance from backup.

Detect unsupported partition key types during dist. table creation

If a type lacks a hash function, we should not allow tables to be distributed using it as the type of their partition key. Right now, doing so results in the cryptic cache lookup failed for function 0 during INSERT commands.

master_create_distributed_table should detect these types, refuse to create a distributed table, and print a helpful message.

Setting use_citusdb_select_logic crashes without Citus

From @samay-sharma on November 24, 2014 15:11

While using pg_shard with postgres, if I run SET pg_shard.use_citusdb_select_logic TO true and then run a select query, pg_shard crashes.

It think it is possible that a user assumes that he can use the citus select logic just by turning this config on (without installing Citus). I think we have an assert to ensure that this situation does not happen, but should we just replace that assert with an error which explicitly says "you need to install CitusDB in order to use this feature" in order to handle such cases?

Prevent deletion of metadata when still in use

Running DROP EXTENSION will remove pg_shard's metadata tables, including information about the placement and health of shards. This behavior should only be possible with the CASCADE modifier: otherwise the extension should refuse to be dropped until all distributed tables have been removed.

Perhaps we can look into the dependency tracking available in PostgreSQL's catalog tables to satisfy this need.

Add package to Linux package managers

Though #67 will take care of serious PostgreSQL users, many will prefer to install pg_shard using OS package management tools (apt, etc.). Many PostgreSQL add-ons are included in Ubuntu's built-in repositories (Slony-I shows up as as postgresql-9.3-slony1-2, for instance). If we could get pg_shard in some of these repositories it could make building cloud instances easier (with CloudInit scripts or similar).

At any rate, users might appreciate an install that doesn't require a compilation step (even if it is hidden within PGXN client or pex).

Creation should error when provided with bad table name

From @samay-sharma on November 20, 2014 15:52

Currently, master_create_distributed_table() works successfully if I pass an index name as the table name and the column on which the index is as the partition key. It also puts an entry into the partition metadata table. master_create_worker_shards() errors out saying that the relation is not a table when an index name is passed to it, which is how I feel master_create_distributed_table() should behave as well.

Again not something we’d expect users to do, but IMO its better if the UDF guards itself against invalid arguments, specially since it is directly exposed to the end users.

Fail fast for certain DDL commands

#1 and #6 discuss issues around dropping shards (or attempting to use DROP TABLE), but there are larger questions around other DDL commands.

  • Discuss whether DDL commands against distributed tables should be allowed at all
  • List any advanced use cases for permitting certain commands
  • Determine feasibility of intercepting certain commands
  • Choose mechanism of interception (utility hook or event trigger)
  • Add code to warn or error when users issue DDL commands against distributed tables

Removing existing shards should be easier

We need a function for dropping all shards of a particular table. Otherwise the process of dismantling a sharded table in order to recreate it can be quite complex.

Improve error handling during failed modification

pg_shard's modification logic assumes that any total failure is due to something transient that a retry might overcome. In many cases, an INSERT or UPDATE could fail due to a constraint check, which is not something that a simple retry will overcome without something else changing.

See #31 for an example of what I mean. In its example, the client sees:

# WARNING:  Bad result from shard1.demo:5432
# DETAIL:  Remote message: duplicate key value violates unique constraint "members_id_key_10007"
# WARNING:  Bad result from shard8.demo:5432
# DETAIL:  Remote message: duplicate key value violates unique constraint "members_id_key_10007"
# ERROR:  could not modify any active placements

A well-written application might want to handle the uniqueness violation in a special fashion, but all pg_shard gives it is a generic error about not being able to modify any placements.

We probably want to try a modification on a placement, then:

  • If the error is in the class of things we think a user cares about (constraints, etc.), we fail-fast and throw them the error
  • If the error is network related or otherwise "transient", we continue with the remaining shards. If any modification completes, we mark the transient-failure shard as bad

At a higher level, we need to handle modification outcomes in a ternary fashion:

  • Total Success — the modification completed successfully
  • Application Failure — the modification returned successfully, but the remote DB raised an error
  • Infrastructure Failure — the modification didn't even complete, or did so with a network error

Only the third case is deserving of a "could not modify placement" error. In the second we can fail-fast and tell the user what happened.

Citus treats pg_shard tables as local when missing metadata

From @sumedhpathak on November 20, 2014 0:27

If someone doesn't copy over the metadata from pg_shard to CitusDB, but sets use_citusdb_select_logic to true, in CitusDB we treat all pg_shard tables as local tables and read the empty table on the master node, returning 0 results.

This is minor, but can be a usability issue with CitusDB/pg_shard integration.

Generalize COPY-to-INSERT trigger

I've written a COPY trigger used by some of our customers which turns a COPY command into many INSERT commands using a temporary table with the same schema as a sharded table.

Unfortunately it's hardcoded to a specific schema. The first step towards users being able to COPY to a sharded table is to generalize this script for any table.

Consider using binary protocol

We have the option of using the Binary protocol between the master and the worker nodes using PQexecParams. We currently use PQexec, which defaults to TEXT. We thus pay the cost of converting the results from text back to the internal Postgres format, which has performance implications.

We should develop a prototype to test performance of the binary protocol and switch to it if that is worthwhile.

Permit use by non-superusers

Currently pg_shard requires superuser access for most activities, and all shards are created as owned by the superuser with superuser-only permissions. In order to make pg_shard work with DBAAS, we'll need to decouple superuser access requirements. It should require superuser to create the extension, but superuser access should not be required to create the shards, provided that the user has create table access on each shard database.

PRIMARY KEY modifier ignored during DDL extension

Let's say I have a DDL command similar to the following:

CREATE TABLE employees
(
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    id BIGINT PRIMARY KEY,
    salary NUMERIC DEFAULT 0.00,
    start_date timestamp without TIME zone,
    resume text,
    CONSTRAINT sal_check CHECK (salary >= 0.00)
);

When passed to ExtendDDLCommand with a shard ID of 12345, we get:

CREATE TABLE employees
(
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    id BIGINT,
    salary NUMERIC DEFAULT 0.00,
    start_date timestamp without TIME zone,
    resume text,
    CONSTRAINT sal_check CHECK (salary >= 0.00)
);

While propagating the UNIQUE property of a PRIMARY KEY might be questionable in a sharded environment, the NOT NULL property isn't as ambiguous: we should definitely be propagating it.

Not that—as in #42—this is a theoretical problem. ExtendDDLCommand only technically needs to support the subset of DDL commands produced by TableDDLCommandList. I so far cannot get the latter to produce a CREATE TABLE statement with a PRIMARY KEY clause (it usually turns those into a CREATE TABLE statement followed by an ALTER TABLE statement), but if this ever changes we'll have a problem.

Support having multiple head nodes

The missing feature for pg_shard-as-OLTP-big-data tool is the ability to have multiple head nodes so that you can support thousands of connections. This shouldn't be too hard to solve, and may actually be solvable with existing PostgreSQL 9.4 tools.

  1. replicate all of the metadata tables only: these don't get updated that often, so BDR ought to be adequate for it.
  2. streaming replication option. This one is hampered by (a) needing to figure out which node is the master, and (b) pg_shards' use of temporary tables.
  3. FDW option: have metadata be an FDW back to the master node. Likely to add latency and complexity. Also fails to provide us with easy failover from the master node.

Support sequences

pg_shard needs to support sequences and SERIAL/BIGSERIAL. The simplest way to do these for now is to have the sequence live on the master. Later we'll need to do something more complicated, but 9.5 will have machinery for that, I think.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.