Git Product home page Git Product logo

pgslice's Introduction

pgslice

Postgres partitioning as easy as pie. Works great for both new and existing tables, with zero downtime and minimal app changes. No need to install anything on your database server. Archive older data on a rolling basis to keep your database size under control.

🍊 Battle-tested at Instacart

Build Status

Install

pgslice is a command line tool. To install, run:

gem install pgslice

This will give you the pgslice command. You can also install it with Homebrew or Docker. If installation fails, you may need to install dependencies.

Steps

  1. Ensure the table you want to partition has been created. We’ll refer to this as <table>.

  2. Specify your database credentials

export PGSLICE_URL=postgres://localhost/myapp_development
  1. Create an intermediate table
pgslice prep <table> <column> <period>

The column should be a timestamp, timestamptz, or date column and period can be day, month, or year.

This creates a partitioned table named <table>_intermediate using range partitioning.

  1. Add partitions to the intermediate table
pgslice add_partitions <table> --intermediate --past 3 --future 3

Use the --past and --future options to control the number of partitions.

  1. Optional, for tables with data - Fill the partitions in batches with data from the original table
pgslice fill <table>

Use the --batch-size and --sleep options to control the speed (defaults to 10000 and 0 respectively)

To sync data across different databases, check out pgsync.

  1. Analyze tables
pgslice analyze <table>
  1. Swap the intermediate table with the original table
pgslice swap <table>

The original table is renamed <table>_retired and the intermediate table is renamed <table>.

  1. Fill the rest (rows inserted between the first fill and the swap)
pgslice fill <table> --swapped
  1. Back up the retired table with a tool like pg_dump and drop it
pg_dump -c -Fc -t <table>_retired $PGSLICE_URL > <table>_retired.dump
psql -c "DROP TABLE <table>_retired" $PGSLICE_URL

Sample Output

pgslice prints the SQL commands that were executed on the server. To print without executing, use the --dry-run option.

pgslice prep visits created_at month
BEGIN;

CREATE TABLE "public"."visits_intermediate" (LIKE "public"."visits" INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING STORAGE INCLUDING COMMENTS) PARTITION BY RANGE ("created_at");

CREATE INDEX ON "public"."visits_intermediate" USING btree ("created_at");

COMMENT ON TABLE "public"."visits_intermediate" is 'column:createdAt,period:day,cast:date,version:3';

COMMIT;
pgslice add_partitions visits --intermediate --past 1 --future 1
BEGIN;

CREATE TABLE "public"."visits_202208" PARTITION OF "public"."visits_intermediate" FOR VALUES FROM ('2022-08-01') TO ('2022-09-01');

ALTER TABLE "public"."visits_202208" ADD PRIMARY KEY ("id");

CREATE TABLE "public"."visits_202209" PARTITION OF "public"."visits_intermediate" FOR VALUES FROM ('2022-09-01') TO ('2022-10-01');

ALTER TABLE "public"."visits_202209" ADD PRIMARY KEY ("id");

CREATE TABLE "public"."visits_202210" PARTITION OF "public"."visits_intermediate" FOR VALUES FROM ('2022-10-01') TO ('2022-11-01');

ALTER TABLE "public"."visits_202210" ADD PRIMARY KEY ("id");

COMMIT;
pgslice fill visits
/* 1 of 3 */
INSERT INTO "public"."visits_intermediate" ("id", "user_id", "ip", "created_at")
    SELECT "id", "user_id", "ip", "created_at" FROM "public"."visits"
    WHERE "id" > 0 AND "id" <= 10000 AND "created_at" >= '2022-08-01'::date AND "created_at" < '2022-11-01'::date

/* 2 of 3 */
INSERT INTO "public"."visits_intermediate" ("id", "user_id", "ip", "created_at")
    SELECT "id", "user_id", "ip", "created_at" FROM "public"."visits"
    WHERE "id" > 10000 AND "id" <= 20000 AND "created_at" >= '2022-08-01'::date AND "created_at" < '2022-11-01'::date

/* 3 of 3 */
INSERT INTO "public"."visits_intermediate" ("id", "user_id", "ip", "created_at")
    SELECT "id", "user_id", "ip", "created_at" FROM "public"."visits"
    WHERE "id" > 20000 AND "id" <= 30000 AND "created_at" >= '2022-08-01'::date AND "created_at" < '2022-11-01'::date
pgslice analyze visits
ANALYZE VERBOSE "public"."visits_202208";

ANALYZE VERBOSE "public"."visits_202209";

ANALYZE VERBOSE "public"."visits_202210";

ANALYZE VERBOSE "public"."visits_intermediate";
pgslice swap visits
BEGIN;

SET LOCAL lock_timeout = '5s';

ALTER TABLE "public"."visits" RENAME TO "visits_retired";

ALTER TABLE "public"."visits_intermediate" RENAME TO "visits";

ALTER SEQUENCE "public"."visits_id_seq" OWNED BY "public"."visits"."id";

COMMIT;

Adding Partitions

To add partitions, use:

pgslice add_partitions <table> --future 3

Add this as a cron job to create a new partition each day, month, or year.

# day
0 0 * * * pgslice add_partitions <table> --future 3 --url ...

# month
0 0 1 * * pgslice add_partitions <table> --future 3 --url ...

# year
0 0 1 1 * pgslice add_partitions <table> --future 3 --url ...

Add a monitor to ensure partitions are being created.

SELECT 1 FROM
    pg_catalog.pg_class c
INNER JOIN
    pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE
    c.relkind = 'r' AND
    n.nspname = 'public' AND
    c.relname = '<table>_' || to_char(NOW() + INTERVAL '3 days', 'YYYYMMDD')
    -- for months, use to_char(NOW() + INTERVAL '3 months', 'YYYYMM')
    -- for years, use to_char(NOW() + INTERVAL '3 years', 'YYYY')

Archiving Partitions

Back up and drop older partitions each day, month, or year.

pg_dump -c -Fc -t <table>_202209 $PGSLICE_URL > <table>_202209.dump
psql -c "DROP TABLE <table>_202209" $PGSLICE_URL

If you use Amazon S3 for backups, s3cmd is a nice tool.

s3cmd put <table>_202209.dump s3://<s3-bucket>/<table>_202209.dump

Schema Updates

Once a table is partitioned, make schema updates on the master table only (not partitions). This includes adding, removing, and modifying columns, as well as adding and removing indexes and foreign keys.

Additional Commands

To undo prep (which will delete partitions), use:

pgslice unprep <table>

To undo swap, use:

pgslice unswap <table>

Additional Options

Set the tablespace when adding partitions

pgslice add_partitions <table> --tablespace fastspace

App Considerations

This set up allows you to read and write with the original table name with no knowledge it’s partitioned. However, there are a few things to be aware of.

Reads

When possible, queries should include the column you partition on to limit the number of partitions the database needs to check. For instance, if you partition on created_at, try to include it in queries:

SELECT * FROM
    visits
WHERE
    user_id = 123 AND
    -- for performance
    created_at >= '2022-09-01' AND created_at < '2022-09-02'

For this to be effective, ensure constraint_exclusion is set to partition (the default value) or on.

SHOW constraint_exclusion;

Frameworks

Rails

Specify the primary key for partitioned models to ensure it’s returned.

class Visit < ApplicationRecord
  self.primary_key = "id"
end

Other Frameworks

Please submit a PR if additional configuration is needed.

One Off Tasks

You can also use pgslice to reduce the size of a table without partitioning by creating a new table, filling it with a subset of records, and swapping it in.

pgslice prep <table> --no-partition
pgslice fill <table> --where "id > 1000" # use any conditions
pgslice swap <table>

Triggers

Triggers aren’t copied from the original table. You can set up triggers on the intermediate table if needed. Note that Postgres doesn’t support BEFORE / FOR EACH ROW triggers on partitioned tables.

Data Protection

Always make sure your connection is secure when connecting to a database over a network you don’t fully trust. Your best option is to connect over SSH or a VPN. Another option is to use sslmode=verify-full. If you don’t do this, your database credentials can be compromised.

Additional Installation Methods

Homebrew

With Homebrew, you can use:

brew install ankane/brew/pgslice

Docker

Get the Docker image with:

docker pull ankane/pgslice
alias pgslice="docker run --rm -e PGSLICE_URL ankane/pgslice"

This will give you the pgslice command.

Dependencies

If installation fails, your system may be missing Ruby or libpq.

On Mac, run:

brew install libpq

On Ubuntu, run:

sudo apt-get install ruby-dev libpq-dev build-essential

Upgrading

Run:

gem install pgslice

To use master, run:

gem install specific_install
gem specific_install https://github.com/ankane/pgslice.git

Reference

Related Projects

Also check out:

  • Dexter - The automatic indexer for Postgres
  • PgHero - A performance dashboard for Postgres
  • pgsync - Sync Postgres data to your local machine

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/pgslice.git
cd pgslice
bundle install
createdb pgslice_test
bundle exec rake test

To test against different versions of Postgres with Docker, use:

docker run -p=8000:5432 postgres:14
TZ=Etc/UTC PGSLICE_URL=postgres://postgres@localhost:8000/postgres bundle exec rake

On Mac, you must use Docker Desktop for the port mapping to localhost to work.

pgslice's People

Contributors

acrogenesis avatar afn avatar andyatkinson avatar ankane avatar chendo avatar creack avatar imobs avatar levkk avatar richardsmd avatar trietphm avatar zewelor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pgslice's Issues

support non-numeric id in pgslice fill

First of all, I love the project and have been using this for quite some time. For me, the most common use case is to partition a table with a numeric id. But this time I want to partition a table that has a varchar type id.

pgslice fill table_name --batch-size 20000
INSERT INTO "public"."table_name_intermediate" ("id",.......) SELECT "id", ....... WHERE "id" > 11674306 AND "id" <= 11694306 AND "start_time" >= ........................................
PG::UndefinedFunction: ERROR: operator does not exist: character varying > integer
LINE 4: WHERE "id_" > 11674306 AND "id_" <= 11694306 AND "start_...

Add automated build in hub.docker.com

That would be so much better to use image from docker hub: docker run ankane/pgslice
instead of building it every time with git clone ... && docker build -t pgslice . && docker run pgslice.

Foreign keys are not preserved

Foreign keys (and any other constraints) that are present on the original table should be propagated to the intermediate table and partitions, the same way that indexes are propagated.

When primary key is a composite key partition table is having only one key as primary key.

  1. Steps to reproduce.:
    Parent table defination :
    request
    (
    message_id character varying(42) COLLATE pg_catalog."default" NOT NULL,
    created_date timestamp without time zone,
    state character varying(18) COLLATE pg_catalog."default",
    err_msg character varying(1000) COLLATE pg_catalog."default",
    err_code character varying(3) COLLATE pg_catalog."default",
    campaign character varying(64) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT request_campaign_id PRIMARY KEY (message_id, campaign)
    )
    WITH (
    OIDS = FALSE
    )
    TABLESPACE pg_default;

  2. pgslice prep hotwire.request created_date month

  3. pgslice add_partitions hotwire.request --intermediate --past 60 --future 60

4)pgslice fill public.request --batch-size 50000
Insert failed due to duplicate entry.

  1. Child table defination
    CREATE TABLE hotwire.request_201304
    (
    message_id ,
    created_date ,
    state ,
    err_msg ,
    err_code ,
    campaign ,
    CONSTRAINT request_201304_pkey PRIMARY KEY (message_id), ->> this is the issue.
    CONSTRAINT request_201304_created_date_check CHECK (created_date >= '2013-04-01'::date AND created_date < '2013-05-01'::date)
    )
    INHERITS (hotwire.request_intermediate)
    WITH (
    OIDS = FALSE
    )
    TABLESPACE pg_default;

Thanks
Kamal

Table not found when table is present in the schema,

Steps to reproduce.
1)created request table in the database BIDA.
2)Imported data on the table.
3)Set the connection url as below:
set PGSLICE_URL=postgres://:@localhost/BIDA
4)running pgslice to create prep .
c:\Program Files\PostgreSQL\10\bin>pgslice prep "hotwire"."request" create_date month --dry-run
Table not found: hotwire.request
c:\Program Files\PostgreSQL\10\bin>pgslice prep request create_date month
Table not found: request
5)BIDA=# select count(*) from hotwire.request;
count

6710826
(1 row)

Support for timestamp and timestamptz column types?

Is there any interest in supporting alternative column types?

I love the project but have timestamptz fields I need to partition on. Right now I'm just wrapping the --dry-run output in a sed script but it wouldn't be a large change. I think perhaps one option and a little additional logic

  • --column-type (defaults: date; add timestamp, timestamptz)
  • Detect column type in sql_date and provide appropriate format and ::type

Please let me know your thoughts on reviewing/accepting a contribution with this functionality.

Thanks,

Mike

Week slice option

          Hey @pohodnya, thanks for the PR! The code looks super clean. From my understanding, Postgres can handle thousands of partitions, so I'm not sure an additional period is needed.

Originally posted by @ankane in #36 (comment)

Nevertheless, PG is able to handle thousands of partitions, but there may be issues if you have fewer CPU cores than partitions.
There may be an issue if you need to request 2 weeks of data and the server has let's say only 4 CPU cores.
It is a request to 2 tables vs a request to 14 tables. I believe that one core may do only one table request at a time.
Considering that index time processing is not linear, I suppose that the request to 2 larger tables will be faster than the request to 14 small tables. Therefore the week option may be very helpful.

Just saying.

README typo

In the last step README says:

psql -c "DROP <table>_retired" $PGSLICE_URL

I believe that correct syntax would be DROP TABLE <table>_retired.

PS. Thank you for this amazing gem, you saved me a lot of time 🙌

Add support to keep synced both tables

Could be great to keep synced both tables (optional), like:

  • When adding/editing/deleting rows from the new table, then replicate the action in the old table
  • Ability to stop/cancel old table syncing (stop doing the above point)

By this way, if something went wrong once the new table was the official one for some time, then rolling back to the old table we get the updated data.

Partitioning behaviour

If we already swap table for partitioning, i guess new write will automatically write to only child table, but in my case, new write would write to both parent and child table ?
Am i missing something ?

Postgres 11 default partition and other improvements

With Postgres 11 on the horizon, I was wondering if there were plans to incorporate some of the improvements being made to partitions? One of the big ones that I'm looking forward to is the ability to set a default partition, which is great for cases like if there was a huge spike in activity and you only want to start partitioning since that point in time. I think it would be great if there was a flag in the prep step to allow for specifying I want to have a catchall, then afterwards partition based on a certain time period.

Another big change is how indexes and foreign keys will be inherited, which I believe will mean that the add_partitions step will not need to set them up.

No support for UUID primary keys

bundle exec pgslice fill events --batch-size 1000
bundler: failed to load command: pgslice (/Users/russ/.rbenv/versions/2.5.0/bin/pgslice)
PG::UndefinedFunction: ERROR:  function max(uuid) does not exist
LINE 1: SELECT MAX("id") FROM "public"."events"
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

  /Users/russ/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/pgslice-0.4.1/lib/pgslice.rb:481:in `exec_params'
  /Users/russ/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/pgslice-0.4.1/lib/pgslice.rb:481:in `execute'
  /Users/russ/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/pgslice-0.4.1/lib/pgslice.rb:568:in `max_id'
  /Users/russ/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/pgslice-0.4.1/lib/pgslice.rb:296:in `fill'
  /Users/russ/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/pgslice-0.4.1/lib/pgslice.rb:33:in `perform'
  /Users/russ/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/pgslice-0.4.1/exe/pgslice:5:in `<top (required)>'
  /Users/russ/.rbenv/versions/2.5.0/bin/pgslice:23:in `load'
  /Users/russ/.rbenv/versions/2.5.0/bin/pgslice:23:in `<top (required)>'

PG::Connection injection support

First of all, thank you for building such a helpful tool!

I'm curious if you know of a way that I could initialize a Pgslice instance with an already existing PG::Connection object. I feel like I'm missing something obvious but if not, would you have a preference on how this functionality is added? Assuming you're on board with the idea, of course.

For some context, I have a database in Amazon RDS that I'm going to maintain via a Sidekiq worker (cron is not a viable solution here) and I'd much rather let Pgslice access my PG::Connection from ActiveRecord instead of shelling out and back, it feels very hacky.

Consider ending support for PostgreSQL <= 10

Hi @ankane

From https://www.postgresql.org/support/versioning/, PostgreSQL version 10 is listed as no longer supported. As of this writing (February 2023), the current released version in Fall of 2022 of PostgreSQL is version 15.

AWS RDS also may forcibly upgrade customers running 10 or earlier that haven't upgraded as well. AWS has encouraged customers to get upgrades done to newer versions than 10 in this page:
https://aws.amazon.com/blogs/database/upgrade-amazon-aurora-postgresql-and-amazon-rds-for-postgresql-version-10/

In order to simplify the pgslice code, since declarative partitioning was added in PostgreSQL 10, if pgslice sunsetted support for trigger-based partitioning for versions < PostgreSQL 10, then code in pgslice that supports those branches could be removed.

For example, the first conditional branch of this method checks for declarative partitioning, and then supports the older style of partitioning in the else portion of the branch.
https://github.com/ankane/pgslice/blob/master/lib/pgslice/cli/add_partitions.rb#L65

If this is something you're open to in a new release of pgslice, I'd be happy to take it on. Although it wouldn't add new functionality, it would simplify the code as a breaking change (possibly a new major version?).

I'm putting pgslice to use at work and spelunking in the code here and there. We only use declarative partitioning. Thanks.

Index names are not preserved

Ideally, the end state of the setup process should result in a parent table that's indistinguishable from the original table other than the presence of child tables and the INSERT trigger. But the way it's currently implemented, the parent table has a primary key and indexes with names based on the intermediate table name. For instance, with a table called search_results with an index on user_id, we end up with:

Indexes:
    "search_results_intermediate_pkey" PRIMARY KEY, btree (id)
    "search_results_intermediate_user_id_idx" btree (user_id)

This could be addressed by doing the following:

  1. Have pgslice prep rename the indexes (including the primary key index) on the original table to be based on the retired table name, e.g. search_results_retired_pkey, and create indexes on the intermediate table using the original names of the indexes.
  2. Have pgslice unprep reverse this process (after dropping the intermediate table, rename the indexes on the original table back to their original names).

The only tricky part here is that the unprep operation needs to be able to determine the mapping between index names in the two tables. It could do this based on the index definition, but that might be ambiguous (you could have multiple identical indexes with different names, for instance). Alternatively, we could assume that the order of indexes in pg_indexes is stable, but that might be an incorrect assumption. The worst case scenario, then, is that after rolling back with pgslice unprep, the original table has some indexes named differently than their original names.

Alternatively, we should introduce a new command, e.g. pgslice rename_indexes, to be run after step 8 (right before dropping the retired table), which attempts to rename the indexes on the parent table based on the index names in the retired table. This suffers from the same issue of potential ambiguity, but at least this way, we can inform the user about the ambiguity and let them resolve it on their own.

Let me know what you think -- happy to pitch in with a PR.

Also when #8 is addressed, we should make sure that foreign key names are also preserved.

Primary key getting added when it exists after getting created from the original table

We have a table:

=> \d activities
                                Partitioned table "public.activities"
     Column     |            Type             | Collation | Nullable |                Default                 
----------------+-----------------------------+-----------+----------+----------------------------------------
 id             | bigint                      |           | not null | nextval('activities_id_seq'::regclass)
 trackable_id   | integer                     |           |          | 
 trackable_type | character varying(255)      |           |          | 
 owner_id       | integer                     |           |          | 
 owner_type     | character varying(255)      |           |          | 
 key            | character varying(255)      |           |          | 
 parameters     | text                        |           |          | 
 recipient_id   | integer                     |           |          | 
 recipient_type | character varying(255)      |           |          | 
 created_at     | timestamp without time zone |           | not null | CURRENT_TIMESTAMP
 updated_at     | timestamp without time zone |           |          | CURRENT_TIMESTAMP
 account_id     | integer                     |           |          | 
Partition key: RANGE (created_at)
Indexes:
    "activities_offline_pkey" PRIMARY KEY, btree (id, created_at)

And are running pgslice add_partitions public.activities --past 0 --future 3. The output is as follows:

BEGIN;

CREATE TABLE "public"."activities_202311" PARTITION OF "public"."activities" FOR VALUES FROM ('2023-11-01') TO ('2023-12-01');

ALTER TABLE "public"."activities_202311" ADD PRIMARY KEY ("id", "created_at");

and is then failing ERROR: multiple primary keys for table "activities_202311" are not allowed.

When we looked into it, the table creation succeeds, but the new partition table retains the original table's composite primary key so when the second command for adding the primary key attempts to be executed we run into the aforementioned error.

RFC: warn on prep if table has 'BEFORE / FOR EACH' triggers

We just used pgslice to partition our tables (pg 11, on heroku) and are very happy.

I was surprised to notice that our triggers weren't firing. I checked and saw that three triggered functions which had been on the target table were 'orphaned'.

I was able to restore two of them as-is, both were 'AFTER INSERT , FOR EACH'.

The third one errors when I tried to apply it

CREATE TRIGGER trigger_update_device_location_events_event_geometry
  BEFORE INSERT OR UPDATE
  ON public.device_location_events
  FOR EACH ROW
  WHEN (((NEW.event_latitude IS NOT NULL) AND (NEW.event_longitude IS NOT NULL)))
  EXECUTE PROCEDURE public.update_device_location_event_event_geometry();

with

"#<ActiveRecord::StatementInvalid: PG::WrongObjectType: ERROR:  \"device_location_events\" is a partitioned table\nDETAIL:  Partitioned tables cannot have BEFORE / FOR EACH ROW triggers.\n>"

The trigger function itself is basically event_geometry = public.ST_PointFromText('POINT(' || event_longitude::text || ' ' || event_latitude::text || ')', 4326)

We haven't yet figured out how to restore the trigger in a way that works for the partitioned tables, whether we want to add it to each partitioned table or if we should write it to work slightly differently. I'll update on that point once we do.

Fortunately, for internal reasons, the impact of this is low, but would be good to discover it before going to prod.

Missing Updates while filling?

I presented our pgslice based solution today and an astute listener correctly pointed out we might be missing updates to rows that were updated after they were filled, but before the fill was done. 😢

For our biggest table the duration of the "fill" was very long, 8 hours, so I was thinking I could check our updated_at timestamp column since the fill start time and try and identify the id of any rows that have been updated. To make that fast, possibly a partial index that's only when updated_at is set, and possibly based on a certain time.

Then with those id values, I could do an UPDATE statement. I'm only seeing INSERT statements in pgslice. So this updates scenario is not something that's currently supported, correct?

A possible extension or concept to add on to that exists in pgslice is the --swapped concept, filling more inserts post "swap" with missing changes from the retired table. That feels similar to catching missed updates. Maybe --swapped could include updates, or a separate option --with-updates could be added.

Thoughts?

Foreign keys referencing partitioned table not copied nor dealt with in `swap`

  • Postgres 12.12
  • Rails 5.2.8.1

Ran through the steps outlined in readme to partition by day, created 30 past and 5 future partitions. Ran db:structure:dump and then db:test:prepare to get schema loaded in test env, then ran specs and found a bunch of failures like this:

ActiveRecord::InvalidForeignKey:
  PG::ForeignKeyViolation: ERROR:  insert or update on table "user_events" violates foreign key constraint "fk_rails_377b6c2879"
  DETAIL:  Key (event_id)=(1) is not present in table "events_retired".
  : INSERT INTO "user_events" ("user_id", "event_id", "viewed_at", "created_at", "updated_at") VALUES ($1, $2, $3, $4, $5) RETURNING "id"

"OK" I figured, "I just need to drop the retired table." So I did that, but then got one failing spec:

Failure/Error:
  expect{
    subject.event.destroy
  }.to raise_error(ActiveRecord::InvalidForeignKey)

  expected ActiveRecord::InvalidForeignKey but nothing was raised

So this leaves me with a few questions:

  1. How to actually operationalize this? Seems like we can't guarantee zero downtime unless we drop foreign keys to this table first and re-create them later? (I guess I could do this)
    • should swap be updated to remove fks to the retired table?
  2. How to deal with schema in specs? I guess I could shell out to pgslice in the spec helper suite setup, but that feels kinda gross
  3. More generally: developers are often very hung up on structure.sql exactly matching production schema. It's a bit gross to have 30+ partitions in there, but also it will be out of date after one day. What do you recommend here?

Add an option to pass the offset for the date in the add_partitions method

today = round_date(DateTime.now.new_offset(0).to_date, period)

I came across an issue when trying to add the future partitions a few days after I made the initial partition. And it started throwing the error: relation "public.table_name_<today's_date>" does not exist (PG::UndefinedTable). I fixed it by changing the hardcoded offset. But it would be great if you could provide an option to pass the offset or date itself. Thanks for writing this awesome library BTW!!!

Problem running pgslice in docker

Hi

When testing pgslice in docker ( Dockerfile provided by project and recreated myself to containerize pgslice ), there is following exception:

/usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:51:in `rescue in connection': uninitialized constant PgSlice::Helpers::URI (NameError)

    rescue URI::InvalidURIError
              ^^^^^^^^^^^^^^^^^
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:27:in `connection'
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:56:in `schema'
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:176:in `create_table'
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/cli/prep.rb:8:in `prep'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/command.rb:27:in `run'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/invocation.rb:127:in `invoke_command'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor.rb:392:in `dispatch'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/base.rb:485:in `start'
        from /usr/local/bundle/gems/pgslice-0.6.0/exe/pgslice:7:in `<top (required)>'
        from /usr/local/bundle/bin/pgslice:25:in `load'
        from /usr/local/bundle/bin/pgslice:25:in `<main>'
/usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:32:in `connection': uninitialized constant PgSlice::Helpers::URI (NameError)

        uri = URI.parse(url)
              ^^^
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:56:in `schema'
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/helpers.rb:176:in `create_table'
        from /usr/local/bundle/gems/pgslice-0.6.0/lib/pgslice/cli/prep.rb:8:in `prep'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/command.rb:27:in `run'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/invocation.rb:127:in `invoke_command'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor.rb:392:in `dispatch'
        from /usr/local/bundle/gems/thor-1.2.1/lib/thor/base.rb:485:in `start'
        from /usr/local/bundle/gems/pgslice-0.6.0/exe/pgslice:7:in `<top (required)>'
        from /usr/local/bundle/bin/pgslice:25:in `load'
        from /usr/local/bundle/bin/pgslice:25:in `<main>'

Docker test workflow:
in repo dir:

docker buildx build . --no-cache -t test
docker run --rm -it test prep some dummy args --url whatever

I believe it might be due to https://bugs.ruby-lang.org/issues/18147 . Fix seems to require "uri" explicit.

Upgrading PG 10 to 11+ with partitioned tables from pgslice < 0.4.5 causes duplicate indexes

Scenario: Do a full pgslice partitioning while running PG 10 then upgrade to PG 11+. Every time add_new_partitions is called it will add duplicate indexes.

After looking in the code it looks like the behavior is controlled by a comment on the parent table. Unless I missed something I don't believe the version is checked after tables have been partitioned. This means that when the comment on the table is read it is either missing the version or has the wrong one and thus pgslice creates the partitions as duplicates.

I believe the fix would be to check the version before adding partitions and update the comment if it has changed which should cause pgslice's behavior to adjust accordingly.

OTOH it's not hard to rectify manually if you just update the comment yourself. I was able to fix it by running: (the fix was to add ,version:3 in case other users have run into this)

COMMENT ON TABLE public.test_execution_logs IS 'column:created_at,period:day,cast:date,version:3';

Which fixed the behavior for me.

`pg` gem `1.3.x` connection string error

After updating pg gem to the 1.3.x series, pgslice errors and exits on what (I guess?) to be a connection string issue

Example command using pg versions 1.3.01.3.3:

$ PGSLICE_URL="postgresql://localhost/my_db" bundle exec pgslice prep my_table_name created_at month
missing key/value separator "=" in URI query parameter: ""

With pg version 1.2.3 (most recent of 1.2.x series) pgslice is able to connect and run:

$ PGSLICE_URL="postgresql://localhost/my_db" bundle exec pgslice prep my_table_name created_at month

BEGIN;

CREATE TABLE "public"."my_table_name" (LIKE "public"."my_table_name" INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING STORAGE INCLUDING COMMENTS) PARTITION BY RANGE ("created_at");

etc, etc

Rails: How to perform SQL Migrations without excessive diff

While PgSlice is a Ruby gem not specific to Rails, it has some Rails info and I’m personally using it only with Rails and Migrations.
https://github.com/ankane/pgslice#rails

I've created a partitioned table and partitions, using the --dry-run functionality of PgSlice to print the SQL statements. I’ve pasted the SQL statements into new Migration files.

I'm using the execute() method within Rails migrations supplying SQL as heredoc text. When I migrate, the dumped schema in db/structure.sql includes the expected partitioned tables and partitions, but unfortunately also touches everything else causing a large diff of unrelated schema objects that are not actually changing. The diff is so large it's not viewable on github either, although the diff can be viewed locally with git.

What I’m trying to do is have the changes to db/structure.sql only reflect what’s changed. I don't see any useful options to the Schema Dumper in Rails for the SQL format.

I'm wondering if PgSlice has any recommendations on how to introduce new partitions in Rails Migrations, since Rails does not support PostgreSQL partitions with create_table to my knowledge, in a way that limits the dumps the schema in a way that doesn’t cause an excessive diff.

Thanks.

bad URI(is not URI?)

I'm setting PGSLICE_URL as
postgres://user:password@localhost:5432/database

But when running command
pgslice prep table_name column_name period
i got error:

 bad URI(is not URI?): postgres://user:password@localhost:5432/dbcounter2 (URI::InvalidURIError)

How to fix it ?

Auto-copying foreign key constraint triggers

Hi, thanks for the great tool. I was struggling with having a foreign key constraint on the partitioned tables. I think you know what exactly I mean, so I wouldn't explain all the constraints with that mechanism here to make it short, but I can if it's needed.
So, I came up with having a trigger on each partition, which takes a look on the related table with the key, and checks if there is such a record. The only thing here is that I need to remember to apply that trigger to each new partition after creating them. It's not that hard, but something that might be automated with pgslice. But, I'm wondering is it really a good practice to have such triggers...
Anyways, as I see it, it might be a config file or a param for the beginning with the names of triggers that must be copied from the master table to a partition during pgslice add_partitions. What do you think about this?
Here is an example of the trigger I want to be applied to the new partitions:

CREATE OR REPLACE FUNCTION foreign_key_constraint_trigger() RETURNS TRIGGER AS $$
	BEGIN
		PERFORM 1 FROM page_view WHERE id=NEW."impressionId" LIMIT 1;
		IF NOT FOUND THEN
      	RAISE EXCEPTION 'not found page_view record with id=%', NEW."impressionId"; 
			RETURN NULL;
		END IF;
		RETURN NEW;
	END;
$$ LANGUAGE plpgsql;

And it doesn't work if I put it just on the master table, it doesn't react on insert.

Support for Partitioning By Date and Hour

I have a use case where millions of rows can be added to a table on the same day, and it seems to make sense to partition the tables by date and hour. Is this something that could be added?

Could not translate host name

Hello, I am getting this error when trying to connect to a postgres instance that is in a docker container. I have exposed this postgres instance on port 5092. I am able to connect to it in any other manner (psql, nodejs app outside of container, inside container, etc). Only pgslice doesn't seem to be working.

pgslice prep reservations created_at month --url=postgres://postgres@localhost:5092/my_app_db

I get this error:

could not translate host name "postgres://postgres@localhost:5092/my_app_db?" to address: nodename nor servname provided, or not known

Any ideas as to why this is happening?

Using a view + trigger table swap

First of all, thanks for this gem, it is very nearly exactly what I was looking for.

I'm curious if you have any interest in supporting an approach like the one outlined here: https://www.2ndquadrant.com/en/blog/partitioning-a-large-table-without-a-long-running-lock/

Basically, instead of the current approach (IIUC) of:

  1. create partitioned table
  2. fill
  3. swap
  4. fill again for rows that were inserted into original table between completing step 2 and starting step 3
  5. cleanup

it would:

  1. create partitioned table
  2. create a view that unions the old, unpartitioned table with the new, partitioned one. rename the old table and use its name for the view, so that clients can continue to read (and write) unawares of a change. the view has a trigger such that modifications get written to the new partitioned table, and removed from the old one.
  3. begin backfill
  4. cleanup (which includes deleting the view and renaming the newly partitioned table to the old name)

This seems to me to have two main advantages:

  1. data is not duplicated during the fill portion, since it is deleted from the old table. i think this is particularly useful given that if you need to partition an existing table, it's probably pretty big, and you'd probably prefer to not double your disk usage
  2. there is no period of time where data is missing from the new, partitioned table as alluded to in step 4 above (I may be misunderstanding the steps here, perhaps this tool avoids that somehow). again, i think this is particularly useful given that large tables where partitioning is desirable are more likely to have high write volume, and if client apps cannot find rows that should be there, even for a brief period of time, that is more likely to be problematic.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.