Hey guys, I've let pg_repack run on one of our busy tables. It's fin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Strategy for busy tables,about reorg/pg_repack

Comments (28)

chobostar commented on September 28, 2024 1

@pyhalov here is PR: #261

from pg_repack.

dvarrazzo commented on September 28, 2024

Is this a regression w.r.t. pg_reorg?

from pg_repack.

armanddp commented on September 28, 2024

I've never used pg_reorg, this is 1.1.8 pg_repack from the repo. (Sorry know this is probably not really an issue issue but don't have access to the mailing list)

from pg_repack.

armanddp commented on September 28, 2024

Just secondary to this as it's happening to me at the moment. Can I safely DROP EXTENSION pg_repack CASCADE while it's running to shut it down?

from pg_repack.

schmiddy commented on September 28, 2024

That is strange. Generally we expect the repack temp table to be able to keep up with even heavy traffic on the main table, if only because we have a compact heap and indexes, and any modifications to the main table have to write to the log_ table as well via our trigger.

Could we see the output with --echo --elevel=DEBUG turned on? If you have the 'ts' utility as well to timestamp the logs, that would be helpful too, e.g.
pg_repack ... --echo --elevel=DEBUG dbname 2>&1 | ts

from pg_repack.

schmiddy commented on September 28, 2024

If you want to kill the current job, I recommend killing the client first -- either ctrl-c, or if that doesn't work a pg_cancel_backend() or pg_terminate_backend() of the pg_repack connection should cause it to bail out. Then you can safely DROP EXTENSION pg_repack CASCADE;

from pg_repack.

armanddp commented on September 28, 2024

I'll restart it shortly with logging enabled. Going to kill the client now.

from pg_repack.

armanddp commented on September 28, 2024

Must say, I love the library so far. It feels really solid (except now for the hic-up with catch-up). This cancel and drop was seamless, I usually have cold shivers when I need to do something like that.

I just want to see what our traffic looks like then I'll run it with the log level up.

from pg_repack.

armanddp commented on September 28, 2024

I've tried the latest version again and see similar behavior. It's doing the catch-up as intended but it's just too slow to catch up with the live data. Unfortunately I don't have ts installed but the repack_apply loops around 10 seconds.

x #select now(), min(id),max(id),max(id)-min(id) as left from repack.log_18369;
now | min | max | left
-------------------------------+-------+----------+----------
2013-05-29 14:06:25.218692+00 | 15031 | 10457455 | 10442424
(1 row)

x # select now(), min(id),max(id),max(id)-min(id) as left from repack.log_18369;
now | min | max | left
-------------------------------+-------+----------+----------
2013-05-29 14:10:33.162414+00 | 25031 | 10651793 | 10626762

Below is the log:

LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: (param:0) = SELECT * FROM repack.log_18369 ORDER BY id LIMIT $1
LOG: (param:1) = INSERT INTO repack.table_18369 VALUES ($1.)
LOG: (param:2) = DELETE FROM repack.table_18369 WHERE (id) = ($1.id)
LOG: (param:3) = UPDATE repack.table_18369 SET (id, type, recipient, sender, reference, data, message_id, created_at, updated_at, state, client_ref, meta_data, broadcast_id, app_version) = ($2.id, $2.type, $2.recipient, $2.sender, $2.reference, $2.data, $2.message_id, $2.created_at, $2.updated_at, $2.state, $2.client_ref, $2.meta_data, $2.broadcast_id, $2.app_version) WHERE (id) = ($1.id)
LOG: (param:4) = DELETE FROM repack.log_18369 WHERE id <= $1
LOG: (param:5) = 1000
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: (param:0) = SELECT * FROM repack.log_18369 ORDER BY id LIMIT $1
LOG: (param:1) = INSERT INTO repack.table_18369 VALUES ($1.)
LOG: (param:2) = DELETE FROM repack.table_18369 WHERE (id) = ($1.id)
LOG: (param:3) = UPDATE repack.table_18369 SET (id, type, recipient, sender, reference, data, message_id, created_at, updated_at, state, client_ref, meta_data, broadcast_id, app_version) = ($2.id, $2.type, $2.recipient, $2.sender, $2.reference, $2.data, $2.message_id, $2.created_at, $2.updated_at, $2.state, $2.client_ref, $2.meta_data, $2.broadcast_id, $2.app_version) WHERE (id) = ($1.id)
LOG: (param:4) = DELETE FROM repack.log_18369 WHERE id <= $1
LOG: (param:5) = 1000
LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
LOG: (param:0) = SELECT * FROM repack.log_18369 ORDER BY id LIMIT $1
LOG: (param:1) = INSERT INTO repack.table_18369 VALUES ($1.*)
LOG: (param:2) = DELETE FROM repack.table_18369 WHERE (id) = ($1.id)
LOG: (param:3) = UPDATE repack.table_18369 SET (id, type, recipient, sender, reference, data, message_id, created_at, updated_at, state, client_ref, meta_data, broadcast_id, app_version) = ($2.id, $2.type, $2.recipient, $2.sender, $2.reference, $2.data, $2.message_id, $2.created_at, $2.updated_at, $2.state, $2.client_ref, $2.meta_data, $2.broadcast_id, $2.app_version) WHERE (id) = ($1.id)
LOG: (param:4) = DELETE FROM repack.log_18369 WHERE id <= $1
LOG: (param:5) = 1000

from pg_repack.

schmiddy commented on September 28, 2024

Hi armanddp,

Thanks for checking with the latest version and providing logs. Sorry I haven't had a chance to investigate this problem more yet. If you could provide us a few more details, that might help us narrow down what's going on.

First, some information which could hopefully be gathered unintrusively:

The relevant disk sizes of the table in question, as reported by pg_relation_size(), pg_indexes_size(), and pg_total_relation_size()
table schema with indexes, as reported by psql's \d+
rough number of rows in table (could use estimate from pg_class.reltuples if SELECT COUNT(*) takes too long)
The rows from this bloat report query with a "tablename" matching your table.
Do you know offhand the approximate write traffic to the table, i.e. number of rows changed per minute?
Is this database server under heavy I/O or CPU load, e.g. as reported by iostat and top?

The above details would be a great starting point.

If you are able to spend a bit more time trying pg_repack on the problem table again, a timestamped pg_repack output log would be pretty helpful. You can grab the 'ts' utility either from installing the 'moreutils' package, or by grabbing just 'ts' from https://raw.github.com/madx/moreutils/master/ts. Then, pipe the verbose output from pg_repack through that script, so we can see exactly how long the steps are taking. (The beginning of the log will be interesting too, to see how long the initial steps are taking.) If some of the repack_apply() steps are taking more than a few seconds, you could check pg_stat_activity to see if anything unusual is going on in the database. (If you are trying pg_repack from git, please grab the latest git master, as I recently fixed a bug which could cause pg_repack to bail out during the swap step.)

Sorry for asking for so many details, but we would like to understand what's
going on, and hopefully some of the above details would give us a starting point. I think that repack_apply() could be optimized to better handle large write traffic to the main table, but we should know more about the problem before fooling with optimizations.

Thanks
Josh

from pg_repack.

armanddp commented on September 28, 2024

Thanks Josh, I'm busy running it to get a timestamped output log. I'll share the results and the rest of the requested information when it's done (for better or worse).

Thanks!

Armand

from pg_repack.

stefank2 commented on September 28, 2024

Hmm.
Mine cycled 15000 times through repack_apply on a table with 67000 live tuples which took about
4 hours. It normally has no trouble with that table and finishes it in a couple minutes. I dont think
the table is really being updated very quickly, I think perhaps a few hundred rows were added,
deleted, or updated in that time. There is another table that more commonly takes hours to finish;
it is not being updated very quickly either. Strange.
...
Oct 04 12:57:57 LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
Oct 04 12:57:57 LOG: (param:0) = SELECT * FROM repack.log_16798 ORDER BY id LIMIT $1
Oct 04 12:57:57 LOG: (param:1) = INSERT INTO repack.table_16798 VALUES ($1.)
Oct 04 12:57:57 LOG: (param:2) = DELETE FROM repack.table_16798 WHERE (id) = ($1.id)
Oct 04 12:57:57 LOG: (param:3) = UPDATE repack.table_16798 SET (id, targcartid, targcartoffset, refcartid, refcartoffset, targbusid, refbusid, checksum, created, examined, used, fullness, targcartoffend, covdensity) = ($2.id, $2.targcartid, $2.targcartoffset, $2.refcartid, $2.refcartoffset, $2.targbusid, $2.refbusid, $2.checksum, $2.created, $2.examined, $2.used, $2.fullness, $2.targcartoffend, $2.covdensity) WHERE (id) = ($1.id)
Oct 04 12:57:57 LOG: (param:4) = DELETE FROM repack.log_16798 WHERE id = $1
Oct 04 12:57:57 LOG: (param:5) = 1000
Oct 04 12:57:57 LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'virtualxid' AND pid <> pg_backend_pid() AND virtualtransaction = ANY($1)
Oct 04 12:57:57 LOG: (param:0) = {196/567622,147/3147735,52/19165090,116/1035212,38/17448680,14/17306840,2/15447834}
Oct 04 12:57:58 LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
Oct 04 12:57:58 LOG: (param:0) = SELECT * FROM repack.log_16798 ORDER BY id LIMIT $1
Oct 04 12:57:58 LOG: (param:1) = INSERT INTO repack.table_16798 VALUES ($1.)
Oct 04 12:57:58 LOG: (param:2) = DELETE FROM repack.table_16798 WHERE (id) = ($1.id)
Oct 04 12:57:58 LOG: (param:3) = UPDATE repack.table_16798 SET (id, targcartid, targcartoffset, refcartid, refcartoffset, targbusid, refbusid, checksum, created, examined, used, fullness, targcartoffend, covdensity) = ($2.id, $2.targcartid, $2.targcartoffset, $2.refcartid, $2.refcartoffset, $2.targbusid, $2.refbusid, $2.checksum, $2.created, $2.examined, $2.used, $2.fullness, $2.targcartoffend, $2.covdensity) WHERE (id) = ($1.id)
Oct 04 12:57:58 LOG: (param:4) = DELETE FROM repack.log_16798 WHERE id = $1
Oct 04 12:57:58 LOG: (param:5) = 1000
Oct 04 12:57:58 LOG: (query) SELECT pid FROM pg_locks WHERE locktype = 'virtualxid' AND pid <> pg_backend_pid() AND virtualtransaction = ANY($1)
Oct 04 12:57:58 LOG: (param:0) = {196/567622,147/3147735,52/19165090,116/1035212,38/17448680,14/17306840,2/15447834}
Oct 04 12:57:59 LOG: (query) SELECT repack.repack_apply($1, $2, $3, $4, $5, $6)
...

from pg_repack.

schmiddy commented on September 28, 2024

If you are seeing the SQL_XID_ALIVE query repeated every second pg_repack must be waiting for an old transaction to finish before it trusts that it is safe to perform its final table swapping. To find the PIDs of those old transactions hanging out, you can run that same query:

SELECT pid FROM pg_locks WHERE locktype = 'virtualxid' AND pid <> pg_backend_pid() AND virtualtransaction = ANY($1)

substituting in the arguments listed by pg_repack as "param:0" in the subsequent LOG message: in your case, that would have been '{196/567622,147/3147735,52/19165090,116/1035212,38/17448680,14/17306840,2/15447834}'.

There is a NOTICE printed by pg_repack when it is waiting:

Waiting for %d transactions to finish. First PID: %s

but that message only gets printed upon a change in the number of backends it is waiting on (i.e. the first time we notice that we're waiting for some number of backends, and any time that number of backends changes). IIRC the rationale for not printing that NOTICE in every loop was to avoid logspam, though maybe we should revisit that decision since it's far from apparent that pg_repack is stuck waiting with the current state of affairs.

from pg_repack.

vincentdephily commented on September 28, 2024

I'm having the same issue right now. I was seeing the same 'SELECT pid FROM pg_locks' earlyer, but not anymore so I can't run that query manually to see what's blocking.

Running pg_repack 1.2.1 on PG 9.1.13.

The table in question is used as a queue: a trigger inserts rows, then the queue-processing connection sets 'processing=true' on some rows, and then deletes them. There's around 10-15K inserts per minute.

There was an issue with the queue processing which let the table baloon to 4G (7G with indexes), but it has caught up now and the number of live rows stays under 10K. The server's IO is very loaded, but keeping up. Even though queue processing has caught up, the table and its indexes is still growing anormally, currently up to 12G.

from pg_repack.

schmiddy commented on September 28, 2024

If pg_repack is not waiting on old transactions to die off, another possibility is that a steady stream of new rows recorded by the trigger are causing repack_apply to always have just a few rows left to process, so this logic:

    /*                                                                      
     * 4. Apply log to temp table until no tuples are left in the log       
     * and all of the old transactions are finished.                        
     */

never actually finishes -- I've seen this on a large busy table before. IMO we should fix pg_repack so that if apply_log() indicates that it has processed just a few remaining rows in the log table, we take the exclusive lock and finish up, instead of waiting for apply_log() to get all the way down to 0, which may never actually happen.

In the meantime, to find out if that's what biting you, check what SELECT COUNT(*) FROM repack.log_xxx says while pg_repack is spinning through that loop performing apply_log(). If you see that number steadily decreasing from a huge count down to zero, everything is fine. If you see that count bouncing around with numbers well under 1000, then you are experiencing this bug. If so, I suggest temporarily halting writes to your main table for a few seconds, enough for apply_log() to chew through the log entries and return 0.

from pg_repack.

vincentdephily commented on September 28, 2024

Yes, had looked at this, row count on repack table bounces around a lot, which is expected behaviour for the source table.

I tried stoping the queue processing to lower activity on the table (so that it only gets inserts, no updates nor deletes), but that wasn't enough for pg_repack to notice that it was a good time to take the lock.

Stoping inserts on that table is a tough one, it'd mean a service interruption which I'd rather avoid.

from pg_repack.

schmiddy commented on September 28, 2024

I think you should be able to grab a LOCK TABLE original_table IN SHARE MODE; while pg_repack is finishing up, to stop updates to the original table for a second or two.

from pg_repack.

vincentdephily commented on September 28, 2024

Glancing at the code, it seems like just raising the first "if (num > 0)" threshold would be enough ? Apply_log() is being run again after we get the exclusive lock, so it looks safe ?

from pg_repack.

vincentdephily commented on September 28, 2024

Locking the table did the trick, although I had to lock the table that fed into the queue table rather than the queue table itself (which I the one I wanted to repack), because otherwise pg_repack saw a running transaction and wouldn't continue, and releasing the lock got me back to the original race condition situation. So "nice trick, not universally usable" :)

BTW, even though there were < 10k rows in repack.table_xxxx, there were many more (up to 300k) in repack.log_xxxx. It took a while to process them. Just mentioning this in case it explains the scenario better.

from pg_repack.

schmiddy commented on September 28, 2024

Glancing at the code, it seems like just raising the first "if (num > 0)" threshold would be enough ?

Yeah, I think bumping that up to, say, 50 would be an improvement over what we have now. Or maybe we'd want something just a bit more sophisticated, like check for num <= 50 and also that num was small in the last loop iteration as well.

BTW, even though there were < 10k rows in repack.table_xxxx, there were many more (up to 300k) in repack.log_xxxx. It took a while to process them. Just mentioning this in case it explains the scenario better.

Oh, so maybe you weren't falling into the case of being stuck in a loop waiting of apply_log() always having a small handful of rows to chew through -- perhaps your write volume to the original table was simply high enough that pg_repack was having difficulty keeping up.

from pg_repack.

vincentdephily commented on September 28, 2024

Sorry, musn't have been level-headed yesterday, I tougtht you suggested to check the size of repack.table_xxxx instead of repack.log_xxxx. Makes perfect sense today :) My problem might well have been an IO bandwidth one, I need to free up ressources. The log was big, but it wasn't growing unboundedly as far as I could see.

Not waiting until the log is exactly empty still sounds like a good idea. I suggest making the threshold configurable from the command line.

from pg_repack.

ktm-rice commented on September 28, 2024

Hi, I am having a similar problem with pg_repack. The table is very busy and pg_repack never catches up with the main table and so the log table grows without bound. Is there anyway to make the apply_log function faster? Regards, Ken

from pg_repack.

gdeglin commented on September 28, 2024

Ran into this myself recently and had to briefly stop my application to get pg_repack to finish packing a couple of tables. Seems like it was close enough that the downtime was under 10 seconds each time, but it'd be great to have a fix for this.

from pg_repack.

drakkan commented on September 28, 2024

Hi, can you please verify if your apps keep connections "idle in transaction"? If so I think pg_repack cannot complete.

from pg_repack.

ktm-rice commented on September 28, 2024

On Tue, Sep 08, 2015 at 01:57:31PM -0700, drakkan wrote:

Hi, can you please verify if your apps keep connections "idle in transaction"? If so I think pg_repack cannot complete.

Hi,

No they are not "idle in transaction", just "idle" waiting for a new
transaction. The problem was that the single apply_log processe just
could not empty quickly enough to gain on all of the new updates. I
ran the pg_repack over the weekend during a less active period and
it completed successfully.

Regards,
Ken

from pg_repack.

gdeglin commented on September 28, 2024

@drakkan Not idle in transaction here either. We're experiencing the same symptoms as @ktm-rice.

from pg_repack.

pyhalov commented on September 28, 2024

See the same here, due to high I/O load, pg_repack can't process queue. It has gone from initial ~ 500k records down to ~ 1k records, but pg_repack didn't get this chance to switch tables, and now queue is 1000k long. Can we at least have configurable APPLY_COUNT and MIN_TUPLES_BEFORE_SWITCH , so that they could be adjusted for big/busy tables?

from pg_repack.

carl-reverb commented on September 28, 2024

We're facing an issue with our large table 'entries', and an apparently interminable repack on RDS Aurora PostgreSQL 13.13 with pg_repack 1.4.7.

This table is 1.5TB in size with 1.2 billion rows and is receiving 400k inserts and 80k updates and 80k deletes per minute.
The inserts are large bulk-inserts with hundreds of values per statement.

Now pg_repack's log table has over 3 billion rows in it and according to pganalyze it's inserting about 400k rows per minute and deleting about 60k rows per minute. There's no sign of this trend reversing, so I believe I have no choice but to declare bankruptcy and drop the repacking effort.

Unless there's something I'm overlooking that will let this run faster? I already caled our instance class up to db.r5.12xlarge which can sustain over 1GB of I/O bandwidth and neither storage bandwidth nor cpu bandwidth appear to be constraining the rate of work.

from pg_repack.

Strategy for busy tables about pg_repack HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent