Hey there! We are moving towards SolidQueue for our apps, we started

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hey! I will try this out soon! is it via Docker's <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks a lot, <a class="user-mention notranslate" data-hovercard-type="user" data-hove

`PG::ConnectionBad` error when restarting container running `solid_queue:start`,about rails/solid_queue

Comments (15)

rosa commented on August 10, 2024 1

Hey @InteNs, thanks for the detailed report. I'll look into this!

from solid_queue.

InteNs commented on August 10, 2024 1

oh I've been testing a combination of SIGKILL and SIGTERM, they both sometimes trigger the exception and sometimes not...

manually killing worker and dispatcher processes work fine and the supervisor starts new ones like it's supposed to.
manually killing the supervisor (both term and kill sometimes trigger the error and sometimes not.

the manual killing was done in a shell inside the container

however in our production setup the restarting is managed by docker so the 10 second (now 20) wait before the final sigkill

Looking forward to try those changes!

from solid_queue.

InteNs commented on August 10, 2024 1

Hey! I will try this out soon!

is it via Docker's CMD?

yup!

Is it just bin/rails solid_queue:start?

yup!

from solid_queue.

InteNs commented on August 10, 2024 1

@rosa We've been running #269 on one of our apps for the entire weekend and haven't seen the error yet!
It has definitely been rebooted a few times and just now manually too.
very smooth now!

from solid_queue.

rosa commented on August 10, 2024 1

Ohh, great news! We've been running that branch also for quite a few days in production for HEY and everything has worked well, so I'm going to merge that now and ship a new version. I'll keep this one open in case you run into this error again.

Thanks a lot again! 🙏

from solid_queue.

rosa commented on August 10, 2024

@InteNs, would you happen to have some more logs right before the restart? That is, a few more lines before this:

gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | 2024-07-24 12:48:42.267770 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (625.4ms)  supervisor_pid: 1, supervised_pids: [28, 31]
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3    | bin/rails aborted!

from solid_queue.

InteNs commented on August 10, 2024

hmm those are all the (info level) logs there are, above the blank line was the container start at 03:00 AM, and below the blank line the restart, nothing was logged in between.
I will set the log level to debug and trigger a restart, I will add those logs here in a minute!

is it possible forked processes are not logging to the same logger? I know for puma we have a on_worker_boot callback to reopen the logs:

on_worker_boot do
  # Re-open appenders after forking the process
  SemanticLogger.reopen
end

I've already set SemanticLogger as the logger for solidqueue: SolidQueue.logger = SemanticLogger[SolidQueue]

from solid_queue.

InteNs commented on August 10, 2024

@rosa okay so here are the debug logs, i could't get the exception to trigger this time, weird...

Makes it seem like it's maybe something that only happens after it's running a while? or something to do with forks?

what is weird about these logs is that after running for 5 minutes the supervisor gets killed and replaced by another process or something? This might be related to an idle database connection timeout maybe? (If i leave a rails console open for 5 minutes i need to reload! in order to get a new db connection)

debug logs

gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:47.278559 I [T5600] Rails -- Loaded .env
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.882022 D [T5600] (0.824ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.883978 D [T5600] (2.831ms) ActiveRecord -- SolidQueue::Process Create -- {:sql=>"INSERT INTO \"solid_queue_processes\" (\"kind\", \"last_heartbeat_at\", \"supervisor_id\", \"pid\", \"hostname\", \"metadata\", \"created_at\") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING \"id\"", :binds=>{:kind=>"Supervisor", :last_heartbeat_at=>"2024-07-24 20:45:54.686520", :supervisor_id=>nil, :pid=>1, :hostname=>"staging-docker-worker-3:gate_worker_1", :metadata=>nil, :created_at=>"2024-07-24 20:45:54.879627"}, :allocations=>140, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.968576 D [T5600] (76.7ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.968846 I [T5600] SolidQueue -- SolidQueue-0.3.4 Register Supervisor (282.3ms)  pid: 1, hostname: "staging-docker-worker-3:gate_worker_1"
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:54.977131 W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms)  pid: 12, status: 0, pid_from_status: 12, signaled: false, stopsig: nil, termsig: nil
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:55.768863 D [Tio-worker-1] (199.6ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"last_heartbeat_at\" <= $1 ORDER BY \"solid_queue_processes\".\"id\" ASC LIMIT $2 FOR UPDATE SKIP LOCKED", :binds=>{:last_heartbeat_at=>"2024-07-24 20:40:55.471181", :limit=>50}, :allocations=>2601, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:55.769637 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Prune dead processes (593.5ms)  size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:56.275987 D [Tio-worker-1] (2.610ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:56.278357 D [Tio-worker-1] (1.557ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>42, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:45:56.380089 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (203.2ms)  size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:46:54.973618 D [Tio-worker-1] (0.950ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:46:54.974810 D [Tio-worker-1] (2.207ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:46:54.971098", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:46:54.976455 D [Tio-worker-1] (1.387ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:47:54.979258 D [Tio-worker-1] (0.968ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:47:54.980239 D [Tio-worker-1] (2.028ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:47:54.977515", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:47:54.982088 D [Tio-worker-1] (1.617ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:48:54.989100 D [Tio-worker-1] (1.302ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:48:54.991324 D [Tio-worker-1] (3.472ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:48:54.985011", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:48:54.993745 D [Tio-worker-1] (2.011ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.003006 D [Tio-worker-1] (1.238ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.005151 D [Tio-worker-1] (3.673ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:49:54.998115", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.007899 D [Tio-worker-1] (2.102ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}


gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:55.015550 D [Tio-worker-1] (1.146ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:55.017647 D [Tio-worker-1] (3.607ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:50:55.011764", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:55.020281 D [Tio-worker-1] (2.034ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.387335 D [Tio-worker-1] (2.713ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"last_heartbeat_at\" <= $1 ORDER BY \"solid_queue_processes\".\"id\" ASC LIMIT $2 FOR UPDATE SKIP LOCKED", :binds=>{:last_heartbeat_at=>"2024-07-24 20:45:56.381719", :limit=>50}, :allocations=>9, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.473909 D [Tio-worker-1] (1.148ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.477218 D [Tio-worker-1] (1.161ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>44, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.479746 D [Tio-worker-1] (0.609ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"supervisor_id\" = $1", :binds=>{:supervisor_id=>251}, :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.481572 D [Tio-worker-1] (1.059ms) ActiveRecord -- SolidQueue::Process Destroy -- {:sql=>"DELETE FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"id\" = $1", :binds=>{:id=>251}, :allocations=>46, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.483263 D [Tio-worker-1] (0.831ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>46, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.484836 D [Tio-worker-1] (0.616ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.485096 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (3.0ms)  size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.486440 D [Tio-worker-1] (1.221ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>67, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.489142 I [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Deregister Supervisor (12.3ms)  process_id: 251, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", last_heartbeat_at: Wed, 24 Jul 2024 22:41:33.118215000 CEST +02:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 251, kind: "Supervisor", last_heartbeat_at: "2024-07-24 22:41:33.118215000 +0200", supervisor_id: nil, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", metadata: {}, created_at: "2024-07-24 12:49:27.575597000 +0200">, :pruned=>true, :claimed_size=>0}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.489220 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Prune dead processes (107.8ms)  size: 1
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.491929 D [Tio-worker-1] (1.414ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>85, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.493298 D [Tio-worker-1] (0.686ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>42, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.493751 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (3.9ms)  size: 0

I'm gonna keep restarting to see if I can get the exception to trigger again.

edit:

Managed to trigger it again by killing the container:

gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:39:23.094078 D [Tio-worker-1] (1.658ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:23.103831 D [Tio-worker-1] (1.196ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:23.108470 D [Tio-worker-1] (5.944ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 21:40:23.101274", :id=>269}, :allocations=>83, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:23.110399 D [Tio-worker-1] (1.582ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:50.970872 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (412.3ms)  supervisor_pid: 1, supervised_pids: [124, 136]
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | bin/rails aborted!
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:51.171604 D [T5600] (81.7ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>269}, :allocations=>123, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | Caused by:
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 	before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | Tasks: TOP => solid_queue:start
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | (See full trace by running task with --trace)

Weird how it's fetching claimed executions after saying that the supervisor was terminated gracefully right?

Also that debug entry with the query happening while the exception is being logged suggests it maybe comes from another process?

Might this be because the supervisor is actually pid 1? I will try to run this container with --init and see if that does anything

edit:
It seems adding init: true to our compose file got rid of the warning we were getting on startup every time:
W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms)

Will need to wait and see if it also gets rid of the exception

from solid_queue.

rosa commented on August 10, 2024

Thanks a lot, @InteNs, that's very helpful 🙏

Weird how it's fetching claimed executions after saying that the supervisor was terminated gracefully right?

I think that is the query to calculate how many claimed executions processes have when they're terminated for logging purposes, after completing the shutdown 🤔

what is weird about these logs is that after running for 5 minutes the supervisor gets killed and replaced by another process or something?

I think I know what's going on. Looks like the supervisor that's apparently getting killed there is not the current supervisor. It's the old one, and only its record in the DB, that's staying around from a previous run, and that gets cleaned up when its heartbeat expires 🤔 Looking at the IDs in the queries:

gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:49:55.005151 D [Tio-worker-1] (3.673ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>
{:last_heartbeat_at=>"2024-07-24 20:49:54.998115", :id=>260}, :allocations=>83, :cached=>nil}

This is the heartbeat for process with ID 260, which is the current supervisor when these logs are happening. Then, we see the following two queries:

gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.479746 D [Tio-worker-1] (0.609ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"supervisor_id\" = $1", 
:binds=>{:supervisor_id=>251}, :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3    | 2024-07-24 22:50:56.481572 D [Tio-worker-1] (1.059ms) ActiveRecord -- SolidQueue::Process Destroy -- {:sql=>"DELETE FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"id\" = $1", 
:binds=>{:id=>251}, :allocations=>46, :cached=>nil}

There we have ID 251 for the process that's being deleted, and supervisor ID 251 for the SELECT query that tries to fetch its supervised processes to deregister them as well (looks like there aren't any, otherwise we'd see the DELETE queries for them as well).

This tells us that the supervisor from the pervious run failed to be correctly deregistered. I suspect this must be because of the error you're getting precisely while logging around deregistering 🤔

gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:50.970872 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (412.3ms)  supervisor_pid: 1, supervised_pids: [124, 136]
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | bin/rails aborted!
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | 2024-07-24 23:40:51.171604 D [T5600] (81.7ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>269}, :allocations=>123, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |  This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |  before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3    |

So it looks like the connection is closed before the deregistering takes place and thus the error and the supervisor hanging around... I'll see how to improve this, it's actually quite tight to do the deregistration after the shutdown if the shutdown happens using up all the shutdown timeout.

May I ask how the restart of your containers happen? Do the process running get send a TERM signal and then gets some time to wait before being killed? That might be playing a role here too, if the time there is shorter than SolidQueue.shutdown_timeout.

from solid_queue.

InteNs commented on August 10, 2024

May I ask how the restart of your containers happen?

sure! we use the default docker swarm settings, so that would be SIGTERM followed by a SIGKILL after 10 seconds.
SolidQueue has a shutdown_timout of 5 seconds so I would assume there is plenty of time?

I will increase both solid queue and docker timeout to see if it helps.

Also can confirm that adding init: true (thus not running the supervisor as pid 1) did not solve the exception, only the warning at startup.

from solid_queue.

InteNs commented on August 10, 2024

@rosa can the cause possibly be that when the supervisor kills its forked processes, those processes terminate the database socket connection that is shared between the different processes? or does each fork get a seperate database connection?

https://stackoverflow.com/questions/13089875/fork-ruby-activerecord-and-file-descriptors-on-fork
https://stackoverflow.com/questions/27892975/what-causes-activerecord-breaking-postgres-connection-after-forking

i'm getting the same error when my console is idle for too long and postgres kills the connection

from solid_queue.

rosa commented on August 10, 2024

sure! we use the default docker swarm settings, so that would be SIGTERM followed by a SIGKILL after 10 seconds.
SolidQueue has a shutdown_timout of 5 seconds so I would assume there is plenty of time?

Yeah, definitely, that should work fine. In my tests of this I used 2 seconds as shutdown_timeout and that was fine, so it seems this should work. Perhaps something takes longer here...

those processes terminate the database socket connection that is shared between the different processes? or does each fork get a seperate database connection?

https://stackoverflow.com/questions/13089875/fork-ruby-activerecord-and-file-descriptors-on-fork
https://stackoverflow.com/questions/27892975/what-causes-activerecord-breaking-postgres-connection-after-forking

Ohhh, good question! I think from Rails 5.2+, connections across forks are seamlessly handled by Rails, so it shouldn't be the case. I'm going to add some tests for this, in any case, where forked processes get terminated independently from the supervisor, to confirm this works fine.

I imagine this is not the case, but is there anything else external you might be running that could be closing all PostgreSQL connections from your job containers when there's a restart?

I'll keep digging and in any case, going to see if I can improve the shutdown + deregistering process.

from solid_queue.

rosa commented on August 10, 2024

Ok, I've got some changes ready around shutdown + deregistering, that will go out as part of a larger change. However, I realised something... in this comment you mentioned:

Managed to trigger it again by killing the container:

How is that "killing" done? From the logs, it seems a SIGTERM was sent, but was also a SIGKILL sent right after, without waiting any time? Or did this follow the "SIGTERM followed by a SIGKILL after 10 seconds" setup?

from solid_queue.

rosa commented on August 10, 2024

@InteNs thanks so much for doing these tests! I should have them ready early next week.

from solid_queue.

rosa commented on August 10, 2024

Hey @InteNs, I opened #269, which includes a bunch of changes on running the supervisor, if you want to try them. The main change is unrelated to this, but I refactored the supervisor's cycle quite a bit. There's a breaking change that might not affect you, depending on how you're running Solid Queue. It's described here.

This also reminded me that I wanted to ask how you were running Solid Queue: is it via Docker's CMD? And if so, what's the command? Is it just bin/rails solid_queue:start? Wondering if there's something there I'm missing.

from solid_queue.

`PG::ConnectionBad` error when restarting container running `solid_queue:start` about solid_queue HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent