Comments (15)
Hey @InteNs, thanks for the detailed report. I'll look into this!
from solid_queue.
oh I've been testing a combination of SIGKILL and SIGTERM, they both sometimes trigger the exception and sometimes not...
manually killing worker and dispatcher processes work fine and the supervisor starts new ones like it's supposed to.
manually killing the supervisor (both term
and kill
sometimes trigger the error and sometimes not.
the manual killing was done in a shell inside the container
however in our production setup the restarting is managed by docker so the 10 second (now 20) wait before the final sigkill
Looking forward to try those changes!
from solid_queue.
Hey! I will try this out soon!
is it via Docker's
CMD
?
yup!
Is it just
bin/rails solid_queue:start
?
yup!
from solid_queue.
@rosa We've been running #269 on one of our apps for the entire weekend and haven't seen the error yet!
It has definitely been rebooted a few times and just now manually too.
very smooth now!
from solid_queue.
Ohh, great news! We've been running that branch also for quite a few days in production for HEY and everything has worked well, so I'm going to merge that now and ship a new version. I'll keep this one open in case you run into this error again.
Thanks a lot again! 🙏
from solid_queue.
@InteNs, would you happen to have some more logs right before the restart? That is, a few more lines before this:
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3 | 2024-07-24 12:48:42.267770 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (625.4ms) supervisor_pid: 1, supervised_pids: [28, 31]
gate_worker.1.bxrxeomly8pa8gvvvdxot19pe@staging-docker-worker-3 | bin/rails aborted!
from solid_queue.
hmm those are all the (info level) logs there are, above the blank line was the container start at 03:00 AM, and below the blank line the restart, nothing was logged in between.
I will set the log level to debug and trigger a restart, I will add those logs here in a minute!
is it possible forked processes are not logging to the same logger? I know for puma we have a on_worker_boot callback to reopen the logs:
on_worker_boot do
# Re-open appenders after forking the process
SemanticLogger.reopen
end
I've already set SemanticLogger as the logger for solidqueue: SolidQueue.logger = SemanticLogger[SolidQueue]
from solid_queue.
@rosa okay so here are the debug logs, i could't get the exception to trigger this time, weird...
Makes it seem like it's maybe something that only happens after it's running a while? or something to do with forks?
what is weird about these logs is that after running for 5 minutes the supervisor gets killed and replaced by another process or something? This might be related to an idle database connection timeout maybe? (If i leave a rails console open for 5 minutes i need to reload!
in order to get a new db connection)
debug logs
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:47.278559 I [T5600] Rails -- Loaded .env
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:54.882022 D [T5600] (0.824ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:54.883978 D [T5600] (2.831ms) ActiveRecord -- SolidQueue::Process Create -- {:sql=>"INSERT INTO \"solid_queue_processes\" (\"kind\", \"last_heartbeat_at\", \"supervisor_id\", \"pid\", \"hostname\", \"metadata\", \"created_at\") VALUES ($1, $2, $3, $4, $5, $6, $7) RETURNING \"id\"", :binds=>{:kind=>"Supervisor", :last_heartbeat_at=>"2024-07-24 20:45:54.686520", :supervisor_id=>nil, :pid=>1, :hostname=>"staging-docker-worker-3:gate_worker_1", :metadata=>nil, :created_at=>"2024-07-24 20:45:54.879627"}, :allocations=>140, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:54.968576 D [T5600] (76.7ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:54.968846 I [T5600] SolidQueue -- SolidQueue-0.3.4 Register Supervisor (282.3ms) pid: 1, hostname: "staging-docker-worker-3:gate_worker_1"
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:54.977131 W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms) pid: 12, status: 0, pid_from_status: 12, signaled: false, stopsig: nil, termsig: nil
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:55.768863 D [Tio-worker-1] (199.6ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"last_heartbeat_at\" <= $1 ORDER BY \"solid_queue_processes\".\"id\" ASC LIMIT $2 FOR UPDATE SKIP LOCKED", :binds=>{:last_heartbeat_at=>"2024-07-24 20:40:55.471181", :limit=>50}, :allocations=>2601, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:55.769637 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Prune dead processes (593.5ms) size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:56.275987 D [Tio-worker-1] (2.610ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:56.278357 D [Tio-worker-1] (1.557ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>42, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:45:56.380089 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (203.2ms) size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:46:54.973618 D [Tio-worker-1] (0.950ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:46:54.974810 D [Tio-worker-1] (2.207ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:46:54.971098", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:46:54.976455 D [Tio-worker-1] (1.387ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:47:54.979258 D [Tio-worker-1] (0.968ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:47:54.980239 D [Tio-worker-1] (2.028ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:47:54.977515", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:47:54.982088 D [Tio-worker-1] (1.617ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:48:54.989100 D [Tio-worker-1] (1.302ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:48:54.991324 D [Tio-worker-1] (3.472ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:48:54.985011", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:48:54.993745 D [Tio-worker-1] (2.011ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:49:55.003006 D [Tio-worker-1] (1.238ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:49:55.005151 D [Tio-worker-1] (3.673ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:49:54.998115", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:49:55.007899 D [Tio-worker-1] (2.102ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:55.015550 D [Tio-worker-1] (1.146ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:55.017647 D [Tio-worker-1] (3.607ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 20:50:55.011764", :id=>260}, :allocations=>83, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:55.020281 D [Tio-worker-1] (2.034ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.387335 D [Tio-worker-1] (2.713ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"last_heartbeat_at\" <= $1 ORDER BY \"solid_queue_processes\".\"id\" ASC LIMIT $2 FOR UPDATE SKIP LOCKED", :binds=>{:last_heartbeat_at=>"2024-07-24 20:45:56.381719", :limit=>50}, :allocations=>9, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.473909 D [Tio-worker-1] (1.148ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>8, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.477218 D [Tio-worker-1] (1.161ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>44, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.479746 D [Tio-worker-1] (0.609ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"supervisor_id\" = $1", :binds=>{:supervisor_id=>251}, :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.481572 D [Tio-worker-1] (1.059ms) ActiveRecord -- SolidQueue::Process Destroy -- {:sql=>"DELETE FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"id\" = $1", :binds=>{:id=>251}, :allocations=>46, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.483263 D [Tio-worker-1] (0.831ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>46, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.484836 D [Tio-worker-1] (0.616ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>251}, :allocations=>6, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.485096 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (3.0ms) size: 0
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.486440 D [Tio-worker-1] (1.221ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>67, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.489142 I [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Deregister Supervisor (12.3ms) process_id: 251, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", last_heartbeat_at: Wed, 24 Jul 2024 22:41:33.118215000 CEST +02:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 251, kind: "Supervisor", last_heartbeat_at: "2024-07-24 22:41:33.118215000 +0200", supervisor_id: nil, pid: 1, hostname: "staging-docker-worker-3:gate_worker_1", metadata: {}, created_at: "2024-07-24 12:49:27.575597000 +0200">, :pruned=>true, :claimed_size=>0}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.489220 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Prune dead processes (107.8ms) size: 1
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.491929 D [Tio-worker-1] (1.414ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>85, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.493298 D [Tio-worker-1] (0.686ms) ActiveRecord -- SolidQueue::ClaimedExecution Load -- {:sql=>"SELECT \"solid_queue_claimed_executions\".* FROM \"solid_queue_claimed_executions\" LEFT OUTER JOIN \"solid_queue_processes\" ON \"solid_queue_processes\".\"id\" = \"solid_queue_claimed_executions\".\"process_id\" WHERE \"solid_queue_processes\".\"id\" IS NULL", :allocations=>42, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.493751 D [Tio-worker-1] SolidQueue -- SolidQueue-0.3.4 Release claimed jobs (3.9ms) size: 0
I'm gonna keep restarting to see if I can get the exception to trigger again.
edit:
Managed to trigger it again by killing the container:
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:39:23.094078 D [Tio-worker-1] (1.658ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:23.103831 D [Tio-worker-1] (1.196ms) ActiveRecord -- TRANSACTION -- {:sql=>"BEGIN", :allocations=>6, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:23.108470 D [Tio-worker-1] (5.944ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>{:last_heartbeat_at=>"2024-07-24 21:40:23.101274", :id=>269}, :allocations=>83, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:23.110399 D [Tio-worker-1] (1.582ms) ActiveRecord -- TRANSACTION -- {:sql=>"COMMIT", :allocations=>48, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:50.970872 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (412.3ms) supervisor_pid: 1, supervised_pids: [124, 136]
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | bin/rails aborted!
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:51.171604 D [T5600] (81.7ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>269}, :allocations=>123, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | Caused by:
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | PG::ConnectionBad: PQconsumeInput() server closed the connection unexpectedly (PG::ConnectionBad)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | Tasks: TOP => solid_queue:start
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | (See full trace by running task with --trace)
Weird how it's fetching claimed executions after saying that the supervisor was terminated gracefully right?
Also that debug entry with the query happening while the exception is being logged suggests it maybe comes from another process?
Might this be because the supervisor is actually pid 1? I will try to run this container with --init and see if that does anything
edit:
It seems adding init: true
to our compose file got rid of the warning we were getting on startup every time:
W [T5600] SolidQueue -- SolidQueue-0.3.4 Tried to replace forked process but it had already died (0.0ms)
Will need to wait and see if it also gets rid of the exception
from solid_queue.
Thanks a lot, @InteNs, that's very helpful 🙏
Weird how it's fetching claimed executions after saying that the supervisor was terminated gracefully right?
I think that is the query to calculate how many claimed executions processes have when they're terminated for logging purposes, after completing the shutdown 🤔
what is weird about these logs is that after running for 5 minutes the supervisor gets killed and replaced by another process or something?
I think I know what's going on. Looks like the supervisor that's apparently getting killed there is not the current supervisor. It's the old one, and only its record in the DB, that's staying around from a previous run, and that gets cleaned up when its heartbeat expires 🤔 Looking at the IDs in the queries:
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:49:55.005151 D [Tio-worker-1] (3.673ms) ActiveRecord -- SolidQueue::Process Update -- {:sql=>"UPDATE \"solid_queue_processes\" SET \"last_heartbeat_at\" = $1 WHERE \"solid_queue_processes\".\"id\" = $2", :binds=>
{:last_heartbeat_at=>"2024-07-24 20:49:54.998115", :id=>260}, :allocations=>83, :cached=>nil}
This is the heartbeat for process with ID 260, which is the current supervisor when these logs are happening. Then, we see the following two queries:
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.479746 D [Tio-worker-1] (0.609ms) ActiveRecord -- SolidQueue::Process Load -- {:sql=>"SELECT \"solid_queue_processes\".* FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"supervisor_id\" = $1",
:binds=>{:supervisor_id=>251}, :allocations=>7, :cached=>nil}
gate_worker.1.nh2vfzy39024ri9u9pgjhyu6r@staging-docker-worker-3 | 2024-07-24 22:50:56.481572 D [Tio-worker-1] (1.059ms) ActiveRecord -- SolidQueue::Process Destroy -- {:sql=>"DELETE FROM \"solid_queue_processes\" WHERE \"solid_queue_processes\".\"id\" = $1",
:binds=>{:id=>251}, :allocations=>46, :cached=>nil}
There we have ID 251 for the process that's being deleted, and supervisor ID 251 for the SELECT
query that tries to fetch its supervised processes to deregister them as well (looks like there aren't any, otherwise we'd see the DELETE
queries for them as well).
This tells us that the supervisor from the pervious run failed to be correctly deregistered. I suspect this must be because of the error you're getting precisely while logging around deregistering 🤔
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:50.970872 I [T5600] SolidQueue -- SolidQueue-0.3.4 Supervisor terminated gracefully (412.3ms) supervisor_pid: 1, supervised_pids: [124, 136]
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | bin/rails aborted!
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | 2024-07-24 23:40:51.171604 D [T5600] (81.7ms) ActiveRecord -- SolidQueue::ClaimedExecution Count -- {:sql=>"SELECT COUNT(*) FROM \"solid_queue_claimed_executions\" WHERE \"solid_queue_claimed_executions\".\"process_id\" = $1", :binds=>{:process_id=>269}, :allocations=>123, :cached=>nil}
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | ActiveRecord::ConnectionFailed: PQconsumeInput() server closed the connection unexpectedly (ActiveRecord::ConnectionFailed)
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | This probably means the server terminated abnormally
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 | before or while processing the request.
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 |
gate_worker.1.99rt9ry9fomtn5s1l62jcmqrc@staging-docker-worker-3 |
So it looks like the connection is closed before the deregistering takes place and thus the error and the supervisor hanging around... I'll see how to improve this, it's actually quite tight to do the deregistration after the shutdown if the shutdown happens using up all the shutdown timeout.
May I ask how the restart of your containers happen? Do the process running get send a TERM
signal and then gets some time to wait before being killed? That might be playing a role here too, if the time there is shorter than SolidQueue.shutdown_timeout
.
from solid_queue.
May I ask how the restart of your containers happen?
sure! we use the default docker swarm settings, so that would be SIGTERM followed by a SIGKILL after 10 seconds.
SolidQueue has a shutdown_timout of 5 seconds so I would assume there is plenty of time?
I will increase both solid queue and docker timeout to see if it helps.
Also can confirm that adding init: true
(thus not running the supervisor as pid 1) did not solve the exception, only the warning at startup.
from solid_queue.
@rosa can the cause possibly be that when the supervisor kills its forked processes, those processes terminate the database socket connection that is shared between the different processes? or does each fork get a seperate database connection?
https://stackoverflow.com/questions/13089875/fork-ruby-activerecord-and-file-descriptors-on-fork
https://stackoverflow.com/questions/27892975/what-causes-activerecord-breaking-postgres-connection-after-forking
i'm getting the same error when my console is idle for too long and postgres kills the connection
from solid_queue.
sure! we use the default docker swarm settings, so that would be SIGTERM followed by a SIGKILL after 10 seconds.
SolidQueue has a shutdown_timout of 5 seconds so I would assume there is plenty of time?
Yeah, definitely, that should work fine. In my tests of this I used 2 seconds as shutdown_timeout
and that was fine, so it seems this should work. Perhaps something takes longer here...
those processes terminate the database socket connection that is shared between the different processes? or does each fork get a seperate database connection?
https://stackoverflow.com/questions/13089875/fork-ruby-activerecord-and-file-descriptors-on-fork
https://stackoverflow.com/questions/27892975/what-causes-activerecord-breaking-postgres-connection-after-forking
Ohhh, good question! I think from Rails 5.2+, connections across forks are seamlessly handled by Rails, so it shouldn't be the case. I'm going to add some tests for this, in any case, where forked processes get terminated independently from the supervisor, to confirm this works fine.
I imagine this is not the case, but is there anything else external you might be running that could be closing all PostgreSQL connections from your job containers when there's a restart?
I'll keep digging and in any case, going to see if I can improve the shutdown + deregistering process.
from solid_queue.
Ok, I've got some changes ready around shutdown + deregistering, that will go out as part of a larger change. However, I realised something... in this comment you mentioned:
Managed to trigger it again by killing the container:
How is that "killing" done? From the logs, it seems a SIGTERM
was sent, but was also a SIGKILL
sent right after, without waiting any time? Or did this follow the "SIGTERM followed by a SIGKILL after 10 seconds" setup?
from solid_queue.
@InteNs thanks so much for doing these tests! I should have them ready early next week.
from solid_queue.
Hey @InteNs, I opened #269, which includes a bunch of changes on running the supervisor, if you want to try them. The main change is unrelated to this, but I refactored the supervisor's cycle quite a bit. There's a breaking change that might not affect you, depending on how you're running Solid Queue. It's described here.
This also reminded me that I wanted to ask how you were running Solid Queue: is it via Docker's CMD
? And if so, what's the command? Is it just bin/rails solid_queue:start
? Wondering if there's something there I'm missing.
from solid_queue.
Related Issues (20)
- Puma plugin causes warning about unresolved or ambiguous Gem specs HOT 1
- Store actual exception instead of full `retry_on` condition HOT 2
- Cron tasks can give weird errors if enqueue fails HOT 5
- Solid Queue isn't processing jobs in queues HOT 2
- Deleting scheduled jobs HOT 1
- Database pool size in multi workers setup HOT 2
- Stopping in progress job HOT 3
- Not Finding Recurring Jobs HOT 3
- Best way to delete an old queue HOT 2
- Feature Request: Automatic Worker Process Recycling HOT 6
- urls for corresponding classes? HOT 4
- SolidQueue trying to interact with db2 database connection and failing to insert job records HOT 8
- NoMethodError from application.active_support #3 undefined method `unblock_next_blocked_job' for nil HOT 13
- Persistent "Key (job_id) already exists" error when claiming jobs HOT 6
- DB connection pool size HOT 2
- Wrong upgrade url HOT 2
- On 0.4 Errno::ECHILD errors are popping up HOT 9
- Execute polling queries to ReadReplica HOT 2
- NameError uninitialized constant X, but works several minutes later on retry HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from solid_queue.