Comments (37)
We are also seeing this sometimes, after killing the worker in dev. Something with the transaction seems to not be working as expected..
from solid_queue.
@rosa You're a trooper. Thank you. I'll try it out and let you know.
from solid_queue.
We are seeing this same behavior pretty consistently - Jobs will get stuck in processing that are assigned to a worker that no longer exists, completely restarting does not correct the job to worker assignments.
from solid_queue.
Hey, just a mention that we see similar issue on production k8s. The jobs where in "in_progress" and assigned to the worker that didn't exist. Manual release
worked in this case.
solid_queue (0.3.1)
ruby 3.3.1
from solid_queue.
When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of 1 / 0
in the count. My guess is that the worker was killed but since it happen when the mac was hibernated, it did not run any lifecycle events allowing solid queue to correct the issue. Restarting solid_queue prior to queuing any new jobs corrects the issue.
Again, just an idea to what is possibly happening.
from solid_queue.
Thanks @wflanagan!
when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work?
No, not really. When a worker is deleted from the DB, it releases its claimed executions. A rebooted worker will have a different ID and will know nothing about other executions. The piece I'm missing here, basically, is how a worker record can get deleted without releasing its executions as well, and what I haven't been able to reproduce in any way. The Process
model in Solid Queue (which represents workers and dispatchers) has this callback:
after_destroy -> { claimed_executions.release_all }
So, both releasing claimed executions and deleting the worker should happen within a transaction, which is why I haven't been able to reproduce what's happening here: a worker (process) getting deleted, leaving all executions behind.
A way this could happen is that the worker gets deregistered from the DB while some executions are still being claimed, in a way that claimed_executions
in the callback is empty because the claiming is happening in a parallel transaction that hasn't been committed yet. But this would mean the record in the DB is being deleted before the process is actually shut down, which can only happen if you delete these manually somehow, because we'd only deregister the process (delete the DB record) if:
- The heartbeat hasn't been sent for over 5 minutes, which means claiming executions is impossible, or
- Termination is requested, in which case the worker first stops picking new jobs, finishes what it's currently running, and then deregister the process record.
In both cases executions can't be claimed while the process DB record is being deleted.
I'm going to make some changes regardless to see if they help with this scenario, but it's tricky without being able to reproduce it or having an explanation why this is happening. If someone could find some logs for the missing workers being deleted, that'd shed some light for sure.
from solid_queue.
Hey @wflanagan, we already have support for this situation that you describe:
If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.
When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own release
method. Yours doesn't work because the job is already in the system and you can't just enqueue it again.
You'd need to investigate why workers are getting deleted from the solid_queue_processes
table without releasing their claimed executions. If done through the regular flow, that happens in a transaction, so there's no way they can leave the executions behind. Something else might be going on there.
from solid_queue.
@rosa Hey thanks for the response. So, we likely don't get a SigINT or other signal before it just terminates. This is because the node or the process has violated some constraint and it will just kill the process and respawn it. The link you point to seems to look for a clean termination. When you're running on, for example, Spot instances on AWS, you don't get any sort of clean signal. The process is just gone.
Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown?
from solid_queue.
Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown?
Yes, exactly that, that's the exact situation I designed for, and this is what I copied from the link I provided, where both graceful and not clean termination are described, also in my comment:
If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.
When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own release
method. Yours doesn't work because the job is already in the system and you can't just enqueue it again.
You'd need to investigate why workers are getting deleted from the solid_queue_processes
table without releasing their claimed executions. If done through the regular flow, that happens in a transaction, so there's no way they can leave the executions behind. Something else might be going on there.
from solid_queue.
Ok. Maybe it was the version. I had "running jobs" for 18 days that were connected to workers that were no longer present. That's what got me going on this. How often are they "reassigned"?
Other than my code to query to find the working/missing ones, we're running a VERY stock SolidQueue setup. 1 queue, expecting it to "just work." I'll rewatch is all and see if it resolves.
from solid_queue.
Ok. Maybe it was the version.
This has been like this from the first version 😕
I had "running jobs" for 18 days that were connected to workers that were no longer present.
This is what seems wrong, and that I think happened because of something external. For Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an after_destroy
callback. If workers are deleted via destroy
, no jobs would be left assigned. My guess is that you might have deleted the workers manually, leaving the claimed executions orphaned 🤔
from solid_queue.
So, I'll throw this out here. I don't know what it would be. But, the fact is that we get stranded jobs. We're running a pretty stock Postgresql.
And when i click one of these workers, I get the following in the UI.
from solid_queue.
@wflanagan, so, for Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an after_destroy
callback. If workers are deleted via destroy
, no jobs would be left assigned. Is it possible for you to investigate what happened with worker with ID 594396? I imagine not, since it's been quite a while.
Also, if you want to clear things up and wait until this happens again so you have logs and can investigate, you can do that in your Rails console as:
SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)
It's going to take a bit because you have so many, but it should release all those jobs.
Alternatively, if you have more recent orphaned jobs like this, you might be able to find logs for those 🤔
from solid_queue.
from solid_queue.
Any cloud infrastructure could/would have something like this happen. It's a pretty typical scenario.
Yes, it can also happen in our case, and why we handle this via heartbeats and having the supervisor clean up both when it starts and periodically.
But, you're saying this is supposed to be cleaned up?
Yes.
If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.
The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that.
from solid_queue.
First, let me say i appreciate your efforts on this. I know it can be thankless. I used to work on the Cequel gem.
The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that.
I don't quite understand what you're asking. We have a node running in our infrastructure that is running a worker process. That worker process is killed via the equivalent of kill -9 ${PID}
. It's immediately terminated. Whatever is set in the DB table is still present.
What I understand is that the process 'record' and the execution are together in a transaction. But, the process that is executing that transaction just disappears. I am not a DB expert. But, what would roll this record back? There's nothing to do it. The process that was running it was killed.
The entire process in the worker that runs is run with bundle exec solidqueue:start
. This entire process gets killed. So, wouldn't that kill the 5 threads and the dispatcher?
FYI, we aren't specifically deleting any jobs or any records in the other tables.
One other thought. We are going through PgBouncer. But, don't know why that would be involved.
from solid_queue.
Thank you!
The entire process in the worker that runs is run with bundle exec solidqueue:start. This entire process gets killed. So, wouldn't that kill the 5 threads and the dispatcher?
Yes, that's right. The process would be killed, and the DB records that correspond to these processes, that is, the rows in the solid_queue_processes
table would remain. Then, when you start the supervisor again via solid_queue:start
, this code would run, which in turn, runs this, and then this. Ultimately, that would run:
solid_queue/app/models/solid_queue/process.rb
Lines 22 to 27 in 108aea9
And
So, whenever a process record gets destroyed (independently of when the actual, real process was killed), its associated claimed executions would get deleted too, as part of the same DB transaction. In your case, you have process records that have been somehow deleted but you still have claimed executions associated to these records. I'm not sure how this can happen, and have never seen it happen, so this is what I'm trying to figure out and wondering if you had any logs or any way to find out how the missing process records were deleted in the first place.
from solid_queue.
We have PG logs, which COULD be helpful or not. Tracking down the specific logs might be challenging though.
I'll see if I can dig something up.
from solid_queue.
I believe this is basically what we are experiencing as well @arjun810
from solid_queue.
We've never encountered this, and I haven't been able to reproduce it, so I'm a bit lost on what might be happening. If anyone could grab DB logs from the moment the worker gets deleted, that'd be really useful to find and fix the root cause.
from solid_queue.
Thanks a lot @dan003400! That's indeed likely, but in that case, the worker record in the solid_queue_processes
table should have remained as well, not just the claimed jobs 😕
I'm trying to reproduce this by doing the following, using this job. First, I enqueue 10 of these, so that they take 1 minute when being performed:
>> 10.times { StoreResultJob.perform_later(42, pause: 60.seconds) }
Then I start Solid Queue with this config:
workers:
- queues: background
threads: 3
- queues: default
threads: 5
4 bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (36.0ms) pid: 63970, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Prune dead processes (19.8ms) size: 0
SolidQueue-0.3.1 Register Dispatcher (28.7ms) pid: 63985, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Dispatcher (29.7ms) pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Register Worker (28.6ms) pid: 63986, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.7ms) pid: 63987, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.3ms) pid: 63986, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Started Worker (29.4ms) pid: 63987, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Unblock jobs (2.0ms) limit: 500, size: 0
And now I kill the workers with a SIGKILL
, so there's no chance to do any cleanup, and they're restarted by the supervisor:
$ kill -9 63986
$ kill -9 63987
SolidQueue-0.3.1 Replaced terminated Worker (5.6ms) pid: 63986, status: "no exit status set", pid_from_status: 63986, signaled: true, stopsig: nil, termsig: 9, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (58.1ms) pid: 66314, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (60.2ms) pid: 66314, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Replaced terminated Worker (1.1ms) pid: 63987, status: "no exit status set", pid_from_status: 63987, signaled: true, stopsig: nil, termsig: 9, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (39.1ms) pid: 66653, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (39.9ms) pid: 66653, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
Let's check claimed jobs and registered workers:
>> SolidQueue::ClaimedExecution.count
SolidQueue::ClaimedExecution Count (4.1ms) SELECT COUNT(*) FROM `solid_queue_claimed_executions`
=> 4
>> SolidQueue::ClaimedExecution.all.all? { |c| c.process.present? }
SolidQueue::ClaimedExecution Load (2.3ms) SELECT `solid_queue_claimed_executions`.* FROM `solid_queue_claimed_executions`
SolidQueue::Process Load (1.4ms) SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
SolidQueue::Process Load (1.4ms) SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
SolidQueue::Process Load (0.6ms) SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
SolidQueue::Process Load (0.5ms) SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 9 LIMIT 1
=> true
As expected, all claimed executions have a registered process. Note that 3 of them are still claimed by one of the killed workers (process ID 7):
=>
#<SolidQueue::Process:0x000000011ea90640
id: 7,
kind: "Worker",
last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00,
supervisor_id: 5,
pid: 63986,
hostname: "Rosas-MacBook-Air.local",
metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3},
created_at: Wed, 29 May 2024 19:03:05.603323000 UTC +00:00>
Now, let's kill the supervisor by sending a SIGKILL
signal:
$ kill -9 63970
Workers and dispatcher realise the supervisor is gone, so they shut down orderly and clean up:
SolidQueue-0.3.1 Deregister Worker (10.3ms) process_id: 9, pid: 66314, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:07:26.976856000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 9, kind: "Worker", last_heartbeat_at: "2024-05-29 19:07:26.976856000 +0000", supervisor_id: 5, pid: 66314, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3}, created_at: "2024-05-29 19:04:26.869299000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Worker (22.3ms) pid: 66314, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Release claimed jobs (3.3ms) size: 0
rosas-air 3.3.1 ~/Work/basecamp/solid_queue/test/dummy main *= SolidQueue-0.3.1 Deregister Worker (9.1ms) process_id: 10, pid: 66653, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:07:35.982878000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 10, kind: "Worker", last_heartbeat_at: "2024-05-29 19:07:35.982878000 +0000", supervisor_id: 5, pid: 66653, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"default", "thread_pool_size"=>5}, created_at: "2024-05-29 19:04:35.894628000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Worker (25.3ms) pid: 66653, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Release claimed jobs (9.0ms) size: 0
SolidQueue-0.3.1 Deregister Dispatcher (25.4ms) process_id: 6, pid: 63985, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:08:05.709912000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 6, kind: "Dispatcher", last_heartbeat_at: "2024-05-29 19:08:05.709912000 +0000", supervisor_id: 5, pid: 63985, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>1, "batch_size"=>500, "concurrency_maintenance_interval"=>600}, created_at: "2024-05-29 19:03:05.602736000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Dispatcher (49.1ms) pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
We still have the 3 claimed executions by the worker that I killed individually:
>> SolidQueue::ClaimedExecution.count
SolidQueue::ClaimedExecution Count (2.9ms) SELECT COUNT(*) FROM `solid_queue_claimed_executions`
=> 3
But the worker record is still there as expected:
>> SolidQueue::ClaimedExecution.first.process
SolidQueue::ClaimedExecution Load (3.6ms) SELECT `solid_queue_claimed_executions`.* FROM `solid_queue_claimed_executions` ORDER BY `solid_queue_claimed_executions`.`id` ASC LIMIT 1
SolidQueue::Process Load (1.3ms) SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
=>
#<SolidQueue::Process:0x000000011ea9e240
id: 7,
kind: "Worker",
last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00,
supervisor_id: 5,
pid: 63986,
hostname: "Rosas-MacBook-Air.local",
metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3},
created_at: Wed, 29 May 2024 19:03:05.603323000 UTC +00:00>
Now, if I start the supervisor again, one of the first thing it does is to run the process prune, releasing claimed executions and deregistering the process with ID 7, whose heartbeat has expired already:
$ bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (39.7ms) pid: 78841, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.8ms) pid: 78857, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.5ms) pid: 78857, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Register Worker (29.0ms) pid: 78858, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Dispatcher (29.8ms) pid: 78856, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.9ms) pid: 78858, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Started Dispatcher (31.0ms) pid: 78856, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Unblock jobs (3.0ms) limit: 500, size: 0
SolidQueue-0.3.1 Release claimed job (18.7ms) job_id: 4, process_id: 7
SolidQueue-0.3.1 Release claimed job (15.0ms) job_id: 5, process_id: 7
SolidQueue-0.3.1 Release claimed job (6.6ms) job_id: 6, process_id: 7
SolidQueue-0.3.1 Release claimed jobs (63.9ms) size: 3
SolidQueue-0.3.1 Deregister Worker (70.2ms) process_id: 7, pid: 63986, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 7, kind: "Worker", last_heartbeat_at: "2024-05-29 19:04:05.633951000 +0000", supervisor_id: 5, pid: 63986, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3}, created_at: "2024-05-29 19:03:05.603323000 +0000">, :pruned=>true, :claimed_size=>3}
SolidQueue-0.3.1 Release claimed jobs (1.5ms) size: 0
SolidQueue-0.3.1 Deregister Worker (5.4ms) process_id: 8, pid: 63987, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:04:05.634434000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 8, kind: "Worker", last_heartbeat_at: "2024-05-29 19:04:05.634434000 +0000", supervisor_id: 5, pid: 63987, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"default", "thread_pool_size"=>5}, created_at: "2024-05-29 19:03:05.602737000 +0000">, :pruned=>true, :claimed_size=>0}
SolidQueue-0.3.1 Prune dead processes (111.3ms) size: 2
Relevant lines:
SolidQueue-0.3.1 Release claimed job (18.7ms) job_id: 4, process_id: 7
SolidQueue-0.3.1 Release claimed job (15.0ms) job_id: 5, process_id: 7
SolidQueue-0.3.1 Release claimed job (6.6ms) job_id: 6, process_id: 7
SolidQueue-0.3.1 Release claimed jobs (63.9ms) size: 3
SolidQueue-0.3.1 Prune dead processes (111.3ms) size: 2
I can't manage to leave solid_queue_claimed_executions
without a corresponding solid_queue_processes
record associated 😕
from solid_queue.
When coming back to my mac today after the mac was hibernated all night.
There is a job stuck in progress assigned to Worker 304
, when you click on the lists of Workers, the only worker present is Worker 339 PID: 10361
- restarting solid queue does not correct this issue and the job will never get assigned the new worker 339 - it will stay in progress for ever with its worker 304 assignment.
The only way I have been able to correct this in the past is by deleting the rows in solid_queue_claimed_executions
table.
from solid_queue.
@dan003400, no clue/logs about how worker 304 got deleted? 😕 Is it possible you have that in your log/development.log
file? You could search for DELETE
queries to the solid_queue_processes
table, to see if you spot when the worker got deleted.
from solid_queue.
When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of
1 / 0
in the count. My guess is that the worker was killed but since it happen when the mac was hibernated, it did not run any lifecycle events allowing solid queue to correct the issue. Restarting solid_queue prior to queuing any new jobs corrects the issue.Again, just an idea to what is possibly happening.
This is running in production on Kubernetes. So, it's not a machine hibernating. Good thought though.
from solid_queue.
@rosa I am going to private message you a video in our process. I haven't done anything to clear them up yet, as I am hoping i can get some feedback and advice that can help us troubleshoot this case.
from solid_queue.
@rosa related to this, i'm wondering, when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work?
Just brainstorming why it might be happening.
Because, if any worker cleans up any old execution records, then this shouldn't be happening.
from solid_queue.
I think #244 should help with this.
from solid_queue.
@rosa I pointed at this repo and updated production. Two things.
- The existing claimed executions that were previously there are still present, unfortunately. So, at least in terms of cleaning up, that didn't seem to work.
- I tried the command
SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)
per your statement back in April. This actually throws an error which I'll list below.
SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)
irb(main):002>
(irb):1:in `<main>': undefined method `id' for nil:NilClass (NoMethodError)
SolidQueue.instrument(:release_claimed, job_id: job.id, process_id: process_id) do
So, It seems that the job.id isn't found. So, it can't release the process.
from solid_queue.
I did more research. So, what's happening in my system is this.
The worker process is actually still there, but it's "idling." In kubernetes, the pod was present. I went into the pod and found there was no work being done, that it was idle.
When i killed the pod, it immediately released all the in process jobs that were outstanding. So, this works. I guess the question becomes, and maybe it's related to my own infrastructure, is why the process is still present and waiting, showing no activity, but the job doesn't perceive itself as completed?
I'm still working on it, but i wanted to give you the update.
from solid_queue.
So, It seems that the job.id isn't found. So, it can't release the process.
Hmmm... this doesn't seem possible, because there's a foreign key:
add_foreign_key "solid_queue_claimed_executions", "solid_queue_jobs", column: "job_id", on_delete: :cascade
so you can't have a row in solid_queue_claimed_executions
table that points to a missing record in solid_queue_jobs
. I assume that this is the reason that the new branch's cleanup didn't work, it must have failed in the same way as you running this manually, but how you got to this state seems impossible from the DB point of view. Unless you've perhaps skipped the foreign keys?
from solid_queue.
The foreign key is in the schema.
from solid_queue.
@wflanagan, and does this return any records for you? 😕
SolidQueue::ClaimedExecution.where.missing(:job)
from solid_queue.
@rosa No. Empty array.
from solid_queue.
Hi,
I have a similar problem to the one described above.
My worker gets shut down but the job gets stuck "In Progress".
This happens when I run a job that runs for a while (1 minute++) and it does a lot of sql and api calls.
This is the terminal output when it shuts down:
10:20:57 web.1 | SolidQueue-0.6.0 Error deregistering Worker (102.1ms) process_id: 370, pid: 86548, hostname: "MacBook-Pro.local", name: "worker-65fdf60c40030f3e9ac0", last_heartbeat_at: "2024-09-02T08:19:55Z", claimed_size: nil, pruned: false, error: "ActiveRecord::ConnectionTimeoutError could not obtain a connection from the pool within 0.100 seconds (waited 0.101 seconds); all pooled connections were in use" 10:20:57 web.1 | SolidQueue-0.6.0 Shutdown Worker (5104.3ms) pid: 86548, hostname: "MacBook-Pro.local", process_id: 370, name: "worker-65fdf60c40030f3e9ac0", polling_interval: 0.1, queues: "*", thread_pool_size: 1 10:20:57 web.1 | [ActiveJob] [FeedbackImports::NeedsFeedbackJob] [78cca2cc-3484-4ca8-a622-7ebdb46e2011] Performed FeedbackImports::NeedsFeedbackJob (Job ID: 78cca2cc-3484-4ca8-a622-7ebdb46e2011) from SolidQueue(default) in 12434.0ms 10:20:57 web.1 | /Users/.../.rvm/gems/ruby-3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:128:in
block in wait_poll': could not obtain a connection from the pool within 0.100 seconds (waited 0.101 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)`
SolidQueue::ClaimedExecution.where.missing(:job)
Does not return anything.
Worker with id '370' not found
If I restart my application the job gets picked up again, and runs til finished or shutdown again.
I have tried tweaking Thread and polling intervals for the solid_queue setup as well as thread count in my database.yml file without any noticeable effect .
Hope this somewhat helps with the problem over. I can provide more information if needed.
from solid_queue.
@espenbff, thanks for writing this up 🙏 what's the size of your database connection pool? You'll find this in your database.yml
configuration, as the pool
option.
from solid_queue.
It is set to 5 pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>.
I tried one run with value 20 as well but I got the same error.
default: &default
adapter: postgresql
encoding: unicode
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 20 } %>
development:
<<: *default
database: name_development
I see I overwrite it with 5 anyway. Will do a new test with 20.
from solid_queue.
Got it! Yes, it'll depend on how you have your workers configured as well (the number of threads you're using).
from solid_queue.
Related Issues (20)
- Support for instrumenting polling? HOT 3
- unable to retry on SolidQueue::ProcessExitError HOT 2
- SolidQueue::ProcessExitError: HOT 3
- Existing migration not updated by bin/rails generate solid_queue:install HOT 5
- New queue_schema.rb file hardcodes Rails version HOT 6
- SQLite queue database corruption HOT 69
- `ActiveRecord::MismatchedForeignKey` in new migration file HOT 3
- Upgrading from 0.7 to 0.8 HOT 1
- Same database for queue HOT 3
- Running `rails db:migrate` clears out `queue_schema.rb`. HOT 32
- Need advice on Solid Queue's memory usage HOT 2
- No connection pool for 'SolidQueue::Record' found HOT 4
- Best way to play around with Solid Queue in development? HOT 1
- SolidQueue::Processes::ProcessExitError HOT 9
- Rethinking concurrency_limits#duration HOT 3
- The `queue` database is not configured for the `production` environment. HOT 2
- Memory Constrained environments HOT 10
- Single database schema maintenance story HOT 2
- Question: How to set process_alive_threshold or a global job timeout on a worker queue type? HOT 2
- Upgrade instructions unclear / Single database discouraged? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from solid_queue.