Git Product home page Git Product logo

Comments (37)

arjun810 avatar arjun810 commented on September 14, 2024 3

We are also seeing this sometimes, after killing the worker in dev. Something with the transaction seems to not be working as expected..

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024 2

@rosa You're a trooper. Thank you. I'll try it out and let you know.

from solid_queue.

danwetherald avatar danwetherald commented on September 14, 2024 1

We are seeing this same behavior pretty consistently - Jobs will get stuck in processing that are assigned to a worker that no longer exists, completely restarting does not correct the job to worker assignments.

from solid_queue.

morgoth avatar morgoth commented on September 14, 2024 1

Hey, just a mention that we see similar issue on production k8s. The jobs where in "in_progress" and assigned to the worker that didn't exist. Manual release worked in this case.
solid_queue (0.3.1)
ruby 3.3.1

from solid_queue.

danwetherald avatar danwetherald commented on September 14, 2024 1

When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of 1 / 0 in the count. My guess is that the worker was killed but since it happen when the mac was hibernated, it did not run any lifecycle events allowing solid queue to correct the issue. Restarting solid_queue prior to queuing any new jobs corrects the issue.

Again, just an idea to what is possibly happening.

from solid_queue.

rosa avatar rosa commented on September 14, 2024 1

Thanks @wflanagan!

when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work?

No, not really. When a worker is deleted from the DB, it releases its claimed executions. A rebooted worker will have a different ID and will know nothing about other executions. The piece I'm missing here, basically, is how a worker record can get deleted without releasing its executions as well, and what I haven't been able to reproduce in any way. The Process model in Solid Queue (which represents workers and dispatchers) has this callback:

after_destroy -> { claimed_executions.release_all }

So, both releasing claimed executions and deleting the worker should happen within a transaction, which is why I haven't been able to reproduce what's happening here: a worker (process) getting deleted, leaving all executions behind.

A way this could happen is that the worker gets deregistered from the DB while some executions are still being claimed, in a way that claimed_executions in the callback is empty because the claiming is happening in a parallel transaction that hasn't been committed yet. But this would mean the record in the DB is being deleted before the process is actually shut down, which can only happen if you delete these manually somehow, because we'd only deregister the process (delete the DB record) if:

  • The heartbeat hasn't been sent for over 5 minutes, which means claiming executions is impossible, or
  • Termination is requested, in which case the worker first stops picking new jobs, finishes what it's currently running, and then deregister the process record.
    In both cases executions can't be claimed while the process DB record is being deleted.

I'm going to make some changes regardless to see if they help with this scenario, but it's tricky without being able to reproduce it or having an explanation why this is happening. If someone could find some logs for the missing workers being deleted, that'd shed some light for sure.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Hey @wflanagan, we already have support for this situation that you describe:

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own release method. Yours doesn't work because the job is already in the system and you can't just enqueue it again.

You'd need to investigate why workers are getting deleted from the solid_queue_processes table without releasing their claimed executions. If done through the regular flow, that happens in a transaction, so there's no way they can leave the executions behind. Something else might be going on there.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

@rosa Hey thanks for the response. So, we likely don't get a SigINT or other signal before it just terminates. This is because the node or the process has violated some constraint and it will just kill the process and respawn it. The link you point to seems to look for a clean termination. When you're running on, for example, Spot instances on AWS, you don't get any sort of clean signal. The process is just gone.

Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown?

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown?

Yes, exactly that, that's the exact situation I designed for, and this is what I copied from the link I provided, where both graceful and not clean termination are described, also in my comment:

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own release method. Yours doesn't work because the job is already in the system and you can't just enqueue it again.

You'd need to investigate why workers are getting deleted from the solid_queue_processes table without releasing their claimed executions. If done through the regular flow, that happens in a transaction, so there's no way they can leave the executions behind. Something else might be going on there.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

Ok. Maybe it was the version. I had "running jobs" for 18 days that were connected to workers that were no longer present. That's what got me going on this. How often are they "reassigned"?

Other than my code to query to find the working/missing ones, we're running a VERY stock SolidQueue setup. 1 queue, expecting it to "just work." I'll rewatch is all and see if it resolves.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Ok. Maybe it was the version.

This has been like this from the first version 😕

I had "running jobs" for 18 days that were connected to workers that were no longer present.

This is what seems wrong, and that I think happened because of something external. For Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an after_destroy callback. If workers are deleted via destroy, no jobs would be left assigned. My guess is that you might have deleted the workers manually, leaving the claimed executions orphaned 🤔

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

So, I'll throw this out here. I don't know what it would be. But, the fact is that we get stranded jobs. We're running a pretty stock Postgresql.

CleanShot 2024-04-05 at 12 27 03@2x

And when i click one of these workers, I get the following in the UI.

CleanShot 2024-04-05 at 12 30 34@2x

from solid_queue.

rosa avatar rosa commented on September 14, 2024

@wflanagan, so, for Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an after_destroy callback. If workers are deleted via destroy, no jobs would be left assigned. Is it possible for you to investigate what happened with worker with ID 594396? I imagine not, since it's been quite a while.

Also, if you want to clear things up and wait until this happens again so you have logs and can investigate, you can do that in your Rails console as:

SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)

It's going to take a bit because you have so many, but it should release all those jobs.

Alternatively, if you have more recent orphaned jobs like this, you might be able to find logs for those 🤔

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Any cloud infrastructure could/would have something like this happen. It's a pretty typical scenario.

Yes, it can also happen in our case, and why we handle this via heartbeats and having the supervisor clean up both when it starts and periodically.

But, you're saying this is supposed to be cleaned up?

Yes.

If processes have no chance of cleaning up before exiting (e.g. if someone pulls a cable somewhere), in-flight jobs might remain claimed by the processes executing them. Processes send heartbeats, and the supervisor checks and prunes processes with expired heartbeats, which will release any claimed jobs back to their queues. You can configure both the frequency of heartbeats and the threshold to consider a process dead. See the section below for this.

The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

First, let me say i appreciate your efforts on this. I know it can be thankless. I used to work on the Cequel gem.

The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that.

I don't quite understand what you're asking. We have a node running in our infrastructure that is running a worker process. That worker process is killed via the equivalent of kill -9 ${PID}. It's immediately terminated. Whatever is set in the DB table is still present.

What I understand is that the process 'record' and the execution are together in a transaction. But, the process that is executing that transaction just disappears. I am not a DB expert. But, what would roll this record back? There's nothing to do it. The process that was running it was killed.

The entire process in the worker that runs is run with bundle exec solidqueue:start. This entire process gets killed. So, wouldn't that kill the 5 threads and the dispatcher?

FYI, we aren't specifically deleting any jobs or any records in the other tables.

One other thought. We are going through PgBouncer. But, don't know why that would be involved.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Thank you!

The entire process in the worker that runs is run with bundle exec solidqueue:start. This entire process gets killed. So, wouldn't that kill the 5 threads and the dispatcher?

Yes, that's right. The process would be killed, and the DB records that correspond to these processes, that is, the rows in the solid_queue_processes table would remain. Then, when you start the supervisor again via solid_queue:start, this code would run, which in turn, runs this, and then this. Ultimately, that would run:

def deregister
destroy!
rescue Exception
SolidQueue.logger.error("[SolidQueue] Error deregistering process #{id} - #{metadata}")
raise
end

And
after_destroy -> { claimed_executions.release_all }

So, whenever a process record gets destroyed (independently of when the actual, real process was killed), its associated claimed executions would get deleted too, as part of the same DB transaction. In your case, you have process records that have been somehow deleted but you still have claimed executions associated to these records. I'm not sure how this can happen, and have never seen it happen, so this is what I'm trying to figure out and wondering if you had any logs or any way to find out how the missing process records were deleted in the first place.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

We have PG logs, which COULD be helpful or not. Tracking down the specific logs might be challenging though.
I'll see if I can dig something up.

from solid_queue.

danwetherald avatar danwetherald commented on September 14, 2024

I believe this is basically what we are experiencing as well @arjun810

from solid_queue.

rosa avatar rosa commented on September 14, 2024

We've never encountered this, and I haven't been able to reproduce it, so I'm a bit lost on what might be happening. If anyone could grab DB logs from the moment the worker gets deleted, that'd be really useful to find and fix the root cause.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Thanks a lot @dan003400! That's indeed likely, but in that case, the worker record in the solid_queue_processes table should have remained as well, not just the claimed jobs 😕

I'm trying to reproduce this by doing the following, using this job. First, I enqueue 10 of these, so that they take 1 minute when being performed:

>> 10.times { StoreResultJob.perform_later(42, pause: 60.seconds) }

Then I start Solid Queue with this config:

workers:
  - queues: background
    threads: 3
  - queues: default
    threads: 5
4 bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (36.0ms)  pid: 63970, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Prune dead processes (19.8ms)  size: 0
SolidQueue-0.3.1 Register Dispatcher (28.7ms)  pid: 63985, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Dispatcher (29.7ms)  pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Register Worker (28.6ms)  pid: 63986, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.7ms)  pid: 63987, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.3ms)  pid: 63986, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Started Worker (29.4ms)  pid: 63987, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Unblock jobs (2.0ms)  limit: 500, size: 0

And now I kill the workers with a SIGKILL, so there's no chance to do any cleanup, and they're restarted by the supervisor:

$ kill -9 63986
$ kill -9 63987
SolidQueue-0.3.1 Replaced terminated Worker (5.6ms)  pid: 63986, status: "no exit status set", pid_from_status: 63986, signaled: true, stopsig: nil, termsig: 9, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (58.1ms)  pid: 66314, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (60.2ms)  pid: 66314, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Replaced terminated Worker (1.1ms)  pid: 63987, status: "no exit status set", pid_from_status: 63987, signaled: true, stopsig: nil, termsig: 9, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (39.1ms)  pid: 66653, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (39.9ms)  pid: 66653, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5

Let's check claimed jobs and registered workers:

>> SolidQueue::ClaimedExecution.count
  SolidQueue::ClaimedExecution Count (4.1ms)  SELECT COUNT(*) FROM `solid_queue_claimed_executions`
=> 4

>> SolidQueue::ClaimedExecution.all.all? { |c| c.process.present? }
  SolidQueue::ClaimedExecution Load (2.3ms)  SELECT `solid_queue_claimed_executions`.* FROM `solid_queue_claimed_executions`
  SolidQueue::Process Load (1.4ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
  SolidQueue::Process Load (1.4ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
  SolidQueue::Process Load (0.6ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
  SolidQueue::Process Load (0.5ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 9 LIMIT 1
=> true

As expected, all claimed executions have a registered process. Note that 3 of them are still claimed by one of the killed workers (process ID 7):

=>
#<SolidQueue::Process:0x000000011ea90640
 id: 7,
 kind: "Worker",
 last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00,
 supervisor_id: 5,
 pid: 63986,
 hostname: "Rosas-MacBook-Air.local",
 metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3},
 created_at: Wed, 29 May 2024 19:03:05.603323000 UTC +00:00>

Now, let's kill the supervisor by sending a SIGKILL signal:

$ kill -9 63970

Workers and dispatcher realise the supervisor is gone, so they shut down orderly and clean up:

SolidQueue-0.3.1 Deregister Worker (10.3ms)  process_id: 9, pid: 66314, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:07:26.976856000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 9, kind: "Worker", last_heartbeat_at: "2024-05-29 19:07:26.976856000 +0000", supervisor_id: 5, pid: 66314, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3}, created_at: "2024-05-29 19:04:26.869299000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Worker (22.3ms)  pid: 66314, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Release claimed jobs (3.3ms)  size: 0
rosas-air 3.3.1 ~/Work/basecamp/solid_queue/test/dummy main *= SolidQueue-0.3.1 Deregister Worker (9.1ms)  process_id: 10, pid: 66653, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:07:35.982878000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 10, kind: "Worker", last_heartbeat_at: "2024-05-29 19:07:35.982878000 +0000", supervisor_id: 5, pid: 66653, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"default", "thread_pool_size"=>5}, created_at: "2024-05-29 19:04:35.894628000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Worker (25.3ms)  pid: 66653, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Release claimed jobs (9.0ms)  size: 0
SolidQueue-0.3.1 Deregister Dispatcher (25.4ms)  process_id: 6, pid: 63985, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:08:05.709912000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 6, kind: "Dispatcher", last_heartbeat_at: "2024-05-29 19:08:05.709912000 +0000", supervisor_id: 5, pid: 63985, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>1, "batch_size"=>500, "concurrency_maintenance_interval"=>600}, created_at: "2024-05-29 19:03:05.602736000 +0000">, :pruned=>false, :claimed_size=>0}
SolidQueue-0.3.1 Shut down Dispatcher (49.1ms)  pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil

We still have the 3 claimed executions by the worker that I killed individually:

>> SolidQueue::ClaimedExecution.count
  SolidQueue::ClaimedExecution Count (2.9ms)  SELECT COUNT(*) FROM `solid_queue_claimed_executions`
=> 3

But the worker record is still there as expected:

>> SolidQueue::ClaimedExecution.first.process
  SolidQueue::ClaimedExecution Load (3.6ms)  SELECT `solid_queue_claimed_executions`.* FROM `solid_queue_claimed_executions` ORDER BY `solid_queue_claimed_executions`.`id` ASC LIMIT 1
  SolidQueue::Process Load (1.3ms)  SELECT `solid_queue_processes`.* FROM `solid_queue_processes` WHERE `solid_queue_processes`.`id` = 7 LIMIT 1
=>
#<SolidQueue::Process:0x000000011ea9e240
 id: 7,
 kind: "Worker",
 last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00,
 supervisor_id: 5,
 pid: 63986,
 hostname: "Rosas-MacBook-Air.local",
 metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3},
 created_at: Wed, 29 May 2024 19:03:05.603323000 UTC +00:00>

Now, if I start the supervisor again, one of the first thing it does is to run the process prune, releasing claimed executions and deregistering the process with ID 7, whose heartbeat has expired already:

$ bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (39.7ms)  pid: 78841, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.8ms)  pid: 78857, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.5ms)  pid: 78857, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Register Worker (29.0ms)  pid: 78858, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Dispatcher (29.8ms)  pid: 78856, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.9ms)  pid: 78858, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Started Dispatcher (31.0ms)  pid: 78856, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Unblock jobs (3.0ms)  limit: 500, size: 0
SolidQueue-0.3.1 Release claimed job (18.7ms)  job_id: 4, process_id: 7
SolidQueue-0.3.1 Release claimed job (15.0ms)  job_id: 5, process_id: 7
SolidQueue-0.3.1 Release claimed job (6.6ms)  job_id: 6, process_id: 7
SolidQueue-0.3.1 Release claimed jobs (63.9ms)  size: 3
SolidQueue-0.3.1 Deregister Worker (70.2ms)  process_id: 7, pid: 63986, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:04:05.633951000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 7, kind: "Worker", last_heartbeat_at: "2024-05-29 19:04:05.633951000 +0000", supervisor_id: 5, pid: 63986, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"background", "thread_pool_size"=>3}, created_at: "2024-05-29 19:03:05.603323000 +0000">, :pruned=>true, :claimed_size=>3}
SolidQueue-0.3.1 Release claimed jobs (1.5ms)  size: 0
SolidQueue-0.3.1 Deregister Worker (5.4ms)  process_id: 8, pid: 63987, hostname: "Rosas-MacBook-Air.local", last_heartbeat_at: Wed, 29 May 2024 19:04:05.634434000 UTC +00:00, claimed_size: 0, pruned: {:process=>#<SolidQueue::Process id: 8, kind: "Worker", last_heartbeat_at: "2024-05-29 19:04:05.634434000 +0000", supervisor_id: 5, pid: 63987, hostname: "Rosas-MacBook-Air.local", metadata: {"polling_interval"=>0.1, "queues"=>"default", "thread_pool_size"=>5}, created_at: "2024-05-29 19:03:05.602737000 +0000">, :pruned=>true, :claimed_size=>0}
SolidQueue-0.3.1 Prune dead processes (111.3ms)  size: 2

Relevant lines:

SolidQueue-0.3.1 Release claimed job (18.7ms)  job_id: 4, process_id: 7
SolidQueue-0.3.1 Release claimed job (15.0ms)  job_id: 5, process_id: 7
SolidQueue-0.3.1 Release claimed job (6.6ms)  job_id: 6, process_id: 7
SolidQueue-0.3.1 Release claimed jobs (63.9ms)  size: 3
SolidQueue-0.3.1 Prune dead processes (111.3ms)  size: 2

I can't manage to leave solid_queue_claimed_executions without a corresponding solid_queue_processes record associated 😕

from solid_queue.

danwetherald avatar danwetherald commented on September 14, 2024

When coming back to my mac today after the mac was hibernated all night.

There is a job stuck in progress assigned to Worker 304, when you click on the lists of Workers, the only worker present is Worker 339 PID: 10361 - restarting solid queue does not correct this issue and the job will never get assigned the new worker 339 - it will stay in progress for ever with its worker 304 assignment.

The only way I have been able to correct this in the past is by deleting the rows in solid_queue_claimed_executions table.

image

image

image

image

from solid_queue.

rosa avatar rosa commented on September 14, 2024

@dan003400, no clue/logs about how worker 304 got deleted? 😕 Is it possible you have that in your log/development.log file? You could search for DELETE queries to the solid_queue_processes table, to see if you spot when the worker got deleted.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of 1 / 0 in the count. My guess is that the worker was killed but since it happen when the mac was hibernated, it did not run any lifecycle events allowing solid queue to correct the issue. Restarting solid_queue prior to queuing any new jobs corrects the issue.

Again, just an idea to what is possibly happening.

This is running in production on Kubernetes. So, it's not a machine hibernating. Good thought though.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

@rosa I am going to private message you a video in our process. I haven't done anything to clear them up yet, as I am hoping i can get some feedback and advice that can help us troubleshoot this case.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

@rosa related to this, i'm wondering, when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work?

Just brainstorming why it might be happening.

Because, if any worker cleans up any old execution records, then this shouldn't be happening.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

I think #244 should help with this.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

@rosa I pointed at this repo and updated production. Two things.

  1. The existing claimed executions that were previously there are still present, unfortunately. So, at least in terms of cleaning up, that didn't seem to work.
  2. I tried the command SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release) per your statement back in April. This actually throws an error which I'll list below.
SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)
irb(main):002> 
(irb):1:in `<main>': undefined method `id' for nil:NilClass (NoMethodError)

    SolidQueue.instrument(:release_claimed, job_id: job.id, process_id: process_id) do

So, It seems that the job.id isn't found. So, it can't release the process.

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

I did more research. So, what's happening in my system is this.

The worker process is actually still there, but it's "idling." In kubernetes, the pod was present. I went into the pod and found there was no work being done, that it was idle.

When i killed the pod, it immediately released all the in process jobs that were outstanding. So, this works. I guess the question becomes, and maybe it's related to my own infrastructure, is why the process is still present and waiting, showing no activity, but the job doesn't perceive itself as completed?

I'm still working on it, but i wanted to give you the update.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

So, It seems that the job.id isn't found. So, it can't release the process.

Hmmm... this doesn't seem possible, because there's a foreign key:

add_foreign_key "solid_queue_claimed_executions", "solid_queue_jobs", column: "job_id", on_delete: :cascade

so you can't have a row in solid_queue_claimed_executions table that points to a missing record in solid_queue_jobs. I assume that this is the reason that the new branch's cleanup didn't work, it must have failed in the same way as you running this manually, but how you got to this state seems impossible from the DB point of view. Unless you've perhaps skipped the foreign keys?

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

https://cln.sh/TyXXZ0b9

The foreign key is in the schema.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

@wflanagan, and does this return any records for you? 😕

SolidQueue::ClaimedExecution.where.missing(:job)

from solid_queue.

wflanagan avatar wflanagan commented on September 14, 2024

@rosa No. Empty array.

from solid_queue.

espenbff avatar espenbff commented on September 14, 2024

Hi,
I have a similar problem to the one described above.
My worker gets shut down but the job gets stuck "In Progress".
This happens when I run a job that runs for a while (1 minute++) and it does a lot of sql and api calls.

This is the terminal output when it shuts down:
10:20:57 web.1 | SolidQueue-0.6.0 Error deregistering Worker (102.1ms) process_id: 370, pid: 86548, hostname: "MacBook-Pro.local", name: "worker-65fdf60c40030f3e9ac0", last_heartbeat_at: "2024-09-02T08:19:55Z", claimed_size: nil, pruned: false, error: "ActiveRecord::ConnectionTimeoutError could not obtain a connection from the pool within 0.100 seconds (waited 0.101 seconds); all pooled connections were in use" 10:20:57 web.1 | SolidQueue-0.6.0 Shutdown Worker (5104.3ms) pid: 86548, hostname: "MacBook-Pro.local", process_id: 370, name: "worker-65fdf60c40030f3e9ac0", polling_interval: 0.1, queues: "*", thread_pool_size: 1 10:20:57 web.1 | [ActiveJob] [FeedbackImports::NeedsFeedbackJob] [78cca2cc-3484-4ca8-a622-7ebdb46e2011] Performed FeedbackImports::NeedsFeedbackJob (Job ID: 78cca2cc-3484-4ca8-a622-7ebdb46e2011) from SolidQueue(default) in 12434.0ms 10:20:57 web.1 | /Users/.../.rvm/gems/ruby-3.3.0/gems/activerecord-7.1.3.4/lib/active_record/connection_adapters/abstract/connection_pool/queue.rb:128:in block in wait_poll': could not obtain a connection from the pool within 0.100 seconds (waited 0.101 seconds); all pooled connections were in use (ActiveRecord::ConnectionTimeoutError)`

SolidQueue::ClaimedExecution.where.missing(:job) Does not return anything.

image

Worker with id '370' not found

If I restart my application the job gets picked up again, and runs til finished or shutdown again.
I have tried tweaking Thread and polling intervals for the solid_queue setup as well as thread count in my database.yml file without any noticeable effect .

Hope this somewhat helps with the problem over. I can provide more information if needed.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

@espenbff, thanks for writing this up 🙏 what's the size of your database connection pool? You'll find this in your database.yml configuration, as the pool option.

from solid_queue.

espenbff avatar espenbff commented on September 14, 2024

It is set to 5 pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>.
I tried one run with value 20 as well but I got the same error.

default: &default
  adapter: postgresql
  encoding: unicode
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 20 } %>

development:
  <<: *default
  database: name_development

I see I overwrite it with 5 anyway. Will do a new test with 20.

from solid_queue.

rosa avatar rosa commented on September 14, 2024

Got it! Yes, it'll depend on how you have your workers configured as well (the number of threads you're using).

from solid_queue.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.