Hello! Thank you for writing a robust Faktory client for NodeJS!

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[feature question/request] Running Faktory workers in multi-process mode about faktory_worker_node HOT 5 CLOSED

jbielick commented on July 28, 2024

[feature question/request] Running Faktory workers in multi-process mode

from faktory_worker_node.

Comments (5)

ThisIsMissEm commented on July 28, 2024

Hi! I'm kinda a new to this repo (but not to faktory), and have plenty of node.js experience: in essence, if you used the controller/worker setup where the controller pulls jobs and workers execute as child processes or v8 isolates, you'd be discarding many of the guarantees that Faktory gives you about a jobs state, as the child process could die without the controller (master) being aware, or the controller could die and leave worker processes and orphaned processes; in both scenarios you'd end up with job states that are potentially incorrect.

If you've a long running computation, try to split it up by bucketing it into smaller units of work (say you've a big for loop to execute over, you could instead bucket that into an array of N entries, e.g.,

[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] => [[ 1, 2, 3 ],  [ 4, 5, 6 ], [ 7, 8, 9 ], [ 10 ]]

Then you await on a process.nextTick for each bucket/batch of the computation, that allows the node.js main thread to unblock and respond to the heartbeat. There are also node.js modules to help with this, e.g., async

from faktory_worker_node.

jbielick commented on July 28, 2024

Hey! Thanks for bringing up this topic.

I see no reason you could not use pm2 today. In practice, I would rely on something like upstart, systemd, nomad, kubernetes, etc to run a group of processes and supervise them. Essentially those are all process supervisors that handle restarting, scheduling, scaling, etc. As someone who works on the devops/infrastructure side of things most often, I'm not a fan of worker client libraries doing their own clustering / forking / supervision. This kind of tooling exists in many, many forms and is often much more stable and inspectable when it's outside of the client worker library itself.

Each worker process you start with .work in this library essentially works like @ThisIsMissEm is describing—we pull jobs with one "thread" and work on jobs with other "threads". I put thread in quotes because they're just function calls thrown on the node.js event loop.

In terms of what's needed to use any of those supervision / clustering pieces of software, I think health checks are one of the common denominators. I'd say that might be one of the deficiencies of this client library today—there's no way to health check the worker that's running. While running, the worker sends a heartbeat to the server, but when the server doesn't hear from the worker it will evict it and I'm not certain that the process shuts down gracefully. If it did shut down gracefully, your process supervisor should bring it back up! That simple "keep it running" supervision will get you most of the reliability you could ask for. I'm sure pm2 does something like that (and I remember that it will possibly export upstart / systemd configs for you if you want as well).

As for running multiple processes, the part that I don't like it monitoring the processes (alive or not, start more, etc) and propagating signals to those processes. Since you can start multiple processes with systemd (example below) and things like pm2, kubernetes, nomad, xyz orchestrator allow you to scale your process definition to n, I think most of what's needed is out there already. When using orchestrators and supervisors like those, it often makes it harder to manage processes that do their own forking, clustering, etc. That's my experience anyway.

Here's an example of what a systemd config would look like for multiple processes:

# sidekiqs.service
[Unit]
Description=Sidekiq Group

[Service]
Type=oneshot

ExecStart=/bin/systemctl --no-block start [email protected]
ExecStart=/bin/systemctl --no-block start [email protected]
ExecStart=/bin/systemctl --no-block start [email protected]

ExecStop=/bin/systemctl stop sidekiq@*

RemainAfterExit=true

[Install]
WantedBy=multi-user.target

# [email protected]
[Unit]
Description=Sidekiq Worker

[Service]
Type=simple
Restart=always
WorkingDirectory=...
User={{ user }}
TimeoutStopSec=30

ExecStart=/usr/local/bin/bundle exec sidekiq -c 12

SyslogIdentifier=sidekiq

[Install]
WantedBy=multi-user.target

You'd start the group with systemctl start sidekiqs and stop / restart with systemctl stop sidekiqs and systemctl restart sidekiqs. It works really well. Pm2 looks like it has something similar and I remember having a fairly good experience with it at once point.

from faktory_worker_node.

kigster commented on July 28, 2024

Hi! I'm kinda a new to this repo (but not to faktory), and have plenty of node.js experience: in essence, if you used the controller/worker setup where the controller pulls jobs and workers execute as child processes or v8 isolates, you'd be discarding many of the guarantees that Faktory gives you about a jobs state, as the child process could die without the controller (master) being aware, or the controller could die and leave worker processes and orphaned processes; in both scenarios you'd end up with job states that are potentially incorrect.

If you've a long running computation, try to split it up by bucketing it into smaller units of work (say you've a big for loop to execute over, you could instead bucket that into an array of N entries, e.g.,
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] => [[ 1, 2, 3 ],  [ 4, 5, 6 ], [ 7, 8, 9 ], [ 10 ]]
Then you await on a process.nextTick for each bucket/batch of the computation, that allows the node.js main thread to unblock and respond to the heartbeat. There are also node.js modules to help with this, e.g., async

Thank you, that's very helpful.

So, in a nutshell — your advice is: keep it single process to not invalidate Faktory's integrity guarantees, but fix your Node code to pass the control away from the CPU-blocking loops.

Did I parse that correctly?

from faktory_worker_node.

kigster commented on July 28, 2024

Hey! Thanks for bringing up this topic.

Thank you so much for providing the SystemD configuration for running workers. I love systems and had a lot of success running large distributed applications where services state was managed by systemd, machine provisioning and configuration with chef, and routing by the Chef-managed haproxy connecting each virtual host with the stateful resources. That system was... shrinkable or expandable... 99% automated, CI-validated, and generally indestructible. But then it went out of fashion.

Is there a way to supercede a rogue job that attempts to steal all CPU without calling tick() or just being a cooperative concurrent citizen? How can we prioritize the health of the worker over the success of the job()? Can we add a requirement that they yueld at least twice per health check period?

from faktory_worker_node.

jbielick commented on July 28, 2024

@kigster

The challenge in isolating a "rogue job" is that the job itself is just an event on the event loop. Processing the job is a callback. There's no concurrency primitive in Node.js that I can point to and say "this is the worker / thread that's blocking the event loop". To that end, there's no easy way of "killing" or terminating a single job—the only thing I could kill or terminate is the node.js process itself.

I think the real issue lies in the job code. The job code is deterministic, right? So if it encounters an expensive operation that blocks the event loop, it will only do so again once the job runs again, right? There's no quarantine in which faktory could put this job based on it's rowdiness. I think this means the job is a poison pill and whatever worker gets it would be constantly killing something in order to self-preserve. Since there's no concurrency primitive to monitor for "rowdiness" and kill, we're just talking about terminating the whole worker and expecting process supervisors to bring it back up, right?

The most I could suggest is this:

A. Run your faktory workers with a concurrency of 1. That means only one job should be in progress at a time. You'll have low throughput, but 1 job per process. If you detect that the event loop is blocked, kill this process. How do you detect that the event loop is blocked? You could try this library: https://github.com/tj/node-blocked but you'll have to pick an arbitrary amount of time and hope that you're not killing jobs that normally block the event loop for a acceptable amount of time.

B. Run your faktory workers with concurrency of N, kill the process when you detect the event loop is blocked, and allow the jobs that were killed in-progress alongside the job that was blocking the event loop to retry (lots of collateral damange and ungraceful exits here).

I know that's probably not a wonderful answer, but in some sense blocking the event loop is a little like an infinite loop or horrible time complexity of a synchronous operation (O(n^2)). From that perspective, this library probably can't offer much to protect against that, can it?

from faktory_worker_node.

[feature question/request] Running Faktory workers in multi-process mode about faktory_worker_node HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent