jfromaniello / master-process Goto Github PK

View Code? Open in Web Editor NEW

7.0 3.0 8.0 65 KB

reload node.js apps with no downtime

JavaScript 100.00%

master-process's Introduction

The purpose of this module is to reload a node.js application with no downtime by using the cluster capabilities.

Compatibility

Node 6.x
Node 8.x
Node 10.x
Node 12.x
Node 14.x

Installation

npm i master-process --save

Recommended usage

Use this code at the very beginning of your node.js application:

if (cluster.isMaster &&                 // if is a master
    process.env.NODE_ENV !== 'test') {  // not in test mode

  require('master-process').init();
  return;
}

How it works

The master-process module uses the cluster module to run the user application in cluster mode. There are two types of processes involved in a Node cluster:

master process,
worker processes.

The worker processes are used to run your application. All worker processes in a cluster will serve requests on a single server port or UNIX domain socket (see cluster documentation for how this is achieved).

The master process handles forking the required number of workers as well as:

handling the SIGHUP signal on the master process to reload the cluster (new workers are created and old workers are destroyed once the new ones are ready to service requests). Use this signal to tell the master process that you have updated the application and it should reload it.
handling the SIGTERM signal to cleanly shut down all workers and exit the cluster.

Number of workers

The number of workers can be controlled with the WORKERS environment variable. The default is 1.

WORKERS=MAX sets the number of workers equals to the number of cores (as returned by os.cpus().length) WORKERS=AUTO sets the number of workers equals to the number of cores - 1 (or a single worker if single core)

Application Crashes

If a worker exits unexpectedly, master-process will attempt to replace it with a new worker. Similarly if the worker crashes or is killed by the operating system it will also be replaced.

To avoid avoid excessive resource usage in case newly-started workers keep crashing there is a WORKER_THROTTLE environment variable that is used to throttle how often a given worker is restarted:

if a worker has been running for less than WORKER_THROTTLE when it crashes there will be a delay before a replacement worker is created.
if a worker has been running for longer than WORKER_THROTTLE then the replacement worker is started immediately.

The default value is WORKER_THROTTLE=1s.

Updating master-process

If the master process detects that the version of the master-process module has changed it will quit with exit code 1. The service manager should take care of restarting the application.

CPU and Memory monitoring

The master process watch by default the behavior of the worker. If the process is taking too much resources it will load a new worker. Here are the environment variables that can be used to control the process monitoring and their respective defaults:

MEM_MONITOR_FAILURES=10
CPU_MONITOR_FAILURE=10
MAX_MEMORY_ALLOWED_MB=1200
MAX_CPU_ALLOWED=95

SIGUSR2

I use this special signal to profile the underlying application (check v8profiler). The master process pauses-resume the CPU/Mem monitoring and pass the signal to the worker.

Unix sockets

If process.env.PORT starts with an / (slash) master-process will assume you are going to listen on a unix socket and it will take care of few things:

cleaning the socket if exists on start up, otherwise the worker will fail with EADDRINUSE.
cleaning the socket on exit.

Debug

Use DEBUG=master-process to debug this module.

Exposed env variables

Every worker receives these additional environment variables:

PPID: The parent process id.
RELOAD_INDEX: The number of times that the process has been reload with the SIGHUP signal.
WORKER_INDEX: The index of the worker, useful when using more than one worker with WORKERS=AUTO or WORKERS=X.

Similar projects

cluster-master.

License

MIT 2015 - Jose F. Romaniello

master-process's People

Contributors

Stargazers

Watchers

Forkers

dschenkelman crigot dirceu jstrutz elbuo8 pmalouin silviom panga

master-process's Issues

MAX_KILL_TIMEOUT unit is not explicit, which is error-prone

The MAX_KILL_TIMEOUT environment variable does not specify the time unit in its name, which end up being error-prone because users may assume it's in seconds when it's really in milliseconds.

We already use ms elsewhere in the codebase to support including the unit in configuration values (e.g. 30 seconds). Let's do the same for MAX_KILL_TIMEOUT. This is backwards-compatible as when there is no unit the time period is assumed to be in millis.

Require cache is not passed to cluster workers

Description

The cluster module creates completely separate Node processes for each child process.

This causes objects in the require cache to be reloaded by each child process which can lead to strange behavior for services that use the NodeJS require cache as a form of singleton management.

Reproduction

cache-test/
  index.js  # The main function
  mymod.js. # A test module "singleton"
  package.json # For dependencies

package.json

{
  "name": "cache-test",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "dependencies": {
    "master-process": "^3.1.2",
    "uuid": "^3.3.3"
  }
}

index.js

const process = require('master-process');
const cluster = require('cluster');
const mymod = require('./mymod');


if (cluster.isMaster) {  // not in test mode

  require('master-process').init();
  return;
}

console.log('id:', mymod.id);

mymod.js

const uuid = require('uuid/v4')

module.exports.id = uuid();

Running the test

$ WORKERS=2 node index.js
id: 83c2998b-b180-4421-84c6-01f82d061baa
id: 8eb223fe-96c0-4f6b-b648-76fd423c4940

cluster crashes when trying to kill a process that does not exist

There are some race conditions around killing processes that can cause master-process to call process.kill on a PID that is no longer running. This will then crash the cluster due to an uncaught exception in the cluster master.

Here is an example I encountered in local testing. In this case I triggered a memory leak in one of the workers that eventually caused it to be killed with FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory, but not before master-process flagged it as using too much memory and spawned a replacement worker process.

The memory monitoring then tries to kill a process that is not around any more.

Error: kill ESRCH
    at Object.exports._errnoException (util.js:1020:11)
    at process.kill (internal/process.js:191:18)
    at /app/node_modules/master-process/lib/monitor.js:57:15
    at Worker.<anonymous> (/app/node_modules/master-process/index.js:76:7)
    at Worker.g (events.js:292:16)
    at emitOne (events.js:96:13)
    at Worker.emit (events.js:188:7)
    at listening (cluster.js:521:12)
    at Worker.onmessage (cluster.js:452:7)
    at ChildProcess.<anonymous> (cluster.js:766:8)

All calls to process.kill should be within a try/catch block to avoid this situation.

Permissions on UNIX domain socket wrong after cluster gets down to 0 workers

Since #8 the cluster will stay up even if the workers die/exit and it will attempt to replace them. This leads to a new situation whereby all workers in the cluster have exited, temporarily leaving the cluster with no workers.

When binding to a UNIX domain socket, if the workers have taken care to call server.close then Node will unlink the underlying UNIX domain sockets as described in 'Identifying paths for IPC connections':

If the UNIX domain socket (that is visible as a file system path) is created and used in conjunction with one of Node.js' API abstractions such as net.createServer(), it will be unlinked as part of server.close()

https://nodejs.org/api/net.html#net_identifying_paths_for_ipc_connections

When the first worker binds to the socket again, the file will be created but its permissions will not be set to 644 as they are when the master-process cluster is first started. Depending on the umask of the process, this can make the socket inaccessible to any clients wishing to connect.

workers are killed before they have been replaced

When a new worker comes online it will kill all existing workers of the previous "generation", according to RELOAD_INDEX. This causes a problem during cluster reload since workers may get killed before their replacement has come online. Here is a log of a cluster reload that shows this happening.

 1	  master-process starting a new worker +7s
 2	  master-process starting a new worker +3ms
 3	  master-process starting a new worker +3ms
 4	  master-process PID/37675: worker is listening +110ms
 5	  master-process PID/37626: killing old worker  +0ms
 6	  master-process PID/37627: killing old worker  +0ms
 7	  master-process PID/37628: killing old worker  +0ms
 8	  master-process PID/37626: terminated worker has exited +6ms
 9	  master-process PID/37627: terminated worker has exited +1ms
10	  master-process PID/37628: terminated worker has exited +0ms
11	  master-process PID/37676: worker is listening +5s
12	  master-process PID/37677: worker is listening +5s
13	  master-process PID/37626: monitor started +3s
14	  master-process PID/37627: monitor started +1ms
15	  master-process PID/37628: monitor started +1ms
16	  master-process PID/37626: monitor stopped - the process is dead +2s
17	  master-process PID/37627: monitor stopped - the process is dead +0ms
18	  master-process PID/37628: monitor stopped - the process is dead +1ms

On lines 5-10 we see the entire previous generation being killed even though only one new worker is listening at this point. The 2nd and 3rd worker only start listening 5s and 10s later, respectively (lines 11-12) so the cluster is running with a reduced worker pool during this time.

A worker should only stop the worker that it replaces.