towards-100-pct-uptime



towards-100-pct-uptime

3 22


towards-100-pct-uptime

Towards 100% Uptime with Node.js

On Github sandinmyjoints / towards-100-pct-uptime

Towards

100% Uptime

with Node.js

A guy is standing on the corner of the street chain smoking cigarettes, one after another. A woman walking by notices him and says, "Hey, don't you know that those things can kill you? I mean, didn't you see the giant warning on the box?!" "That's OK," the guy says, while he puffs away. "I'm a computer programmer." "So? What's that got to do with anything?" "We don't care about warnings. We only care about errors." (From http://stackoverflow.com/a/235307/599258) We indeed care greatly about errors. Though probably best not to ignore warnings entirely. :)

9M uniques / month.

75K+ users, some are paid subscribers.

About me: I'm a Software Engineer, one of three, at Curiosity Media. We have two main properties: SpanishDict and Fluencia. SpanishDict is a traditional web site, with page reloads. Fluencia is a single page web app with AJAX calls to a REST API. We want both to run all the time, every day. Both run Node.js on the backend.

( We | you | users )

hate downtime.

Downtime is bad for all sorts of reasons. Users go away. You might get paged in the middle of the night. If you know that deploying code can cause a bad experience for users who are online, or cause system errors or corrupted data, you won't deploy as much.

Important, but

out of scope:

  • Redundant infrastructure.
  • Backups.
  • Disaster recovery.
Lots of things can cause downtime. Database. Network. Lots of factors needed to prevent it in entirely. Some of them are out of scope for this presentation.

In scope:

  • Application errors.
  • Deploys.
  • Node.js stuff:
    • Domains.
    • Cluster.
    • Express.
Imperfect engineers (e.g., me) cause application errors. Deploys are a necessary evil. So we'll focus on what we can do with Node to keep these from causing downtime. Without further ado, here are the...

Keys to 100% uptime.

1. Sensibly handle

uncaught exceptions.

2. Use domains

to catch and contain errors.

3. Manage processes

with cluster.

4. Gracefully terminate connections.

We're going to visit each of these in detail.

1. Sensibly handle uncaught exceptions.

Uncaught exceptions happen when:

  • An exception is thrown but not caught.
  • An error event is emitted but nothing is listening for it.

From node/lib/events.js:

EventEmitter.prototype.emit = function(type) {
  // If there is no 'error' event listener then throw.
  if (type === 'error') {
      ...
    } else if (er instanceof Error) {
      throw er; // Unhandled 'error' event
    } else {
      ...
If we're emitting an event of type error, then throw it! No try / catch around this. If you're not listening for it, what will an uncaught thrown error do?

An uncaught exception

crashes the process.

That's it, Node is dead, not running anymore.

If the process is a server:

x 100s??

An even scarier type of downtime than the screen we saw before: for any given client, a single response fails -- so for a server, this could be happening for 100s or 1000s of clients if the uncaught exception was handled poorly and the process crashed hard. The question is, how to recover from this and continue as well as possible?

It starts with...

Domains.

2. Use domains to catch and contain errors.

try/catch doesn't do async.

try {
  var f = function() {
    throw new Error("uh-oh");
  };
  setTimeout(f, 100);
} catch (ex) {
  console.log("try / catch won't catch", ex);
}

Domains are a bit like

try/catch for async.

var d = require('domain').create();

d.on('error', function (err) {
  console.log("domain caught", err);
});

var f = d.bind(function() {
  throw new Error("uh-oh");
});

setTimeout(f, 100);
Try / catch won't help. Wrap async operations in a domain, and the domain will catch thrown exceptions and error events.

The active domain is

domain.active.

var d = require('domain').create();
console.log(domain.active); // <-- null

var f = d.bind(function() {
  console.log(domain.active === d) // <-- true
  console.log(process.domain === domain.active) // <-- true
  throw new Error("uh-oh");
});
Current domain is domain.active and also process.domain. This is important because...

New EventEmitters bind

to the active domain.

EventEmitter.prototype.emit = function(type) {
  if (type === 'error') {
    if (this.domain) {  // This is important!
      ...
      this.domain.emit('error', er);
    } else if ...
We don't just use timers -- most IO in Node happens through EEs. If a domain is active when an EE is created, it will associate itself with that domain. What does that mean? It's like magically adding error listeners to a bunch of EEs. If an EE has an associated domain, the error will be emitted on the domain instead of thrown. This can prevent a whole bunch of uncaught exceptions, thus saving countless server processes. What to do once you've caught an error? You've caught an error with your domain (TODO: image of catching a fish)

Log the error.

Helpful additional fields:

  • error.domain
  • error.domainEmitter
  • error.domainBound
  • error.domainThrown
Probably want to log the error. Errors caught by have extra fields that provide context that may be useful for tracing errors and debugging.

Then it's up to you.

  • Ignore.
  • Retry.
  • Abort (e.g., return 500).
  • Throw (becomes an unknown error).
it's up to you depending on context: what kind of error, what is emitting the error, what your application is doing... No general answer. Domains are a tool, not an answer.

Do I have to create a new domain

every time I do an async operation?

Like every time I handle a request / response cycle? Not necessarily. Can group related operations. For example...

Use middleware.

More convenient.

In Express, this might look like:

var domainWrapper = function(req, res, next) {
  var reqDomain = domain.create();
  reqDomain.add(req);
  reqDomain.add(res);

  reqDomain.once('error', function(err) {
    res.send(500); // or next(err);
  });

  reqDomain.run(next);
};

Based on https://github.com/brianc/node-domain-middleware https://github.com/mathrawka/express-domain-errors

Let's step through this. req and res are both EEs. They were created before this domain existed, so must can be explicitly added to a domain. We add an error handler. On error, we'll just return 500. Alternatively, you could trigger yourerror handling middleware. Then we run the rest of the req / res stack through the context of the domain, and when new EEs are created, they add themselves to the active domain. When any EE emits an error, or an error is thrown, it propagates to this domain.

Domain methods.

  • add: bind an EE to the domain.
  • run: run a function in context of domain.
  • bind: bind one function.
  • intercept: like bind but handles 1st arg err.
  • dispose: cancels IO and timers.
Dispose: If no error, no need to dispose. Now we're in an error state, so more errors could be thrown. Do you want your error handler triggered on all of them? - If no IO or timers, probably no need to dispose. The intention of calling dispose is generally to prevent cascading errors when a critical part of the Domain context is found to be in an error state. - Tries to clean up IO associated with the domain - Streams are aborted, ended, closed, and/or destroyed. - Timers are cleared. - Any error events that are raised as a result of this are ignored. - Node tries really hard to close everything down in this context. - Use of dispose is context-dependent. Investigate its effects and decide if you need it.

Domains

are great

until they're not.

For example,

node-mongodb-native does not

play well with active domain.

console.log(domain.active); // a domain
AppModel.findOne(function(err, doc) {
  console.log(domain.active); // undefined
  next();
});

See https://github.com/LearnBoost/mongoose/pull/1337

This is the lib that Mongoose is built around. Any errors thrown in this callback will not go to the domain because there effectively isn't one. Yikes. (skip) Why this is is outside the scope of this and I don't have a good handle on it yet, but I'm trying to learn more.

Fix with explicit binding.

console.log(domain.active); // a domain
AppModel.findOne(domain.active.bind(function(err, doc) {
  console.log(domain.active); // still a domain
  next();
}));
of course, if you have a lot of db operations, this could get tedious and be error-prone because you might miss one...

What other operations don't play well

well with domain.active?

Good question!

Package authors could note this.

If you find one, let package author know.

It is probably feasible for domains to work. I'm opening a ticket with node-mongodb-native to find out more about this particular case.

Can 100% uptime be achieved

just by using domains?

No.

Not if only one instance of your app

is running.

When that instance is down, or restarting, perhaps due to re-thrown error or uncaught exception or deploy/upgrade, it's unavailable. The time between when this process dies and its successor comes up? That's downtime. This brings us to #3...

3. Manage processes

with cluster.

Cluster module.

Node = one thread per process.

Most machines have multiple CPUs.

One process per CPU = cluster.

master / workers

  • 1 master process forks n workers.
  • Master and workers communicate state via IPC.
  • When workers want to listen to a socket, master registers them for it.
  • Each new connection to socket is handed off to a worker.
  • No shared application state between workers.
IPC is inter-process communication: messages between process. TODO replace with image

What about when a worker

isn't working anymore?

Some coordination is needed.

Worker tells cluster master it's done accepting new connections.

Cluster master forks replacement.

Worker dies.

TODO image of someone not working

Another use case for cluster:

Deployment.

  • Want to replace all existing servers.

  • Something must manage that = cluster master process.

Deploy is a bit like a deliberately induced error across all your workers. Except that you need to start new workers from a different codebase.

Zero downtime deployment.

  • When master starts, give it a symlink to worker code.

  • After deploy new code, update symlink.

  • Send signal to master: fork new workers!

  • Master tells old workers to shut down, forks new workers from new code.

  • Master process never stops running.

Master process never stops, so the socket is continually open and never refuses connections. symlink is a "symbolic link": a pointer to a directory.

Signals.

A way to communicate with running processes.

SIGHUP: reload workers (some like SIGUSR2).

$ kill -s HUP <pid>
$ service <node-service-name> reload

Process management options.

You can write your own process management code with cluster, and it's educational. Getting the behavior correct for all worker states is great fun. Or if you want to simplify your life, there are packages out there that will do it for you.

Forever

github.com/nodejitsu/forever

  • Has been around...forever.
  • No cluster awareness — used on a single process.
  • Simply restarts the process when it dies.
  • More comparable to Upstart or Monit.

Naught

github.com/superjoe30/naught

  • Newer.
  • Cluster aware.
  • Zero downtime errors and deploys.
  • Runs as daemon.
  • Handles log compression, rotation.

Recluster

github.com/doxout/recluster

  • Newer.
  • Cluster aware.
  • Zero downtime errors and deploys.
  • Does not run as daemon.
  • Log agnostic.
  • Simple, relatively easy to reason about.

We went with recluster.

Happy so far.

This ia very simplified example master.js. Cluster emits a variety of events, such as listening and exited, and you would want to log those. Some opts includes num workers, and timeout, which is how long to let old workers live after they stop accepting new connections, in seconds. If this is zero, workers are killed instantly without having a chance to cleanly close down existing connections.

I have been talking about

starting / stopping workers

as if it's atomic.

It's not.

4. Gracefully terminate connections

when needed.

Don't call process.exit too soon!

Give it a grace period to clean up.

process.exit is how you shut down a node process. When you want to shut down a server, you don't want to call process.exit right away! This is what leads to the scenario we saw before where 100s of in-flight requests all failed.

Need to clean up:

  • In-flight requests.
  • HTTP keep-alive (open TCP) connections.

Revisiting our middleware from earlier:

var domainWrapper = function(afterErrorHook) {
  return function(req, res, next) {
    var reqDomain = domain.create();
    reqDomain.add(req);
    reqDomain.add(res);

    reqDomain.once('error', function(err) {
      next(err);
      if(afterErrorHook) afterErrorHook(err);  // Hook.
    });
    reqDomain.run(next);
  };
};
Add after-error hook for cleanup. What do we put into the after-error hook?

1. Call server.close.

var afterErrorHook = function(err) {
  server.close(); // <-- ensure no new connections
}
Node's server class has a method close that stops the server from accepting new connections. Call it to ensure that this worker handles no more work.

2. Shut down keep-alive connections.

var afterErrorHook = function(err) {
  app.set("isShuttingDown", true); // <-- set state
  server.close();
}

var shutdownMiddle = function(req, res, next) {
  if(app.get("isShuttingDown") {  // <-- check state
    req.connection.setTimeout(1);  // <-- kill keep-alive
  }
  next();
}

Idea from https://github.com/mathrawka/express-graceful-exit

HTTP defaults to keep-alive which keeps the underyling TCP connection open. We want to close those TCP connections for our dying worker. So we set global app state that we are shutting down, and for every TCP connection, we set the keepalive timeout to a minimal value -- so as soon there is any activity on that particular connection, it basically closes right away. This will decrease the number of existing connections over time.

3. Then call process.exit

in server.close callback.

var afterErrorHook = function(err) {
  app.set("isShuttingDown", true);
  server.close(function() {
    process.exit(1);  // <-- all clear to exit
  });
}
server.close is actually pretty graceful by default. It will only call back once all existing connections are closed. So we put the call to process.exit inside of it.

Set a timer.

If timeout period expires and server is still around, call process.exit.

Now it's hard shutdown, but time is up and the worker just has to go. Gracefully shutting down is all about "best efforts". If the server is in a bad state (e.g., db disconnected), bad things might happen to these in-flight requests that we are trying to finish out cleanly, too. But they would have anyway if you had just down a hard shutdown without trying to close cleanly. So might as well try it. Most likely, if the shutdown is due to an application error, the other requests will be fine.

Summing up:

Our ideal server.

On startup:

  • Cluster master comes up (for example, via Upstart).
  • Cluster master forks workers from symlink.
  • Each worker's server starts accepting connections.

On deploy:

  • Point symlink to new version.
  • Send signal to cluster master.
  • Master tells existing workers to stop accepting new connections.
  • Master forks new workers from new code.
  • Existing workers shut down gracefully.
master never stops. There are always workers accepting new connections. Workers close out existing connections before dying.

On error:

  • Server catches it via domain.
  • Next action depends on you: retry? abort? rethrow? etc.
Again, no catch-all action here: depends on your app and on what error you've got. Use contextual domains to isolate specific operations or groups of operations so you have a better sense of what kinds of errors are being handled by a particular domain.

On uncaught exception:

  • ??
// The infamous "uncaughtException" event!
process.on('uncaughtException', function(err) {
  // ??
})

Back to where we started:

1. Sensibly handle uncaught exceptions.

We have minimized these by using domains.

But they can still happen.

Node docs say not to keep running.

An unhandled exception means your application — and by extension node.js itself — is in an undefined state. Blindly resuming means anything could happen. You have been warned.

http://nodejs.org/api/process.html#process_event_uncaughtexception This makes sense. By definition, you don't know what's going on, so there's no sure way to recover. This comes from Node not separating your application from the server. It doesn't run in a container like mod_php or mod_wsgi with Apache.

What to do?

First, log the error so you know what happened.

Then, you've got to

kill the process.

(TODO: Old Yeller image??)

It's not so bad. We can now do so

with minimal trouble.

On uncaught exception:

  • Log error.
  • Server stops accepting new connections.
  • Worker tells cluster master it's done.
  • Master forks a replacement worker.
  • Worker exits gracefully when all connections are closed, or after timeout.
Similar to what we have seen for deploy, except reversed: the worker tells the master it is going down.

What about the request

that killed the worker?

How does the dying worker

gracefully respond to it?

Good question!

People are also under the illusion that it is possible to trace back [an uncaught] exception to the http request that caused it...

-felixge, https://github.com/joyent/node/issues/2582 Felix Geisendorfer, Node community member who originally added the uncaughtException handler, and has also asked for it to be removed (unsuccessfully)!

This is too bad, because you

always want to return a response,

even on error.

Keeping a client hanging can come back to bite you. 1) the user agent appears to hang and 2) it might resend the bad request once the connection closes and trigger another exception! This in the HTTP spec. I've seen this happen. It's not pretty. Can crash multiple workers. This presentation was originally titled "I Have Much to Learn About Node.js" because of my surprise at seeing what happened due to this particular behavior and some less robust error handling that we were doing.

This is Towards 100% Uptime b/c these approaches don't guarantee response for every request.

But we can get very close.

Fortunately, given what we've seen,

uncaughts shouldn't happen often.

And when they do, only one

connection will be left hanging.

Must restart cluster master when:

  • Upgrade Node.
  • Cluster master code changes.

During timeout periods, might have:

  • More workers than CPUs.
  • Workers running different versions (old/new).

Should be brief. Probably preferable to downtime.

Tip:

Be able to produce errors on demand

on your dev and staging servers.

(Disable this in production.)

This is really helpful for debugging and testing. Maybe have multiple ones for sync, async, db errors, etc.

Tip:

Keep cluster master simple.

It needs to run for a long time without being updated.

Things change.

I've been talking about:

{
  "node": "~0.10.20",
  "express": "~3.4.0",
  "connect": "~2.9.0",
  "mongoose": "~3.6.18",
  "recluster": "=0.3.4"
}

The Future:

Node 0.11 / 0.12

For example, cluster module has some changes.

Cluster is experimental.

Domains are unstable.

These terms are defined in the node docs. Volcano not because you're going to get burned by Node, but the big island of Hawaii is mostly stable--thousands live there--but it is also still being created -- parts are unstable. Like Node. Best approached with caution, respect.

Good reading:

If you thought this was interesting,

We're hiring.

careers.fluencia.com

Thanks!