This is unfortunate and we’re stuck with it forever

I. Some weeks ago, at NodeConf Argentina, Mathias Bynens gave a presentation about RegExp and Unicode in JavaScript. At some point he showed a series of examples that yielded counter-intuitive results and said: “This is unfortunate and we’re stuck with it forever”. The phrase immediately resounded in my head, because to me it was a perfect definition of JavaScript, or at least the current state of JavaScript.

II. Douglas Crockford published JavaScript: The Good Parts about eight years ago. It’s a masterpiece to me, with a lot of takeaways: JavaScript is loaded with crap (the bad and awful parts); there’s a great functional language somewhere in there, striving to get out; sometimes, a lot of the times, less is more. The book itself is the perfect example: 100 pages long, mostly filled with snippets and examples, and still it had a clearer purpose and communicated it more effectively than most computer books. We can (and we should) make a better language out of JavaScript by subsetting it.

III. JavaScript and its ecosystem have been evolving constantly over the last few years; I imagine it mutated faster than any other programming language has done before. And a lot of the changes are genuinely good additions, that make our lives easier and more pleasant. But is JavaScript as a whole getting any better?

IV. Unlike any other language, JavaScript runs in browsers. And we can’t control the runtime of the browsers. And we want older browsers to support our new code, and new browsers to support our old code. We can’t break backwards compatibility. We can add (some) stuff to the language but we can’t take stuff out. All the bad and awful parts are still there. JavaScript is like a train wreck with a nose job.

V. I always wonder how is it like to learn JavaScript in 2016, for new programmers and for programmers coming from other languages. Do you tell them about var? Or just stick to let and const? And which one of those is preferred? And why the preferred one isn’t just the default? Why do we always have to prepend some operator to define a variable? Yes, we can instruct a linter to forbid this or that keyword (especially this!) but it should be the compiler/interpreter doing it. And we can’t make the interpreter assume a const declaration by default.

VI. Is there really no way to break backwards compatibility? Can’t we just stick a flag in there somewhere? <script language="javascript-good">? 'use non-crappy'? I don’t mind adding the 'use strict' on every file and I honestly forgot what it does. Can’t the browsers manage multiple versions of JavaScript? is it worth their save, especially now that this uneven language has crawled its way to the server and the desktop?

VII. I know there must be excellent reasons why we can’t break backwards compatibility of JavaScript or why it would be just too expensive to do so. But I can’t help my mind, my syntax-oriented-programmer-with-an-inclination-to-a-less-is-more-kind-of-thinking-type-of-mind, I can’t help it from imagining how would I go about subsetting the language. How would I design my JavaScript--. I can’t help myself from outlining a spec of that imaginary language in my head. I even came up with a name for it, which, unsurprisingly, has already been taken.

Advertisements

Real-world RPC with RabbitMQ and Node.JS

tl;dr: use the direct reply-to feature to implement RPC in RabbitMQ.

I’m currently working on a platform that relies heavily on RPC over RabbitMQ to move incoming requests through a chain of Node.JS worker processes. The high level setup for RPC is well described in RabbitMQ’s documentation; let’s steal their diagram:

python-six

We grew our RPC client code based on the JavaScript tutorial, using the amqp.node module. The first —admittedly naive— implementation just created a new connection, channel and queue per request and killed the connection after getting a reply:

const sendRPCMessage = (settings, message, rpcQueue) =>
  amqp.connect(settings.url, settings.socketOptions)
    .then((conn) => conn.createChannel())
    .then((channel) => channel.assertQueue('', settings.queueOptions)
      .then((replyQueue) => new Promise((resolve, reject) => {
        const correlationId = uuid.v4();
        const msgProperties = {
          correlationId,
          replyTo: replyQueue.queue
        };

        function consumeAndReply (msg) {
          if (!msg) return reject(Error.create('consumer cancelled by rabbitmq'));

          if (msg.properties.correlationId === correlationId) {
            resolve(msg.content);
          }
        }

        channel.consume(replyQueue.queue, consumeAndReply, {noAck: true})
        .then(() => channel.sendToQueue(rpcQueue, new Buffer(message), msgProperties));
      })));

That got us a long way during development but obviously failed to perform under non-trivial loads. What was more shocking is that it got dramatically worse when running it on a RabbitMQ cluster.

So we needed to refactor our client code. The problem is that most examples show how to send one-off RPC messages, but aren’t that clear on how the approach would be used at scale on a long-lived process. We obviously needed to reuse the connection but how about the channel? Should I create a new callback queue per incoming request or a single one per client?

Using a single reply-to queue per client

Based on the tutorial, I understood that the sensible approach was to reuse the queue from which the client consumed the RPC replies:

In the method presented above we suggest creating a callback queue for every RPC request. That’s pretty inefficient, but fortunately there is a better way – let’s create a single callback queue per client. That raises a new issue, having received a response in that queue it’s not clear to which request the response belongs. That’s when the correlation_id property is used.

We were already checking the correlationId, and just needed to create the reply-to queue in advance:

const createClient = (settings) =>
  amqp.connect(settings.url, settings.socketOptions)
    .then((conn) => conn.createChannel())
    .then((channel) => channel.assertQueue('', settings.queueOptions)
      .then((replyQueue) => {
        channel.replyQueue = replyQueue.queue;
        return channel;
      }));

I thought that would be enough to make sure the right consumer got the right message, but in practice I found that each message was always delivered to the first consumer. Therefore, I needed to cancel the consumer after the reply was processed:

const sendRPCMessage = (channel, message, rpcQueue) =>
  new Promise((resolve, reject) => {
    const correlationId = uuid.v4();
    const msgProperties = {
      correlationId,
      replyTo: channel.replyQueue
    };

    function consumeAndReply (msg) {
      if (!msg) return reject(Error.create('consumer cancelled by rabbitmq'));

      if (msg.properties.correlationId === correlationId) {
        channel.cancel(correlationId)
          .then(() => resolve(resolve(msg.content)));
      }
    }

    channel.consume(channel.replyQueue, consumeAndReply, {
      noAck: true,
      // use the correlationId as a consumerTag to cancel the consumer later
      consumerTag: correlationId
    })
    .then(() => channel.sendToQueue(rpcQueue, new Buffer(message), msgProperties));
  });

Enough? Only if the client processed one request at a time. As soon as I added some concurrency I saw that some of the messages were not handled at all. They were picked up by the wrong consumer, which ignored them because of the correlationId check, so they were lost. I needed to do something about unexpected message handling.

Requeuing unexpected messages

I tried using nack when a consumer received a reply to a message with an unexpected correlationId:

function consumeAndReply (msg) {
  if (!msg) return reject(Error.create('consumer cancelled by rabbitmq'));

  if (msg.properties.correlationId === correlationId) {
    channel.ack(msg);
    channel.cancel(correlationId)
      .then(() => resolve(resolve(msg.content)));
  } else {
    channel.nack(msg);
  }
}

Now the messages seemed to be handled, eventually. Only they weren’t: when I increased the load I saw message loss again. Further inspection revealed that the consumers were entering a weird infinite loop:

Consumer A gets message B; message B requeued
Consumer B gets message C; Message C requeued
Consumer C gets message A; Message A requeued

Repeated ad-infinitum. The same behaviour was reproduced using every possible combination of nack, reject and sendToQueue to send back the message.

Googling the issue, I read about the possibility of using Dead letter exchanges to handle those cases. But having to manually requeue unexpected messages felt weird enough; introducing a new exchange and queue sounded like a lot of effort to handle what should be a pretty standard use case for RPC. Better to take a step back.

Using a new reply-to queue per request

So I went back to a reply-to queue per request. This was marginally better than our initial approach since now at least we were recycling the connection and the channel. What’s more, that appeared to be the standard way to do RPC in RabbitMQ according to the few spots where I found non-tutorial implementation details, so, as Paul Graham would say, we wouldn’t get in trouble for using it.

And it worked well for us as long as we run a single RabbitMQ instance. When we moved to a RabbitMQ cluster the performance was pretty much the same as when we were creating connections like there was no tomorrow.

Using direct reply-to

We were seriously considering dropping the RabbitMQ cluster altogether (which meant turning our broker into a single point of failure), when I came across the link to the direct reply-to documentation. The first interesting thing there was that it confirmed why we were seeing such bad performance when running a RabbitMQ cluster:

The client can declare a single-use queue for each request-response pair. But this is inefficient; even a transient unmirrored queue can be expensive to create and then delete (compared with the cost of sending a message). This is especially true in a cluster as all cluster nodes need to agree that the queue has been created, even if it is unmirrored.

Direct reply-to uses a pseudo-queue instead, avoiding the queue declaration cost. And fortunately it was fairly straightforward to implement:

const createClient = (settings) => amqp.connect(settings.url, settings.socketOptions)

const sendRPCMessage = (client, message, rpcQueue) => conn.createChannel()
  .then((channel) => new Promise((resolve, reject) => {
    const replyToQueue = 'amq.rabbitmq.reply-to';
    const timeout = setTimeout(() => channel.close(), 10000);

    const correlationId = uuid.v4();
    const msgProperties = {
      correlationId,
      replyTo: replyToQueue
    };

    function consumeAndReply (msg) {
      if (!msg) return reject(Error.create('consumer cancelled by rabbitmq'));

      if (msg.properties.correlationId === correlationId) {
        resolve(msg.content);
        clearTimeout(timeout);
        channel.close();
      }
    }

    channel.consume(replyToQueue, consumeAndReply, {noAck: true})
    .then(() => channel.sendToQueue(rpcQueue, new Buffer(content), msgProperties))
  });

This worked just as we expected, even in the cluster. As the code shows, though, we were still creating a new channel per request and we needed to handle its closing, even when the response never came. Trying to use a single channel resulted in a “reply consumer already set” error, because the queue was always the same.

Creating so many channels didn’t feel right, so I filed an issue asking for advice in the amqp.node repo. The creator confirmed that that was indeed an anti-pattern and suggested not only using a single channel but registering a single consumer (i.e. a single callback function to handle all RPC responses). This meant introducing some structure to be able to route responses back to the promise that was expecting it. Using an EventEmitter turned out to be an elegant way to accomplish it:

const REPLY_QUEUE = 'amq.rabbitmq.reply-to';

const createClient = (settings) => amqp.connect(settings.url, settings.socketOptions)
  .then((conn) => conn.createChannel())
  .then((channel) => {
    // create an event emitter where rpc responses will be published by correlationId
    channel.responseEmitter = new EventEmitter();
    channel.responseEmitter.setMaxListeners(0);
    channel.consume(REPLY_QUEUE,
      (msg) => channel.responseEmitter.emit(msg.properties.correlationId, msg.content),
      {noAck: true});

    return channel;
  });

const sendRPCMessage = (channel, message, rpcQueue) => new Promise((resolve) => {
  const correlationId = uuid.v4();
  // listen for the content emitted on the correlationId event
  channel.responseEmitter.once(correlationId, resolve);
  channel.sendToQueue(rpcQueue, new Buffer(message), { correlationId, replyTo: REPLY_QUEUE })
});

Better authentication for socket.io (no query strings!)

Introduction

This post describes an authentication method for socket.io that sends the credentials in a message after connection, rather than including them in the query string as usually done. Note that the implementation is already packed in the socketio-auth module, so you should use that instead of the code below.

The reason to use this approach is that putting credentials in a query string is generally a bad security practice (see this, this and this), and though some of the frequent risks may not apply to the socket.io connection request, it should be avoided as there’s no general convention in treating urls as sensitive information. Ideally such data should travel on a header, but that doesn’t seem to be an option for socket.io, as not all of the transports it supports (WebSocket being one) allow sending headers.

Needless to say, all of this should be done over HTTPS, otherwise no security level is to be expected.

Implementation

In order to authenticate socket.io connections, most tutorials suggest to do something like:

io.set('authorization', function (handshakeData, callback) {
  var token = handshakeData.query.token;
  //will call callback(null, true) if authorized
  checkAuthToken(token, callback);
});

Or, with the middleware syntax introduced in socket.io 1.0:

io.use(function(socket, next) {
  var token = socket.request.query.token;
  checkAuthToken(token, function(err, authorized){
    if (err || !authorized) {
      next(new Error("not authorized"));
    }
    next();
  });
});

Then the client would connect to the server passing its credentials, which can be an authorization token, user and password or whatever value that can be used for authentication:

socket = io.connect('http://localhost', {
  query: "token=" + myAuthToken
});

The problem with this approach is that it credentials information in a query string, that is as part of an url. As mentioned, this is not a good idea since urls can be logged and cached and are not generally treated as sensitive information.

My workaround for this was to allow the clients to establish a connection, but force them to send an authentication message before they can actually start emitting and receiving data. Upon connection, the server marks the socket as not authenticated and adds a listener to an ‘authenticate’ event:

var io = require('socket.io').listen(app);

io.on('connection', function(socket){
  socket.auth = false;
  socket.on('authenticate', function(data){
    //check the auth data sent by the client
    checkAuthToken(data.token, function(err, success){
      if (!err && success){
        console.log("Authenticated socket ", socket.id);
        socket.auth = true;
      }
    });
  });

  setTimeout(function(){
    //If the socket didn't authenticate, disconnect it
    if (!socket.auth) {
      console.log("Disconnecting socket ", socket.id);
      socket.disconnect('unauthorized');
    }
  }, 1000);
}

A timeout is added to disconnect the client if it didn’t authenticate after a second. The client will emit it’s auth data to the ‘authenticate’ event right after connection:

var socket = io.connect('http://localhost');
socket.on('connect', function(){
  socket.emit('authenticate', {token: myAuthToken});
});

An extra step is required to prevent the client from receiving broadcast messages during that window where it’s connected but not authenticated. Doing that required fiddling a bit with the socket.io namespaces code; the socket is removed from the object that tracks the connections to the namespace:

var _ = require('underscore');
var io = require('socket.io').listen(app);

_.each(io.nsps, function(nsp){
  nsp.on('connect', function(socket){
    if (!socket.auth) {
      console.log("removing socket from", nsp.name)
      delete nsp.connected[socket.id];
    }
  });
});

Then, when the client does authenticate, we set it back as connected to those namespaces where it was connected:

socket.on('authenticate', function(data){
  //check the auth data sent by the client
  checkAuthToken(data.token, function(err, success){
    if (!err && success){
      console.log("Authenticated socket ", socket.id);
      socket.auth = true;

      _.each(io.nsps, function(nsp) {
        if(_.findWhere(nsp.sockets, {id: socket.id})) {
          console.log("restoring socket to", nsp.name);
          nsp.connected[socket.id] = socket;
        }
      });

    }
  });
});

The road to Invisible.js

This post will describe the development process of Invisible.js, the isomorphic JavaScript framework that Martín Paulucci and I have been working on for around a year, as our Software Engineering final project at the University of Buenos Aires.

Motivation and Philosophy

We came from different backgrounds; I was programming Django for years, working in applications with increasingly complex UIs, moving from spaghetti jQuery to client MVCs such as backbone; Martin was already getting into Node.js development, also using AngularJS after trying other client frameworks. We both regarded the current state of web development, centered in REST servers and MV* clients, as one of unstable equilibrium. Some problems were evident to us: inherent duplication (same models, same validations) and continuous context switches between front and back end code. The latter one was partially solved by Node.js, letting the programmers use the same language in both sides. But we felt there wasn’t enough effort put into taking advantage of the potential of the platform, to erase or at least reduce the gap between client and server in web applications. That was the direction we wanted to take with Invisible.js, acknowledging the limitations of being a couple of developers working in our free time.

With that goal in mind, we started out with some months of research on frameworks and technologies, most of which we weren’t familiar with or hadn’t used yet; then we built a couple of prototypes to test them out. After that, we had a better picture on how to lay out the development of Invisible.js. We weren’t out to build a full stack framework like Derby or Meteor, trying to cover each aspect of web development; rather, we wanted to pull together the awesome modules available in Node.js (express, browserify, socket.io) in order to achieve client/server model reuse as gracefully as possible. In that sense, the nodejitsu blog was a great source of inspiration.

As a side note, the framework is named after Invisible, a progressive rock group led by Luis Alberto Spinetta in the ’70s.

Architecture

Invisible.js stands on a more or less MEAN stack; it’s actually not tied at all to AngularJS, but we usually choose it as our front end framework, as we think it’s the best and places no constraints on the models it observes, which makes it a good fit for Invisible (as opposed to backbone, for example). As of the database, we appreciate the short distance between a JSON object and a Mongo document, plus it has a nice, flexible Node.js driver, but certainly an interesting branch of development for Invisible.js would be to add support for other data stores.

The main component of Invisible are the models. The developer defines them with its methods and attributes, and registers them in Invisible; this way they are exposed both in client and server, and augmented with methods to handle database access, real-time events and validations. The implementation of those methods will change depending on where the call is made, as the figure shows.

tpprof1b

What goes under the hood is that the Invisible server, which replaces the express one, exposes a dynamically generated browserify bundle that contains all the registered model definitions for the client. It also exposes the REST controllers to handle the CRUD methods that they call.

Further development

We’re very pleased with the result of our work; most of the things we’ve tried worked out, and we went further than we expected. Indeed, we feel that Invisible.js not only solves its initial goal of exposing reusable models, but also that it’s simple to use and gives a lot of non-trivial stuff out of the box, with a few lines of code.

Nevertheless, we are very aware that it’s still a toy; a fun one, which we’d like to keep working on, but a toy so far. As the Russians fairly point out, Invisible.js exposes the database to any JavaScript client (namely, the browser console), without any restriction whatsoever. Thus, the main thing we’ll be working on in the short term is providing means for authentication and authorization: establishing the identity of the clients and restricting the segments of data they can access, both in the REST API and the socket.io events. We’ve already started studying the topic and made some implementation attemps.

Apart from security, we still have to see how well the framework escalates, both in resources and code base, as it’s used in medium and big applications. We hope other developers will find it interesting enough to give it a try and start collaborating, so it does turn into something more than a toy.

A Node.js primer

I finally took the time to start fiddling with Node.js, and as I expected from such a young and dynamic technology, I ran into some gotchas and configuration headaches, so I’ll put down some notes here that might be helpful for other people getting started with Node.

First off, a good resource to get familiar with Node and its philosophy: The Node Beginner Book. It’s a long tutorial that guides you through a very simple web application, explaining some of Node’s basic concepts and JavaScript programming techniques on the way. This book was completely available at www.nodebeginner.org up until a couple of months ago, with a little disclaimer in the middle for those interested in buying it. Now there’s just the first half online, but the author points in the comments that the full book is still available at github.

As I moved along the tutorial, I ran into the first problem: it uses the formidable node module to handle the file uploads, which is not compatible with the most recent versions of Node (the current one is 0.10.5). Looking into this I found out a couple of interesting facts:

  • The versioning scheme of Node states that odd versions are unstable and even versions are stable.
  • Since 0.10 is a very recent version, it’s recommended for those starting out to stick with 0.8 (the previous stable version).

So I needed to install version 0.8.something.

At this point I started to feel uncomfortable messing around with different versions on a global Node installation. Neither I liked to sudo everytime I needed to install a new  module. There was some misleading advice around the web on doing a chown on your /usr/local folder as a way to avoid this, which didn’t look all that good. Coming from Python and virtualenv I like to handle my installations locally. This is the simplest way to do it I’ve found.

There are several modules that allow handling multiple Node versions, the most popular being nvm and n. I found n was difficult to configure to work with the local installation, so I switched to nvm instead. The code needed to install it and switch to 0.8 was something like:

wget -qO- https://raw.github.com/creationix/nvm/master/install.sh | sh
echo '[[ -s ~/.nvm/nvm.sh ]] &amp;&amp; . ~/.nvm/nvm.sh' >> .bashrc
nvm install 0.8
nvm alias default 0.8