I was hired in December 2014 as the sixth engineer at Shyp . Shyp runs Node.js on the server. It’s been a pretty frustrating journey, and I wanted to share some of the experiences we’ve had. There are no hot takes about the learning curve, or "why are there so many frameworks" in this post.
Initially we were running on Node 0.10.30. Timers would consistently fire 50 milliseconds early – that is, if you called
setTimeoutwith a duration of 200 milliseconds, the timer would fire after 150 milliseconds. I asked about this in the #Node.js Freenode channel, and was told it was my fault for using an "ancient" version of Node (it was 18 months old at that time), and that I should get on a "long term stable" version. This made timeout-based tests difficult to write – every test had to set timeouts longer than 50ms.
I wrote a metrics library that published to Librato. I expect a background metric publishing daemon to silently swallow/log errors. One day Librato had an outage and returned a 502 to all requests; unbeknownst to me the underlying librato library we were using was also an EventEmitter, that crashed the process if unhandled . This caused about 30 minutes of downtime in our API.
The advice on the Node.js website is to crash the process if there’s an uncaught exception . Our application is about 60 models, 30 controllers; it doesn’t seem particularly large. It consistently takes 40 seconds to boot our server in production; the majority of this time is spent requiring files . Obviously we try not to crash, and fix crashes when we see them, but we can’t take 40 seconds of downtime because someone sent an input we weren’t expecting. I asked about this on the Node.js mailing list but got mostly peanuts. Recently the Go core team sped up build times by 2x in a month .
We discovered our framework was sleeping for 50ms on POST and PUT requests for no reason . I’ve previously written about the problems with Sails , so I am going to omit most of that here.
We upgraded to Node 4 (after spending a day dealing with a nasty TLS problem ) and observed our servers were consuming 20-30% more RAM, with no accompanying speed, CPU utilization, or throughput increase. We were left wondering what benefit we got from upgrading.
It took about two months to figure out how to generate a
npm shrinkwrapfile that produced reliable changes when we tried to update it. Often attempting to modify/update it would change the "from" field in every single dependency.
sinonlibrary appears to be one of the most popular ways to stub the system time. The default method for stubbing (
useFakeTimers) leads many other libraries to hang inexplicably . I noticed this after spending 90 minutes wondering why stubbing the system time caused CSV writing to fail . The only ways to debug this are 1) to add
console.logstatements at deeper and deeper levels, since you can’t ever hit ctrl+C and print a stack trace, or 2) take a core dump.
The library we used to send messages to Slack crashed our server if/when Slack returned HTML instead of JSON .
Our Redis provider changed something – we don’t know what, since the version number stayed the same – in their Redis instance, which causes the Redis connection library we use to crash the process with a "Unhandled event" message. We have no idea why this happens, but it’s tied to the number of connections we have open – we’ve had to keep a close eye on the number of concurrent Redis connections, and eventually just phased out Redis altogether .
Our database framework doesn’t support transactions, so we had to write our own transaction library .
The underlying database driver doesn’t have a way to time out/cancel connection acquisition , so threads will wait forever for a connection to become available. I wrote a patch for this; it hasn’t gotten any closer to merging in 10 weeks, so I published a new NPM package with the patch, and submitted that upstream.
I wrote a job queue in Go . I couldn’t benchmark it using our existing Node server as the downstream server, the Node server was too slow – a basic Express app would take 50ms on average to process incoming requests and respond with a 202, with 100 concurrent in-flight requests. I had to write a separate downstream server for benchmarking .
Last week I noticed our integration servers were frequently running out of memory. Normally they use about 700MB of memory, but we saw they would accumulate 500MB of swap in about 20 seconds. I think the app served about thirty requests total in that time frame. There’s nothing unusual about the requests, amount of traffic being sent over HTTP or to the database during this time frame; the amount of data was in kilobytes. We don’t have tools like pprof . We can’t take a heap dump, because the box is out of memory by that point.
You could argue – and certainly you could say this about Sails – that we chose bad libraries, and we should have picked better. But part of the problem with the Node ecosystem is it’s hard to distinguish good libraries from bad .
I’ve also listed problems – pound your head, hours or days of frustrating debugging – with at least six different libraries above. At what point do you move beyond "this one library is bad" and start worrying about the bigger picture? Do the libraries have problems because the authors don’t know better, or because the language and standard library make it difficult to write good libraries? I don’t think it helps .
You could argue that we are a fast growing company and we’d have these problems with any language or framework we chose. Maybe? I really doubt it. The job queue is 6000 lines of Go code written from scratch. There was one really bad problem in about two months of development, and a) the race detector caught it before production, b) I got pointed to the right answer in IRC pretty quickly.
You could also argue we are bad engineers and better engineers wouldn’t have these problems. I can’t rule this out, but I certainly don’t think this is the case.
I’ve learned a lot about debugging Node applications, making builds faster , finding and fixing intermittent test failures and more. I am available to speak or consult at your company or event, about what I have learned. I am also interested in figuring out your solutions to these problems – I will buy you a coffee or a beer and chat about this at any time.