The Heroku mess

On February 12, 2013, a startup called RapGenius accused the Heroku cloud host of knowingly stifling the performance of the RapGenius Rails app, causing the need for extra instances and subsequent higher service charges.

RapGenius' blog post is impressive in that it describes the issues clearly yet provides sufficient details. It is really worth a read which is why I'm not going to reproduce nor summarize it here.

I believe none of the parties involved (Heroku, RapGenius and the monitoring outfit New Relic) come out looking good:

  • Heroku failed to announce a substantial platform change to customers. The company provides Platform as a Service (several platforms, actually). A platform is a system of components working together to provide essential services for applications. Heroku switched their Rails platform from an optimal combination of components (single-request dynos, intelligent routing) to a substantially sub-optimal one (single-request dynos, random routing). They failed to notify customers of the switch in advance. They either failed to measure the performance impact of the switch or purposefully withheld knowledge of the impact from their customers. They failed to reflect the switch in their documentation.
  • New Relic simply failed to account for all latencies in their monitoring. It appears they measured many internal values within the platform but failed to cross-check them against a black-box view from the outside. It seems quite unprofessional, especially given the pricing of their service.
  • RapGenius put too much trust in an external entity. Instead of an independent monitoring service they used one sold as an add-on to their platform by the platform provider. They took more than two years to detect a major performance degradation in their platform (in fairness, the problem only becomes apparent when the number of dynos reaches a certain threshold).

What's truly puzzling is that none of Heroku's other Rails customers seems to have encountered this issue before RapGenius. One possibility is that RapGenius is the first Heroku Rails app to achieve sufficient scale for the problem to occur. Another is that some customers did indeed catch the problem right at the start, adapted to the new configuration and simply never thought to give a heads-up to the rest of the community.

The morale of the story for me is that it's foolish to use a monitoring service with business ties to the entity being monitored (or, indeed, to outsource monitoring in the first place - it's a critical QA function, after all). An independent monitoring service must still be given enough access to the monitored system to provide necessary detail. This is not trivial as monitoring can interfere with the monitored system. Some form of meta-monitoring is perhaps needed. Ah, the ever-fascinating world of IT :-)

Proudly powered by Pelican, which takes great advantage of Python.