“Big data” is all the rage these days – there are conferences, journals, and a million consultants. Until a few weeks ago, I mocked the term mercilessly. I don’t mock it anymore.

Not a “big” data problem

Facebook has a big data problem. Google has a big data problem. Even MySpace probably has a big data problem. Most businesses, including 37signals, don’t.
I would guess that among our “peer group” (SaaS businesses), we probably handle more data than most, but our volume of data is still relatively small: we generate around a terabyte of assorted log data (Rails, Nginx, HAproxy, etc.) every day, and a few gigabytes of higher “density” usage and performance data. I’m strictly talking about non-application data here – not the core data that our apps use, but all of the tangential data that’s helpful to improve the performance and functionality of our products. We’ve only even attempted to use this data in the last couple of years, but it’s invaluable for informing design decisions, finding performance hot spots, and otherwise improving our applications.
The typical analytical workload with this data is a few gigabytes or tens of gigabytes – sometimes big enough to fit in RAM, sometimes not, but generally within the realm of possibility with tools like MySQL and R. There are some predictable workloads to optimize for (add indexes for data stored in MySQL, instrument in order to work with more condensed data, etc.), but the majority aren’t things that you ordinarily plan for particularly well. Querying this data can be slow, but it’s all offline, non-customer facing applications, so latency isn’t hugely important.
None of this is an insurmountable problem, and it’s all pretty typical of “medium” data – enough data you have to think about the best way to manage and analyze it, but not “big” data like Facebook or Google.

Technology changes everything

The challenges of this medium amount of data are, however, enough that I occasionally wish for a better solution for extracting information from logs or a way to enable more interaction with large amounts of data in our internal analytics application.
A few months ago, Taylor spun up a few machines on AWS to try out Hive with some log data. While it was kind of exciting to see queries running split across machines in a cluster, the performance of simple queries on a moderately sized dataset (a few gigabytes) on these virtualized instances wasn’t particularly impressive – in fact, it was much slower than using MySQL with the same dataset.
A couple weeks ago Cloudera released Impala, with the promise of Hive like functionality with much lower latency. I decided to give Hadoop-based SQL-like technologies another shot.
We set up a couple of machines in a cluster, pulled together a few sample datasets, and ran a few benchmarks comparing Impala, Hive, and MySQL, and the results were encouraging for Impala.

Workload Impala Query Time Hive Query Time MySQL Query Time
5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146s
5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164s
800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s
800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s
8 Gb pageview table – daily pageviews and unique visitors 22.4s 92.2s 180s

These aren’t scientific benchmarks by any means (nothing’s been especially tuned or optimized), but I think they’re indicative enough: on real hardware, it’s certainly possible to dramatically improve on tools like MySQL for these sorts of analytical workloads. With larger datasets, you’d expect the performance differential to grow (none of these datasets exceeded the buffer pool size in MySQL).
We’re still in the early stages of actually putting this to use – setting up the data flows, monitoring, reporting, etc., and there are sure to be many ups-and-downs as we continue to dig deeper. So far though, I’m thrilled to have been proven wrong about the utility of these technologies for businesses at sub-Facebook scale, and I’m more than happy to eat crow in this case.