“Big data” is all the rage these days – there are conferences, journals, and a million consultants. Until a few weeks ago, I mocked the term mercilessly. I don’t mock it anymore.
Not a “big” data problem
Facebook has a big data problem. Google has a big data problem. Even MySpace probably has a big data problem. Most businesses, including 37signals, don’t.
I would guess that among our “peer group” (SaaS businesses), we probably handle more data than most, but our volume of data is still relatively small: we generate around a terabyte of assorted log data (Rails, Nginx, HAproxy, etc.) every day, and a few gigabytes of higher “density” usage and performance data. I’m strictly talking about non-application data here – not the core data that our apps use, but all of the tangential data that’s helpful to improve the performance and functionality of our products. We’ve only even attempted to use this data in the last couple of years, but it’s invaluable for informing design decisions, finding performance hot spots, and otherwise improving our applications.
The typical analytical workload with this data is a few gigabytes or tens of gigabytes – sometimes big enough to fit in RAM, sometimes not, but generally within the realm of possibility with tools like MySQL and R. There are some predictable workloads to optimize for (add indexes for data stored in MySQL, instrument in order to work with more condensed data, etc.), but the majority aren’t things that you ordinarily plan for particularly well. Querying this data can be slow, but it’s all offline, non-customer facing applications, so latency isn’t hugely important.
None of this is an insurmountable problem, and it’s all pretty typical of “medium” data – enough data you have to think about the best way to manage and analyze it, but not “big” data like Facebook or Google.
Technology changes everything
The challenges of this medium amount of data are, however, enough that I occasionally wish for a better solution for extracting information from logs or a way to enable more interaction with large amounts of data in our internal analytics application.
A few months ago, Taylor spun up a few machines on AWS to try out Hive with some log data. While it was kind of exciting to see queries running split across machines in a cluster, the performance of simple queries on a moderately sized dataset (a few gigabytes) on these virtualized instances wasn’t particularly impressive – in fact, it was much slower than using MySQL with the same dataset.
A couple weeks ago Cloudera released Impala, with the promise of Hive like functionality with much lower latency. I decided to give Hadoop-based SQL-like technologies another shot.
We set up a couple of machines in a cluster, pulled together a few sample datasets, and ran a few benchmarks comparing Impala, Hive, and MySQL, and the results were encouraging for Impala.
Workload | Impala Query Time | Hive Query Time | MySQL Query Time |
---|---|---|---|
5.2 Gb HAproxy log – top IPs by request count | 3.1s | 65.4s | 146s |
5.2 Gb HAproxy log – top IPs by total request time | 3.3s | 65.2s | 164s |
800 Mb parsed rails log – slowest accounts | 1.0s | 33.2s | 48.1s |
800 Mb parsed rails log – highest database time paths | 1.1s | 33.7s | 49.6s |
8 Gb pageview table – daily pageviews and unique visitors | 22.4s | 92.2s | 180s |
These aren’t scientific benchmarks by any means (nothing’s been especially tuned or optimized), but I think they’re indicative enough: on real hardware, it’s certainly possible to dramatically improve on tools like MySQL for these sorts of analytical workloads. With larger datasets, you’d expect the performance differential to grow (none of these datasets exceeded the buffer pool size in MySQL).
We’re still in the early stages of actually putting this to use – setting up the data flows, monitoring, reporting, etc., and there are sure to be many ups-and-downs as we continue to dig deeper. So far though, I’m thrilled to have been proven wrong about the utility of these technologies for businesses at sub-Facebook scale, and I’m more than happy to eat crow in this case.
Ben Dunlap
on 09 Nov 12Where do you keep those daily terabytes of log data? Does log data have a pretty short shelf life before getting thrown away?
Josh Turmel
on 09 Nov 12Have you thought about using Google’s BigQuery? We’ve been really impressed with the performance of the ad-hoc queries we can perform on hundreds of millions (& billions) of rows in just a few seconds.
David Andersen
on 09 Nov 12I think it’s perfectly fine to still mock ‘big data’ because it’s clearly an over the top marketing/consulting blitz that consists of mostly hot air. Like any fad there is some substance to some of it, but only some of it.
Chris Baus
on 09 Nov 12I like how you say you mention don’t have a “big” data problem. We definitely have data problems, but I often joke our data problems are “medium” data problems. There is big and there is BIG!
Sowell Man
on 09 Nov 12Stephen Few: ”...big data is more hype than substance and it thrives on remaining ill defined.”
Big Data, Big Deal
BillP
on 09 Nov 12Great post Noah! When giving presentations about Hadoop to database users (who knew nothing about Hadoop and the ecosystem) it was very valuable to deprogram erveryone from the marketing buzzword hype machine: Big Data = Hype, Hadoop = Reality.
Hadoop/Hbase/Hive/Impala are fantastic tools for large datasets – think terabytes to petabytes. Sizes that cause relational databases to choke. it’s just hard to swim against all of the marketing-speak.
Ted Jackson
on 09 Nov 12Ben is excitely right.
Everyone has a Big Data problem but just don’t release it – it’s how to you get meaning information from the huge amount of OS/Database/Application log files you’re keeping today.
People mistakenly think Big Data means it has to be customer oriented data.
There’s huge untapped value in the data found in log files that’s not being utilized today.
Michiel Sikkes
on 10 Nov 12I am just wondering. While speed is definitely important if you are dong a lot of queries just to see where you can find some interesting data results to optimize something in the apps or your infrastructure.
Can you name some examples of what makes the effort of optimizing your statistics tools worthwhile as compared to which insights you guys gained from them and where able to implement in daily life?
Of course, I can imagine that bringing something down from 2 minutes to about a few seconds to run is at least a lot less frustrating when you are just querying potentially interesting information.
Dimitri
on 12 Nov 12Any way to get a look on SQL queries you’ve used for MySQL?.. (and their EXPLAIN plans)
Thank you!
Rgds, -Dimitri
Dominic
on 12 Nov 12@Chris – then there’s Dayaaaam!
This discussion is closed.