I analyze data for a living. I occasionally do some other things for Basecamp — help with marketing, pitch in on support, do some “business” things — but at the end of the day, I analyze data. Some of the data is about feature usage, some about application performance or speed, some about our great support team, some about financial matters, and some of it defies categorization; regardless of the type of data, my job is to identify business problems, use data to understand them, and make actionable recommendations.
If you ask almost any data analyst, they’ll tell you that the biggest chunk of their time is spent cleaning and preparing data — getting it into a form that’s usable for reporting or analysis. Ironically, I don’t have any actual data about how much time that process consumes, either personally or for the profession as a whole, but I’d guess that the time spent on acquisition and transformation of data outweighs actual math, statistics, or coming up with recommendations by a factor of five to one or more.
Over time, both personally and as an organization, you get better at capturing and preparing data, and it eats up less and less of your time. I’d characterize the time I’ve spent at Basecamp as being four fairly distinct phases of increasingly greater sophistication in terms of how I prepare data for analysis:
- The CSV phase: In my early days at Basecamp, I was just happy to have data at all, and everything was basically a comma separated value (CSV) file. I used `SELECT … INTO OUTFILE` to get data from MySQL databases, `awk` to get things from log files, and the ‘export’ button from third party services to get data that I could then analyze.
- The R script phase: After a month or two, I graduated to a set of R scripts to get data directly into my analysis environment of choice. A wrapper function got data from our MySQL databases and I wrote API wrappers to get data from external services. Our first substantial automated reports showed up in this phase, and they were literally R scripts piped to `sendmail` on my laptop.
- The embryonic data warehouse: Eventually, a fledgling “data warehouse” started to take form — a MySQL instance held some data explicitly for analysis, a Hadoop cluster came into the picture to process and store logs, and we added a dashboard application that standardized reporting.
- The 90% data warehouse (today): today a centralized data warehouse holds almost of all our data, every type of data belongs to a documented schema, and Tableau has dramatically changed the way we do reporting. It’s not perfect — there are some pieces of data that remain scattered and analyzed only after cleaning and manual processing, but that’s the exception rather than the rule.
Over the course of this transformation, the time that I spend preparing data for analysis has fallen dramatically — it used to be that I started any “substantial” analysis with two or three days of getting and cleaning data followed by a day of actually analyzing it; now, I might spend twenty minutes getting vastly greater quantities of already clean data and then a couple of days analyzing it far more deeply.
That evolution didn’t come for free — it took substantial investments of both time and money into our data infrastructure. We’ve built out two physical Hadoop clusters, bought software licenses, and poured hundreds of hours over the last five years into developing the systems that enable reporting and analysis.
I used to struggle with feeling guilty every time I spent time on our data infrastructure. After all, wasn’t my job to analyze data and help the business, not build data infrastructure?
Over time, I’ve come to realize that there’s nothing to feel guilty about. The investment in our infrastructure have paid dividends many times over: in direct time savings (mine and others), in greater insights for the company, and in empowering others to work with data. In the example analysis case I described above, the transformation infrastructure saved a day or two of my time and delivered a better result to the business; I do perhaps thirty or forty such analyses per year. That makes a few weeks or even months of total time spent on those investments look like a bargain.
I often hear people argue that “investing in infrastructure” is just code for giving in to “Not Invented Here” syndrome. The single biggest impact infrastructure investment we’ve made was actually abandoning a custom developed reporting solution for a piece of commercially developed software. Just like any sort of investment, you can of course spend your resources poorly, but done properly, investing in infrastructure can be one of the highest returns you can possibly achieve.
Seth Godin had an excellent take on the topic recently:
Here’s something that’s unavoidably true: Investing in infrastructure always pays off. Always. Not just most of the time, but every single time. Sometimes the payoff takes longer than we’d like, sometimes there may be more efficient ways to get the same result, but every time we spend time and money on the four things, we’re surprised at how much of a difference it makes.
I recently wrapped up a fairly large infrastructure project at Basecamp, and my focus is naturally swinging back towards focusing more exclusively on the core of what I do: analyzing data. For the first time, however, I’m moving on from an infrastructure project without much guilt about whether it was an investment worth making. Instead, I’m looking forward to reaping the dividends from these investments for years to come.