When I wrote about how I mostly just use arithmetic, a lot of people asked me about what skills or tools a data scientist needs if not fancy algorithms. What is this mythical “basic math” that I mentioned? Here’s my take on what skills are actually needed for the sort of work that I do at Basecamp: simple analyses focused on solving actual business problems.
The most important skill: being able to understand the business and the problem
I’ll get to actual practical skills that you can learn in a textbook in a minute, but first I have to belabor one point: the real essential skill of a data scientist is the ability to understand the business and the problem, and the intellectual curiosity to want to do so. What are you actually trying to achieve as a business? Who are your customers? When are you selling your product? What are the underlying economics of the business? Are profit margins high or modest? Do you have many small customers or a few large customers? How wide is your product range? Who are you competing with? What challenge is the business facing that you’re trying to solve or provide input towards a decision on? What’s the believable range of answers? Who is involved in solving this problem? Can analysis actually make a difference? How much time is worth investing in this problem?
Understanding the data
Before you look at any data or do any math, a data scientist needs to understand the underlying data sources, structure, and meaning. Even if someone else goes out and gets the data from wherever it’s stored and gives it to you, you still need to understand the origin and what each part of the data means. Data quality varies dramatically across and within organizations; in some cases you’ll have a well documented data dictionary, and in other cases you’ll have nothing. Regardless, you’ll want to be able to answer the following questions:
- What data do I need to solve the problem?
- Where is that data located? In a relational database? In a log file on disk? In a third party service?
- How comprehensive (time and scope) is the data? Are there gaps in coverage or retention?
- What does each field in the data mean in terms of actual behavior of humans or computers?
- How accurate is each field in the data? Does it come from something that’s directly observed, self-reported, third-party sourced, or imputed?
- How can I use this data in a way that minimizes the risk of violating someone’s privacy?
SQL skills
For better or worse, most of the data that data scientists need live in relational databases that quack SQL, whether that’s MySQL, Postgres, Hive, Impala, Redshift, BigQuery, Teradata, Oracle, or something else. Your mission is to free the data from the confines of that relational database without crashing the database instance, pulling more or less data than you need to, getting inaccurate data, or waiting a year for a query to finish.
Virtually every query a data scientist writes to get data to analyze to solve business problems will be a SELECT statement. The essential SQL concepts and functions that I find necessary are:
- DESCRIBE and EXPLAIN
- WHERE clauses, including IN (…)
- GROUP BY
- Joins, mostly left and inner
- Using already indexed fields
- LIMIT and OFFSET
- LIKE and REGEXP
- if()
- String manipulation, primarily left() and lower()
- Date manipulation: date_add, datediff, to and from UNIX timestamps, time component extraction
- regexp_extract (if you’re lucky to use a database that supports it) or substring_index (if you’re less lucky)
- Subqueries
Basic math skills
Once you have some data, you can do some maths. The list of what I consider to be the essential list of math skills and concepts is not a long one:
- Arithmetic (addition, subtraction, multiplication, division)
- Percentages (of total, difference vs. another value)
- Mean and median (and mean vs. median)
- Percentiles
- Histograms and cumulative distribution functions
- An understanding of probability, randomness, and sampling
- Growth rates (simple and compound)
- Power analysis (for proportions and means)
- Significance testing (for proportions and means)
This isn’t a very complicated set of things. It’s not about the math, it’s about the problem you’re solving.
Slightly more advanced math concepts
On occasion, some more advanced mathematical or SQL concepts or skills are of value to common business problems. A handful of the more common things I use include:
- Analytic functions if supported by your database (lead(), lag(), rank(), etc.)
- Present and future value and discount rates
- Survival analysis
- Linear and logistic regression
- Bag of Words textual representations
There are some problems that require more advanced techniques, and I don’t mean to disparage or dismiss those. If your business can truly benefit from things like deep learning, congratulations! That probably means you’ve solved all the easy problems that your business is facing.