Sharding is a database technique where you break up a big database into many smaller ones. Instead of having 1 million customers on a single, big iron machine, you perhaps have 100,000 customers on 10 different, smaller machines.
The general advise on sharding is that you don’t until you have to. It’s similar to Martin Fowler’s First Law of Distributed Object Design: Don’t distribute your objects! Sharding is still relatively hard, has relatively poor tool support, and will definitely complicate your setup.
Now I always knew that the inevitable day would come where we would have no choice. We would simply have to shard because there was no more vertical scaling to be done. But that day seems to get pushed further and further into the future.
Bigger caches, more reads
Our read performance is in some aspect being taken care of by the fact that you can get machines with 256GB RAM now. We upgraded the Basecamp database server from 32GB to 128GB RAM a while back and we thought that would be the end of it.
The box was maxed out and going beyond 128GB at the time was stupid expensive. But now there’s 256GB to be had at a reasonable price and I’m starting to think that by the time we reach that, there’ll be reasonably priced 512GB machines.
So as long as Moore’s law can give us capacity jumps like that, we can keep the entire working set in memory and all will be good. And even if we should hit a ceiling there, we can still go to active read slaves before worrying about sharding.
The bigger problem is writes
Traditionally it hasn’t been read performance that caused people to shard anyway. It has been write performance. Our applications are still very heavy on the reads vs writes, so it’s less of a problem than it is for many others.
But with the rise of SSD, like Fusion-IO’s ioDrive that can do 120K IOPS, it seems that we’re going to be saved by the progress of technology once again by the time we’ll need it.
Punt on sharding
So where does that leave sharding? For us, we’re in the same position we’ve been in for the past few years. We just don’t need to pay the complexity tax yet, so we don’t. That’s not to say that sharding doesn’t have other benefits than simply allowing that which otherwise couldn’t be, but the trade is not yet good enough.
One point of real pain we’ve suffered, though, is that migrating a database schema in MySQL on a huge table takes forever and a day. That’s a very real problem if you want to avoid an enterprisey schema full of kludges put in place to avoid adding, renaming, or dropping columns on big tables. Or avoid long scheduled maintenance windows.
I really hope that the clever chaps at MySQL comes up with something more reasonable for that problem, though. I’m told that PostgreSQL is a lot more accommodating in this regard, so hopefully competition will rise all boats for that.
Don’t try to preempt tomorrow
I guess the conclusion is that there’s no use in preempting the technological progress of tomorrow. Machines will get faster and cheaper all the time, but you’ll still only have the same limited programming resources that you had yesterday.
If you can spend them on adding stuff that users care about instead of prematurely optimizing for the future, you stand a better chance of being in business when that tomorrow finally rolls around.
Nalin Mittal
on 07 Jan 09Great article. Thanks Dave!
Kurt
on 07 Jan 09Well written article, clearly emphasizing against premature optimization and avoiding complexity where possible.
It would be awesome if you provided some links to different topics (i.e. more depth on why Sharding works, why read only slaves work, why writes are slow).
Jonathan Persson
on 07 Jan 09Great post! :)
Martijn Lafeber
on 07 Jan 09Moore already predicted the end of his law, so don’t count on that forever :)
Paul Leader
on 07 Jan 09Excellent article as always.
As Knuth/Hoare put it: “Premature optimisation is the root of all evil”.
It’s funny how so much of what’s been said in the dim and distant past (for the software industry anyway) continues to hold true. It’s also depressing how few people are actually aware of the wisdom that is already out there. I’ve seen (and been responsible for) so many examples of premature optimisation, and excessive complexity over the years that I sometimes wonder if it is a fundamental personality trait of software developers to make things more complex than they need be.
Paul
Greg
on 07 Jan 09Have you done any investigation on what it would take to change over to PostgreSQL and whether it would be a net benefit? I’m confused by the perspective of “we hear PostgreSQL does this better, but we’ll stick with MySQL rather than trying it.” I don’t mean to make this a religious war, I just see it as due diligence to investigate it as a potential low-risk, high-gain, and maybe low-effort optimization.
Daniel
on 07 Jan 09Good read! Mange tak!
Like Kurt, I’m curious about the performance difference when writing to a database. It’s pure intellectual curiosity, as I don’t have a hand in any database that’s large enough to have any noticeable performance issues of any kind. I tried googling a bit, but didn’t find an explanation.
I’m guessing it’s because the database has to check that the input matches the column definitions before writing, check indexes for key duplication, coupled with the general data shuffling (I’m guessing that databases do fragmentation within their data files) that occurs if a variable-width column like “text” suddenly holds more or less data than before…. am I in the ballpark here?
Please, correct me if I’m wrong. As I said, I have no real stake in database performance myself; hence the willingness to guess.
amix
on 07 Jan 09Sharding is not that hard to do if you bake it into the system from the start, but it’s really hard to bake it in afterwards since things like SQL joins are a big no-no in a sharded system.
Regarding Basecamp, it’s pretty impressive that you run it on one database server. The major problem, like you have noted, is that MySQL is really bad at doing any schema changes to really big tables. Scaling upwards and cheap schema changes are really big pluses for a sharded setup.
Personally, I would recommend not to include sharding from the start, but don’t ignore sharding either and leave doors open for sharding by using as few joins as possible.
Dee Zsombor
on 07 Jan 09All objects in basecamp are derived from the subdomain name, so if you were to set up a second cluster you could have all newly created accounts on it whilst keeping the old ones on the current cluster. A load balancer would direct requests to the right cluster depending on the subdomain.
So while technology moves along quite nicely, I think sites keeping the subdomain as account model with strict separation between account boundaries are one step ahead insofar scaling goes.
Dan Creswell
on 07 Jan 09As always, premature optimization is evil.
That said, it’s very difficult to actually measure the point at which the transition to e.g. sharding should be made.
At least from a cost perspective, you’re currently saving on engineering time but paying out to replace/upgrade/deploy new hardware instead. That repeated replacement is a cumulative cost. Further should there be a day when you do need to switch, you might well have to buy more hardware, spend a whole load of time overhauling your code, then rolling it out (and if you stave it off long enough you might have a lot of code to fix, the change might’ve been less costly with a smaller codebase).
So the question is, over the long haul, do you actually make a saving? Well you might, but unless you track the costs closely and measure the cost of the switch (should you have to do it) you can’t be certain. Agreed then that you’re increasing your chances of still being in business but will you have done enough business to offset your costs and give yourself the heardroom to move to sharding or will you have to fold? Who knows, that’s all hindsight….
....though we do know that Amazon for example had to borrow a huge amount of money to sort it’s architecture and they probably had an easier time getting the loan given they could say “we’re quite big you know with a lot of customers and proven revenue.
Interesting topic and I enjoyed the post.
bvs
on 07 Jan 09There are a couple of informative RFCs and another by Oracle on sharding and partitioning (none of whose link I can find right now). However, http://technoroy.blogspot.com/2008/07/shard-database-design.html is a decent start.
Will post with links later.
Al Chou
on 07 Jan 09According to Desmond Morris’s The Human Sexes, propensity toward complex planning is a trait of human males in general. He uses footage of middle-aged men operating their model railroad setups as illustration.
Now, I personally have never understood the fascination with model railroads, but in some things I do show that male propensity, such as contemplating using CouchDB for a pet app I’ve had in mind for years, and not just because the app truly is document-oriented. It’s a good thing MacPorts hasn’t been able to build a fully functional Erlang environment on my machine just yet—it’s a brake on such silliness. :)
Not being really at all knowledgeable about relational databases, I too am enjoying the post and the comment thread.
Jijnes Patel
on 07 Jan 09There are a few good approaches at Max Indelicato’s blog. One of which is a technique I’ve used in the past, where a master index is maintained to find the proper shard for a given tuple. This way expanding the cluster does not require rebalancing of the data. The index remains a point of failure though, but has overall worked well.
We were also fortunate that our data set was already in a position we could shard it. If we had to search across it, it would have complicated our sharding solution.
jlarson
on 07 Jan 09This is one of my new favorite quotes “Machines will get faster and cheaper all the time, but you’ll still only have the same limited programming resources that you had yesterday.”
Dave Giunta
on 07 Jan 09Great article, David…
I really appreciate the recent spate of highly technical, look-behind-the-curtain style articles coming from you guys lately.
Thanks a bunch.
Daniel
on 07 Jan 09Dave Giunta:
I second that!
DHH
on 07 Jan 09I appreciate the appreciation ;). I’ll try to put more of these thoughts to blog as we talk about them internally.
A good story on sharding and read/write performance of dbs is the original LiveJournal scaling presentation.
Dee, the problem with that model is that it doesn’t balance very well. You’ll typically need significantly more application servers since you can’t balance across the entire inventory of servers. So if you have a spike on cluster #1, it doesn’t help that you have plenty of available CPU on cluster #2.
Dan, the Basecamp code base isn’t really growing rapidly any more. The cost to switch to sharding is more or less fixed. It may indeed be decreasing over time as others do more pioneering work and come up with easier ways to solve the problem.
Also, many smaller machines may not be significantly cheaper to operate for us than a single large one. There’s more maintenance in 10 machines than in 1. There’s more maintenance in complexity.
MI
on 07 Jan 09We have folks who are quite familiar with PostgreSQL, and it definitely has advantages in some areas. That said, it has a gigantic disadvantage as far as replication is concerned since it doesn’t have any baked in support for it out of the box. I know that there are third party options available that address this, but I haven’t taken the time to investigate them deeply. Regardless, the idea of migrating 140+GB of data from MySQL to PostgreSQL does not excite me. :)
Absolutely, the fact that we have a very clean separation (ie: accounts/subdomains) makes the approach that we would take if we decided o implement sharding pretty obvious. That said, there are a variety of edge cases that have to be taken into account and it’s not as straightforward as it might appear on the surface.
Actually, it’s just an incremental cost. We lease all of our hardware through Rackspace Platform Hosting and they make it very easy for us to upgrade to new hardware when we need to without having big capital expenses.
matt
on 07 Jan 09Why is it everyone gets the quote wrong?
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”
When you start getting into performance improvements there are usually several things that stick out like a sore thumb which anybody can find with a small bit of digging (use of bind variables in oracle for example) versus things that are small in size but large in use. Those are MUCH harder to track down; however, they can have an enormous impact. Seemingly small changes in things like string concatenation over the entirety of a code base can make large performance improvements.
“Also, many smaller machines may not be significantly cheaper to operate for us than a single large one. There’s more maintenance in 10 machines than in 1. There’s more maintenance in complexity.” In the short term sure. In the longer term, if you move towards automating your operations, you typically gain several things: 1) redundancy 2) ease of scaling 3) more confidence in your setup 4) simpler roll forward / roll back of code / schema changes
With an automated operation system in place you don’t think in terms of individual machines for the most part. The complexity lies in how many environments / parts of the system there are.
Mike Fisher
on 07 Jan 09Nice article. A boss of mine used to tell me to “fix it a day before it breaks”, because like you said we all have a limited amount of engineers. The obvious problem is knowing when that day is. If you are growing very fast or in an unpredictable manner, then you need a larger margin of error around that date.
Most companies are hoping to grow faster than technology can keep up and therefore need to determine other ways to scale. One could argue that is a problem with their business model, but that’s a different article. For those companies that need to shard or want other ideas on how to split databases here is an article on the subject
http://akfpartners.com/techblog/2008/05/22/splitting-databases-for-scale/
Jonathan LaCour
on 07 Jan 09While I understand your basic points David, I don’t entirely agree with you. At some fundamental level, your argument is that sharding is complex, and so you are relying on solving the problem with hardware as long as Moore’s Law keeps bailing you out. This may be true in the case of software like Basecamp, but its most definitely not the case for all software. Sharding is a completely valid, and often required solution to certain classes of applications.
The real issue at hand is that the tools that you use don’t make it easy to shard. Sharding is a complex problem to an extent, but its a completely solvable one if your tools make it possible. ActiveRecord doesn’t currently do this, but could learn some lessons from alternative ORMs like Python’s SQLAlchemy and Storm, which both provide efficient, easy to use, and powerful built-in sharding mechanisms.
Hardware may be bailing you out now, but it can’t bail out everyone, and its only masking the real problem: your tools aren’t good enough yet! Note, I am not saying that the tools aren’t good, just that I believe sharding is an important thing that they are missing.
Robert Treat
on 07 Jan 09Nice post! Too many people get caught in in the latest trendy buzz talk about cloud computing and sharding and all of that; but the simple reality is that if you can scale your database vertically, that is likely the right thing to do, for many of the reasons you have stated (moores law, simpler management, less code complexity); especially when those techniques can take you into the top 1000 website range (which is beyond what many people will need). It’s one of the reasons we usually end up running Postgres for our clients. (We also manage some large MySQL and Oracle instances, but MySQL doesn’t scale as well vertically, and Oracle is cost prohibitive for many people).
That said, I’ve never understood the concern about “3rd-party options” that people have with Postgres replication. If you’re running open source, it’s likely your whole infrastructure is filled with 3rd-party solutions; like using something like Nagios for monitoring (which I would favor over MySQL’s monitoring system), or using Maatkit for easing administration.
I find this especially amusing coming from the 37signals guys, who specialize in developing first rate 3rd-party solutions to solve common business problems. Don’t be afraid; there are good, open source solutions available with companies that provide support for those solutions if needed.
On a side note, converting a 100+GB system from MySQL to Postgres isn’t that hard, but I would agree that there aren’t good solutions for migrating 100+GB data from MySQL to Postgres with minimal downtime. It’s certainly best done when you have have plenty of headroom on your hardware to absorb the extra load you’ll need.
Espen Antonsen
on 07 Jan 09I am surprised to see Basecamp running one just one database server. When I worked at 24SevenOffice and we started facing performance issues due to physical aspects of the server again and again after numerous upgrades, we decided to spread out our databases both vertically and horizontally. We could have kept on adding RAM but we did not want to end up one day with too many clients accessing one maxed out box. A few minor code rewrites was required but little to be done to abstract data access code to handle different databases for different clients, different databases for different objects (actually just e-mail, everything else was stored in one database) and a bunch of cheap read slaves. I am not going to start a religious war here but there is one other major difference between our setup and the setup at 37Signals: Microsoft SQL Server. Replication, data access between servers from all code aspects, migration and other potential issues gave us no problems. I would seriously consider PostgreSQL or Oracle in your case. I have recently worked briefly with MySQL and after dealing with MSSQL for seven years I am not at all impressed with MySQL.
JF
on 07 Jan 09Robert: Mark didn’t say we’re concerned about 3rd party options, only that he hasn’t had a chance to investigate them deeply yet. We use lots of 3rd party add-ons and extra tools that aren’t part of the core tech we’re using.
Mitch
on 07 Jan 09Great Post!
Getting Real talks about delayed scaling as well. You guys at 37Signals have changed my ideas on scalability, and it has already affected a few applications that I have written.
Thank you for sharing your insights!
James
on 07 Jan 09I agree with the critques of your approach insomuch as you should have said “For our application, sharding is unecessary because of moores law”. I doubt you have the same constraints as Flickr or other advocates of sharding.
For read-heavy databases that can fit in RAM (nowadays up to 256GB) then sharding does seem unecessary. For write heavy apps (facebook, flickr, etc.) I bet sharding is a much better approach.
My informed but untested opionion is 1) keep a simple database schema and try to keep queries simple (avoid joins) 2) cache as much as possible on as many levels as possible 3) shard if necessary. Keeping that path in mind can lead to architectural decisions that prepare you for the future.
Nathan LeMesurier
on 07 Jan 09Dan: Yes, you save on engineering time but spend more on hardware and that’s probably your best bet.
Hardware is Cheap, Programmers are Expensive
DHH
on 07 Jan 09I absolutely agree that if you’re Yahoo or Flickr or LiveJournal, you’ll need to shard sooner rather than later. But bear in mind that neither Flickr or LiveJournal started life as a fully sharded application. When did they start sharding? When they needed to!
That’s generally a good approach to all scaling techniques. Do them when you need to. That doesn’t have to mean “the day everything goes up in smoke”. It can well just mean “we’re growing the db set at 2GB/month and our queries/sec with X/month, this means that in 4 months we’ll be out of headroom on the current setup—let’s shard!”.
But thinking that you need to shard ahead of time to Play With The Big Boys is just stupid. Chances are you won’t build an application that’s that hot. And if you should be so lucky, you can deal with the problem then. Just like Flickr and LiveJournal and every big site under the sun did.
Alan Rimm-Kaufman
on 07 Jan 09Yep, and that is for the “easy” case.
When you don’t have the clean split - when you’re dealing with an app where all the tuples/objects/thingies need to interact - then it is a huge mess.
We were lucky to fall into the “clean split” case, and our IT guys are wizards, and that made sharding here reasonable.
Due to growth, we just deployed our third shard, and the process was amazingly calm and non-eventful.
rkgblog
David Andersen
on 07 Jan 09Enjoyed this and all the comments. Quite educational.
StartBreakingFree.com
on 07 Jan 09Have you guys heard of SHARTING? Might be a good option to look into!
http://www.urbandictionary.com/define.php?term=shart
Haha, seriously though I thought that was what the article was about when I saw the title and couldn’t figure out why Mr. Moore would be involved.
roger wilco
on 07 Jan 09Man, you guys were this close to writing an entire blog post without telling us what to do.
MI
on 07 Jan 09Robert: I didn’t say that I was concerned that the replication options were third party, just that it puts Pg at a disadvantage relative to MySQL as far as replication is concerned since there’s not even a basic story for replication out of the box. It seems like the PostgreSQL developers recognized this some time ago and plan to bake in at least basic replication capabilities.
That said, I agree that PostgreSQL is a fantastic piece of software and I’ve wished for some of it’s features on a number of occasions but none of those have been compelling enough for us to seriously consider a move.
Silly Goose
on 07 Jan 09Sorry, when I saw the term sharded I couldn’t help but think of along came polly.
http://www.urbandictionary.com/define.php?term=sharded
Sam
on 07 Jan 09I’d be interested to hear how you actually go about migrating your MySQL schema given that it takes “takes forever and a day”, even on a small table. Do you kludge your schema or do you do something clever to get around it?
Jason Watkins
on 07 Jan 09Dee, DHH: from what I understand myspace partitions it’s user database by auto-increment key, 1 million users per server, so perhaps the load balancing issues of that approach aren’t as big a problem in practice.
Even if you’re moving faster than Moore and reach the limits of hardware scaling a single machine, there are options. I’ve never used them, but there are shared nothing clustered database products on the market (ie Teradata). Hopefully someday Mysql Cluster or a similar open source product will have the same maturity.
I also think the point about tooling is well taken. Google AppEngine’s data API’s are easy to use and are more transparent to scaling than SQL. Choosing the right abstraction can be a powerful way to gain both programmer productivity and hardware efficiency.
Xaprb
on 08 Jan 09A lot of things scale non-linearly inside of MySQL. I don’t have any experience running MySQL on 512GB of memory.
Mikael Ronstrom
on 08 Jan 09We have a prototype of Parallel ALTER TABLE working in our labs. We demonstrated it at the last developers meeting in MySQL where for MyISAM, archive and HEAP we were able to use 16 cores fully utilized to ALTER a 10 million row table in 2.5 seconds. In many cases the storage engine limits the scalability.
However when we introduce such a feature it will be fairly straightforward to pinpoint how to fix this particular scalability issue.
This feature mainly works when there are partitioned tables involved, for non-partitioned tables we can split into a read thread and a write thread. To do more than that would require changing the handler API to accomodate for multithreaded changes.
It’s very likely that we’ll look at how we can do a similar parallelisation of other variants of ALTER TABLE, LOAD DATA and various other load functionalities.
Brandon Teo
on 08 Jan 09Hey David, Any thoughts or plans on implementing some sort of BigTable mini-system for your database capacity needs?
Michal Frackowiak
on 08 Jan 09The problems with RDBMS and sharding is that relational databases were never designed for it. By using sharding, replication, read load balancing – naturally you need to sacrifice some of the benefits and complexity of RDBMS.
Transactions – no way you can consistently use them in a sharded environment. Foreign keys between shards? No-go. Thus you also sacrifice enforced (and immediate) consistency that comes with a well-designed relational database. In the end you need to add extra layers (in app or db proxy?) that handle sharding, system becomes quite complex… Not to mention the scenario when an application was not designed for sharding – scaling such systems can be really painful.
Sure there are several ways you can shard your data to keep some (most) of benefits of RDBMS but I would still call this design “hacking”.
On the other hand systems like CouchDB becomes an interesting option. It it much simpler in implementation, less complex but very powerful. Come on – on very large, high-traffic systems the role of DBs is very often reduced to “keep the data” (i.e. less logic, less enforced consistency…), while more specialized tasks are moved to external resources (maps, filters etc.).
I dare say that high-volume and high-traffic projects would benefit much more from scalable solutions than from hacked, partitioned and clustered RDBMS. Compare Amazon’s SimpleDB (configure in minutes, scales to zillions, but still a bit poor query options and limits) to custom MySQL or PostgreSQL replicated / sharded cluster that takes a handful of engeners to set-up and maintain?
Many projects need “database” storage that is easy to configure and easy to scale.
Recently we have been also looking at CouchDB. This is not relational, not transactional, but a document-oriented database. I think the project is worth watching in 2009 and IMHO could become a huge success.
francisco
on 08 Jan 09fair point here: http://jeremy.zawodny.com/blog/archives/010841.html
Matt
on 08 Jan 09We sharded WordPress.com from the beginning and I’m thankful for that every day. A big advantage for us is that not all users are equal so we don’t have to scale up the entire system in a monolithic fashion.
DHH
on 08 Jan 09Matt, what do you mean that you don’t have to scale the system in a monolithic fashion? When you shard, you typically create inefficiencies across the load patterns. One shard might be at 100% capacity while another is only at 50%. So you have cluster #1 slowing down and unable to use the resources from cluster #2 to speed things up.
(BTW, sharding for Wordpress.com is probably very reasonable given your load—although I would think it would almost all be page cached anyway?)
Wei Hu
on 08 Jan 09Sharding is one way to overcome the scalability limitations of a single database. Interestingly, sharding in turn imposes additional demands on a database.
The following presentation, given at the 2008 Oracle Open World Conference, discusses these challenges and how advanced database technology can address them.
http://www.oracle.com/technology/deploy/availability/pdf/oracle-openworld-2008/300461.pdf
dbjock
on 09 Jan 09As someone that has worked with many different databases for many years, this type of discussion really baffles me.
I congratulate you on overcoming the “sharding is the only way to design a solution” religion and looking at this from a practical point of view and deciding based on facts when and where it makes sense for your application.
On the other hand I am amazed at how people are willing to put up with limitations in MySQL that other products have overcome many years (decades?) ago.
When you say “One point of real pain we’ve suffered, though, is that migrating a database schema in MySQL on a huge table takes forever and a day. That’s a very real problem if you want to avoid an enterprisey schema full of kludges put in place to avoid adding, renaming, or dropping columns on big tables. Or avoid long scheduled maintenance windows.”
The commercial databases have solved this problem for many years. Oracle has had the ability to add a column online for more than 20 years. That’s right 20 years! This problem was solved in the era of Windows 95! Microsoft and IBM also have this capability.
Even worse your pragmatic approach of running your database on a single system is going to run out of steam soon with MySQL as Intel and AMD add more cores and threads to their processors. MySQL has well known scalability limitations and won’t scale well on the 2 and 4 socket boxes that are coming out in 2009. The Intel 4 socket 8 core box with 2 threads per core will look like a 64 way SMP. This is well beyond the capabilities of MySQL to scale.
The only negative I see in these posts vs. commercial databases is that they are prohibitively expensive. Again, this is mostly religion. For a single database running on a single system the costs are not huge.
So, I congratulate you on using a pragmatic approach to determining whether sharding is right for your application. I would suggest you also use a pragmatic approach to choosing the software you choose to base your solution on.
This discussion is closed.