We launched the new Basecamp on March 6. Since then we’ve deployed 891 new versions with all sorts of new features, bug fixes, and tweaks. Through all of that we’ve had just six incidents of either scheduled or unscheduled downtime for a total of 19 minutes offline.
Today, that means we’ve been available 99.99% of the time since launch. That’s worth celebrating! Our fantastic operations teams consisting of Anton, Eron, John, Matt, Will, and Taylor have worked tirelessly to eliminate interruptions and they deserve our applaud.
Since we count “scheduled” downtime the same as “unscheduled” (have you ever met a customer who cared about the difference?), that has meant making good progress on stuff like database migrations.
In the past, when we focused mainly on unscheduled downtime as a measure of success, we wouldn’t think too much of taking a 30-minute window to push a major new feature. Not so these days. Thanks to Percona’s pt-online-schema-change, we’re able to migrate the database much easier without any downtime or master-slave swappero.
So three cheers to the four 9’s! Our next target is five 9’s, but that only allows for 5 minutes of downtime in a whole year, so we have our work cut out for us.
You can follow along and see how we’re doing on basecamp.com/uptime.
Henrik N
on 17 Jul 12Even if that tool lets you migrate the database table itself without locking issues, how do you handle the Rails end of renaming/removing columns, with ActiveRecord’s column cache? Do you first deploy a patch to ignore it or something else?
Marcus Swope
on 17 Jul 12I used to work for a company whose deployments would take between 3-4 hours each. I’m not sure that it would even be possible to deploy that much in that little time, even if we just deployed 24-7! Kudos!
z
on 17 Jul 12Can you list the features added since launch please?
DHH
on 17 Jul 12Z: You can see a partial list of new features and tweaks on http://basecamp.com/changes.
Gerard
on 17 Jul 12Congratulations. That is a serious accomplishment and is worth at least some feathers up some …. well let’s just leave it at compliments :)
Michael
on 17 Jul 12How’s the contest with Github et al going?
Don S.
on 17 Jul 12How is the Ops team configured? you have 6 (great) team members to manage…
Network Servers / Clusters etc. DBA Support or?
I ask simply for reference for what I would like to do with our team. I think we need more team members !
Montigny
on 17 Jul 1299.99% is impressive, especially since you are counting the scheduled maintenances. Going for even more availability seems like too much to me, you’re doing over-quality :)
I’d be glad if I can do the same thing in our production, unfortunately, everything depends on the application architecture. If your basecamp is fully scalable it doesn’t matter if you have crappy storage and weak DB server, you can put a 100’s of cheap servers and rely on probabilities. That’s a great idea, but you have to start early in application design.
Taylor
on 17 Jul 12@Don S.
We are distributed throughout the US, Canada, Europe + Russia. I’m not sure exactly what your question is but I’d be happy to give more information if you can clarify what you’d like to know.
@Montigny
We are excited about our recent success. You might find it interesting to know that most of Basecamp is deployed on “standard” hardware with a single relational database pair backing the application. We’ve worked hard to make things as fault tolerant and easy to maintain as is humanly possible in this updated version—but there is still work to be done. We still battle with storage and other infrastructure even though this is a new product. There’s always tradeoffs for ease of implementation vs long term reliability and operability.
David E.
on 18 Jul 12Congrats. Very impressive. Could you share a little more on how you manage to do that? Thanks.
Michael Warkentin
on 18 Jul 12Have you guys run into any issues with using pt-online-schema-change? We’re thinking about trying it out at Wave Accounting.
Rick
on 19 Jul 1237signals
This is yet another time you’ve lied to your customers. You did not in fact have 99.99% uptime.
The new Basecamp has been live for 4 months.
99.99% uptime for 1 year translates to only 52.56 minutes of downtime for the year or 17.56 minutes for 4 months. You had 19 minutes of downtime for that period.
http://en.m.wikipedia.org/wiki/High_availability#section_2
I know people will say, well – we only missed it by a minute or so.
Problem is, when it comes to uptime – a minute or so is HUGE.
NL
on 19 Jul 12@Rick: I appreciate you keeping us honest. This is certainly a case where precision is called for.
David published this post at 15:00 UTC on the 17th of this month. We launched the new Basecamp at around 13:00 UTC on March 6th, or a little over 133 days before (191,640 minutes, to be precise). 99.99% uptime for that period works out to 19.164 minutes.
We track downtime in 1 minute increments that are always rounded up (so 61 seconds of downtime ends up counting as two minutes). By that measure, we’re at 19 minutes total since launch, so we were just over the threshold when this post was published—99.99009% uptime since launch.
Don S.
on 19 Jul 12Taylor,
Thank you for your reply. I will be more specific.
I am trying to gauge how many team members and what roles they should have in a (kind of) similar situation. 24×7 365 SaaS application, 50,000 + users and growing. Similar set up as you, except there are three locations that all are same in different countries, but due to international law, the information they house cannot leave country. So we have to maintain multiple locations.
So with your six support team members you cover one DC, and I was wondering what each members role was. All are “System Administrators” or do you have specialties? I.e. Network, Server, DB, General Support. As far as where they are located, I understand and agree, that is not as important. We want great team members. They don’t have to be neighbors :-)
I hope that clarifies my question. We are growing fast, and need to staff out, hoping for some insights from the success you have had.
Thanks!
Julian H
on 20 Jul 12Well done on the 4×9’s. What’s the uptime on the New Basecamp mobile site? I still can’t find it.
Amaury Bouchard
on 22 Jul 12I would like to know something: Are the 6 members of your IT team on flexible work time, and four-days weeks during summer? Do they have “on-call time” during evenings, week-ends or hollidays?
Sairam
on 24 Jul 12“Thanks to Percona’s pt-online-schema-change, we’re able to migrate the database much easier without any downtime or master-slave swappero.”
Dear Team,
Can you elaborate the usage of "pt-online-schema-change" . What are the merits and demerits, when it comes to techinal aspects of "pt-online-schema-change".This discussion is closed.