Let's get honest about uptime

David wrote this on Jan 02 2012 36 comments

Ma Bell engineered their phone system to have 99.999% reliability. Just 5 minutes of downtime per year. We’re pretty far off that for most internet services.

Sometimes that’s acceptable. Twitter was fail whaling for months on end and that hardly seem to put a dent in their growth. But if Gmail is down for even 5 minutes, I start getting sweaty palms. The same is true for many customers of our applications.

These days most savvy companies have gotten pretty good about keeping a status page updated during outages, but it’s much harder to get a sense of how they’re doing over the long run. The Amazon Web Services Health Dashboard only lets you look at a week at the time. It’s the same thing with the Google Apps Status Dashboard.

Zooming in like that is a great way to make things look peachy most of the time, but to anyone looking to make a decision about the service, it’s a lie by omission.

Since I would love to be able to evaluate other services by their long-term uptime record, I thought it only fair that we allow others to do the same with us. So starting today we’re producing uptime records going back 12 months for our four major applications:

Basecamp: 99.93% or about six hours of downtime.
Highrise: 99.95% or about four hours of downtime.
Campfire: 99.95% or about four hours of downtime.
Backpack: 99.98% or just under two hours of downtime.

Note that we’re not juking the stats here by omitting “scheduled” downtime. If you’re a customer and you need a file on Basecamp, do you really care whether we told you that we were going to be offline a couple of days in advance? No you don’t.

While we, and everyone else, strive to be online 100%, we’re still pretty proud of our uptime record. We hope that this level of transparency will force us to do even better in 2012. If we could hit just 4 nines for a start, I’d be really happy.

I hope this encourages others to present their long-term uptime record in an easily digestible format.

David wrote this on Jan 02 2012 There are 36 comments.

John Saddington

on 02 Jan 12

oh yeah. this is dope. great transparency….!

Bram Jetten

on 02 Jan 12

Very well said. You’re doing a pretty damn good job!

DHH

on 02 Jan 12

For anyone who wants to setup their own uptime report, we’ve been happy with http://pingdom.com/ for tracking the outages.

Joshua Warchol

on 02 Jan 12

Ma Bell did better than five 9’s. Is that a typo? 99.999 is 8 hours per year.

Matthias

on 02 Jan 12

This is how we do it in an even more detailed way:

http://blog.desk-net.com/2012/01/01/desk-net_uptime_downtime_201/

We are a small company. Our customers are currently only located in the Central European Time zone.

Therefore we distinguish between core usage time (Sun – Fri, 9am – 7pm – most of our customers are newspapers working a lot on Sundays) and non-core usage time.

Joshua Warchol

on 02 Jan 12

Please ignore me. I need less math and more coffee this morning.

DHH

on 02 Jan 12

Matthias, we have customers from, I believe, 100+ countries. It’s always core usage time somewhere. So we don’t allow ourselves the opt-out to distinguish.

Andy

on 02 Jan 12

I saw “juking the stats” and thought “please let this be a link to the Wire.”

Thanks for coming through.

Shining up shit and calling it gold.

mike

on 02 Jan 12

99.77%... I need to improve it, that’s 20.5 hours (mostly because infrastructure transfers)! :( Congrats for yours though!

Brian

on 02 Jan 12

When I created our status page, I tried to remove as much as possible. Left it to 12-month percentage, and a link to the official status feed on twitter. (So, if there is an outage, they can see what it is right away.)

Macminicolo.net/status

Matthias

on 02 Jan 12

@DHH Sure, we will need to stop doing that once we grow our user base. It is kind of a luxury (and makes life for our IT easier).

Garry

on 02 Jan 12

Those are some impressive stats! I really wonder what methodology does facebook adopt to manage almost 100% uptime. What do they do differently?

Salim Virani

on 02 Jan 12

Nice to see you taking real responsibility to your customers. I hope you lead the way for other companies to follow.

Managing downtime isn’t just about the numbers though. Failures happen, and how those are handled really make a difference. And of course, 37 Signals is transparent here too!

Are you going to go long-term with that too? 37signals Customer Support Happiness Report is just for the last 100 ratings. http://smiley.37signals.com/

DHH

on 02 Jan 12

Salim, that’s a great idea. We’ll post a summary for Smiley in 2011 tomorrow.

marcgg

on 02 Jan 12

Great initiative of posting this! I love the approach you’ve taken of counting scheduled maintenance.

It would be interesting to see how this downtime is distributed. A 4 hours long downtime a year is maybe more troublesome than 240 outages of one minute.

Fred

on 02 Jan 12

Wonder what Tumblr clocks in at. Hasn’t hurt their growth either, but I won’t use it for client work.

Prateek Dayal

on 02 Jan 12

DHH: Curious about your setup. How do you measure uptime? Do you do use http monitoring in pingdom or send back a custom xml with status/response time to pingdom?

How do you do http monitoring for login protected pages (the pages in the app that could actually be hit v/s login page etc).

sverre nokleby

on 02 Jan 12

The new Pingdom status panel is pretty good, here are the last months for tagdef.com.

Too bad they don’t include a total uptime for the period in this view.

Linus Ericsson

on 02 Jan 12

Great job with the uptime! Transparency on these matters is of great importance (and Someone will publish the numbers anyway).

How about spending next year on making the design of this blog a bit more clear? Everything looks like google ads, except for the pictures, which are the true ads. Very confusing.

BR / Gmail-user.

Suraj

on 02 Jan 12

Yea – transparency wins over falsified truth.

mike

on 02 Jan 12

for all interested in pingdom, they have a 70% discount on year plans until tomorrow

John

on 02 Jan 12

Pingdom is a good service, but we recently switched to Panopta and have found it to be more accurate and have more features.

mike

on 02 Jan 12

Another very good IMHO is the one provided with newrelic. And everything tied up with pagerduty, just to be waken up in the middle of the night :)

Jonathan

on 02 Jan 12

We tell our customers 99.9%, period, no exceptions. We don’t qualify that 99.9% with exceptions for planned maintenance or any other weasel language that gives us an out. We set a realistic expectation from day one that we know we can achieve without breaking the bank, and we usually exceed it. 2011 we were at 99.97% and with the exception of one 6 minute outage caused by human error, all downtime was scheduled at least 10 days in advance and communicated as such. The longest planned outage was 15 minutes. All others were squeezed into the (roughly) 90 seconds of downtime our SLA allows per day. We have found that even the most demanding customers greatly prefer multiple brief outages even over the course of a few days) to one long outage. As we improve our infrastructure and processes, we will probably add another 9 to the SLA for future contracts.

Martin May

on 02 Jan 12

David, thanks for the initiative. Following your lead, we (Forkly) have published our stats for 2011 as well:

http://blog.forkly.com/post/15190891337/lets-get-honest-about-uptime-forklys-take

While we don’t have the fancy detailed pages yet, we’ll look into that for 2012.

Sol Irvine

on 02 Jan 12

Jonathan,

99.9% of what, and measured at what frequency, and across what demarc points? I’m being a bit facetious, but this stuff matters a lot. You’re clearly way ahead of the pack in terms of transparency and good will, but you can’t just say “99.9% uptime”.

There are four crucial elements of any uptime metric: (1) the demarc points of the network components being measured, (2) the unit of measurement (i.e., the frequency of polling), (3) the period being measured (i.e., how many units per year, month, day, hour), and (4) the remedies for failures.

Most vendors game all four to their advantage. If you’re seeking to go the opposite direction from your competition, go all the way!

Jonathan

on 02 Jan 12

We keep it simple. We define “available” as the web applications and services we provide are all 100% functional, as measured by a battery of internal and external tests. Any test that does not require privileged access is either accessible, controllable, or repeatable by customers if they so desire. For external testing, we rely largely on Pingdom to handle network availability (though we do leverage it for application availability monitoring as well), and a series of automated tests run from our DR location (with and without privilege) that continuously test functions inside the application, that should they fail, would indicate a problem not immediately visible to a limited GET or POST test at Pingdom. The tests from the DR are on a continuous loop. Pingdom checks are at 1m. The 99.9% figure is officially quarterly, though we provide monthly and yearly summaries. I can’t discuss SLA penalties publicly.

Jason

on 02 Jan 12

Great post, and it’s great to see someone taking a visible stand for demonstrating their stats. As a monitoring company (Panopta), we always encourage our customers to take a visible, transparent approach – there’s always some risk that you’ll have problems, but those will be visible regardless. Better to be open about how you’re performing, plus that visibility gives you additional encouragement to take the right steps to minimize your downtime.

One question: are you just monitoring the public site or login page for each application, or do you do deeper checks to ensure the actual application is functioning correctly? Oftentimes the entry page is mostly static and isn’t the best indicator of whether your full application is functioning correctly.

NL

on 02 Jan 12

@Jason and @Prateek—our test logs in, causes some data to be fetched from the database, and renders a page which we then check against what it should return.

We haven’t (to date) had either any false positives or false negatives (when it alerts, the site is really down, and if it doesn’t alert, the site is really up).

This obviously isn’t a replacement for functional or integration tests to ensure that a commit doesn’t cause a piece of the app to stop working, but it does test the full infrastructure stack to make sure that it’s performing the way we expect it to.

Scott Windsor

on 02 Jan 12

I love the format and the approach. I’ve been a long time fan of pingdom as well. Did you make the uptime pages a plugin? I’d love to use the same on our site (although different format/branding, of course).

Aaron Suggs

on 03 Jan 12

Great idea!

Curious if 37s has many partial outages and if/how you would report those.

E.g., search or file uploads are unavailable, but the rest of the site works. Or a single app server has an undetected problem that causes it to pass the health check, but return 500 errors for most other urls: 1/nth of requests fail.

Most web apps are engineered to fail in many small, isolated ways, avoiding giant outages. But reporting these partial outages is tricky.

Chris Sparno

on 03 Jan 12

David, I have always appreciated your company’s transparency with Tweeting outages and what caused them. Unfortunately there are some SaaS companies that see this transparency as a weakness and don’t share this type of info.

Anonymous Coward

on 04 Jan 12

Great spam considering the title of this post ;-)

Meaning of uptime

on 05 Jan 12

Rarely the stats for uptime could be accurate and only a local statistic solution should be a real and accurate one. I do agree that Pingdom offers a very good service but that not means that is fully accurate and I’m sure about saying this because I got an alert for downtime while I was working on server. No, it’s not about high server load, no high latency, or services being down. After receiving that alert I did a ping and http check from 3 different locations and was no problem. If you set your checking interval on high level, the server could be down between those two checkpoints but not recorded by stats. So, the real uptime was indeed 100% between checks or not ?

Paul Turner

on 06 Jan 12

What a great discussion. We launched our Trust site in April 2011. We are testing every 8s. We are running tests that emulate (using Python and Selenium) the things that our users do. We publish the response times of these tests in pretty much real-time. And we are diffentiating across the range of user interactions we offer across our app. With plenty of redundency and hot failover within our data centre and between data centres, we are truly aiming for five 9s. On the 30th December 2011 we passed 2M public tests of the service. By the end of 2012 it will be 10M such tests. The “agents” we use for the tests are geographically dispersed. There is lots more to do and I’d welcome any direct feedback.

Kyla Cromer

on 07 Jan 12

Twitter is still fail whaling….

This discussion is closed.

About David

Creator of Ruby on Rails, partner at 37signals, best-selling author, public speaker, race-car driver, hobbyist photographer, and family man.

Read all of David’s posts, and follow David on Twitter.

If you liked this Sysadmin post by David, you’ll probably like reading Benchmarking Basecamp's uptime against five other web apps, New Basecamp: Available 99.99% of the time since launch, and Basecamp network attack postmortem