As we announced at the beginning of the month, we’re always on a mission to improve our uptime. Inaccessible apps are the cause of much frustration and users don’t care whether that’s because they’re scheduled or not.
While publishing our own uptimes have been a great step towards getting everyone in the company focused on improving, we also wanted to compare ourselves to others in the industry. So since December 16, we’ve been tracking five other applications through Pingdom to compare and contrast.
The goal is to have the least amount of downtime and here are the results from the period December 16 to January 31:
- Github, down for 6 minutes
- Freshbooks, down for 14 minutes
- Basecamp, down for 16 minutes
- Campaign Monitor, down for 21 minutes
- Shopify, down for 1 hour and 53 minutes
- Assistly (now Desk), down for 6 hours and 46 minutes
Congratulations to Github for the number one spot on the list. We are definitely going to be gunning for them! We’ll publish another edition of this list in a month or so.
Jesse Newland
on 31 Jan 12The ops team is focusing on availability at GitHub this year. 6 minutes over that period is more downtime than I’m personally comfortable with.
Bring it on :)
GB
on 31 Jan 12@37signals
I don’t believe it.
So, in 6 weeks – you guys were down for 16 minutes. If you extrapolated that, Basecamp will be down approx 2 hours for a year? No way.
I can say with confidence that last year, Basecamp was down for way more than 2 hours.
A simple Twitter search will prove it
DHH
on 31 Jan 12GB, we’ve published our entire uptime history for the last twelve months for Basecamp on http://basecamphq.com/uptime. As mentioned in the post linked in the very first sentence, we were down about 6 hours in 2011. We certainly aim to decrease that big time in 2012.
GB
on 31 Jan 12@DHH
All I’m saying is this blog post is deceiving, given as you said – you already posted your uptimes (downtimes) for all of last year.
When people read a headline that says “Benchmarking Basecamp’s uptime” ... then only see 6 minutes, my immediately reaction was WHAT, NOT TRUE?
Then I had to notice you were only reporting for a 6 week period. Seems a bit like you guys just cherry-picked a period of time when you guys had below average downtime (extrapolated 2 hours of downtime vs actual 6 HOURS of yearly downtime).
DHH
on 31 Jan 12GB, the reason we’re only reporting 6 weeks is because that’s all the data we have for the benchmarked services. If you read the article, you’ll see that we setup the benchmarks on December 16. So while it would be great to compare a whole year’s worth of data, it just wasn’t possible. When it is possible we will.
I didn’t think it was a very long article? Your first comment missed the first paragraph and your second comment missed the second paragraph. There’s only two more paragraphs to go, so please take a swing at them :)
Adam
on 31 Jan 12@GB I think it’s a bit silly to just read a number and make an assumption of the time period it is for.
They made no claims that this was an average month or representative of past or future performance. You made a HUGE jump in extrapolating yearly performance based on a small sample size.
@DHH It’s great when any company embraces more transparency and attempts to improve their service.
FWIW, over at ArgyleSocial.com we managed 8 minutes of downtime (according to pingdom) over that period. Looks like we’re holding up pretty well, though perhaps a bit less load than you guys ;)
DHH
on 31 Jan 12Adam, that’s awesome! 8 minutes of downtime over 45 days is 99.99% uptime. You have reason to be proud of that.
Christian
on 31 Jan 12Just signed up for pingdom and they sent me my password via email in clear. Great!
Ryan
on 31 Jan 12As much as I love GitHub, the web frontend going down is usually a minor annoyance, while problems on the git backend can be a major disruption to workflow. I’m guessing these numbers only include the web frontend?
NL
on 31 Jan 12@Ryan – correct, for Github we’re only monitoring the web site. In each case, our check does something that should exercise the web part of the app in some way that’s similar to how a user would use the app. Our benchmark checks get the same attention towards finding a good check as our own apps do.
Michael
on 31 Jan 12This is an awesome way to use Pingdom. I never thought of doing this until now!
GB is an idiot. Read much?
Sairam
on 31 Jan 12Today, heroku was down as well – https://status.heroku.com/
George
on 31 Jan 12I bet these numbers don’t take into account the downtime of Pingdom itself ;)
Rich Lafferty
on 31 Jan 12I feel like I’m being watched! :-)
14 minutes for FreshBooks in that period sounds about right to me, for whatever that’s worth. I’m curious, though, do you have Pingdom monitoring our website, or our app? They’re hosted separately.
-Rich, FB IT PHBWill Jessop
on 31 Jan 12@Rich Lafferty: “On Freshbooks a known invoice” (see http://news.ycombinator.com/item?id=3533412)
Andrew
on 31 Jan 12Pingdoms monitors only check every 60 seconds. If you want real downtime stats, signup for the Verelo beta at www.verelo.com and get monitoring that checks as regularly as every 5 seconds.
You cant pretend your site is down for 6 minutes if you only check it once every 60 seconds.
Verelo is going to monitor the sites mentioned above for the next 30 days and provide comparative results the next time this blog is updated.
Michael
on 31 Jan 12Andrew, perhaps you could tell us what difference you’re expecting.
David Andersen
on 31 Jan 12@GB -
Let me help you.
“Then I had to notice you were only reporting for a 6 week period.”
should be
“Then I actually pulled my head out of my *ss, read the literal text, used my brain and realized there’s no problem here.”
David Andersen
on 31 Jan 12Andrew, how often do you think a given site goes down and then back up within a 60 second window? For it to happen a statistically significant # of times would be odd.
L Roa
on 31 Jan 12This is great (and hilarious). One of the things I found out when moving to Silicon Valley (after working as an Architect at Bell Labs) was the complete lack of understanding of what it takes to have 5 or 6 nines of uptime by the majority of the designers in the Valley. Of course Marketing would claim a “very high availability and fault tolerance” but when actually measured the results were far, far from the benchmark.
Experience can’t be improvised.
It’ll take some time to get there, but I finally see a new web 2.0 outfit make a serious attempt at it.
It’ll take some time and gaining some experience, but you are on the right path (real world measurements). Good luck and best wishes!
PingOfDeath
on 31 Jan 12@Andrew: How can you claim Verelo’s measurement will be accurate if it only sample every 5 seconds?
I wrote my own tool which continuously hammer a server in order to accurately measure uptime. I found out that most servers out there crash all the time… Strangely enough usually soon after my tool starts sampling, though.
David Andersen
on 31 Jan 12@PingofDeath
Exactly.
My motto is: if you’re not sampling sub-millisecond, you’re probably lying about your uptime.
bruno
on 31 Jan 12Hey, How can I get defensive for the web design ebook for free? tanks
ShirleyYouJest
on 31 Jan 12Selective transparency is actually called “lying with statistics”. Another classic play from the Apple marketing playbook. Whens the last time 37s blogged about downtime again?
Btw, my app has been up for 10 minutes and I have 100.0% uptime. Alright!
Andrew
on 31 Jan 12@David Andersen
Based on what we’ve seen its actually very common for a site to go up and down inside the 1 minute range. Monitoring every 60 seconds is pretty decent, but if you’re very concerned or seeing some very unusual downtime events that only happen on and off (as in not consistently enough within a 1 minute period to trigger an alert) you’re likely to pick it up once you start monitoring below the 1 minute mark.
@PingOfDeath
Good question, and that’s exactly how we feel about it too :-) As far as we know we’re the only provider offering a very reasonably priced 5 second monitor, you’ll find the industry standard is around 60 seconds – 5 minutes. We are considering 1 second checks but are first addressing some required verification tasks to ensure we dont “DOS” someones site by attempting to monitor it.
Your own script to “hammer” it isnt a bad idea but you are probably not monitoring from all around the world. Verelo is monitoring from a lot of locations, and we’re constantly adding more to ensure all sides of the equation are taken into consideration i.e. Caching, Network connectivity between geographical regions and distance.
PingOfDeath
on 31 Jan 12@Andrew: “but you are probably not monitoring from all around the world”
Oh but I am! A Russian friend of mine graciously lent (well leased, actually) me some sort of distributed network of computers he somehow set up around the world (I think he calls that a botnet). I can tell you the target servers are probed from all around, so to speak.
Brian
on 31 Jan 12Will 37signals be sticking with Assistly (desk.com) now that Salesforce owns it. Seems like little by little they are moving into your space with these smaller acquisitions.
Matt Carey
on 31 Jan 12Basecamp is down again. It goes down for a few minutes every day at the moment, which must add up to more than 6 minutes…
Paul D
on 31 Jan 12Oi… proof that no good deed (post) goes unflamed. Can someone shoot me when I need to discriminate my app’s uptime at 5 second resolution?
David Andersen
on 01 Feb 12All right Andrew. What are typical situations where a site goes down and (I assume) automatically restarts in less than 60 seconds? I’m genuinely curious.
Andrew
on 01 Feb 12@David
In mose cases we’ve seen sub minute issues such as poorly executed deploys, timeouts, slow mysql queries (such as a missing index) result in pages displaying unexpected content.
In one very rare case we noticed a bad puppet script of a company we were monitoring was rebooting apache on all their servers once every 30 minutes (it was missing a clause which would first check if a host file already existed), which was resulting in the load balancer kicking the servers our of service (and there being none in service!)
Appreciate the question, i think the types of issues sub-minute monitors pickup v’s 60 seconds and above are fairly different and in general we’re just not use to finding them this way. Its a pretty cheap and easy way to find some very unexpected results, we’ve personally discovered a lot in past systems we’ve worked on through this product.
James
on 01 Feb 12@Christian
Sending you password in clear text doesn’t necessarily mean they store it in clear text, though it really makes users worry. They might create a random password, send you via email, then store the hashed password for authentication.
GeeIWonder
on 01 Feb 12Interesting stuff, admirable goals. Not all downtime is equal, so some metrics on actual impact would be merited.
The Basecamp numbers are not monitored via Pingdom, correct? I think I’d want to emphasize which numbers were derived how. There’s a granularity issue that should be significant on these sorts of time scales. Just ask your cellphone service provider.
GeeIWonder
on 01 Feb 12A companion number that is therefore useful would be to compare # of outage events. This is probably readily available to you.
DHH
on 01 Feb 12GW, yes, the Basecamp numbers are measured through Pingdom as well. Same methodology as we measure the benchmarks with.
The 16 minutes of Basecamp downtime was 1×10 minute outage, 1×3 minutes, and 3×1 minutes, I believe.
GeeIWonder
on 01 Feb 12In which case I don’t see a big deal about increasing temporal resolution. Apples to apples is fine.
This discussion is closed.