You’re reading Signal v. Noise, a publication about the web by Basecamp since 1999. Happy !

Signal v. Noise: Sysadmin

Our Most Recent Posts on Sysadmin

Anton Koldaev join 37signals as newest Sysop

Taylor
Taylor wrote this on 22 comments

Today, we are announcing Anton Koldaev as the newest member of our operations team! Anton hails from Moscow and recently worked at the first Russian-based cloud hosting company.

In David’s recent post on remote working, there were a number of comments about investing in more junior or less experienced individuals. Anton is 24 and a great example of that and quickly won us over during his 30 day trial as a junior member of our team.

How we hired Anton

Anton saw our job board post via an old colleague’s tweet and responded with a well-written email to us. We reviewed his resume and gave a traditional operations task to complete: Deploy Redmine to EC2 using Chef, and document the process.

Anton not only did an excellent job from the operational side of things, he was the only applicant to use the Redmine instance to document his work. In addition, he was the only applicant to take a page from tally and build a little Campfire integration for code pushes, deployment notifications, etc.

Impressed with Anton’s work on the trial project, each member of the team had a chat with him on Skype. (We also talked to his references who all had great things to say.) As a team we agreed that Anton would be a good contractor, and offered him a 30 day trial.

Anton started with documentation, and worked his way through projects such as upgrading our Chef server, deploying new hardware for Basecamp and rebuilding our staging database servers. Near the end of his trial, Anton was able to get a visa to visit Will in the UK for a week of co-working and in depth learning about our operations.

Welcome Anton!

Let's get honest about uptime

David
David wrote this on 36 comments

Ma Bell engineered their phone system to have 99.999% reliability. Just 5 minutes of downtime per year. We’re pretty far off that for most internet services.

Sometimes that’s acceptable. Twitter was fail whaling for months on end and that hardly seem to put a dent in their growth. But if Gmail is down for even 5 minutes, I start getting sweaty palms. The same is true for many customers of our applications.

These days most savvy companies have gotten pretty good about keeping a status page updated during outages, but it’s much harder to get a sense of how they’re doing over the long run. The Amazon Web Services Health Dashboard only lets you look at a week at the time. It’s the same thing with the Google Apps Status Dashboard.

Zooming in like that is a great way to make things look peachy most of the time, but to anyone looking to make a decision about the service, it’s a lie by omission.

Since I would love to be able to evaluate other services by their long-term uptime record, I thought it only fair that we allow others to do the same with us. So starting today we’re producing uptime records going back 12 months for our four major applications:

  • Basecamp: 99.93% or about six hours of downtime.
  • Highrise: 99.95% or about four hours of downtime.
  • Campfire: 99.95% or about four hours of downtime.
  • Backpack: 99.98% or just under two hours of downtime.

Note that we’re not juking the stats here by omitting “scheduled” downtime. If you’re a customer and you need a file on Basecamp, do you really care whether we told you that we were going to be offline a couple of days in advance? No you don’t.

While we, and everyone else, strive to be online 100%, we’re still pretty proud of our uptime record. We hope that this level of transparency will force us to do even better in 2012. If we could hit just 4 nines for a start, I’d be really happy.

I hope this encourages others to present their long-term uptime record in an easily digestible format.

Behind the Scenes: Internet Connectivity

Taylor
Taylor wrote this on 23 comments

Last year, we suffered a number of service outages due to network problems upstream. In the past 9 months we have diligently worked to install service from additional providers and expand both our redundancy and capacity. This week we turned up our third Internet provider, accomplishing our goals of circuit diversity, latency reduction and increased network capacity.

We now have service from Server Central / Nlayer Networks, Internap and Level 3 Communications. Our total network capacity is in excess of 1.5 gigabits per second, while our mean customer facing bandwidth utilization is between 500 megabits and 1 gigabit. In addition, we’ve deployed two Cisco ASR 1001 routers which aggregate our circuits and allow us to announce our /24 netblock (our own IP address space) via each provider.

Keeping Basecamp, Highrise, Backpack, and Campfire available to you at all times is our top priority, and we’re always looking for ways to increase redundancy and service performance. This setup has prevented at least 4 significant upstream network issues from becoming customer impacting… which we can all agree is great!

Looking for two more people to join our operations team

Taylor
Taylor wrote this on Discuss

We are looking for two more people to join our operations team. We would prefer individuals interested in both application development and systems engineering. We’ve got 100’s of servers running Ubuntu, 400+ terabytes of Isilon storage, and we’re building out a second site. We use lots of “bare metal” in addition to VMware virtualization and some of the Amazon and Rackspace Cloud services.

Come work with us at 37signals and do the best work of your career.

Want more information? Check out this post on the Job Board.

Eron Nicholson joins 37signals as Sysop

Taylor
Taylor wrote this on 7 comments

Yesterday Eron Nicholson joined John, Will, and me on our operations team!

Recently Eron’s work included building and maintaining several multi-data center installations for a green energy startup. In addition to wearing the hardware hat, Eron was responsible for networking, monitoring and systems administration. As if those duties weren’t enough, Eron also developed internal tools and did other software design (including embedded systems).

We are really looking forward to Eron’s work on our monitoring and high availability systems.

Welcome Eron!

PS Hat tip to Nic at Newrelic for introducing us to Eron.

PPS Eron hails from North Carolina and enjoys racing … of a different variety (24 Hours of Lemons).

Looking for one more person to join our operations team

Taylor
Taylor wrote this on Discuss

We’re looking for another person to join our operations team. We would prefer someone in the United States who is interested in both application development and systems engineering.

You will automate Sysadmin tasks with Chef, troubleshoot application performance issues with New Relic, and create new tools to make things run more smoothly and efficiently. You’ll also be expected to participate in an on-call rotation after you’re fully up to speed (traditionally 1 to 2 months).

Want more information? Check out this post on the Job Board.

Will Jessop joins 37signals as Sysop

Taylor
Taylor wrote this on 17 comments

We’re excited to announce another new addition to our team: Will Jessop, who pings from Manchester in the UK, is now the second EU engineer on our Operations team.

Will comes from Engine Yard where he provided scaling and support expertise for Rails sites such as Seeking Alpha, Groupon and KgbKgb. Recently Will worked on a senior team tasked with live migrating more than half of Engine Yard’s private cloud customers to a new cloud platform. It’s going to be great to have another member of the team with such deep understanding of application development and systems implementation practices.

With the addition of Will, we’ll have a lot more ability to conduct off-hours maintenance and scalability testing. He will also be helping us improve deployment practices, standardize our applications on the latest Ruby, and improve performance of our applications in Europe and the rest of the non-US world.

Welcome aboard, Will!

Nuts & Bolts: Potpourri

Mark Imbriaco
Mark Imbriaco wrote this on 7 comments

As my final installment in the Nuts & Bolts series, I want to hit a few of the questions that were sent in that I didn’t get a chance to get to earlier in the week. I hope you’ve enjoyed reading these as much as I’ve enjoyed writing them.

What colocation provider did you choose, and why?

After an exhaustive (and exhausting!) selection process, we chose ServerCentral to host our infrastructure. They have an awesome facility that has some of the most thoughtful and redundant datacenter design I’ve ever seen. On top of top notch facilities they have a great network via their sister company nLayer.

Finding a partner who could manage the hardware for us without us having to be onsite was a big deal for us too. The quality of “remote hands” support from datacenter to datacenter is, well let’s just call it inconsistent and be generous. ServerCentral has a great reputation with its customers in that regard and we’ve found their support to be excellent. They manage all of the physical installations, hardware troubleshooting, and maintenance for us.


They do a mean cabling job too.
Continued…

Nuts & Bolts: Storage

Mark Imbriaco
Mark Imbriaco wrote this on 27 comments

Next up in the Nuts & Bolts series, I want to cover storage. There were a number of questions about our storage infrastructure after my new datacenter post asking about the Isilon storage cluster that is pictured.

To set the stage, I’ll share some file statistics from Basecamp. On an average week day, there are around 100,000 files uploaded to Basecamp with an average file size that is currently 2MB for a total of about 200GB per day of uploaded content in Basecamp. And that’s just Basecamp! We have a number of other apps that handle tens of thousands of uploaded files per day as well. Based on that, you’d expect we’d need to handle maybe 60TB of uploaded files over the next 12 months, but those numbers don’t take into account the acceleration in the amount of data uploaded. Just since January we’ve seen an increase in the average file size uploaded from 1.88MB to 2MB and our overall storage consumption rate has increased by 50% with no signs of slowing down.

When I sat down to begin planning our move from Rackspace to our new environment, I looked at a variety of options. Our previous environment consisted of a mix of MogileFS and Amazon S3. When a customer uploaded a file to one of our applications we would immediately store the file in our local MogileFS cluster and it would be immediately available for download. Asynchronously, we would upload the file to S3, and after around 20 minutes, we would begin serving it directly from S3. The staging of files in MogileFS was necessary to account for the eventually consistent nature of S3.

While we’ve been generally happy with that configuration, I thought that we could save money over the long term by moving our data out of S3 and onto local storage. S3 is a phenomenal product, and it allows you to expand storage without having to worry much about capacity planning or redundancy, but it is priced at a comparative premium. With that premise in mind I crunched some numbers and was even more convinced that we could save money on our storage needs without sacrificing reliability and while reducing the complexity of our file workflow at the same time.

The main contenders for our new storage platform were either an expanded MogileFS cluster or a commercial NAS. We knew that we did not want to have to juggle LUNs or a layer like GFS to manage our storage, so we were able to eliminate traditional SAN storage as a contender fairly early on. We have had generally good luck with MogileFS, but have had some ongoing issues with memory growth on some of our nodes and have had at least a couple of storage related outages over the past couple of years. While the user community around MogileFS is great, the lack of commercial support options raises its head when you have an outage.

After weighing all of the options, we decided to purchase a commercial solution and we settled on Isilon as the vendor for our storage platform. Protecting our customer’s data is our most important job and we wanted a system that we could be confident in over the long term. We initially purchased a 4 node cluster of their 36NL nodes, each with a raw capacity of 36TB. The usable capacity of our current cluster with the redundancy level we have set is 108TB. We’ve already ordered another node to expand our usable space to 144TB in order to keep pace with the storage growth that took place between the time we planned the move and when we implemented it.

The architecture of the Isilon system is very interesting. The individual nodes interconnect with one another over an InfiniBand network (SDR or 10 Gbps right now) to form a cluster. With the consistency level we chose, each block of data that is written to the cluster is stored on a minimum of two nodes in the cluster. This means that we’re able to lose an entire node without affecting the operation of our systems. In addition, the nodes cooperate with one another to present the pooled storage to our clients as a single very large filesystem over NFS. Isilon also has all the features like snapshots, replication, quotas, and so on that you would expect from a commercial NAS vendor. These weren’t absolute requirements, but they certainly make management simpler for us and are a welcome addition to the toolbox.

As we grow, it’s very simple to expand the capacity of the cluster. You just rack up another node, connect it to the InfiniBand backend network and to the network your NFS clients are connected to and push a button. The node configures itself into the existing cluster, its internal storage is added to the global OneFS filesystem, its onboard memory is added to the globally coherent cache, and its CPU is available to help process I/O operations. All in about a minute. It’s pretty awesome stuff, and we had fun testing these features in our datacenter when we were deploying it.

For now, we continue to use Amazon S3 as a backup, but we intend to replace it with a second Isilon cluster in a secondary datacenter which we’ll keep in sync via replication within the next several months.