As my final installment in the Nuts & Bolts series, I want to hit a few of the questions that were sent in that I didn’t get a chance to get to earlier in the week. I hope you’ve enjoyed reading these as much as I’ve enjoyed writing them.

What colocation provider did you choose, and why?

After an exhaustive (and exhausting!) selection process, we chose ServerCentral to host our infrastructure. They have an awesome facility that has some of the most thoughtful and redundant datacenter design I’ve ever seen. On top of top notch facilities they have a great network via their sister company nLayer.

Finding a partner who could manage the hardware for us without us having to be onsite was a big deal for us too. The quality of “remote hands” support from datacenter to datacenter is, well let’s just call it inconsistent and be generous. ServerCentral has a great reputation with its customers in that regard and we’ve found their support to be excellent. They manage all of the physical installations, hardware troubleshooting, and maintenance for us.

They do a mean cabling job too.

How do you bootstrap new hardware when it is installed?

We have a PXE installation server that handles installation of bare metal machines using the Ubuntu preseed unattended installation mechanism. All we do is add a small snippet of data with things like the MAC address of the primary network interface and a hostname to our Chef configuration management system and it generates the required configuration on our installation server. Our installations are extremely bare bones with just enough operating system to run our Chef configuration management recipes for final configuration.

The typical workflow is that hardware arrives at ServerCentral and is installed by their technicians. The technicians cable it per our specifications, configure the DRAC remote access cards for us, and provide us with the MAC address of the primary interface. From there we can configure our installation server, power the machine on, and within about 5 minutes have a machine that is ready to go. It works great.

How did you do such a large migration from Rackspace to your own colo? I’m assuming you had something like VMWare Motion to move some of it without downtime/interruption?

Actually, it was pretty straightforward, at least in the broad strokes. There was a lot of work involved by the general process goes something like this:

  1. Setup a database server at the new facility, restore a recent backup to it, and connect it to the production server over a VPN for replication. We also setup the old production server to warm the cache of the new server continually so that it’s ready as soon as we flip the switch.
  2. Setup the production web/application/proxy tier at the new facility. The vast majority of this is boilerplate that is already setup in our configuration management system so it’s largely a matter of adding a configuration entry and running Chef.
  3. Test!
  4. Change DNS to point to the new site, and setup the old one to proxy to the new one to catch any DNS stragglers.

There are a lot of details about timing of the few dozen steps involved in the final switchover, but we were able to reduce that to a checklist and repeat it successfully. Campfire, which was the last application to move, was only down for about 25 minutes during the migration. This is one situation where having all of our data on S3 was admittedly a huge benefit, since we didn’t have to worry about immediately replicating a huge amount of data (outside of the database) to the new site.

Why did you choose the 2U Dell R710 servers instead of a 1U server since it looks like you’re only using 2 drives in most of them.

First of all, you guys looked pretty closely at the pictures.

There were three main reasons why we stuck with the 2U Dell R710s:

  1. Flexibility. What is an application server today with a small amount of disk requirement may need to be repurposed down the road into a role that requires more capacity.
  2. Consistency. We use the Dell R710 chassis for absolutely everything in our infrastructure; database servers, application servers, proxy servers, everything. This makes configuration and spares much easier to manage since they’re all very similarly configured with perhaps some changes in the memory, hard drive, and CPU configurations.
  3. They’re small enough. The limiting factor in most modern datacenters is much more likely to be power than it is to be physical space and the cost structure reflects that. The 2U form factor strikes a nice balance between ease of maintenance and space efficiency. There wouldn’t be a significant change in our hosting bill by going with 1U devices.

How do you manage redundancy and fault-tolerance?

Absolutely nothing goes into production unless it has a minimum of one other system to fail over to. This philosophy carries through from our network, to our databases, to our application servers, and so on.

To give just one example, let me talk about how our servers are connected to the network. At the top of each of our cabinets are a pair of 48 port Cisco 3750G switches, in a stacked configuration. We run three main networks (VLANs) in our environment, a general purpose network, a storage dedicated network, and a network for our remote management systems.

On each server, we have 5 network interfaces in use. One interface for the remote access card in the server, and the other 4 are in use for the general purpose and storage networks. We run one cable to each switch for each network and bond the two ports into a single logical interface using 802.3ad link aggregation. This configuration ensures that we can lose any cable, network port, or switch without losing connectivity to a server.

This same kind of thought process was repeated throughout our entire environment and played a big role in the reason we chose the Isilon storage systems. I won’t say that we don’t have any single points of failure, but we try to eliminate them as much as we possibly can. We make sure that we know about the ones we can’t eliminate and have plans to respond to failures.

By the way, the photo above and the rest of the photos in the series were taken by John Williams, one of the great sysadmins on my team. Who knew he was multitalented?!