Behind the Scenes: Internet Connectivity

Taylor wrote this on Sep 21 2011 23 comments

Last year, we suffered a number of service outages due to network problems upstream. In the past 9 months we have diligently worked to install service from additional providers and expand both our redundancy and capacity. This week we turned up our third Internet provider, accomplishing our goals of circuit diversity, latency reduction and increased network capacity.

We now have service from Server Central / Nlayer Networks, Internap and Level 3 Communications. Our total network capacity is in excess of 1.5 gigabits per second, while our mean customer facing bandwidth utilization is between 500 megabits and 1 gigabit. In addition, we’ve deployed two Cisco ASR 1001 routers which aggregate our circuits and allow us to announce our /24 netblock (our own IP address space) via each provider.

Keeping Basecamp, Highrise, Backpack, and Campfire available to you at all times is our top priority, and we’re always looking for ways to increase redundancy and service performance. This setup has prevented at least 4 significant upstream network issues from becoming customer impacting… which we can all agree is great!

Taylor wrote this on Sep 21 2011 There are 23 comments.

Peter Cooper

on 21 Sep 11

What sort of connectivity do you have between the office and the datacenter? That can sometimes be neglected in companies, I’ve found, but can be of significance to developers and DBAs (unless all work and experiments occur on servers at the DC, of course!)

Taylor

on 21 Sep 11

Hey Peter,

Good point! We currently have a cable Internet circuit which has been pretty unreliable—causing lots of frustration for those working out of our Chicago office. In the coming weeks we’ll have a 20M dedicated ethernet circuit online which should bring us back to the happy days. We looked at getting a point to point circuit directly into the DC, but the cost and install times didn’t match our needs.

Jason Abate

on 21 Sep 11

@Taylor – who’d you go with for your office connectivity? We’re in a similar situation in Chicago, looking for something more reliable than Comcast business service.

Mirek Burn

on 21 Sep 11

How about internall links utilization? Do you have any loadbalancers ((hw/sw))? Do you have any plans to spread your DC to other continents?

mirek from ONE small THING

Gary Bury

on 21 Sep 11

I’m no techie but I always struggle with the concept of Redundancy. It means “not needed”, “surplus to requirements”, “unused”.

I can see why you’d want masses of excess capacity, but redundancy, surely that’s a waste of resources?

Taylor

on 21 Sep 11

@Jason,

We selected US Signal. Unfortunately I cannot recommend them for service. They are weeks and weeks and weeks behind on delivering our circuit and they haven’t delivered on their promise of superior customer service. I’d recommend you contact Chris @ Avant ([email protected]) if you want help finding the best provider in your area of the city.

Taylor

on 21 Sep 11

@Mike,

We run bonded 1G links internally. We are moving to 10G shortly. We use Fortigate appliances for load balancing. (The Fortigates are probably the most “finicky” part of our stack.) We also use Haproxy + Nginx on some Dell boxes for lb and SSL termination. We are working on building out a second site in the US (likely Atlanta). After we’ve finished the second US site we’ll work on stuff overseas as it makes sense.

Taylor

on 21 Sep 11

@Gary,

Previously we were “single homed” to one provider, although we had physically redundant links. Now we have multiple physical links, taking divergent physical paths, to multiple providers. So there’s physical and logical redundancy there.

As for the excess bandwidth, each link can handle 1G of traffic. We make a utilization commitment that the provider then enables for that link. For instance some links have the full 1G, and others are committed at 500M.

We have an excess of capacity because we need to be able to carry our full traffic on a secondary link. We also need to have an excess because we get traffic spurts and because the providers are notoriously slow to move when you need to increase your commit.

What it comes down to is we build in N+1 to the point where it meets our business objectives and cost objectives. At this point we feel like 3 providers with separate circuits is the right balance.

Hope this helps!

greg

on 21 Sep 11

This seems like a strange thing to worry about this way. Do you run your own data center, or does your DC just not handle upstream connections for you?

If you’re running your own, how do you justify the costs? In my experience, DCs only really work on a fairly large scale, because the fire suppression, security, monitoring, multiple UPSes, multiple generators, multiple A/C’s, and multiple multi-homed upstream connections are very expensive, and there’s a minimum cost to do it really well, regardless of how many servers. Even the mid-size datacenters these days have upwards of 30 Gbps capacity, so your 1.5, while I’m sure impressive and adequate for your operation, seems underwhelming if it’s for an entire DC.

I’ve only dealt with 3 different DCs, but they always handle the upstream connectivity. They give me a port on their network and some public IPs, and they own the routing infrastructure and upstream peering that goes to anywhere between 4 and 8 upstream providers. In many years of doing it, I’ve had temporary partial routing problems (which are fixed pretty quickly) but never a complete outage. More importantly, it’s their problem to fix routing and bandwidth problems, and I can concentrate on my core business.

How are you guys different, and why?

Taylor

on 21 Sep 11

@Greg,

We colocate with Server Central in the DF-CH1 facility. We trusted them to provide connectivity through their managed network which is ultimately a part of the Nlayer network (which they also own). Despite doing their best, we were affected by a number of scheduled events gone awry and unscheduled events that brought us to a point of needing additional connectivity. So in effect, we got pushed into it. (To be very clear, we have an excellent working relationship with Server Central and they’ve gone out of their way to help us despite these issues.)

Eventually I (we) decided we were big enough that it made sense to have physical and logical redundancy and complete control over our Internet presence. (It’s obvious that if our sites are offline because the Internet is unable to reach us, we are out of business.) So we turned up a circuit with Internap, which gave us access to their mix in Chicago and the awesome management they provide. In addition we turned up a circuit with Level 3 that goes to a separate pop in Chicago via a separate path. As you probably know, Level 3 is one of the best providers in the US, if not world, so that gives us an even better mix. Finally we moved (nearly entirely) off our existing IP space and on to our allocation from ARIN so we have full portability.

In terms of maintenance, it’s been very low to date. To give you an idea how stable things have been, I turned up the Level 3 circuit during business hours with full confidence everything would be fine. We’ve conducted power supply swaps and other maintenance events with zero down time as well. I’ll try to update in another 6 months if things change in this regard.

Nate Rosenberg

on 21 Sep 11

Taylor,

Thank you for opening the kimono.

You may have already thought of this, but something else to consider: router failover with Cisco’s Interchassis High Availability (IHA) feature. Multi-homing will help if you lose connectivity to Nlayer, InterNAP, or Level3; IHA will help if one of the ASRs fail.

Using the IHA feature, you can have your two Cisco ASR 1001s act as automatic backups for each other. Here’s Cisco’s documentation on Interchassis High Availability.

Also, as you plan your second data center, you’ll want to look at HSRP for failover.

If I can help more, let me know. I work at XO Communications. Like Level3, we have a great national IP backbone and have fiber at Server Central and many other data centers. I used to use Basecamp when I had my own website design company in college and would love to give something back to you guys.

Taylor

on 21 Sep 11

@Nate,

Thanks for chiming in!

We use VRRP to float an address between the devices. So far that’s worked very well.

I didn’t know XO was in the building… There’s not a lot of carrier choice at this point, and getting the last mile stuff worked out was a challenge to say the least.

Ironically XO is the hold up with our office Internet circuit. Maybe you can help move that along? :) (taylor @ our domain if you want to email me.)

Glad to hear you’ve been a Basecamp customer!

Nate Rosenberg

on 21 Sep 11

@Taylor, just emailed you.

Nate Rosenberg

on 21 Sep 11

@Taylor,

VRRP is standards-based while HSRP is Cisco proprietary. Both are protocols for gateway redundancy, so your’re all set!

Tarakit

on 22 Sep 11

network problems is always an unresolve issue, followed by hackers who never give up til’ they bug down your hosts..I have been disabled a lot of times with my marketing partners due to network problems and connectivity issue and until now I don’t find any solution to it, at least one that is reliable enough.

AC

on 22 Sep 11

@37signals

This post seems extremely odd to me, given especially that you’re a company based solely on having services outsourced (SaaS).

Internet connectivity, and redundancy, is the responsibility of your data center.

Why don’t you just re-locate to a data center that already provides redundant Internet circuits? (i.e. “outsource” this task to the proper owner, your data center)

Tim

on 22 Sep 11

As a Telco Cap Planner in Aus, this is interesting to me (it’s exactly the type of thing I manage day to day).

The concept of redundancy is interesting. I find a better word to use is resiliency or resilience.

Re: the comment by AC: I second having control over your infrastructure, if you can staff the support needed, and can afford capex/opex from Cisco. Can get exy with redundant power, UPS (support, capex), fire suppression.

As someone who has a few little online biz’s, it’s a source of frustration when the hosts losing power/comms in a DC they also have no control over.

Paul Montwill

on 22 Sep 11

What are the top reasons behind upstreams?

Michael

on 22 Sep 11

Taylor, I don’t understand how that is enough bandwidth for you. People must be downloading thousands of files and pages across your apps simultaneously. Surely many more gigabits are required. What am I missing?

Taylor

on 22 Sep 11

@Paul

Can you clarify your question?

Taylor

on 22 Sep 11

@Michael,

I think you have under estimated how much 1 Gigabit of bandwidth is. Further, you have to remember our customers are all over the world. There’s an ebb and flow to the traffic as different parts of the world are awake and asleep.

People are downloading thousands of files and pages, that is correct. It doesn’t require more than a couple hundred megabits though. As a matter of fact we pushed 6 million files up to S3 a Saturday or two ago. Even that didn’t use too much bandwidth with almost 200 uploads running concurrently.

carlivar

on 23 Sep 11

Surprised you went with Cisco routers. 37signals seems to match Juniper better in spirit.

Chris W.

on 27 Sep 11

Don’t sweat it guys, you are the best. I don’t think I could function without Basecamp.

This discussion is closed.

About Taylor

Husband. Father. Son. Brother. Ops Manager at Basecamp. I have a Vizsla named Ruby. I've been rebuilding an old boat (19' Mako) for two years. I sharpen my own knives. I don't have as many guns as Jamie says I do.

Read all of Taylor’s posts, and follow Taylor on Twitter.

If you liked this Sysadmin post by Taylor, you’ll probably like reading I heard you like numbers..., Scaling Your Database via InnoDB Table Compression, and December 4th Basecamp Classic, Campfire and Highrise Outage