Co-authors:Leigh Madock,Andrew Stracner, andTim Crofts
If you follow this blog regularly, you’ll know that scaling up our infrastructure is always part of our design work, and simplification is one way we address the challenge. For a long time, we used a pair of lightweight load balancers for every “core,” or approximately a room’s worth of compute, to serve virtual IP addresses (VIPs) for critical infrastructure services (think: DNS, logging). They were resilient and standard, but as we moved toward an entire spine-and-leaf topology, the layer 2 requirement for load balancer HA and VIP direct server return (DSR) didn’t conform to our new designs. We had to find a new way to support the load balancing of these services in a resilient and scalable way, both technically and economically.
A couple of years ago, we experimented with anycast for other internal uses, and as we will detail below, we were able to leverage the concept to solve our new load balancing challenge.
For many, the first and obvious option would be to centralize infrastructure services and beefier load balancers in a designated core. We considered this first, even extending the idea to two silos—a dedicated set of heavy-duty servers and switches—each with a LB pair for ultra-resiliency. Some services work best for us in a DSR load-balanced configuration, and we wanted to preserve that. This meant all the servers behind these VIPs would have to share this same silo and VLAN for the DSR requirement. While centralization served current requirements on the whiteboard, it didn’t have the elasticity we were looking for between our largest data centers and our smaller regionalized POPs.
We also considered using a routing protocol between switches and load balancers to maintain HA. BGP would let the LB’s announce the VIP pool and SNAT pool prefixes. A failure of a load balancer would remove a next-hop candidate from the switch. We didn’t go down this route (no pun intended) because of difficult code dependencies on the load balancers and vendor lock-in.
Anycast, the way we implemented it inside the datacenter, has a number of characteristics that make it very effective and scalable for us. First, as noted above, we use a DSR configuration for Syslog. In drawing traffic to a server from an upstream switch via routing announcements, there’s no source-NAT’ing occurring; the server has the original client source IP to log or respond directly. Next, since every switch in our datacenters that serves as a default gateway for servers is BGP capable, and the open-source BGP route-server we chose can run on any of our hosts, we could use this method anywhere. Finally, in our recent datacenter network designs where BGP is used among all network elements in the datacenter, the anycast prefix is available to all locations within the datacenter.
The failure of a single host or switch doesn’t have an impact on service availability to more than the “closest” clients; even then, the interruption only lasts until convergence, which occurs within milliseconds. Since the route-server is free, we can turn up these services in large quantities, making the blast radius of a failure exceedingly small.
Each server running an infrastructure service such as DNS has the same IP address (per service) configured on the loopback interface. For example, the DNS “VIP” in this situation is 10.255.0.53. The host runs BGP to its upstream default gateway—a layer 3 switch—and announces 10.255.0.53/32. That Layer3 switch shares the route with its peers in the datacenter via BGP. The server has a local health-check script running against itself, and if it fails, the route will be withdrawn. At least 4 of these types of servers are setup per service in the datacenter, so one server failure just means that DNS traffic from clients goes to the shortest DNS “route” available.
We chose the BIRD route-server because of its strength as a BGP route server. It is popular for BGP on many of the world’s Internet exchanges, has a large development community, is easily supported on our Linux-based servers, and has hooks available that make local scripting simple.
Quagga and ExaBGP were also considered. Quagga’s strength is in turning the server into a router, which wasn’t a fit for our requirements. ExaBGP looked promising but there were concerns about project maturity.
Service self-health checks were an early key concern when we first approached the anycast option. Consider this: Host has BGP up to layer 3 switch to advertise the /32 or /128 prefix for DNS. If the DNS service on the server fails, a traditional load balancer could test for this and down that one member. But the risk with locally sourcing the BGP route is that the route will stay active, effectively black holing traffic destined for the DNS server. So, in this case, the burden of health-checks is on the server itself. A benefit of this, however, is that you can write health checks as complex as you could on the best load balancers out there.
The layer 3 switch that keeps a BGP session with the “real” server can be any switch that runs BGP and supports remembering multiple paths to a prefix. The prefix, of course, is the anycast address per service. The BGP configuration on the switch allows any downstream server with the right key to become a neighbor, but only /32 and /128 from within designated ranges are accepted. Because of the health checks on the server side, we don’t monitor or alert for changes in downstream BGP neighbors from the switches. If a cabinet has a host announcing an anycast address to its switch, the rest of the cabinet’s servers will use that local service. Since each layer 3 switch shares the BGP advertisement with upstream core and fabric switches, if a cabinet is not sourcing an anycast address, it will have routes to the cabinets that are. This way, no matter which cabinets or how many are offering the anycast services, none will be more than three hops away for most services, and traffic flows will benefit from the Equal-Cost Multi-Path (ECMP) nature of the spine-and-leaf network.
The simplicity and efficiency of this anycast solution in our datacenter network fabric was extended to our remote Edge POP sites as well. The benefits of anycast apply there in two ways. First, the configuration between servers and switches in the Edge POP is exactly the same as in the datacenters—right down to the anycast VIP addresses used! This leverages the configuration and operational streamlining that anycast allows.
Second, by having anycast routing across the backbone, we can deploy one single server for DNS and Syslog in the remote site without compromising reliability. The Layer 3 switch in the remote Edge POP site will learn two paths to the DNS-anycast VIP—one through the local DNS server, and the other over the backbone. The local one will be preferred until it fails. In which case, the remote site will simply use the path via the backbone to send traffic to the next closest site. This architecture helps us reduce the server footprint in our Edge POPs and enables rapid deployment of services from templates.
Anycast has proven itself as a sleek and effective solution across our datacenter and backbone networks. It has fulfilled our original goal of removing the sticky Layer 2 requirement per core and has enabled DSR, even between Layer 3 boundaries. Nearly as significant has been that internally, Anycast has eliminated much of our need for traditional load balancing appliances and enabled rapid service turn-ups by cutting out deployment complexity and process. We used the working group model to organize network and systems engineers around the solution. The output of a replicable design solved our immediate problems but also contributes to Project Altair. Because of the success of implementing anycast for DNS and Syslog, other teams at LinkedIn are looking to the same technology for additional services.