Load Balancer Redundancy And Failover Mechanisms
You may have tried resolving some of the public DNS records of the public facing cloud load-balancers (e.g. AWS ALB) or whatever is behind the CNAME, just to find out that there are multiple instances behind. It is no wonder that cloud load balancers come with their own redundancy. So you may ask yourself what mechanisms are they using to maintain the consistency of the system state (including the status of backend servers, session data, and other configuration settings)?
Cloud load balancers utilize a combination of distributed consensus algorithms and heartbeating mechanisms to maintain system state and ensure high availability. When a primary load balancer node fails, the redundant system must agree on the new “leader” or active node without causing a split-brain scenario (where two nodes believe they are in charge).
✎Distributed Consensus Algorithms
To maintain a consistent state across redundant nodes, cloud systems often rely on protocols like Paxos or Raft. These algorithms ensure that all participating nodes in a cluster agree on a single version of the truth, such as which backend servers are healthy or which node is currently primary.
- Quorum-Based Voting: For a state change to be committed, a majority (quorum) of nodes must acknowledge the update.
- Log Replication: The state is maintained as a series of replicated logs. Every redundant node receives the same sequence of instructions, ensuring they reach the identical internal state.
✎Health Checking and Heartbeating
Redundancy is physically managed through constant communication between nodes, typically referred to as a heartbeat.
- Failure Detection: If the secondary node stops receiving heartbeats from the primary node for a defined threshold, it initiates a “leader election” to take over traffic.
- State Sharing: Beyond just “liveness,” these heartbeats often carry state information (session tables, connection mappings) so that the secondary node can resume traffic seamlessly without dropping existing user connections.
✎
In many implementations, the “consistency” of the system state is also anchored by:
- Virtual IP (VIP) Failover: Using protocols like ARP (Address Resolution Protocol) or BGP (Border Gateway Protocol), the system can re-route the public-facing IP address from a failed node to a standby node almost instantaneously.
- Distributed Key-Value Stores: Many modern cloud balancers offload their state to highly available, externalized databases (like etcd or DynamoDB) to ensure that the configuration remains consistent even if multiple balancer nodes restart simultaneously.
✎Active-Active vs. Active-Passive Redundancy
To ensure high availability and fault tolerance, load balancers should be designed and deployed with redundancy in mind. This means having multiple instances of load balancers that can take over if one fails. Redundancy can be achieved through several failover strategies:
- Active-passive configuration: In this setup, one load balancer (the active instance) handles all incoming traffic while the other (the passive instance) remains on standby. If the active load balancer fails, the passive instance takes over and starts processing requests. This configuration provides a simple and reliable failover mechanism but does not utilize the resources of the passive instance during normal operation.
- Active-active configuration: In this setup, multiple load balancer instances actively process incoming traffic simultaneously. Traffic is distributed among the instances using methods such as DNS load balancing or an additional load balancer layer. If one instance fails, the others continue to process traffic with minimal disruption. This configuration provides better resource utilization and increased fault tolerance compared to the active-passive setup.
The mechanism for consistency varies slightly depending on the architecture:
| Mechanism | Active-Passive | Active-Active |
|---|---|---|
| Consistency Tool | Periodic state synchronization | Real-time distributed state locking |
| Failover Method | Standby node promotes itself | Traffic is redistributed among remaining nodes via BGP |
| Complexity | Lower; state is copied | Higher; requires a “Global Server Load Balancing” (GSLB) layer |




