Whitepaper
Resilient Multi-Site Infrastructure Design
Designing infrastructure that survives site loss: failure domains, active-active vs active-passive, replication and RPO/RTO math, and failover testing.
Abstract
Multi-site resilience is routinely purchased and rarely achieved. Organizations operate second sites, replication licenses, and DNS failover services, yet still take extended outages when a site fails, because the design was assembled from products rather than derived from failure domains and recovery objectives. This paper presents a design method for multi-site infrastructure: enumerate failure domains honestly, including the shared dependencies that quietly correlate them; select active-active or active-passive topology per workload based on data semantics rather than preference; derive replication mode from recovery point objectives using explicit distance, latency, and loss arithmetic; engineer failover routing with DNS and anycast while respecting their convergence characteristics; and institutionalize failover testing as a scheduled operating practice. The recurring argument is that resilience is a property of practiced systems, not of purchased redundancy.
Redundancy is bought; resilience is designed
Most organizations that operate two sites do not have multi-site resilience.
They have two sites.
The distinction shows up at the worst possible moment: the primary fails, and the secondary — replicated, licensed, audited — cannot actually carry production because capacity was never sized for it, a credential store lived only at the primary, or nobody had performed the promotion sequence outside a wiki page.
I have run infrastructure across owned datacenters, colos, and cloud regions for two decades, including twelve years of hosting operations where a site problem was my revenue problem, and the pattern is consistent: resilience failures are design and rehearsal failures, almost never product failures. This paper walks the design decisions in the order they should be made.
Step 1 — Define failure domains honestly
A failure domain is the largest thing that fails as a unit. The design question is not “do we have two of everything” but “what single events take out multiple things at once.” Enumerate them explicitly:
- Physical: rack, row, room, building, campus, metro region. Two “independent” sites on the same flood plain or the same utility feed share a domain.
- Provider: a cloud region is one domain; so, more subtly, is a cloud provider’s control plane, a DNS provider, a certificate authority, an identity provider. If both sites authenticate against one IdP hosted at site A, site B is decorative.
- Logical: one Kubernetes control plane spanning sites, one SAN replicating corruption faithfully in both directions, one CI/CD system that deploys the same bad config everywhere in ninety seconds. Replication is not protection against logical failure — it is efficient distribution of it. That is what backups are for, and a multi-site design never retires the backup tier.
- Human and organizational: one on-call engineer who knows the failover procedure is a failure domain with a vacation schedule. So is one change advisory board that must convene before anyone is allowed to fail over.
The deliverable from this step is a table: each dependency, its failure domain, and which sites share it. Every shared row is either accepted in writing or engineered away.
Step 2 — Choose the topology per workload
Active-active versus active-passive is not an organizational identity; it is a per-workload decision driven by data semantics.
| Property | Active-active | Active-passive |
|---|---|---|
| Capacity honesty | High — both sites carry load daily | Low — secondary rots unless tested |
| Failover complexity | Low for stateless; already serving | High — promotion sequence must run |
| Data layer requirement | Multi-site writes or partitionable data | Single-writer with replication |
| Consistency hazards | Write conflicts, split-brain | Replication lag at failover |
| Cost | ~2× capacity actually used | Standby capacity often undersized |
The pragmatic pattern:
- Stateless tiers active-active, always. Web, API, and compute tiers behind global load balancing at both sites, each sized to carry the full load at your defined degraded service level. Daily production traffic at both sites is what keeps capacity claims honest.
- Stateful tiers active-passive by default. Run active-active data only where the technology genuinely supports it — quorum-based systems (etcd, Cassandra-class stores, modern distributed SQL) across three failure domains, or data that shards cleanly by geography. A conventional relational database stretched across two sites with bidirectional replication is an outage with a delivery date.
- Two sites cannot arbitrate a partition. Quorum systems need a third location, even if it is only a lightweight witness. Without one, every inter-site link flap is a potential split-brain event.
Step 3 — Do the replication math
Recovery objectives are commitments, and replication mode either supports the commitment or it does not. The arithmetic is short enough that there is no excuse for skipping it.
RPO (recovery point objective) — data you can afford to lose:
- Synchronous replication: RPO = 0. Cost: every write waits one inter-site round trip. At roughly 1 ms RTT per 100 km of fiber path (in practice, after equipment and routing, often worse), a 100 km separation adds ~1–2 ms to every committed transaction; at 1,500 km it adds ~15–30 ms, which most OLTP workloads cannot absorb. Synchronous is therefore a metro-distance tool.
- Asynchronous replication: RPO = replication lag under load, not the brochure number. Measure lag at your peak write rate — month-end batch, Black Friday — because that is exactly when a failure loses the most. If measured lag peaks at 45 seconds and the business signed off on a 5-second RPO, the design is out of compliance regardless of vendor claims.
RTO (recovery time objective) — time until service resumes. It is a sum, and every term must be measured, not estimated:
RTO = detection + decision + promotion + reconfiguration
+ routing convergence + validation
Detection and human decision routinely dominate. A database that promotes in 90 seconds sits inside an RTO of an hour when detection took ten minutes and the decision to fail over required waking three people who had never rehearsed making it. If the business RTO is minutes, the decision must be automated — and automated failover demands quorum and fencing discipline, or it becomes the outage. The tradeoff is honest: fast RTO requires either standing capacity plus automation (expensive, complex) or accepting that “the failover decision” is a practiced human act with a defined time budget.
A useful forcing exercise: write RPO and RTO per service tier into a table with the measured (not designed) values beside them, and take the red rows to the business for either budget or a relaxed objective. Silence on this table is a decision too — the wrong one.
Step 4 — Failover routing: DNS and anycast
Traffic has to follow the failover, and both mainstream mechanisms have sharp edges.
DNS-based failover is the workhorse: health-checked records steer clients to the surviving site. Its physics are governed by TTL and by resolvers that disrespect TTL. Run TTLs at 60–120 seconds on failover records normally — pre-lowering TTL “when you expect trouble” assumes trouble sends a calendar invite. Expect a convergence tail of minutes as long-tail resolvers and OS caches drain, and design applications to retry across it. The full pattern set, including split-horizon traps, is in DNS architecture for resilience. Above all, the DNS service itself must not live inside either site’s failure domain.
Anycast with BGP converges in seconds and needs no client cooperation: both sites announce the same prefix, and withdrawing the announcement drains a site at routing speed. The costs are provider-independent address space and the operational maturity to run BGP deliberately — plus a design constraint that anycast suits stateless, short-lived connections best, because a route shift mid-session moves the client to a site that has never heard of its TCP connection. The mature pattern layers them: anycast (or a global load balancer that amounts to anycast someone else operates) for the edge, DNS failover for everything that cannot anycast, and GTM health checks that test real user transactions rather than ping.
One rule spans both mechanisms: failover routing must fail toward the healthy site, not toward whoever answers health checks fastest. Health checks that test only “port 443 open” will cheerfully steer the world toward a site whose application tier is up and whose database is a read-only promotion candidate.
Step 5 — The testing discipline
Every mechanism above decays silently. Capacity drifts as the primary grows and the secondary does not. Promotion runbooks reference retired hostnames. The new microservice hardcodes the primary’s IP. Testing is the only force that opposes this rot, and it has to be scheduled, not aspirational:
- Quarterly component failovers. Promote the database. Cut DNS over. Withdraw one site’s anycast announcement for an hour. Each in a maintenance window, each with a timed runbook, each ending with measured RPO/RTO recorded next to the committed numbers.
- Annual full site failover. Run production from the secondary for at least a full business cycle — long enough for the batch jobs, the certificate renewals, and the third-party allowlists to reveal themselves. The first one will be ugly; that ugliness is the finding.
- Fail back deliberately. Failback is a second failover and fails for its own reasons — resynchronization direction, conflict handling, and the temptation to rush it. Rehearse it with the same rigor.
- Write down what broke. Each test feeds the failure-domain table from Step 1 and the RPO/RTO table from Step 3. A test that produces no findings and no updated documents was a demo, not a test.
The uncomfortable summary: an organization’s real disaster recovery capability is the last failover it actually performed, under conditions it did not fully control, measured honestly. Everything else — the second site, the replication licenses, the laminated runbook — is potential energy. Design the failure domains, do the arithmetic, wire the routing, and then practice until the failover is dull enough that nobody’s heart rate changes.
Boring is the goal state. It always was — the sites and the software will keep changing, but the principle that resilience lives in rehearsal rather than in inventory has outlasted every platform I have run it on.
Frequently asked questions
- Should I run active-active or active-passive?
- Decide per workload, not per organization. Stateless tiers should almost always run active-active because it makes capacity honest and failover trivial. Stateful tiers earn active-active only if the data layer genuinely supports multi-site writes; otherwise active-passive with well-rehearsed promotion is more reliable than a half-true active-active design.
- What is the practical difference between synchronous and asynchronous replication?
- Synchronous replication holds every write until the remote site acknowledges it, giving zero data loss (RPO zero) but adding round-trip latency to each transaction, which practically limits site separation to metro distances. Asynchronous replication acknowledges locally and ships changes behind, tolerating any distance but losing the replication lag — seconds to minutes of data — in a hard failover.
- Why did my failover fail even though replication was healthy?
- Because data replication is only one dependency of a working failover. The common misses are undersized capacity at the surviving site, credentials or certificates absent from the secondary, hardcoded IPs and unreplicated DNS state, and shared dependencies like an identity provider or license server that lived only at the failed site. Only a full failover test finds these.
- How often should we actually test site failover?
- Component failovers — a database promotion, a DNS cutover — quarterly. A full site failover, running production from the secondary for a meaningful period, at least annually. If the business refuses to authorize a real failover test, it has decided the failover does not work; the test you cannot run in daylight will not run at 3 a.m. either.