Skip to content
PAVEL GLUKHIKH
Menu

Networking

DNS architecture for resilience

DNS architecture that fails gracefully: authoritative/recursive separation, anycast, TTL strategy, DNSSEC tradeoffs, and the failure patterns behind outages.

6 min read

Executive summary

DNS architecture is the design of the resolution path — authoritative servers, recursive resolvers, caches, and the records themselves — so that name resolution keeps working when infrastructure around it fails. 'It's always DNS' is a joke with an engineering explanation: DNS sits upstream of every other dependency, fails in cache-delayed and partial ways, and is usually the least-redundant critical system in the building. This article covers separating authoritative from recursive service, when anycast and secondary providers earn their cost, TTLs as a deliberate risk dial, and what DNSSEC actually buys and costs.

Why it’s always DNS

“It’s always DNS” is funny because it keeps being true, and it keeps being true for structural reasons, not because DNS is badly built. Three properties conspire:

  1. DNS is upstream of everything. Every connection starts with a lookup, so a DNS fault presents as a fault in whatever was being looked up — the app is down, email is broken, the VPN won’t connect. You investigate the symptom’s layer first and find DNS last.
  2. Caching makes failures partial and time-shifted. When authoritative service breaks, nothing happens immediately. Resolvers keep serving cached answers until TTLs expire, so the outage arrives in waves, per record, per resolver — the least diagnosable failure shape there is.
  3. It is critical infrastructure run as a side duty. Organizations that maintain redundant everything routinely run DNS on a domain controller pair nobody has patched deliberately, with zone data whose only backup is the servers themselves.

DNS is not fragile. It is neglected.

The fix is to treat it as a tier-one system with an actual architecture. That architecture has four decisions in it.

Decision one: separate authoritative from recursive

Authoritative servers hold your zones and answer the internet’s questions about your names. Recursive resolvers chase referrals across the internet on behalf of your users and cache the results. These are different jobs with opposite exposure: authoritative must be reachable by the world; recursive must never be.

Run them on separate software instances, separate addresses, and ideally separate networks. NIST SP 800-81r2 has recommended this split for years, and the reasons have not aged: an open recursive resolver is a DDoS amplification tool and a cache-poisoning target, and a combined server blurs the answer to “who should be able to query this?” — the question every DNS firewall rule hangs on.

The same logic extends inside the enterprise. Keep a hidden primary — the server where zone edits happen — unreachable from the internet entirely, and publish through secondaries that transfer from it (TSIG-signed, NOTIFY enabled). Your editable copy of the zone and your attack surface should never be the same box.

Decision two: diversity and anycast for the authoritative set

RFC 2182 is old and still the best short document on this: secondaries must be diverse in every dimension that fails together. Two authoritative servers in the same rack share a power failure. Two in the same AS share a routing incident. Two running the same software share every CVE.

Practical tiers, in ascending order of resilience per dollar:

  • Minimum: two authoritative servers in different sites and different ASes. This is what the standard requires in spirit; many enterprises still fail it.
  • Better: primary DNS provider plus an independent secondary provider transferring the same zones. Provider diversity is what saves you when a major DNS operator has a bad day — and the large DDoS events of the past decade taught exactly this lesson to everyone who had all their NS records in one provider.
  • The anycast question: anycast — one prefix announced from many sites, with BGP steering each client to the nearest instance — is how every serious DNS platform achieves both performance and DDoS absorption. Buy it, don’t build it: below ISP scale, running your own anycast authoritative fleet is an expensive hobby. As an operator I run anycast for services where it pays; for a typical enterprise’s zones, a provider’s anycast network at commodity prices is the right answer.

Internally, anycast recursive resolvers (the same resolver IP announced from each site via OSPF/BGP with health-checked routes) give every branch a local resolver with automatic failover — one of the few places where DIY anycast is genuinely easy.

Decision three: TTLs are a risk dial, not a default

TTL sets how long the world may cache an answer, which makes it a dial between agility and outage-resistance:

TTLYou gainYou pay
Long (1–24 h)Caches ride out authoritative outages; low query loadMistakes and stale records persist for hours; slow failover
Short (30–300 s)Fast failover and changesHigher query volume; an authoritative outage bites within minutes

The mature pattern is TTL by record role: long on stable records (NS, MX, infrastructure A records), short only on records that participate in failover or frequent change. Before any planned migration, drop the relevant TTL a full old-TTL-period in advance, make the change, verify, then restore.

Two caches people forget: negative caching (RFC 2308) means a lookup that returned NXDOMAIN stays failed for the SOA-derived negative TTL — so “we created the record but it still doesn’t resolve” is often a cached NXDOMAIN, not a propagation delay. And resolvers, browsers, and application runtimes each keep their own caches with their own rules; the TTL you set is a request, not a guarantee, and Java applications ignoring TTL entirely is a recurring war story for a reason.

Decision four: DNSSEC with eyes open

DNSSEC (RFC 4033 and friends) signs zone data so validating resolvers can detect forged answers. It is the only real defense against off-path cache poisoning, and if your names anchor anything high-value — payment flows, software update endpoints, DANE, or CAA records you actually rely on — signing is worth it. It is also load-bearing for ACME DNS-01 integrity, which matters if you issue wildcard certificates via DNS-01.

The costs are equally concrete:

  • Operational fragility. Signatures expire. A zone whose re-signing automation quietly stops becomes bogus, and validating resolvers drop it — a self-inflicted outage invisible from any non-validating vantage point. This exact failure has taken large, sophisticated domains offline.
  • Key ceremony. KSK rollovers must be coordinated with the DS record at your registrar; algorithm rollovers even more so.
  • Bigger responses, which resurrect UDP fragmentation and amplification concerns your provider needs to have engineered around.

My rule: sign if, and only if, signing is automated end-to-end — either by a DNS provider that manages keys and rollovers, or by tooling (BIND’s dnssec-policy, Knot, PowerDNS) you monitor for signature expiry the same way you monitor certificate expiry. Manual DNSSEC is an outage on a timer. And deploy validation on your recursives first; it is the cheap half of the benefit.

The war patterns, so you recognize them early

Every experienced operator’s DNS incidents rhyme. The recurring shapes:

  • The expiry class: domain registration lapses, DNSSEC signatures expire, a secondary silently stops transferring and serves stale data until its SOA expiry hits. All preventable with monitoring that checks expiry dates, serial agreement across the NS set, and validation status.
  • The hidden-dependency class: the resolvers your servers use live on a VM platform whose management plane needs DNS; a ransomware event encrypts the domain controllers that were also the only internal resolvers. Map what resolves your critical names and break the loops.
  • The cache-delay class: a bad change looks fine (old answers still cached), then fails in waves as TTLs expire — hours after the change window closed and the person who made it went home. Correlate by change time plus TTL, not by symptom time.
  • The split-horizon class: internal and external views of the same zone drift, and a name works in the office but not from the VPN. If you run split-horizon DNS, the views need a diff in CI, not an annual audit.

Monitoring that would catch all four costs almost nothing: query each NS directly (bypassing caches) for a canary record, compare serials, check DNSSEC chain validity, and alert on registration and DS mismatches. When a lookup fails anyway, work the resolution path hop by hop — the same discipline as any fault, covered in network troubleshooting methodology.

DNS stops being “always DNS” the day it has the same redundancy, monitoring, and change discipline as the systems that depend on it. Which is to say: it earns the joke only when it is run as an afterthought.

Frequently asked questions

Why should authoritative and recursive DNS be separated?
They have opposite trust models and failure profiles. Authoritative servers answer the world about your zones; recursives answer your users about the world and must never be open to the internet. Combining them couples their failures, complicates caching behavior, and historically enabled cache-poisoning and open-resolver amplification attacks. Separate software, addresses, and ideally networks.
What is anycast DNS and do I need it?
Anycast announces the same IP prefix from many locations, so BGP routes each client to the nearest instance and a site failure reroutes traffic automatically. For authoritative DNS facing the internet, you get it by using providers who run anycast — building your own rarely makes sense below ISP scale. Internally, anycast for recursive resolvers is achievable and worthwhile in multi-site enterprises.
What TTL should I use on DNS records?
Match TTL to how fast you need to move. Stable infrastructure records tolerate 1–24 hours; records involved in failover need 60–300 seconds. Long TTLs mean caches ride out authoritative outages but pin mistakes in place; short TTLs enable fast changes but raise query load and shrink your outage buffer. Lower TTLs ahead of planned changes, then raise them back.
Is DNSSEC worth deploying?
Sign your zones if spoofed answers for your names would cause real damage — it is the only mechanism that lets validating resolvers detect forged responses. But go in clear-eyed: signing mistakes and expired signatures make your domain vanish for validating resolvers, and operational failures from DNSSEC are today more common than the attacks it prevents. Automate signing or use a provider that does.

References

Related reading