Skip to content
PAVEL GLUKHIKH
Menu

Networking

Enterprise network design fundamentals that age well

Enterprise network design from the failure modes up: hierarchy, L2/L3 boundaries, HSRP/VRRP redundancy, and capacity planning that survives growth.

6 min read

Executive summary

Enterprise network design is the discipline of arranging switching, routing, and redundancy so a network stays operable and diagnosable as it grows. The core decisions are few: how many hierarchy tiers you need, where Layer 2 ends and Layer 3 begins, which redundancy model protects each tier, and how much capacity headroom you buy. Each choice purchases a specific set of failure modes. This article walks through those decisions the way I make them in practice — starting from what breaks, not from what the reference slide shows.

Design for the failure you can afford

Every network design decision buys you a failure mode.

A stacked switch pair buys you a single control plane that can take both “redundant” units down with one bad firmware image. A stretched VLAN buys you a spanning-tree domain that can melt two buildings at once. A single-tier flat network buys you a broadcast domain where one misbehaving NIC becomes everyone’s problem.

Good enterprise network design is not about eliminating failures — it is about choosing which failures you are willing to have, making them small, and making them diagnosable at 3 a.m. That framing, more than any reference architecture, is what I carry from years of running networks in hosting, a petrochemical plant, and the connectivity business I operate today.

Hierarchy: three tiers, two tiers, and when each is right

The classic model has three layers, each with one job:

  • Access — where hosts plug in. Port security, PoE, edge QoS marking.
  • Distribution — aggregates access closets, terminates VLANs, applies policy, summarizes routes.
  • Core — moves packets between distribution blocks as fast as possible. No policy, no ACL sprawl, nothing that slows convergence or troubleshooting.

The value is containment. You can rewire an access closet without touching core routing. You can replace the core without re-addressing a single host. When a layer has one job, the failure of that layer has one shape.

Most enterprises under a few thousand users do not need three physical tiers. A collapsed core — one redundant switch pair doing distribution and core duty — is the right answer for a single building or small campus. You graduate to a dedicated core when you have three or more distribution blocks and the collapsed pair becomes either a port-count problem or a fault domain you can no longer stomach. In data centers, the same instinct now expresses itself as spine-leaf: every leaf is two hops from every other leaf, capacity scales by adding spines, and there is no spanning tree in the fabric because every link is routed (or VXLAN-overlaid per RFC 7348).

The anti-pattern I see most is the “grown, not designed” network: a daisy chain of switches added one at a time, where the hierarchy exists only in the Visio file. You find it by looking at spanning tree — if the root bridge is a closet switch someone installed in 2016, the topology is telling you the truth and the diagram is not.

The L2/L3 boundary is the most important line you’ll draw

Where Layer 2 ends and Layer 3 begins determines your blast radius. A VLAN is a failure domain: broadcast storms, unknown-unicast flooding, spanning-tree reconvergence, and MAC table exhaustion all travel to the VLAN’s edge and stop there. So the design question is simply: how far do you want those problems to travel?

My default is routed access — the L3 boundary at the access switch, every uplink a routed point-to-point link, one VLAN never leaving one closet. Spanning tree still runs, but it protects a single switch’s ports instead of a building. Convergence becomes a routing protocol problem (ECMP over two uplinks, sub-second failover with tuned timers) instead of an STP problem.

You give up VLAN mobility. Anything that demands the same subnet in two closets — some legacy clustering, certain Wi-Fi designs that anchor roaming at the controller anyway, flat industrial segments that cannot be re-addressed — forces L2 up to distribution. That is a legitimate trade, but make it consciously and per-VLAN, not as a default posture. In the plant networks I supported, the OT side was flat by necessity; the answer was to keep that flatness contained and firewalled, which is a segmentation problem as much as a design one.

Where L2 does span, run it defensively: Rapid-PVST or MST with the root bridge pinned at distribution, root guard on downlinks, BPDU guard on every access port, storm control, and no VLAN 1 for anything.

Redundancy models and what each one actually protects

First-hop redundancy is the piece everyone gets wrong by default. Hosts point at one gateway IP; HSRP (Cisco) or VRRP (RFC 5798, everyone else) float that IP between two routers. The protocol matters less than the discipline around it:

  • Set priorities explicitly and document which box is primary. Two routers at default priority means the tiebreak is an interface address, and your traffic pattern is an accident.
  • Enable preemption with a delay (60–120 seconds), so a rebooting primary does not reclaim the VIP before its routing table converges. HSRP does not preempt by default; VRRP does. A pair audit finds this mismatch constantly.
  • Track upstream interfaces or routes, so the gateway fails over when the path behind it dies, not just when the box does.

Above the first hop, the real choice is control-plane sharing vs. control-plane independence:

ModelProtects againstFails when
Switch stack / chassis with dual supervisorsHardware failure of one unitSoftware bug or bad upgrade takes the shared control plane down
MLAG / vPC (two boxes, shared LAG state)One chassis failure, with all links activePeer-link or state-sync bugs cause split-brain; upgrades need care
Two independent L3 switches + FHRP + ECMPHardware and most software failures independentlyNothing exotic — you pay in config duplication and slightly harder L2

Stacks are operationally lovely and a single fault domain. I use them at the access layer, where losing a closet is survivable, and avoid them at distribution and core, where I want two genuinely independent control planes. Every network engineer eventually watches a “redundant” stack reboot as one unit during an upgrade; it is cheaper to learn that from someone else.

Capacity planning: ratios first, then measurements

Start with oversubscription ratios, because they make bad designs visible before any traffic flows. Classic guidance — around 20:1 access-to- distribution and 4:1 distribution-to-core — is a sanity check, not a law. What matters is knowing your ratio and deciding it on purpose. A 48-port gigabit access switch with a single 10G uplink is 4.8:1 and fine for offices; the same switch aggregating a VDI farm or an imaging modality is not.

Then measure, because ratios lie about bursts:

  • Poll interface utilization at short intervals for anything that matters. Five-minute averages hide the microbursts that cause drops on a link that “never goes above 40%.” Output-queue drop counters tell the truth.
  • Watch flow data, not just totals — knowing a link is at 70% matters less than knowing whether that 70% is backups that could move to 2 a.m. I covered the observability side of this in observability stack design.
  • Plan headroom for the failure case, not the sunny day. Two distribution uplinks at 60% each are one failure away from a single link at 120%. My rule: any redundant pair should carry full load on one member at under ~80% utilization, or it is redundant in name only.

Budget capacity in step-function terms. Networks do not grow linearly; they grow when a floor gets densified, a camera project lands, or someone deploys a new backup product. The design review question for every new uplink is “what is the next thing that doubles this?”

Decisions to write down before you build

The design is not finished until these are recorded — they are the questions the next engineer (possibly you, in two years) will need answered:

  • Which tier owns the L3 boundary for each VLAN, and which VLANs are deliberately stretched, and why?
  • FHRP protocol, priorities, preempt timers, and tracked objects per pair.
  • The oversubscription ratio at each layer, and the trigger that forces an uplink upgrade.
  • Which redundancy is control-plane-shared (stacks, MLAG) versus independent, and what the upgrade procedure is for each.
  • What breaks first when a distribution switch dies — and whether anyone has actually tested it by pulling the power.

That last one is the honest measure of a design. A network you have never failed on purpose has a redundancy model you have never verified. Keep the answers in a living document, not a diagram from the install project — I cover how in network documentation that works.

The hardware in these designs turns over every five to seven years, and each refresh arrives with new fabric names and new marketing. The parts that survive the refresh cycle are the ones this article is actually about: where the failure domains end, which control planes are genuinely independent, and whether anyone has watched the failover happen. Get those right and the network does what good infrastructure should do — it disappears.

Frequently asked questions

What is hierarchical network design?
Hierarchical design divides a network into layers with distinct jobs — classically access (host connectivity), distribution (aggregation and policy), and core (fast transport). The point is fault and change containment: each layer can be modified, scaled, or replaced without redesigning the others, and failures stay local to a layer instead of propagating.
Do I still need a core layer in a smaller enterprise?
Usually not as separate hardware. A collapsed core — distribution and core functions on one redundant switch pair — serves most single-building and small campus networks well. You add a dedicated core when multiple distribution blocks need any-to-any transport and the collapsed pair becomes a port-count or fault-domain bottleneck.
Should the Layer 2/Layer 3 boundary be at access or distribution?
Push routing as close to the access layer as your applications allow. Routed access shrinks spanning-tree domains to a single switch and makes failures local and fast to converge. Keep L2 spanning to distribution only where you genuinely need VLANs across closets — legacy clustering, some Wi-Fi roaming designs, or flat OT segments.
What is the difference between HSRP and VRRP?
Both provide a redundant default gateway by sharing a virtual IP between routers. HSRP is Cisco proprietary; VRRP is the open standard (RFC 5798) with near-identical behavior. Feature differences are minor — VRRP can use the real interface address as the virtual IP and preempts by default. Pick whichever your fleet supports and standardize the timers and priorities.

References

Related reading