Skip to content
PAVEL GLUKHIKH
Menu

Infrastructure

Production Kubernetes Architecture: What Actually Changes

Production Kubernetes architecture decisions that separate lab clusters from real platforms: control plane topology, node pools, ingress, storage, upgrades.

7 min read

Executive summary

Production Kubernetes architecture is the set of design decisions — control plane topology, etcd placement, node pool layout, ingress, storage, and upgrade strategy — that determine whether a cluster survives real failure, real load, and real change. A lab cluster and a production cluster are the same components arranged around different assumptions: the lab assumes things mostly work; production assumes everything eventually fails. This article walks through each decision, names the failure it exists to prevent, and gives the defaults I reach for when nobody has a stronger opinion.

The difference between a cluster and a platform

Anyone can stand up Kubernetes in an afternoon. kubeadm init, join two workers, deploy an app, done. I have written guides that take people exactly that far, and it is genuinely the right way to learn.

Anyone can deploy software. Keeping it healthy for five years is engineering.

That is the whole distance between a lab cluster and a production platform. They are built from the same components, arranged around different assumptions. The lab assumes things mostly work. Production assumes everything eventually fails: nodes, disks, certificates, upgrades, and the person running the upgrade. After enough years carrying the pager for infrastructure other people depended on, I stopped asking whether a component would fail and started asking what happens to the platform when it does.

Every decision below exists to survive a specific failure. Once you can name the failure, the decision stops being cargo cult and starts being engineering.

Control plane topology: three nodes, and why not two

The control plane decision is really an etcd decision. etcd is a Raft-based store that needs a majority of members alive to accept writes, and the math has no sympathy for intuition:

etcd membersQuorumFailures tolerated
110
220
321
532

Two control plane nodes is the classic trap. You pay for redundancy and receive none, because losing either node drops you below quorum. Three is the production floor. Five makes sense only for large clusters, or when a maintenance window on one member plus an unplanned failure of another is a scenario you are required to survive.

The decisions that matter more than people expect:

  • Stacked vs. external etcd. Stacked — etcd on the control plane nodes — is the right default: fewer machines, simpler operations. Go external when the API server and etcd start competing for disk, which announces itself as rising etcd_disk_wal_fsync_duration and leader elections under load.
  • Give etcd fast disks. etcd fsyncs its write-ahead log on every commit. Put it on spinning disks or busy shared storage and the entire cluster turns sluggish in ways that look like API server problems. Local NVMe or the cloud equivalent. There is no exception worth making here.
  • Spread control plane nodes across failure domains — racks on-prem, zones in cloud. Three control plane nodes in one chassis is one power supply away from being one node.
  • Back up etcd on a schedule and test the restore. A snapshot you have never restored is a hypothesis, not a backup. I keep the exact procedure in a separate note on etcd backup and restore because it is the one runbook you want verbatim at 3 a.m.

Front the API servers with a load balancer — a cloud LB, an HAProxy pair, or kube-vip. Every kubeconfig points at that VIP, never at a node. Otherwise your first control plane node becomes unremovable by tradition, and infrastructure you cannot replace is infrastructure you no longer control.

Node pools: separate by failure behavior, not by team

In a lab, every workload lands on the same workers and nobody notices. In production I split node pools by how workloads fail and what they are allowed to touch:

  • System pool — CoreDNS, ingress controllers, monitoring agents, cert-manager. Tainted, so application workloads cannot starve the components the cluster needs in order to schedule anything at all.
  • General application pool — stateless services, the default landing zone. Sized for churn; this is where cluster autoscaling operates.
  • Stateful pool — databases and queues, if you run them in-cluster. Bigger disks, no aggressive bin-packing, no spot or preemptible instances.
  • Special pools as justified — GPU nodes, Windows nodes, or a pool in a separate network zone for workloads with different exposure. That pattern pairs with real network segmentation; it does not replace it.

Enforce placement with taints, tolerations, and node affinity, and set resource requests on everything.

The scheduler can only pack what it can measure.

A production cluster where half the pods carry no requests is a cluster where the OOM killer does the capacity planning — and the OOM killer has never read your SLOs.

Ingress: one well-operated front door

Answer first: run one ingress controller class you know deeply, behind an external load balancer, with TLS terminated at the controller and certificates automated. Which controller — ingress-nginx, HAProxy, an Envoy-based gateway — matters far less than operating it well. Boring and mastered beats sophisticated and half-understood, at the front door more than anywhere else.

The production concerns labs skip:

  • Certificate automation. cert-manager with ACME, DNS-01 challenges for wildcards. Manual certificate renewal is an outage with a due date.
  • Real client IPs. Decide once where the client address is established — proxy protocol, or X-Forwarded-For from the LB — and configure it, or every security investigation downstream opens with “the source IP is the load balancer.”
  • Capacity isolation. If one noisy tenant can exhaust ingress controller connections for everyone, you have built a shared fate you did not intend. Separate ingress classes for external and internal traffic is a cheap split with a real payoff.

Storage: decide what you refuse to run

Storage is where I see the most production incidents per unit of architecture diagram. And the core decision is not which CSI driver to pick. It is which stateful workloads you are willing to run in-cluster at all. Managed database services, or a dedicated database tier outside the cluster, is a legitimate production architecture — for a small platform team it is usually the correct one, because every stateful system you pull into the cluster brings its failure modes in with it.

If you do run stateful workloads:

  • Use a CSI driver with volume expansion and snapshot support, and test both operations before you need them — not during the incident that needs them.
  • Understand the failure mode of your storage class. Replicated SDS (Ceph/Longhorn-class) survives node loss but adds network and rebuild behavior you must now operate. Local PVs are fast and simple but pin pods to nodes — fine for databases that replicate at the application layer, wrong for ones that do not.
  • Backups are a separate system. Snapshots living on the same storage cluster protect against nothing that takes out the storage cluster. The reasoning is the same as for any infrastructure — see backup and recovery security.

Upgrade strategy: the part that separates platforms from pets

Kubernetes ships three minor releases a year and supports roughly the latest three. Skipping upgrades is not a strategy. It is a deferred migration with compounding interest, and the invoice always arrives at a worse time than the upgrade would have.

The production pattern:

  1. Read the deprecation notes first. Most “the upgrade broke us” incidents are removed APIs that workloads still used. Scan manifests for deprecated API versions before the control plane moves.
  2. Snapshot etcd. Non-negotiable. It takes a minute.
  3. Upgrade the control plane one minor version at a time. Components tolerate a one-minor-version skew; never skip.
  4. Replace nodes, don’t patch them. Drain, delete, join a freshly imaged node on the new version. Immutable nodes turn every upgrade into a rolling replacement you have already rehearsed every time you scaled.
  5. PodDisruptionBudgets before draining. Without PDBs, a drain will happily evict both replicas of a two-replica service. With badly set PDBs (maxUnavailable: 0), the drain hangs forever. Both are findable in a lab before they are findable in production.

Run the same upgrade in a staging cluster first — not a smaller cluster with different components, the same manifests at smaller scale. That is the strongest argument for keeping a real lab environment; I have written about how I structure mine in building a production-grade lab.

Decisions to write down

Record these while the context is fresh, because the next engineer will otherwise re-derive them from folklore:

  • Control plane topology and the failure it tolerates, with the etcd restore procedure linked, tested, and dated.
  • The node pool map: taints, who is allowed to schedule where, and why.
  • The ingress path from client to pod, including where TLS terminates and where the real client IP is established.
  • The list of stateful workloads you deliberately keep out of the cluster.
  • Upgrade cadence and the API deprecation review step — owned by a person, not a team.

None of this is exotic, and that is the point.

Production Kubernetes architecture is mostly boring decisions made before the incident instead of during it. The orchestrator will keep changing; whatever replaces it will change too. The discipline of naming your failure modes and deciding in advance is what actually keeps platforms alive, and it transfers to every system you will ever run.

Frequently asked questions

How many control plane nodes does a production Kubernetes cluster need?
Three. etcd accepts writes only while a majority of members are alive, and the arithmetic is unforgiving: three members tolerate one failure, five tolerate two. One control plane node is a lab. Two is worse than one — double the cost, zero added tolerance, because losing either member breaks quorum. Five is justified only for very large clusters or strict compliance requirements.
Should etcd run stacked on the control plane nodes or on dedicated nodes?
Stacked etcd, co-located with the API server, is the right default for most clusters and is what kubeadm builds. Go external when API server load and etcd disk latency start interfering with each other — typically at hundreds of nodes — or when your compliance model requires separating the data store. Be honest about the price: external etcd doubles the machines you patch, monitor, and back up.
What is the safest way to upgrade a production Kubernetes cluster?
One minor version at a time, control plane first, then nodes, never skipping. Read the API deprecation notes before anything else, because removed APIs break workloads more often than the upgrade mechanics do. Snapshot etcd, confirm PodDisruptionBudgets exist, then drain and replace nodes with freshly imaged ones instead of patching in place. Rehearse the whole sequence in a staging cluster first.
Do I need a service mesh in production Kubernetes?
Not by default. A mesh buys mTLS between services, retries, and traffic shifting, and charges you a proxy in every data path plus a new control plane to operate — a cost that shows up at upgrade time and during incidents. If the actual requirement is encryption in transit and traffic control, CNI-level encryption and NetworkPolicy cover most of it. Adopt a mesh when a concrete requirement demands one.

References

Related reading