Reference Architecture
Production Kubernetes Platform
Reference architecture for production Kubernetes: HA control plane, ingress, GitOps delivery, observability, backup, and multi-environment promotion.
Design summary
A production Kubernetes platform is the set of components around the cluster — HA control plane, ingress, GitOps delivery, observability, backup, and environment promotion — that turns a container orchestrator into something a business can run workloads on. This reference architecture describes a three-node control plane with stacked etcd, an ingress tier behind external load balancing, Argo CD-style GitOps as the only deployment path, and Velero-class backup with tested restores. It covers what each layer is for, the tradeoffs baked into the design, and how it scales from one cluster to a fleet.
Component stack
- Kubernetes (kubeadm, RKE2, or managed control plane)
- etcd (stacked, 3 nodes) with scheduled snapshots
- External L4 load balancer (HAProxy/keepalived or cloud LB)
- ingress-nginx or Traefik + cert-manager
- Argo CD (GitOps controller, app-of-apps)
- Prometheus + Alertmanager + Grafana, Loki for logs
- Velero + object storage (S3-compatible) for backup
- Longhorn/Ceph or cloud CSI for persistent storage
- External Secrets Operator + Vault-class secret store
Purpose and requirements
Kubernetes out of the box is an orchestrator, not a platform. This architecture is the surrounding machinery — delivery, ingress, observability, backup, and promotion — that I consider the minimum for running workloads a business depends on. It is the same shape I run in my own lab and the shape I have written about in my Kubernetes guides; the lab version just has fewer nodes and cheaper storage.
Requirements:
- Survive one node failure anywhere — control plane, worker, or ingress — without human intervention.
- Git is the only deployment path. Cluster state must be reviewable, revertable, and rebuildable from the repo.
- Restorable, not just backed up. etcd, API objects, and persistent data recoverable within a defined RTO, proven by drills.
- Promotion, not snowflakes. The same artifact moves dev → staging → prod by pull request.
Every component below earns its place against one of those four lines. Nothing else made the cut, because everything in the platform is something someone has to patch, monitor, and explain during an incident.
Topology
Users / external traffic
|
+---------------------+
| External L4 LB | (HAProxy+keepalived
| VIP: api + ingress | or cloud LB)
+----+-----------+----+
| |
api :6443 | | :443 ingress
+----------------+--+ +----+-----------------+
| CONTROL PLANE x3 | | INGRESS NODES x2+ |
| apiserver | | ingress-nginx |
| etcd (stacked) | | cert-manager certs |
| scheduler/ctrl-mgr | +----+----------------+
+---------------------+ |
|
+------------------------------+-----------------+
| WORKER POOL (n nodes) |
| app namespaces | platform namespace |
| (NetworkPolicy) | argocd, monitoring, velero |
+---------------------+--------------------------+
|
+---------------+----------------+
| |
+-----------------+ +--------------------+
| CSI storage | | S3-compatible |
| (Longhorn/Ceph) | | object store: |
+-----------------+ | Velero backups, |
| etcd snapshots, |
Git repo ──► Argo CD ──► cluster Loki chunks |
(single source of truth) +--------------------+
Component roles
Control plane (3 nodes, stacked etcd). Three kubeadm- or RKE2-provisioned
nodes running apiserver, scheduler, controller-manager, and etcd co-located.
Stacked etcd is simpler than an external etcd cluster and adequate until you
are operating many large clusters. The non-negotiables: odd node count for
quorum, low-latency disks for etcd (fsync latency is the silent killer), and
scheduled etcdctl snapshot save shipped off-cluster — the procedure in my
etcd backup note is the one I use.
External load balancer. A VIP in front of the apiservers (port 6443) and a second in front of the ingress nodes. On-prem this is HAProxy plus keepalived on a small VM pair; in cloud it is the provider’s L4 LB. The apiserver VIP is what makes control plane HA real — without it, “HA” means “clients pinned to whichever node they were configured with.”
Ingress tier. Two or more nodes labeled and tainted for ingress, running ingress-nginx or Traefik, with cert-manager handling certificates via ACME DNS-01 (see my wildcard certificate note for why DNS-01 beats HTTP-01 for internal services). Dedicated ingress nodes keep noisy application workloads from starving the traffic path and give you a stable set of IPs for upstream firewall rules.
GitOps delivery (Argo CD). One Argo CD instance per cluster, app-of-apps pattern, sync waves for ordering. Git holds two kinds of state: platform (ingress, monitoring, Velero — the things in this document) and applications. Humans get read-only kubeconfigs by default; the controller holds the write credentials. Drift detection stays on and pages nobody, but shows up red on a dashboard someone looks at daily.
Observability. Prometheus with Alertmanager and Grafana for metrics and alerting, Loki for logs, all deployed by the same GitOps repo. Retention in-cluster is short (7–15 days); anything longer belongs in object storage or a central stack — the full design is its own entry in this library (observability platform). Watch Prometheus cardinality from day one; it is the platform component most likely to eat its node.
Backup (Velero-class). Velero with the CSI snapshot integration, backing up to S3-compatible object storage that lives outside the cluster’s failure domain. Schedules: full cluster daily, critical namespaces every 6 hours. Database workloads additionally run native dumps (pg_dump, xtrabackup) via hooks, because a crash-consistent volume snapshot of a busy database is a recovery gamble.
Secrets. External Secrets Operator pulling from a Vault-class store, so Git carries references, never values. Sealed Secrets is an acceptable simpler alternative when you do not have a secret manager to point at.
Security model
- RBAC: developers get namespace-scoped roles in non-prod, read-only in prod. Cluster-admin is break-glass, audited, and boring to request on purpose.
- NetworkPolicy: default-deny per application namespace, with explicit allows for ingress-tier and monitoring scrapes. Platform namespaces are locked to their function.
- Pod Security Admission at
baselinecluster-wide,restrictedfor application namespaces; exceptions are documented in Git next to the manifest that needs them. - Supply chain: images come from a private registry mirror; admission
policy (Kyverno-class) blocks
:latestand unsigned images in prod. - API surface: apiserver reachable only from admin networks and CI; no public 6443. Audit logging shipped off-cluster with everything else.
Tradeoffs
| Decision | What you gain | What it costs |
|---|---|---|
| Stacked etcd vs external | Fewer nodes, simpler ops | Control plane node loss takes an etcd member with it |
| Dedicated ingress nodes | Stable traffic path, clean firewall story | 2+ nodes that run “nothing but” ingress |
| Argo CD per cluster vs centralized | Cluster is self-contained, no cross-cluster blast radius | Fleet-wide view requires extra tooling |
| Longhorn/replicated CSI vs SAN | No storage array dependency, easy in lab | Replication traffic on the node network; slower than real arrays |
| PR-based promotion vs auto-promote | Human gate before prod | Slower cadence; the gate can become a rubber stamp |
| One cluster per env tier | Hard isolation between prod and everything else | Two of everything in this document |
Scaling and variations
Lab scale: collapse to three nodes total (control plane tainted to also run workloads), keep every other component identical. The value of a lab is practicing the same GitOps, backup, and restore motions as production — a point I make at length in building a production-grade lab.
Managed control plane (EKS/AKS/GKE): the control plane and etcd rows of this design become the provider’s problem; everything from ingress down is unchanged. Do not mistake a managed control plane for a managed platform — GitOps, backup, and observability remain yours.
Multi-environment promotion: one Git repo, one overlay per environment (Kustomize or Helm values). CI builds and pushes an image digest; promotion is a PR that bumps the digest in the next environment’s overlay. Prod overlays require review from the platform team. No environment ever points at a mutable tag.
Fleet scale (5+ clusters): move to a hub-and-spoke Argo CD or per-cluster instances registered to a central UI, push Prometheus data to a central long-term store, and template cluster creation itself (Cluster API or Terraform) so a cluster becomes cattle too.
Operations notes
- Restore drills quarterly: one etcd restore to a scratch cluster, one Velero namespace restore, one full database recovery. Time them; the times are your real RTO.
- Upgrades: control plane first, one node at a time, then workers in small batches behind PodDisruptionBudgets. Stay within n-2 of upstream; the skew policy is not a suggestion.
- Capacity: alert at 70% sustained on node CPU/memory and at PV fill rates, not just absolutes. The platform namespaces get requests/limits like any tenant — monitoring OOMKilled by an app rollout is an avoidable irony.
- Runbooks: apiserver VIP failover, etcd member replacement, certificate
expiry (check
kubeadm certs check-expirationbefore it checks you), and full-cluster rebuild from Git plus backups. That last one is the platform’s final exam: if you cannot rebuild from the repo and the object store, the cluster is a pet with extra YAML.
None of this is exotic, and that is deliberate. Kubernetes releases three times a year; the components in this document will all be replaced eventually. What persists is the operating discipline — Git as the single source of truth, restores proven by drills, promotion by review. A platform built on those habits survives its own technology choices. One built on heroics does not.
Frequently asked questions
- How many control plane nodes does production Kubernetes need?
- Three. etcd requires a quorum of (n/2)+1, so three nodes tolerate one failure; a two-node control plane is strictly worse than one because either failure breaks quorum. Five nodes buy tolerance of two failures at the cost of higher write latency, and are rarely justified below very large or regulated environments.
- Should deployments go through GitOps only, with no kubectl apply?
- Yes, for everything beyond break-glass. When Git is the single source of truth, the cluster state is reviewable, revertable, and rebuildable. The moment humans apply manifests directly, drift begins and the Git history stops being trustworthy. Keep a documented break-glass procedure, and make the GitOps controller flag drift loudly.
- Does Velero back up everything I need?
- Velero captures Kubernetes API objects and, via CSI snapshots or file-system backup, persistent volumes. It does not give you application-consistent database backups by itself — pair it with backup hooks or database-native tooling for stateful stores. And a backup you have not restored in a drill is a hypothesis, not a backup.
- One big cluster or many small ones?
- Fewer, larger clusters are cheaper to operate; more, smaller clusters give harder isolation and smaller blast radius. The practical middle for most teams: one cluster per environment tier (prod, non-prod), with namespaces plus NetworkPolicy and quotas as the intra-cluster boundary, splitting further only for compliance or noisy-neighbor reasons.