Infrastructure
Kubernetes Troubleshooting: A Method That Always Finds It
A systematic Kubernetes troubleshooting method — cluster, node, workload, network, storage — with the kubectl commands in order and common failure signatures.
Executive summary
Kubernetes troubleshooting works best as layered elimination: verify the cluster, then the node, then the workload, then the network, then storage, in that order. Most engineers debug Kubernetes by jumping straight to the pod that paged them — which works right up until the real cause sits two layers down. This article gives the diagnostic sequence I have refined across years of supporting clusters, the exact kubectl commands at each layer, and the failure signatures — CrashLoopBackOff, Pending, ImagePullBackOff, NotReady — mapped to their usual root causes.
Debug the platform before the pod
The most common Kubernetes troubleshooting mistake is starting where the alert pointed.
The alert tells you where it hurts. It rarely tells you what is broken.
A pod restarting, a service timing out, a deployment stuck — these are frequently symptoms of a problem one or two layers below, and an hour spent reading application logs is an hour wasted if the actual issue is a node out of disk or an API server gasping on slow etcd. I learned this the ordinary way: years of cluster support, plenty of them on call, watching smart engineers chase the loudest symptom while the quiet cause sat in the event stream the whole time.
The method that works is the same one I use for network troubleshooting: layered elimination. Verify each layer in order — cluster, node, workload, network, storage — and only descend into a layer once the ones beneath it are proven healthy. It feels slower. It is faster, because you never solve the wrong problem.
Layer 1: Cluster
Answer one question in under a minute: is the platform itself healthy?
kubectl get nodes
kubectl get pods -n kube-system
kubectl get events -A --sort-by=.lastTimestamp | tail -30
kubectl api-resources --request-timeout=5s >/dev/null && echo "API OK"
What you are looking for:
- Any node NotReady or Unknown — stop. Go to Layer 2 for that node before anything else.
- kube-system pods restarting — CoreDNS, CNI agents, kube-proxy. A crashlooping CNI pod explains almost any networking symptom above it.
- The event stream is the cluster narrating its own problems: evictions, failed scheduling, image pull errors, OOM kills, all timestamped. I read events before logs. Every time.
- Slow API responses point at the control plane — often etcd disk latency. If the API server is timing out, nothing above it is debuggable; fix that first. And if etcd itself is the casualty, you want a tested restore procedure, not an improvised one.
Layer 2: Node
If a node is NotReady or suspect:
kubectl describe node <node> | sed -n '/Conditions:/,/Addresses:/p'
kubectl describe node <node> | sed -n '/Allocated resources:/,$p'
kubectl top node
The Conditions block hands you the signature directly:
| Condition | Usual cause |
|---|---|
MemoryPressure | Workloads without limits; kubelet is about to evict |
DiskPressure | Image bloat or logs filling the node filesystem |
PIDPressure | A fork-bombing workload |
Ready=Unknown | Kubelet stopped reporting — node down, kubelet dead, or network partition |
Ready=Unknown deserves emphasis: it means the control plane cannot hear
the kubelet at all, so the fix is on the node or the network path to it, not
in Kubernetes objects. SSH in — or use your out-of-band access — and work
the basics: systemctl status kubelet, journalctl -u kubelet --since -1h,
disk with df -h, and whether the container runtime is alive with
crictl ps or systemctl status containerd.
Two node-level classics worth naming because they burn everyone once:
clock skew breaking TLS and token validation, and DiskPressure from
container images, where crictl rmi --prune buys you room while you fix
the garbage collection settings that let it happen.
Layer 3: Workload
Nodes healthy? Now, and only now, the pod. The sequence:
kubectl get pods -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl get deploy,rs -n <ns>
describe first, logs second. The Events section of describe answers “why
won’t it start,” while logs answer “why does it die” — different questions,
and mixing them up costs time. The status column is a signature table:
| Signature | Meaning | First check |
|---|---|---|
Pending | Scheduler can’t place it | describe → Events names the failed predicate: resources, taints, affinity, unbound PVC |
ImagePullBackOff | Can’t fetch the image | Image name/tag typo, missing imagePullSecret, registry auth or network |
CrashLoopBackOff | Starts, then exits | logs --previous and the exit code in describe |
CreateContainerConfigError | Bad config reference | Missing ConfigMap/Secret named in the spec |
OOMKilled (exit 137) | Exceeded memory limit | Raise the limit or fix the leak; check kubectl top pod |
| Running but not Ready | Failing readiness probe | Probe endpoint, port, and timing in describe |
Exit codes carry most of the CrashLoopBackOff information: 1 is the
application failing on its own, 137 is SIGKILL (OOM or liveness probe
timeout), 143 is SIGTERM (something asked it to stop). A pod that crashes
too fast to exec into can still be inspected with an ephemeral container:
kubectl debug <pod> -n <ns> -it --image=busybox --target=<container>
One habit prevents a whole class of confusion: check whether the
deployment is the problem rather than the pod.
kubectl rollout status deploy/<name> and kubectl rollout history tell
you whether you are staring at a stuck rollout, and kubectl rollout undo
is the fastest mitigation in Kubernetes when the last change is the suspect.
The last change is usually the suspect.
Layer 4: Network
The workload runs but nothing can reach it. Debug the path in the order traffic flows: DNS → Service → Endpoints → NetworkPolicy → Ingress.
kubectl run dbg --rm -it --image=nicolaka/netshoot -- bash
# inside the pod:
nslookup <service>.<namespace>.svc.cluster.local
curl -sv http://<service>.<namespace>:port/healthz
Then from outside the debug pod:
kubectl get svc <service> -n <ns>
kubectl get endpointslices -n <ns> -l kubernetes.io/service-name=<service>
kubectl get networkpolicy -n <ns>
kubectl describe ingress <name> -n <ns>
The single highest-yield check is the endpoints: a Service with an empty
EndpointSlice means the selector matches no ready pods. Either the labels
do not match — compare spec.selector on the Service against the pod
labels — or the pods exist but fail readiness. A label typo accounts for a
remarkable share of “the service is down” tickets, and it will never stop
being a label typo just because the ticket sounds dramatic.
NetworkPolicy failures are silent drops by design — correct security posture, miserable debugging. If traffic vanishes with populated endpoints and working DNS, list the policies in both the client and server namespaces before blaming the CNI. The rule is the same as in any segmented network: find the enforcement point before questioning the transport.
Layer 5: Storage
Storage problems announce themselves as pods stuck in ContainerCreating
or Pending with volume events:
kubectl get pvc -n <ns>
kubectl describe pvc <claim> -n <ns>
kubectl get pv
kubectl get events -n <ns> --field-selector reason=FailedAttachVolume
Signatures:
- PVC
Pending— no StorageClass, no capacity, or a provisioner that is not running.describe pvcsays which. FailedAttachVolume/FailedMount— commonly a volume still attached to a previous node after an unclean node loss. Cloud volumes and some SDS systems need minutes, or a nudge, to detach.- Read-only filesystem mid-flight — the storage backend had an event and the kernel remounted read-only. That is a storage-system investigation, not a Kubernetes one, and pretending otherwise wastes the evening.
Make the method a reflex
Write the five layers into your runbook with the commands verbatim, in order. During an incident nobody composes field selectors from memory.
The value of a method is that it removes decisions at exactly the moment your judgment is at its worst.
Two closing habits from years of doing this. First, always establish what
changed — kubectl rollout history, recent node operations, certificate
expiry dates — because Kubernetes clusters rarely break spontaneously; they
break because something moved. Second, after every incident, ask which layer
the alert should have pointed at, and fix the alert. That loop — incident
to better signal — is the whole argument for investing in a real
observability stack, and it
is how troubleshooting time falls from hours to minutes over the life of a
platform. Tools will change; the discipline of eliminating layers in order
is the part that keeps working.
Frequently asked questions
- What is the first command to run when troubleshooting Kubernetes?
- kubectl get nodes, then kubectl get events -A --sort-by=.lastTimestamp. The first tells you whether the platform itself is healthy. The second is the cluster narrating its own problems in chronological order — evictions, failed scheduling, OOM kills, all timestamped. Starting at the pod that paged you skips the two layers most likely to hold the actual cause.
- How do I fix CrashLoopBackOff in Kubernetes?
- CrashLoopBackOff means the container starts and then exits, and Kubernetes is restarting it with increasing delay. Run kubectl logs with --previous to read the crashed instance's output, and kubectl describe pod for the exit code: 1 is an application error, 137 is OOMKilled or a failed liveness probe, 143 is SIGTERM. The fix lives in the application or its configuration far more often than in Kubernetes.
- Why is my Kubernetes pod stuck in Pending?
- Pending means the scheduler cannot place the pod anywhere. kubectl describe pod shows the reason under Events: insufficient CPU or memory on every node, an unsatisfiable node selector or affinity rule, a taint without a matching toleration, or an unbound PersistentVolumeClaim. The describe output names the exact predicate that failed on each node. Read it before guessing — it is almost always specific enough to act on.
- How do I debug Kubernetes DNS resolution failures?
- Test from inside a pod, not from a node — the node resolves through a different path. Run a debug pod and query the service name with nslookup or dig against the cluster DNS service IP. If that fails, check that CoreDNS pods are running and read their logs. If cluster DNS works but one name fails, confirm the service exists and its endpoints are populated with kubectl get endpointslices.