What is Kubernetes?
Kubernetes (K8s) is an open-source platform for automating the deployment, scaling, and management of containerized applications. Originally designed at Google and open-sourced in 2014, it abstracts the underlying infrastructure and lets teams declare the desired state of their applications. Kubernetes continuously works to match actual state to desired state and is the dominant orchestration platform for production workloads.
Core concepts
Kubernetes is built around a small set of primitives. Understanding them is a prerequisite for operating it effectively.
Pod: the smallest deployable unit in Kubernetes. A pod contains one or more containers that share a network namespace and storage volumes. All containers in a pod run on the same node and communicate over localhost.
Deployment: a controller that manages a set of identical pods. Deployments handle rolling updates, rollbacks, and desired replica counts. When you change the container image, the Deployment controller replaces pods one at a time to minimize downtime.
Service: a stable network endpoint that routes traffic to a set of pods, even as individual pods are created and replaced. Services abstract away ephemeral pod IP addresses and provide load balancing across replicas.
Node: a worker machine (VM or physical server) that runs pods. Each node runs the kubelet agent, a container runtime (such as containerd), and kube-proxy for network routing.
Namespace: a logical partition within a cluster for isolating resources. Namespaces scope names, resource quotas, and access controls. They are commonly used to separate environments (staging, production) or teams within a single cluster.
Control plane: the set of components that manage cluster state. The API server is the central endpoint for all operations. The scheduler assigns pods to nodes. The controller manager runs control loops that reconcile desired state with actual state. etcd is the distributed key-value store that holds all cluster state.
Why teams use Kubernetes
Kubernetes solves the operational complexity of running containers at scale. It handles:
- Scheduling containers onto nodes based on available resources and placement constraints
- Restarting failed containers automatically
- Distributing load across replicas of a service
- Rolling out new versions of an application without downtime
- Scaling workloads up and down based on CPU, memory, or custom metrics
Before Kubernetes, these concerns were solved manually or with bespoke tooling. Kubernetes provides a standardized API and control loop model that works consistently across cloud providers, on-premises hardware, and hybrid environments.
The operational complexity it introduces
Kubernetes solves deployment problems but introduces its own operational surface. Running Kubernetes in production requires expertise that goes beyond writing application code.
Common operational challenges include:
- Resource sizing: setting CPU and memory requests and limits correctly requires measured baselines, not guesswork. Misconfigured limits cause OOM kills and CPU throttling.
- Networking: pod-to-pod communication, DNS resolution, ingress routing, and network policies each have their own failure modes.
- Stateful workloads: StatefulSets, persistent volumes, and storage class configuration are significantly more complex than stateless Deployments.
- Cluster upgrades: Kubernetes releases a new minor version approximately every four months. Each upgrade requires compatibility checks across addons, controllers, and API versions.
- Observability: the volume of metrics, logs, and events generated by a Kubernetes cluster requires intentional instrumentation and filtering to be actionable.
Most production teams need dedicated platform engineering or SRE capacity to manage Kubernetes reliably.
Common Kubernetes failure modes
These are the failure modes that SRE and platform teams encounter most frequently in production Kubernetes clusters:
- OOMKilled (exit code 137): the Linux kernel terminated a container for exceeding its memory limit
- Pod stuck in Pending: the scheduler cannot place the pod on any available node
- Readiness and liveness probe failures: misconfigured health checks that cause unnecessary restarts or traffic routing to unhealthy pods
- Debugging 502 and 503 errors: upstream failures at the ingress or service layer
- CrashLoopBackOff: Kubernetes backing off on restart attempts after repeated container failures
Each failure mode has distinct signals in kubectl describe, pod events, and container logs. Diagnosing them accurately requires correlating state across the control plane, the node, and the application.
Kubernetes and AI SRE
As AI agents take on operational tasks in Kubernetes clusters (applying manifests, scaling workloads, remediating incidents), the surface for automated failure grows. An agent that misidentifies root cause and applies the wrong fix can escalate an incident rather than resolve it. Runtime governance and blast-radius bounds on agent actions become a reliability primitive, not an optional layer.
Automated root-cause analysis on Kubernetes failure scenarios is one of the core evaluation dimensions in the AI SRE Benchmark, where NOFire AI reached 89% Top-1 root-cause accuracy (RCAEval, N=735, ACM 2025) compared to a 17-42% state-of-the-art range.
Frequently asked questions
- What is the difference between Docker and Kubernetes?
- Docker builds and runs containers on a single machine. Kubernetes orchestrates containers across many machines, handling scheduling, networking, scaling, and failure recovery.
- Is Kubernetes hard to operate?
- Kubernetes reduces application deployment complexity but introduces its own operational surface. Most teams need dedicated platform or SRE expertise to run it reliably in production.
- What is kubectl?
- The command-line tool for interacting with a Kubernetes cluster. It talks to the Kubernetes API server to query state, apply configuration, and run commands inside pods.
Go deeper: the AI SRE Benchmark
Book a demo