Kubernetes readiness and liveness probe failures

Kubernetes readiness and liveness probe failures cause two distinct problems: a failing liveness probe triggers a container restart, while a failing readiness probe removes the pod from the Service endpoints so it stops receiving traffic. Both are common root causes for CrashLoopBackOff restarts, 502/503 errors, and traffic drops during rolling deploys.

Liveness vs readiness vs startup probes

These three probe types answer different questions and have different consequences when they fail.

Liveness probe: "Is this container alive?" If it fails, Kubernetes kills and restarts the container. Use it to detect deadlocks or unrecoverable application states where the process is still running but cannot make progress.

Readiness probe: "Is this container ready to serve traffic?" If it fails, the pod is removed from the Service's endpoint list and traffic stops routing to it. Use it for startup completion checks, dependency health checks, and graceful overload handling.

Startup probe: "Has this container started yet?" It delays liveness and readiness checks until startup completes. Use it for slow-starting applications where a liveness probe would kill the container before it finishes initializing.

The startup probe is the right tool when your application has a variable or long startup time. Setting a high failureThreshold on the startup probe gives the container time to start without inflating initialDelaySeconds on the liveness probe.

How to diagnose in 60 seconds

Start with kubectl describe pod:

kubectl describe pod <pod-name> -n <namespace>

Look at two sections of the output:

  1. Events at the bottom: look for Unhealthy, Killing, and BackOff reasons.
  2. Container state: check the restart count and Last State termination reason.

Then pull recent cluster events sorted by time:

kubectl get events --sort-by='.lastTimestamp' -n <namespace>

Common event messages and what they mean:

  • Liveness probe failed: HTTP probe failed with statuscode: 404 — the probe path does not exist on this container. The application may have changed its health endpoint.
  • Readiness probe failed: connection refused — the container is not yet listening on the probe port. initialDelaySeconds may be too short.
  • Unhealthy with an increasing restart count — the liveness probe is failing after startup, likely due to a timeout, misconfigured path, or CPU throttling.

Root causes

1. Wrong probe path or port

The most common cause. The application moved its health endpoint from /healthz to /health but the probe spec was not updated. To verify what the container actually exposes:

kubectl exec -it <pod-name> -n <namespace> -- curl localhost:8080/health

Replace 8080 and /health with the port and path your application uses. If this returns a non-200 response or connection refused, the probe path or port in the spec is wrong.

2. Probe timeout too aggressive

timeoutSeconds defaults to 1 second. A health endpoint that queries a database connection to verify readiness may take 2-3 seconds under normal load. The probe fails even though the application is healthy. You will see Readiness probe failed: context deadline exceeded or similar.

3. initialDelaySeconds too short

The container is probed before it finishes initializing. The liveness probe fails, Kubernetes kills the container, and the cycle repeats. This is one of the primary causes of CrashLoopBackOff on first deploy or after a pod reschedule.

This is especially common with:

  • Java/JVM applications (30-90 second startup times are normal)
  • Applications that run database migrations on startup
  • Services that wait for a sidecar to be ready before accepting connections

4. CPU throttling causing probe timeouts

If a container is at or near its CPU limit, the health endpoint may not respond within timeoutSeconds. The probe fails even though the application is functioning. The failure appears intermittent and correlates with high-traffic periods.

Check CPU usage:

kubectl top pod <pod-name> --containers -n <namespace>

If the container is consistently at its CPU limit, the probe timeouts are a symptom. The underlying cause is a CPU limit that is too low for the workload.

5. Dependency failure surfaced through readiness probe

The readiness probe checks a downstream dependency (database, cache, message broker) that is temporarily unavailable. The pod becomes unready and is removed from load balancing. This is the correct behavior when the dependency is genuinely down, but it can cause cascading traffic removal if many pods share the same failing dependency.

Fixes

Fix a wrong path or port

Update the probe spec in your Deployment or StatefulSet manifest:

livenessProbe:
  httpGet:
    path: /health      # update to match what your app actually exposes
    port: 8080
  initialDelaySeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3

Apply the change:

kubectl apply -f <your-deployment.yaml>

Fix timeout issues

Increase timeoutSeconds to give the health endpoint time to respond:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  timeoutSeconds: 5
  failureThreshold: 3
  periodSeconds: 10

With failureThreshold: 3 and periodSeconds: 10, the pod has 30 seconds of grace for transient slow responses before being marked unready. This is appropriate for health endpoints that check downstream dependencies.

Fix initialDelaySeconds

Base initialDelaySeconds on your observed worst-case cold start time, not a guess. Measure it:

kubectl describe pod <pod-name> | grep -A2 "Started"

For Java applications, set initialDelaySeconds to at least 30-60 seconds. For a more robust approach, use a startup probe with a high failureThreshold instead of inflating initialDelaySeconds:

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30    # 30 * 10s = 5 minutes maximum startup window
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  timeoutSeconds: 5
  failureThreshold: 3
  periodSeconds: 10

The liveness probe only activates after the startup probe succeeds. The container has up to 5 minutes to start before Kubernetes considers it failed.

Fix CPU throttling

If kubectl top pod --containers shows a container consistently at its CPU limit, you have two options:

  1. Increase the CPU limit in the container's resource spec.
  2. Reduce probe frequency: increase periodSeconds so probes fire less often, reducing the probe's CPU contribution.
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 30     # probe every 30s instead of the default 10s
  timeoutSeconds: 5

Prevention

Base initialDelaySeconds on measured startup time. The single most common mistake is setting initialDelaySeconds: 10 on an application that takes 45 seconds to start. Measure the actual startup time during development and set the value accordingly.

Separate liveness and readiness probes. They answer different questions and should have different thresholds. A readiness probe can be more sensitive (fail on slow dependency response) because it only stops traffic, not restarts. A liveness probe should be conservative (fail only on true deadlocks) because failure triggers a restart.

Test probes before you deploy. Run the container locally and curl the probe endpoint:

docker run --rm <your-image> &
curl -s -o /dev/null -w "%{http_code}" localhost:8080/health

If it does not return 200, the probe will fail in the cluster.

Alert on restart count, not just CrashLoopBackOff. CrashLoopBackOff means Kubernetes has already backed off after repeated failures. A restart count above 3 in a rolling window is an early signal worth alerting on:

- alert: PodRestartingFrequently
  expr: |
    increase(kube_pod_container_status_restarts_total[1h]) > 3
  for: 5m
  labels:
    severity: warning

For automated root-cause analysis that correlates probe failures with upstream signals, deployment changes, and resource constraints, see the AI SRE Benchmark to understand how NOFire AI approaches signal-to-root-cause accuracy across complex failure chains.

Related debugging guides

Probe failures connect to several other common failure modes:

Frequently asked questions

What is the difference between a liveness probe and a readiness probe?
Liveness: if it fails, Kubernetes kills and restarts the container. Readiness: if it fails, Kubernetes stops sending traffic to the pod. A pod can be alive but not ready.
Why does my pod keep restarting with a liveness probe failure?
The most likely causes are a probe path that changed after a deploy, an initialDelaySeconds value that is too short for the container's startup time, or the container being CPU-throttled so the health endpoint cannot respond in time. Check kubectl describe pod for the exact failure message.
Can a readiness probe failure cause 502 errors?
Yes. When all pods in a Service fail their readiness probe, the Service has zero endpoints. Requests reach the load balancer but are not forwarded, returning 502 or 503.
Book a demo