Debugging 502 and 503 errors in Kubernetes

HTTP 502 (Bad Gateway) and 503 (Service Unavailable) in Kubernetes indicate that a request reached a proxy or load balancer but could not be forwarded to a healthy upstream. The cause is almost always one of four things: unhealthy pods failing readiness probes, connection pool exhaustion, a rollout in progress, or a service configuration mismatch. Knowing which one you are dealing with takes less than two minutes with the right kubectl commands.

502 vs 503: what each code actually means

The two codes are easy to conflate but they point to different failure modes.

502 Bad Gateway means the proxy (your Ingress controller, an API gateway, or a sidecar) successfully connected to an upstream pod but received an invalid or empty response. The pod is reachable but not behaving correctly. Common triggers: the pod is still starting, it is in a crash loop, or it returned a raw TCP reset instead of a valid HTTP response.

503 Service Unavailable means the proxy found no healthy backend to forward the request to. There are zero ready endpoints. The pod may not exist yet, all pods may be failing their readiness checks, or the Service selector does not match any running pod.

The distinction matters because it changes where you look first. A 502 sends you to the pod logs. A 503 sends you to endpoints and selectors.

Diagnose quickly

Start with endpoints. This single command tells you whether Kubernetes has any healthy pods registered for the Service:

kubectl get endpoints <service-name>

If the ENDPOINTS column is empty or shows <none>, you have a 503 scenario. Every request will fail until at least one pod passes its readiness probe.

Next, confirm the Service is targeting the right pods:

kubectl describe service <service-name>

Look at the Selector field. It must exactly match the labels on your pods.

Fix: failing readiness probes

If endpoints are empty, inspect the pod:

kubectl describe pod <pod-name>

Scroll to the Conditions section and look for Ready: False. Then check the Events section for readiness probe failure messages.

Two issues account for the majority of cases:

  1. The probe path changed after a deploy. If your readiness probe checks /healthz but a new version moved that endpoint to /health, every pod will fail its check and the Service will have zero ready endpoints.
  2. Startup takes longer than initialDelaySeconds. Kubernetes starts probing after initialDelaySeconds seconds. If your application takes 30 seconds to initialize and initialDelaySeconds is 10, the probe will fail repeatedly. Increase initialDelaySeconds or add a startupProbe to gate readiness probing until the application is fully initialized.
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3

Fix: connection pool exhaustion

Symptoms: intermittent 502 or 503 errors that appear under load but not at low traffic. The pod logs may show no errors at all because the failure happens at the Ingress layer before the request reaches the pod.

Check your Ingress controller's upstream connection limits. For nginx-ingress, the relevant annotation is nginx.ingress.kubernetes.io/upstream-keepalive-connections. For Envoy-based controllers, look at the cluster configuration's max_connections and max_pending_requests circuit-breaker settings.

Increasing limits is only part of the fix. Also enable connection draining so that connections are not dropped abruptly when a pod scales down or is replaced.

Fix: rolling deploy causing 502

During a rolling update, Kubernetes terminates old pods while new ones come up. If old pods are removed from the load balancer before their active connections finish, those connections receive a 502.

The fix is a preStop lifecycle hook:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 30

The sleep gives the load balancer time to stop routing new requests to the pod before it shuts down. Set terminationGracePeriodSeconds higher than the sleep duration so Kubernetes does not kill the pod before the sleep completes.

For nginx-based pods, send a graceful shutdown signal instead of a raw sleep:

lifecycle:
  preStop:
    exec:
      command: ["/usr/sbin/nginx", "-s", "quit"]

Fix: service selector mismatch

A selector mismatch is one of the most common 503 root causes after a rename or a label change in a Deployment manifest.

List your pods with their labels:

kubectl get pods --show-labels

Then compare against the Service selector:

kubectl get service <service-name> -o jsonpath='{.spec.selector}'

Every key-value pair in the selector must appear in the pod's labels. A single typo or a missing label version (app: my-service vs app: my-service-v2) will result in zero endpoints.

Prevention

Catching these issues before they reach production is straightforward:

  • Always configure readiness probes. A pod without a readiness probe is considered ready as soon as it starts, even if the application has not finished initializing. This is the single most common cause of 503 errors during deploys.
  • Set minReadySeconds on your Deployment. This holds a pod in the "progressing" state for a minimum time after it becomes ready, giving dependent systems time to register the new endpoint before traffic is shifted.
  • Use PodDisruptionBudgets. A PDB prevents Kubernetes from taking down too many pods at once during voluntary disruptions (node drains, cluster upgrades). Without one, a drain operation can temporarily leave your Service with zero ready endpoints.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-service

For systematic root-cause analysis across 502/503 and other Kubernetes failure patterns, see the AI SRE Benchmark to understand how automated tools perform against these scenarios at scale.

Related debugging guides

These failure modes often share a root cause. See also:

Frequently asked questions

Why do 502 errors appear during a rolling deploy?
Kubernetes sends traffic to new pods as soon as they pass readiness, but old pods may still be terminating active connections. A preStop lifecycle hook with a short sleep gives in-flight connections time to drain before the pod exits.
What is the difference between a liveness probe and a readiness probe?
A liveness probe answers the question: should Kubernetes restart this container? A readiness probe answers: should Kubernetes send traffic to this container? A pod can be live but not ready.
Can an Ingress misconfiguration cause 503 errors?
Yes. If the Ingress backend service name or port does not match an existing Service, all requests return 503 because there are no valid endpoints to forward to.
Book a demo