The 5xx family: 500 / 502 / 503 / 504 / 520 and what each one is actually telling you

Every 5xx code is a server's way of saying 'something went wrong on our end.' But they are not interchangeable. 502 means upstream; 504 means upstream-but-slow; 503 means deliberately unavailable; 520 means Cloudflare-doesn't-know. Knowing which one you got cuts the bug-hunt by more than half.

StatusDetectorMay 11, 202612 min read

"500 Internal Server Error" is the generic shrug. "502 Bad Gateway" tells you something specific: a proxy in front of the application got a bad answer from the application. "503 Service Unavailable" is the application saying I know I'm not ready and I'm refusing you on purpose. "504 Gateway Timeout" is the proxy reporting that the application took too long. "520" is one of Cloudflare's invented codes for when its own error-handling can't fit the actual problem into a standard status code.

These five carry most of the production debugging signal. Telling them apart cuts down the bug-hunt enormously.

500 Internal Server Error

The application caught an unexpected exception, or didn't catch one and the runtime did. Whatever was supposed to handle the request didn't. The status code says nothing useful beyond "the error came from inside the application."

What it usually means

Uncaught exception in application code. Null pointer, missing config, DB connection failure that wasn't gracefully retried. The fix is in your application logs, not the response.
Misconfiguration. Application started but a critical environment variable is wrong. Database URL pointing at a stale endpoint, secret rotated and not redeployed.
Resource exhaustion that returns 500 instead of 503. Some frameworks return 500 when they run out of memory because they can't get far enough into the request-handling pipeline to return the more-correct 503.

Diagnostic

Look at the response body — many frameworks include a useful error message in development that's stripped in production. Look at the response headers — X-Request-Id or similar will let you find the exact log line. Look at the application logs around the request's timestamp.

The browser-side Network tab tells you almost nothing for 500s. The server-side logs are where the answer lives.

502 Bad Gateway

A proxy (your CDN, nginx, AWS ALB, Cloudflare) made a request to your application's upstream server and got back something that wasn't a valid HTTP response. The proxy is reporting that it couldn't talk to the thing behind it.

What it usually means

Upstream is down or unreachable. Most common case. Application server crashed; container is in CrashLoopBackOff; the proxy can't open a TCP connection.
Upstream returned malformed HTTP. The application accepted the connection but wrote bytes the proxy couldn't parse. Less common; usually a bug in custom HTTP code or a protocol mismatch (e.g. HTTP/2 client talking HTTP/1.1 to upstream).
Connection reset mid-response. Application closed the socket before sending all the body the headers promised.
SSL termination mismatch. Proxy is trying to talk HTTPS to an upstream that only speaks HTTP, or vice versa.

Diagnostic

If the proxy is yours (you run nginx), check its error log — nginx logs the exact upstream error. If the proxy is managed (Cloudflare, AWS, Vercel), the platform's analytics will distinguish "upstream unreachable" from "upstream returned bad response."

Terminal

If curl to the upstream works but the proxy says 502, the proxy's view of the upstream differs from yours — usually a routing or SSL issue.

503 Service Unavailable

The application is alive and conscious but is deliberately refusing the request. Either it's in maintenance mode, it's protecting itself from overload, or a circuit breaker upstream of it has tripped.

What it usually means

Maintenance mode. Operators turned the application "off" intentionally. There's often a Retry-After header indicating when it'll be back.
Rate limiting / overload protection. Application is up but refusing new requests to protect itself. Common during traffic spikes.
Health-check failure. Load balancer pulled the application out of rotation but didn't yet de-register it from the upstream pool. Some requests still hit it briefly.
Circuit breaker tripped. A dependency (DB, cache, downstream service) is down; the application is refusing requests because it knows they'll fail.

Diagnostic

Check the Retry-After header — its presence and value tell you whether this is an intentional "back at X" or a bare 503 with no useful guidance.

If your app sends a custom body on 503, look at it; most observability platforms log 503s separately from 500s for exactly this reason.

504 Gateway Timeout

A proxy made a request to your upstream and the upstream didn't respond within the proxy's timeout window. Different from 502 (the connection itself worked or didn't) — 504 specifically means the connection worked and the upstream didn't talk fast enough.

What it usually means

Slow database query. Most common case in API backends. A query that used to be fast is now slow because an index dropped, the table grew, or a different query is locking the table.
External dependency latency. Your app calls a third-party API that's slow today. Your timeouts and theirs are different; theirs eventually responds but yours fired first.
Long-running synchronous work. PDF generation, video transcoding, large file uploads — anything where the "request" is actually 30+ seconds of synchronous server work. Should be async, but isn't yet.
Idle-connection timeout on a kept-alive socket. The application is fine, but the specific TCP connection the proxy kept open has been idle long enough that the kernel reaped it.

Diagnostic

Check what the proxy's timeout is set to. Common defaults: AWS ALB 60 seconds, Cloudflare 100 seconds, nginx 60 seconds. Check whether your application is taking that long.

For databases: SELECT * FROM pg_stat_activity WHERE state != 'idle' (PostgreSQL) shows what's currently running.

Cloudflare-specific 5xx (520–527)

Cloudflare invented its own codes for situations the standard 5xx codes don't describe well. They tell you which side of Cloudflare's edge the problem is on.

A few less-common ones: 523 Origin is Unreachable (routing issue between Cloudflare and origin), 525 SSL Handshake Failed (origin's cert is wrong from Cloudflare's view), 526 Invalid SSL Certificate (origin's cert chain broken), 527 Railgun Listener to Origin Error (Railgun service issue — increasingly rare since Railgun was deprecated).

The Cloudflare-specific codes are documented in their troubleshooting guide. They're frequent enough in real ops work to be worth memorising.

The decision tree

When a 5xx code appears, ask:

Where is the code from? Your own app server (visible in your access logs), your CDN, or both? The Server: and Via: response headers usually answer this.
Is the upstream reachable? Bypass the CDN and curl the origin directly. If it works direct but fails through the CDN, the CDN-to-origin layer is the bug.
Is the error consistent or intermittent? Consistent → likely a deployment or configuration issue. Intermittent → load, network, or dependency.
What's the timing? 504 with the proxy's timeout value (e.g. exactly 60s) → upstream is slow, not absent. 5xx with sub-second response time → the application itself returned the error quickly.

The Retry-After convention

Both 503 and 429 (Too Many Requests) can carry a Retry-After header. It can be either a number of seconds or an HTTP-date. Respecting it on the client side is a significant improvement to system behaviour during incidents — clients that hammer through a 503 with no backoff are part of the problem.

Terminal

If you're building a service, send it on 503 and 429 even with a generic default. If you're consuming a service, respect it.

The single best signal during an incident

When you're investigating a live 5xx incident: filter your access logs by status code first. Don't grep for keywords. The mix of codes tells you immediately whether the issue is one specific failure mode or a cascade.

All 502s and nothing else → upstream is down.
All 504s and nothing else → upstream is slow.
All 503s and nothing else → application is intentionally refusing.
Mix of 500s and 502s → application is crashing inconsistently.
Mix of 5xx codes across many services → infrastructure problem (the platform, not your app).

This sounds obvious. In practice, the first thing engineers do during an incident is read individual log lines, which buries the signal in noise.

Frequently asked

Why does my CDN return 5xx for some users and 200 for others on the same URL?

Most likely: the request is going through different CDN edge nodes that have different views of the origin. One edge has a healthy connection cached; another tried recently and failed and is in a "cool-down" period. Less likely but possible: geo-targeted DNS is pointing users at different origins, only one of which is healthy.

My application returns 500 but I never see it in logs. What's happening?

Common causes: (1) logs are buffered and the process crashed before flushing; (2) the request is failing in the framework / runtime before your code runs (auth middleware, body parser); (3) logs are going somewhere you're not looking — many platforms split request logs from application logs and the answer is in the other one.

Should I always retry on a 5xx?

On 502 / 503 / 504 with a Retry-After header — yes, after waiting the indicated time. On a bare 5xx — once, with exponential backoff. Not three times in a tight loop, which is the default behaviour of many HTTP libraries. Idempotency matters: only retry safely-idempotent operations (GET, idempotent PUTs). POST retries can double-charge customers.

How do I distinguish a real outage from a degraded state?

The mix of codes again. A real outage: most requests return the same code (typically 502, 503, or 504). A degraded state: some succeed, some don't, error rate is elevated above baseline but not 100%. Watch for the rate of 5xx in your monitoring; absolute numbers are less informative than the percentage of total traffic.

Tools that help

Website Down Checker — probes a URL, surfaces the exact response code, and identifies the responding server (origin vs CDN). Useful first-pass diagnostic when a user reports "it's broken."
curl -v — the verbose flag dumps the full request and response, including which proxy hops are in the chain. Most incident debugging starts with curl -v against the affected URL.
Browser error code reference — for the cases where the issue is in the browser-server connection, not the server itself.
HTTP status code reference — every status code with plain-English meaning, including the 4xx and 5xx ranges.

The frame

5xx codes are a server's way of telling you which part of the server failed. The codes are precise enough that the right code, on its own, narrows the cause to a handful of candidates. The wrong instinct is to treat all 5xx as "the site is broken." The right instinct is to read the code first, then the body, then the logs — in that order.

The 5xx family: 500 / 502 / 503 / 504 / 520 and what each one is actually telling you

500 Internal Server Error

What it usually means

Diagnostic

502 Bad Gateway

What it usually means

Diagnostic

503 Service Unavailable

What it usually means

Diagnostic

504 Gateway Timeout

What it usually means

Diagnostic

Cloudflare-specific 5xx (520–527)

The decision tree

The Retry-After convention

The single best signal during an incident

Tools that help

The frame

More from the notebook

What `429 Too Many Requests` actually means, and how rate limits really work

Reading SSL certificate errors: the four most common ones, decoded

Reading an HTTP redirect chain — the underrated debugging skill