What `429 Too Many Requests` actually means, and how rate limits really work
Most rate-limit explanations stop at 'you went too fast.' That's true but not useful. Real rate limiters work in five different ways — token bucket, leaky bucket, fixed window, sliding window, concurrency cap — and the headers tell you which one you're hitting, when to retry, and how much budget you have left.
When an API returns 429 Too Many Requests, the client's first instinct is often to assume "I went too fast." That's directionally right, but to actually do something useful about it you need to know which kind of rate limit you tripped. There are five common designs in production today and they behave very differently. The right response — sleep, retry, change concurrency, change request pattern — depends entirely on the design.
This post walks the five common rate-limit designs, the standard headers they communicate through, and what to do when you're on the receiving end.
The five common rate-limit algorithms
Token bucket (most common)
The server maintains a "bucket" of N tokens per client. Each request consumes one token. Tokens refill at a fixed rate (e.g. 10 per second). When the bucket is empty, requests are rejected with 429.
Behaviour you'll see. Bursts up to N work fine. Sustained traffic above the refill rate gets rejected. After a 429, waiting briefly refills enough tokens to continue.
Examples. Stripe API, AWS APIs, GitHub API.
Leaky bucket
Inverted token bucket. Requests enter a queue of fixed depth. The server processes requests at a fixed rate, "leaking" them out. Excess requests over the queue depth get rejected.
Behaviour you'll see. Smooths bursts at the cost of latency. Less common as an outright rejection mechanism; more common as a request-shaping layer.
Fixed window
The server counts requests in non-overlapping time windows (e.g. 100 per minute, reset at :00 of each minute). When the counter hits the limit, all further requests in the window get 429.
Behaviour you'll see. Easy to exploit: you can make 100 requests in the last second of one window and 100 in the first second of the next, doubling the apparent rate in two seconds. Cheap to implement, which is why it's still common despite the burst problem.
Examples. Many internal APIs, simple Redis-backed rate limiters.
Sliding window
Like fixed window but the window slides continuously. A counter weighted by where in the window each request fell. Smoother than fixed window; resists the burst-edge exploit.
Behaviour you'll see. Predictable. The reset time is meaningful — once it ticks past, you have full budget again.
Examples. Cloudflare rate limiting rules, most modern API gateways.
Concurrency cap
Not about rate but about simultaneous requests. The server allows N concurrent requests per client. The (N+1)th gets 429 until one of the existing N completes.
Behaviour you'll see. Bursts of cheap requests work fine. A single slow request can block other requests. Some APIs use this for expensive endpoints (large file uploads, AI inference) while using token-bucket on cheap ones.
Examples. AI inference APIs (OpenAI, Anthropic) often have both concurrency and token-bucket limits.
The signalling headers (read these first)
When a server returns 429 (or even just might return 429 — many APIs include rate-limit headers on every response), four headers are the de-facto standard:
There is no IETF standard for the X-RateLimit-* headers. The draft RFC 9745 (Rate-Limit header field) was an attempt to standardise; not yet ubiquitous. In practice, every major API uses some variant of the four headers above, but the units (seconds-from-now vs Unix timestamp vs ISO date) differ.
The right way to retry
Linear retries — wait 1s, try again, wait 1s, try again — are wrong. Two reasons:
- Synchronised retries make outages worse. Every client retrying at the same interval after a service degradation creates a herd. The service comes back, gets crushed by the herd, fails again. Add a random component (jitter).
- Exponential backoff is dramatically more efficient. Doubling each retry interval means after a minute you've tried 6 times instead of 60. The success probability per attempt doesn't drop fast enough to make the extra attempts worth it during a real outage.
The canonical pattern:
> The key points: respect Retry-After when present; use full jitter (random in [0, delay]) rather than delay + jitter, which still synchronises clients; cap the maximum delay so you don't end up waiting for an hour.
Five real-world rate-limit profiles
GitHub API
Token bucket. 5,000 authenticated requests per hour for the REST API; secondary rate limits for some endpoints. Headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset (Unix timestamp), X-RateLimit-Used.
On 429, GitHub also sends Retry-After for the secondary rate limit. Primary limit hits return a 403 with a "rate limit exceeded" message, not 429.
Stripe API
Token bucket. 100 read operations per second; 100 write operations per second (separate buckets). Headers: standard Retry-After on 429.
Stripe is unusual in that they document the rate limits but generally encourage you to retry — their infrastructure is designed to absorb retries gracefully.
OpenAI / Anthropic API
Multiple limits stacked: requests per minute (RPM), tokens per minute (TPM), requests per day, plus concurrency. Hitting any one returns 429.
Headers: x-ratelimit-limit-requests, x-ratelimit-limit-tokens, x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-requests, x-ratelimit-reset-tokens.
The presence of two separate budgets means you can be under the RPM limit but over the TPM limit, or vice versa. Watch both.
AWS APIs
Token bucket per API per account per region. Limits are not always documented per-endpoint. AWS encourages exponential backoff with jitter and provides reference implementations in every SDK.
The AWS SDKs (boto3, AWS SDK for JavaScript, etc.) have rate-limit-aware retry built in. If you're writing custom HTTP code against AWS APIs, you're likely re-implementing what the SDK already does correctly.
Cloudflare-protected sites
Sliding window. Limits vary per zone / per rule. Cloudflare returns 429 (or, depending on configuration, 403 or a managed challenge page).
Importantly, Cloudflare rate-limit responses don't always include Retry-After. Inspect the body — Cloudflare's response page tells you what kind of limit was hit, even when headers are sparse.
Anti-patterns
A few things to stop doing:
"I'll just retry forever"
The number of retries needs a ceiling. After N attempts, surface the failure to the caller. Burning compute on retries against a service that's been down for an hour helps no one and creates load that delays the service's own recovery.
"I'll spread my requests evenly"
A common over-correction: the developer measures the limit (e.g. 100 RPM) and decides to send exactly one request every 600ms. This works fine until a single request is slow, the next one starts before the previous finishes, and concurrency overlaps push you into a 429. Use the headers (X-RateLimit-Remaining) to pace; don't try to be clever with timing.
"I'll spin up more clients to bypass the limit"
Limits are usually per-account or per-API-key, not per-IP. Spinning up more workers with the same key doesn't help — and risks the account being suspended for "abuse" if the API operator notices.
"Caching is a rate-limit solution"
It can be, but only if the responses are cacheable for long enough to matter. A GET /api/users/me request is per-user; caching it across users doesn't help. Cache only what's actually shareable.
When the 429 isn't actually a rate limit
A few situations get classified as 429 but aren't really rate-limited in the traditional sense:
- Quota exceeded. You've used your monthly allowance, not your per-minute one.
Retry-Aftermay not be meaningful; the next reset is the start of the next billing period. - Account suspended. Some APIs return 429 (rather than 401/403) when an account is in a suspended state. The "limit" never resets in the usual sense.
- Cloudflare bot challenge. Cloudflare may return 429 as part of a bot-detection flow rather than a strict rate-limit rule. Solving the challenge clears it.
Always read the response body. The error code is the headline; the body is the actual story.
The headers-first diagnostic
When you suspect you're being rate-limited, the workflow is:
$ If you see rate-limit headers, you have a budget number and a reset time. The math from there is straightforward.
If you don't see any rate-limit headers, run a small burst — say 10 requests — and watch for 429s in the response. If you 429 within the first 10, you're hitting a low limit. If you don't, the limit (if any) is higher than 10/burst.
Frequently asked
Why don't I see `Retry-After` on a 429?
Some APIs don't send it. Cloudflare often doesn't on its default rate-limit responses. Some custom limiters don't implement it. When it's missing, fall back to your own exponential-backoff schedule starting at e.g. 1 second, doubling each retry, capped at 60 seconds, with jitter.
My code respects the headers but I'm still getting 429. Why?
Three common reasons. (1) You're hitting a different limit than the one the headers describe. AI APIs often have separate RPM and TPM budgets — the headers describe one and you're exhausting the other. (2) Your retry logic is racing other workers also retrying. Add jitter. (3) The 429 is coming from an upstream layer (Cloudflare) that doesn't honour your application-layer rate-limit math.
Can I ask for a higher rate limit?
Almost always yes, if you ask the right person and give them a reason. Most API providers will increase limits for legitimate use cases — they don't want their best customers to be the most rate-limited. The trick is to ask before you hit the limit, with a concrete use case ("we expect to send X requests/minute during our Y workflow") rather than after, as a complaint.
If a service is being DDoSed, will I see 429 or 503?
Generally 503 — the service is refusing requests because of overall load, not because you specifically are over your quota. A well-configured CDN will return 429 to clients identified as the abuse source and 503 to the rest of the population. If you see 429 from a service known to be under attack, you may be incidentally in the same fingerprint bucket as attackers (same ASN, similar user-agent). Vary your headers or wait.
Tools that help
- Website Down Checker — probes a URL, surfaces all response headers including rate-limit ones. Useful for confirming what an API actually sends back.
- Status Meaning Decoder — explains every status code in plain English, with the appropriate "what to do" for each.
- HTTP status reference — full 4xx and 5xx catalogue with the specific semantics of each code.
The frame
429 is a precise message: I know who you are, I know what you've asked for, and I'm refusing because of usage rules I can articulate. That's vastly more information than a generic "server error." Read the headers, respect Retry-After, retry with exponential backoff and jitter, and almost everything else falls into place.
The rate limiter is on your side — it's protecting the service from itself, including from you. Treating its signals as instructions rather than obstacles is the difference between resilient integrations and brittle ones.
On this page23
- The five common rate-limit algorithms
- Token bucket (most common)
- Leaky bucket
- Fixed window
- Sliding window
- Concurrency cap
- The signalling headers (read these first)
- The right way to retry
- Five real-world rate-limit profiles
- GitHub API
- Stripe API
- OpenAI / Anthropic API
- AWS APIs
- Cloudflare-protected sites
- Anti-patterns
- "I'll just retry forever"
- "I'll spread my requests evenly"
- "I'll spin up more clients to bypass the limit"
- "Caching is a rate-limit solution"
- When the 429 isn't actually a rate limit
- The headers-first diagnostic
- Tools that help
- The frame