Why your status page shows green when the service is broken

You're staring at error 503s. You open the vendor's status page. All systems operational. You aren't crazy and the page isn't lying — it's lagging, and the lag is structural. Here's the anatomy of the gap, and what to look at instead.

StatusDetectorMay 12, 20267 min read

The pattern repeats weekly on every developer Slack. Someone shares an error screenshot. Someone else asks "is it down?" Someone clicks the vendor's official status page. "All systems operational." The thread devolves into "the status page is lying."

The page isn't lying. It's behind. Understanding the structural reasons why is the difference between trusting it and rolling your own monitoring.

Key takeaways

Most status-page updates are manual. A human on call has to flip the switch, and they're triaging the incident before they remember to update the page.
Even automated alarms fire on aggregate metrics, which need a critical mass of failure before they trip. A single broken endpoint, a single broken region, or a single broken account class often slips below the threshold.
The page reflects the vendor's view of the world, not yours. Geographic POPs, customer segments, and feature flags can fail for you while the global metric stays green.
Cross-reference the status page with third-party probes and user reports. When all three agree the service is fine, it probably is. When they disagree, the truth is usually closer to whichever one is showing red.
Use the Shutdown Radar for an at-a-glance view across services where official signals, our probes, and user reports diverge.

Why the page lags

There are three structural reasons a status page can be green while real users are seeing failures. Most outages involve at least two of them.

1. Manual updates are slow

The vendor on-call rotation has a playbook. The playbook says: investigate, mitigate, communicate. Communication is third in the list because it's the least urgent — fixing the problem matters more than telling the world about it.

In practice, the gap between "this is a real incident" and "the status page reflects it" is somewhere between five minutes and an hour. The faster end is well-staffed vendors with a dedicated comms-on-call. The slower end is small companies where the on-call engineer is also expected to write the public update.

While that gap runs, you see errors and the page says green. That isn't dishonesty — it's prioritisation. The on-call engineer is doing the right thing.

2. Automated alarms have thresholds

Some status pages are auto-driven. A monitoring system (Prometheus, Datadog, etc.) watches health metrics and flips component status when they breach a threshold. Sounds great — until you read the threshold.

Typical alarm rules look like "5xx rate above 1% for 5 minutes." That means:

A 0.9% error rate that's been running all day → no alarm, page green.
A 50% error rate that lasted four minutes → no alarm, page green.
A 100% error rate hitting only one of fifty regions → globally still well below 1%, page green.

The thresholds exist for a good reason — flapping alarms erode trust and drown the on-call engineer. But they create blind spots. If your traffic hits a failing edge POP, you see 100% errors and the page tells you everything's fine.

3. The page can't see you

The vendor's view of "the service" is the aggregate signal: response codes their load balancers see, error counts their probes generate, percentiles from their telemetry. That aggregate hides per-customer pain.

Examples of failures that don't show up in vendor-side telemetry:

Account-specific bugs. A schema migration broke records that belong to one customer tenant; everyone else is fine.
Feature flags rolled out badly. A flag is on for 10% of users and breaks for half of them; vendor sees a 5% degradation and may or may not page.
Geographic CDN failures. One edge POP is broken; the vendor's primary monitoring runs from a different region.
Client-side errors. The API works fine when the vendor's monitoring calls it; their JavaScript SDK has a bug that's breaking only browsers.
Authentication / authorisation edge cases. Your specific OAuth scope hits a bug; the vendor's smoke tests use a different scope.

In every case the failure is real but the vendor's metric is green.

What to look at instead

Three signals, ranked by usefulness when the status page says green and you suspect it's wrong:

The combination matters more than any single source. When third-party probes show failures, user-report volume spikes, and the vendor page is still green — the vendor is almost certainly behind the curve. Wait twenty minutes; the page usually updates.

When third-party probes are clean, user-report volume is flat, and the vendor page is green — the problem is local to you or your specific setup. Time to debug your network, your client code, or your auth.

How we cross-reference

Every service page on StatusDetector pulls all three signals into one view:

The vendor's current status indicator (from the official feed).
Our own HTTP/DNS probe against the service's primary URL.
User-submitted reports in the last 30 minutes.

When the three agree, we surface a single confidence-weighted summary. When they disagree, we say so — explicitly — and let the reader decide. The post Status page indicators decoded walks through how to read the vendor's indicator; the disagreement case is where the rest of the dashboard earns its keep.

What "the status page is lying" actually means

Most of the time, the page isn't lying — it's slow. The information will be correct in 15-45 minutes; you just want to know now.

Occasionally the page is manipulated. Some vendors leave incidents un-acknowledged because they're tracked under an SLA contract and acknowledging them publicly commits them to credits. Others fold a clearly-degraded service into a "scheduled maintenance" window after the fact. These cases are rare but real, and they explain why some teams have stopped trusting vendor status pages entirely and run their own external monitoring.

The honest summary: a vendor status page is the floor of how bad things are, not the ceiling. If it says critical, things are at least that bad. If it says none, things might still be wrong — but the vendor either hasn't noticed or hasn't told you yet.

Frequently asked

If the vendor's page is unreliable, why does StatusDetector still show it?

Because it's the authoritative voice on what the vendor admits. When it lights up, you have a concrete reference to point at when you escalate. We surface it alongside our own data so you can compare — never as the only source.

How fast do status pages typically update?

For incidents affecting all users: usually within 15 minutes. For partial outages: 30-60 minutes. For account-specific or feature-flag bugs: often never — the vendor handles them as support tickets, not public incidents.

What's the single most useful action when the status page disagrees with what I'm seeing?

Run the Website Down Checker against the affected endpoint. It probes from our infrastructure, so if it agrees with you and the vendor page is green, you have an objective third-party signal — useful for support tickets and for ruling out local issues.

On this page7

Why the page lags
1. Manual updates are slow
2. Automated alarms have thresholds
3. The page can't see you
What to look at instead
How we cross-reference
What "the status page is lying" actually means

Why your status page shows green when the service is broken

Why the page lags

1. Manual updates are slow

2. Automated alarms have thresholds

3. The page can't see you

What to look at instead

How we cross-reference

What "the status page is lying" actually means

More from the notebook

Status page indicators decoded: what 'minor', 'major', and 'maintenance' actually mean

Cloudflare error codes: 520, 521, 522, 524, 1015, 1020 decoded

Connection refused vs connection reset vs connection timed out — which one means what