StatusDetector
Notebook

How to build a simple dependency status dashboard for your team

Every team eventually needs an internal page that shows which of their third-party dependencies are healthy right now. Here's the minimum viable version — what to monitor, what to skip, and where the simple approach breaks down.

StatusDetectorMay 13, 202614 min read

Sometime around year three of a SaaS company's life, the same conversation happens. An outage hits Stripe, or AWS, or SendGrid, and customer-facing engineers spend the first 20 minutes of the incident figuring out which of their dependencies is broken. By the time they've identified the upstream, the incident is half over and they could have been writing the customer comms.

The solution is unglamorous: an internal dashboard that shows the current status of every third-party service the team depends on. Not a fancy monitoring stack — just a single page that, during an outage, answers the question "is it us or is it them?" in five seconds. The first version of this dashboard is the highest-leverage page a small engineering team can build, and it's almost always built badly the first time.

This post is the field guide for building it well. The aim is a dashboard that earns its place: useful during the rare minutes it matters, ignorable the rest of the time. Anything more than that is over-engineering.

The minimum viable scope

The temptation when building a dependency dashboard is to track everything. Resist it. The dashboard's value comes from the small list — fifteen entries that you actually depend on — not a hundred entries that are vaguely interesting.

Start by listing the third-party services that, if they went down for an hour, would impact your customers. Almost every SaaS team's list looks similar to this:

  • Cloud provider — usually one specific region. (AWS us-east-1, GCP us-central1, Azure East US.)
  • CDN / edge layer — Cloudflare, Fastly, AWS CloudFront, Vercel.
  • DNS — Cloudflare DNS, Route 53, NS1.
  • Payment processor — Stripe, Adyen, PayPal, Square.
  • Email delivery — SendGrid, Postmark, Mailgun, AWS SES.
  • Identity / SSO provider — Auth0, Okta, AWS Cognito, Clerk.
  • Database (managed) — Aurora, PlanetScale, Supabase, Neon.
  • Customer support tooling — Intercom, Zendesk, Front.
  • Monitoring / observability — Datadog, New Relic, Honeycomb.
  • Critical webhooks / integrations — Slack, Twilio, OpenAI, Anthropic.

That's twelve entries. For most teams, the right list is between five and fifteen. If you're tempted to add a twentieth, ask whether you can name the specific customer impact of that service being down. If the answer takes more than ten seconds, leave it off.

The data sources

For each entry, the dashboard should pull from one or both of:

  1. The vendor's published status feed. Most modern SaaS vendors publish a machine-readable status feed via Statuspage.io (the dominant tool). The format is well-documented: a GET to https://<status-page-host>/api/v2/status.json returns the current page-level indicator (none / minor / major / critical / maintenance) and the timestamp. See what 'degraded performance' means on a status page for the vocabulary.

  2. Your own HTTP probe of the endpoint you actually use. The vendor's aggregate feed describes their global health; your probe describes the path your code takes through their infrastructure. The two often disagree, and when they do, your probe is more relevant to your specific customer impact.

For each dependency, write down both URLs:

DependencyVendor status feedEndpoint we probe
Stripehttps://status.stripe.com/api/v2/status.jsonhttps://api.stripe.com/v1/charges (rejected 401 = healthy)
Cloudflarehttps://www.cloudflarestatus.com/api/v2/status.jsonA known URL behind your CF account
AWS us-east-1https://status.aws.amazon.com/... (RSS)EC2 metadata endpoint via a probe in us-east-1
SendGridhttps://status.sendgrid.com/api/v2/status.jsonhttps://api.sendgrid.com/v3/mail/send (rejected 401 = healthy)

The "rejected 401 = healthy" pattern is worth noting. Many APIs return 401 to unauthenticated probes — that response confirms the endpoint is alive and answering, without needing real credentials in your probe code. A 500 or a connection failure on the same endpoint is a real signal that something's wrong.

The shape of the dashboard itself

Three columns, one row per dependency, sorted by criticality:

DependencyVendor signalOur probeLast updated
🟢 AWS us-east-1operational200 (147 ms)30s ago
🟠 Cloudflareminor — TR region200 (89 ms)30s ago
🟢 Stripeoperational401 (215 ms)30s ago
🔴 SendGridmajor outage50030s ago
🟢 Auth0operational200 (310 ms)30s ago

That's the entire dashboard. A single table, refreshing every 30 seconds, that any team member can scan in five seconds. The colour coding is the headline; the detail columns are for confirming.

A few specific design choices that matter:

  • Sort by criticality, not alphabetically. During an outage you want the most-likely-relevant dependencies at the top.
  • Show the "last updated" column. A dashboard that's silently broken for three hours is worse than no dashboard. Stale data is a signal.
  • Combine vendor + your probe. Show both. When they disagree, the disagreement itself is interesting information.
  • Link each row to the vendor's status page. When something turns red, the next click is to read the incident detail. Make that one click.
  • Don't show uptime percentages. They're decoration. The question is "is it healthy right now?" — not "what was last month's number?"

The implementation, in 50 lines

You can build the first version in an afternoon. The shape (in pseudocode):

// Backend cron task, every 60 seconds
const dependencies = [
  { name: 'Stripe',     status_feed: 'https://status.stripe.com/api/v2/status.json',
    probe_url: 'https://api.stripe.com/v1/charges',  probe_ok_codes: [401] },
  { name: 'SendGrid',   status_feed: 'https://status.sendgrid.com/api/v2/status.json',
    probe_url: 'https://api.sendgrid.com/v3/mail/send', probe_ok_codes: [401] },
  // ... and so on
];
 
for (const dep of dependencies) {
  const vendor = await fetch(dep.status_feed).then(r => r.json());
  const probe  = await fetch(dep.probe_url, { signal: AbortSignal.timeout(5000) })
                       .then(r => ({ status: r.status, ms: timeSince(start) }))
                       .catch(e => ({ error: e.message }));
  await db.upsert('dependency_status', { ...dep, vendor, probe, checked_at: new Date() });
}

The frontend is a single page that queries dependency_status ordered by criticality and renders the table.

The whole thing is a few hundred lines of code, including authentication and a basic UI. The boring stuff (a cron runner, a small database table, a Next.js or simple Flask page) takes longer than the actual probe logic.

What to avoid

The mistakes that turn a useful dashboard into a useless one:

Don't alert on every status-page change

Status pages flap. A vendor will sometimes flip a minor component to minor for five minutes and back. If your alerting setup pages an engineer every time this happens, you'll get fatigue fast and start ignoring the alerts — including the ones that matter.

The right approach for alerting (if you do it at all): page only on major or critical indicators and your own probe agreeing. The agreement is the noise filter. A vendor major with your probe still returning 200 is probably a partial outage not affecting your code; no need to wake anyone up.

Don't build a uptime calculator

The temptation: "let's calculate our dependencies' uptime percentages and show 99.94% for Stripe last month." Two problems. First, uptime percentages need long observation windows and careful blackout handling to be meaningful. Second, no engineering decision is made on the basis of "Stripe was 99.94% last month." The decision-relevant question is binary: is it up right now? Build for that question and stop.

Don't probe at high frequency

A probe every 30 seconds is plenty. A probe every second is wasteful, sometimes against the vendor's terms of service, and rarely changes the outcome. The vendor's status page updates on a 30s-to-15min cadence; running your probe faster than that just generates noise.

Don't include services you don't actually use

If your team doesn't use Twilio, don't put Twilio on the dashboard. The dashboard's job is to identify which of your dependencies has broken. Adding services you don't depend on dilutes the signal.

Don't put it behind a login

The dashboard's value comes from being looked at quickly during an incident. Putting it behind authentication adds five seconds of friction and means people won't actually use it during the moments that matter. If you're worried about exposing your dependency list, a single shared password or an IP-restricted internal-only URL is the right level of security.

What to add later (only if needed)

If after six months of using the basic dashboard you're routinely missing things, the second version usually adds:

  1. Multi-region probes. Your customers in EU see a different version of your dependencies' health than your customers in US. If you have significant traffic from multiple regions, probing from each one catches geographic POP-level failures the global vendor feed misses. See why a website can be down in one country.

  2. Endpoint-specific probes. Instead of one probe per vendor, probe each of the specific endpoints your code calls. A POST /v1/payment_intents probe to Stripe is more informative than a generic /healthz probe, because it exercises the path your real traffic takes.

  3. Auth-aware probes. Some endpoints return errors at the API layer (a 400 or 422 from a malformed request) that don't fail your TCP/TLS probe. Sending a known-bad authenticated request and confirming the expected error is a richer signal than just confirming the endpoint answered.

  4. Synthetic transaction monitoring. Run a full end-to-end flow through each dependency every few minutes. "Can we create a charge? Can we send a test email to a sink address? Can we deliver a webhook?" This catches issues that single-request probes miss — like rate-limit errors that only show up after the third call.

  5. Status feed integration into your incident response. When an incident opens, the on-call automation should pre-populate the incident channel with the current state of your dependency dashboard. The 30 seconds saved looking it up manually compounds across every incident.

But — and this is important — wait until you need each of these before adding it. Most teams don't need anything beyond the basic version. Premature complexity in a dashboard is the same anti-pattern as premature complexity anywhere else.

When to upgrade to a hosted product

The basic dashboard is enough for almost everyone. The cases where it's worth paying for a hosted product (Datadog Synthetics, Atlassian Statuspage with monitoring, Pingdom, Better Stack):

  • You have customers who pay for uptime guarantees and need a way to prove your dependency-level reliability.
  • You operate a public-facing status page that your customers consume, and you need it to integrate with the rest of your incident-response tooling.
  • Your dependency list has grown past 30–50 entries (rare but real for very large infrastructure-heavy products).
  • You need synthetic transaction monitoring at a frequency that's painful to maintain in-house (every minute, from 12 global regions).

For most teams, the in-house version stays good enough indefinitely. The hosted products are excellent but expensive — and the simple dashboard you built in an afternoon, plus the Stack Dashboard tool for an off-the-shelf view, covers the vast majority of cases.

A note on the customer-facing version

This post is about the internal dependency dashboard — for your engineers, during incidents. The external version, the one you publish to customers, is a different artefact with different requirements. It needs to be honest, fast to update, and resilient to the very incidents it's reporting on (don't host it on the same infrastructure as your product). See why official status pages sometimes lag for the customer-facing perspective.

The internal dashboard's purpose is to answer "is it us or them?" in five seconds. The external dashboard's purpose is to communicate "we know" to thousands of customers at once. They're built for different audiences and should not be the same page.

Frequently asked

Should the dashboard show our own services' health too?

Yes — but treat them as a separate section. The questions you're asking are different: "is our app's checkout endpoint working?" is an internal monitoring question that needs your own infrastructure to answer; "is Stripe's API working?" is a third-party-status question that needs Stripe's signals. Mixing them in one table works, but make sure it's visually clear which rows are first-party and which are third-party.

Vendor status feeds lag real outages — how do we deal with that?

Cross-reference. The internal dashboard's value is partly in showing the disagreement between the vendor feed (slow, authoritative) and your own probe (fast, specific). When your probe shows red and the vendor feed shows green, you've found a window where the vendor hasn't caught up yet. Trust your probe in that window.

How long should we keep historical data?

Long enough to answer post-mortem questions ("what was Stripe's status during our incident on May 12?"), short enough to not become a data-warehouse problem. 90 days is a reasonable starting point. If you genuinely need year-over-year uptime trending, that's a separate problem and probably means you've outgrown the simple dashboard.

Should the dashboard be public?

Probably not for the internal version — your dependency list is competitive intelligence. The customer-facing status page is a different artefact; publish what you'd publicly commit to monitoring.

StatusDetector

We check whether a website, app, API, or domain is working, broken, expired, parked, or permanently shut down. Free, no signup — run a check or open the shutdown radar.