StatusDetector
Notebook

How CDN, DNS, and cloud outages affect apps you use every day

Apps you use daily share fewer pieces of infrastructure than you'd think. When one of those pieces fails — a single CDN, a single DNS provider, a single AWS region — half the internet seems to break at once. Here's why, with the canonical examples and what you can do about it.

StatusDetectorMay 13, 202616 min read

You're trying to use Slack at work. It's broken. So is your bank's website, and Reddit, and Discord, and the Hacker News thread complaining about how everything is broken. The internet feels like it's having a bad day. Twenty minutes later, everything works again, and you don't think about it until it happens the next time.

What you've just experienced isn't "the internet" being broken — it's the small handful of shared infrastructure providers that sit underneath almost every consumer app. Cloudflare, Fastly, Akamai, AWS Route 53, AWS us-east-1, the major DNS resolvers (Google, Cloudflare, Quad9), the major TLS certificate authorities. Each of these supports thousands of apps and services. When one of them has a bad fifteen minutes, thousands of unrelated apps fail simultaneously, which is why it feels like the entire internet broke.

This post is a tour of the three biggest categories of shared-infrastructure failure — CDN, DNS, and cloud-region outages — with the canonical examples that defined each category and a clear picture of what's actually happening when your app says "something went wrong."

The shared-stack reality

Behind almost every consumer-facing web app sits a layered stack. From bottom to top:

  1. A cloud provider's compute and storage — usually AWS, Microsoft Azure, or Google Cloud. The actual servers running the application live here.
  2. A DNS provider — translates the domain name to an IP address. Often AWS Route 53, Cloudflare DNS, Google Cloud DNS, or a specialist like NS1 or Dyn.
  3. A CDN / edge layer — accelerates static content, terminates TLS, and provides DDoS protection. Cloudflare, Fastly, Akamai, AWS CloudFront, Vercel Edge.
  4. The application itself — the part the company actually built.

When you ask "is the app down?" you're asking about a stack of four layers, and a failure at any one of them produces the same surface symptom: the page doesn't load. The vendor's status page usually only reports failures in the application layer. Outages in the lower three layers often go unmarked on the application's own status page, because the application is technically still healthy — the infrastructure it depends on is just unreachable.

The lower layers are also where consolidation has happened most. There are dozens of public CDNs, but four of them carry most of the traffic. There are hundreds of DNS providers, but five of them resolve most of the queries. There are thousands of cloud providers, but three of them (AWS, Azure, GCP) host the majority of consumer-facing services. When one of the top players has a bad afternoon, the effect is felt everywhere.

The CDN cascade

A CDN — content delivery network — sits between the public internet and the application server. Its job is to cache static content close to users (so the New York Times homepage loads in 50ms from a server two hops away from you, rather than 500ms from a server in Virginia), and to absorb DDoS traffic before it reaches the origin server.

Cloudflare, Fastly, Akamai, and Amazon CloudFront together carry roughly 70-80% of CDN traffic in the consumer-facing web. Any one of these going down means thousands of unrelated sites suddenly serve errors from the same place at the same time.

The canonical examples

  • Cloudflare, June 21 2022. A configuration deployment for Cloudflare's edge routing took down 19 data centres globally for about 30 minutes. The affected sites included Discord, Shopify, Fitbit, Peloton, and a long list of others, all returning Cloudflare-branded 5xx errors. The post-mortem is the clearest published description of how a CDN-level outage cascades.
  • Fastly, June 8 2021. A bug in a software deployment, triggered by a single customer pushing a specific configuration, took Fastly's edge platform offline for approximately 50 minutes. Reddit, Amazon, the New York Times, the UK government's gov.uk site, Twitch, GitHub Pages, and the Financial Times all served errors. The post-mortem is similarly detailed.
  • Cloudflare DNS, October 4 2023. A bug in Cloudflare's 1.1.1.1 public resolver made the resolver intermittently unavailable for several hours. Sites that depended on Cloudflare for both DNS and CDN saw their users unable to resolve the domain at all, and users using 1.1.1.1 as their DNS resolver saw a long list of unrelated sites become unreachable.

The user-perspective fingerprint

A CDN outage looks like this from a user's perspective:

  • Multiple unrelated sites are simultaneously broken.
  • The error pages mention the CDN by name (Cloudflare, Fastly), or include a CDN-specific identifier (a Cloudflare Ray ID, a Fastly POP code).
  • The affected sites recover at roughly the same time — the CDN fixes one upstream issue, every downstream site comes back.
  • Each affected site's own status page may say "operational" because the application itself is fine.

If you suspect a CDN issue, the test is to check whether multiple unrelated services are broken simultaneously. If only one is, the issue is application-side; if many are, look upstream. See how to check if Cloudflare is down for the specific cross-reference.

The DNS provider cascade

DNS is the layer most people don't think about. When you type slack.com into a browser, your computer asks a DNS resolver to translate that name to an IP address. The translation is fast and cached, so it's invisible. But every web request starts with a DNS lookup, and if the DNS infrastructure fails, every request fails before it begins.

The DNS layer has two interesting failure modes: failures of the authoritative DNS (the DNS server that owns the answers for a domain), and failures of the resolver DNS (the DNS server you use to ask for answers).

Authoritative DNS outages

When a site's authoritative DNS provider has an outage, the site's domain name stops resolving worldwide. No browser anywhere can find the IP address. The site is effectively invisible — even though the application servers are still running, no one can reach them.

  • Dyn DNS, October 21 2016. A coordinated DDoS attack against Dyn (now owned by Oracle), the DNS provider for Twitter, Spotify, Netflix, GitHub, Reddit, Airbnb, Etsy, and many more, made all of those services unreachable for several hours along the US east coast and parts of Europe. The attack used the Mirai IoT botnet. This incident is the canonical example of a DNS-as-single-point-of-failure outage.
  • Amazon Route 53, multiple smaller incidents. Route 53 is the most-used DNS provider for cloud-hosted apps; the rare hour it's degraded affects the long list of services that didn't set up failover DNS to a second provider.

The fingerprint of an authoritative DNS outage:

  • The domain name doesn't resolve from any resolver, anywhere. dig/nslookup returns NXDOMAIN, SERVFAIL, or just hangs.
  • Sites that share the same DNS provider all fail simultaneously.
  • The application's status page is unreachable (because the application's own domain isn't resolving either) — so you can't even check.

See DNS propagation isn't a thing for the model of how DNS actually works under the hood.

Resolver outages

When the DNS resolver you're using has an outage, every site you try to visit fails to resolve — from your end. The sites are fine; the rest of the world can reach them; you can't, because your resolver isn't answering.

  • The 1.1.1.1 outages, periodic. Cloudflare's public resolver has had several brief outages since launch in 2018. Each time, the millions of users who set their DNS to 1.1.1.1 lost the ability to resolve any domain for the duration.
  • ISP resolver issues are common but small. Every consumer ISP runs DNS resolvers for its subscribers, and those resolvers occasionally have bad days. The symptom is universal "sites won't load" for users of one ISP while users of other ISPs are unaffected.

The fix for resolver outages is fast: switch to a different resolver. The most common DNS resolvers are:

  • 1.1.1.1 (Cloudflare) — /dns/cloudflare
  • 8.8.8.8 (Google) — /dns/google
  • 9.9.9.9 (Quad9) — privacy-focused, blocks known malware domains
  • 208.67.222.222 (OpenDNS / Cisco) — filtering-focused

If sites suddenly stop resolving for you but a friend on a different ISP can still reach them, change your resolver in your OS settings. The change takes effect in seconds.

The cloud-region cascade

The most consolidated layer. AWS, Azure, and GCP together host the majority of internet-facing services. Each of them has a small number of major regions; most companies that use them don't deploy redundantly across regions, because cross-region redundancy is expensive and complicated.

When a major cloud region has a bad afternoon, the long list of services hosted there has the same bad afternoon. AWS's us-east-1 region (Northern Virginia) is the most-used single cloud region in the world, and it's had at least four high-impact outages in recent years:

  • December 2021 — IAM and EC2 control plane issues in us-east-1 took down a long list of consumer apps, including Disney+, Slack, Coinbase, Robinhood, and Amazon's own retail site. The outage lasted approximately 7 hours from initial symptoms to full recovery.
  • June 2023us-east-1 network issues affected Lambda and other services for several hours; downstream apps including Snapchat and many smaller services were affected.
  • April 2023 — DynamoDB capacity issues caused intermittent failures for hours.
  • December 2020 — Kinesis Data Streams cascading failure affected a wide range of AWS services and their customers.

The fingerprint of a cloud-region outage:

  • Multiple unrelated services, all using the same cloud region, fail simultaneously.
  • Each affected app's status page may say operational because the app's monitoring also runs in the affected region and can't report.
  • The cloud provider's own status page is usually the most reliable signal — AWS publishes incident status at status.aws.amazon.com, Azure at status.azure.com, GCP at status.cloud.google.com. Our /aws page mirrors Amazon's public RSS feed in real time.

The hidden dependency map

Here is the surprising observation: most consumer apps share a small number of infrastructure dependencies, and they overlap heavily. A typical SaaS app's stack might look like:

  • Hosted on AWS (compute, storage, database)
  • Cached / fronted by Cloudflare (CDN, DDoS protection, edge logic)
  • Resolved by AWS Route 53 (authoritative DNS)
  • Sends transactional email through AWS SES or SendGrid
  • Logs to Datadog
  • Sends customer notifications via Twilio
  • Authenticates with Auth0 or Okta

The same shape — with the same five or six providers in the same five or six slots — applies to thousands of unrelated SaaS products. So when one of those providers has an outage, all of those products see the same failure at the same time.

You can see this in real time during an incident. The Hacker News front page during a Cloudflare incident is full of "X is down" threads for unrelated services, all of which turn out to be coincident because they share the same upstream. The thing that looks like the entire internet failing is usually one provider's bad afternoon spreading across thousands of unrelated apps.

What you can do as a user

Most of the time, the answer is: wait. You can't fix a Cloudflare incident. You can't fix an AWS region. You can't fix a DNS provider's DDoS recovery. The outages typically last 15-90 minutes for the smaller infrastructure incidents and 2-8 hours for the bigger ones; the right move is usually to do something else for an hour and try again.

A few small things that sometimes help:

  • Switch DNS resolver. If your problem is at the resolver layer, swapping from your ISP's resolver to 1.1.1.1 or 8.8.8.8 takes 30 seconds and sometimes resolves regional resolver issues.
  • Try a VPN. If the affected layer is geographically scoped (a CDN POP, a regional cloud failure, a country-specific DNS issue), a VPN exit elsewhere will sometimes bypass the affected piece of infrastructure entirely. See why a website can be down in one country.
  • Check the radar. The Shutdown Radar flags simultaneous outages across the catalogue. If a half-dozen unrelated services show degraded status at the same time, you're looking at an upstream incident rather than a coincidence — and there's nothing useful to do about it except wait.

What developers can do (a quick aside)

For developers reading this who might be deploying their own services, the architecture lessons from a decade of cascade outages are well-established:

  1. DNS failover to a second provider. If your authoritative DNS goes down, having NS records for a second provider lets resolvers fall back automatically. Few teams do this; the ones who do are protected from the Dyn-style outages.
  2. Multi-region deployment. Even active-passive cross-region deployment is cheaper than the cost of an hour-long outage. Most apps don't bother because the engineering cost is real and the outages are rare.
  3. CDN-independent fallback. A static error page hosted on a different CDN (or on the origin directly) gives users a meaningful "we are aware" message rather than a generic CDN error.
  4. Visible status pages that don't depend on the affected infrastructure. A status page hosted on the same infrastructure as the service is useless during an outage. Hosting it on a completely independent platform (Statuspage.io, Atlassian Statuspage, an external static-site host) is the boring-but-correct choice.

The teams that have these things in place are the ones whose customers don't notice when AWS us-east-1 has an afternoon.

Frequently asked

If the same providers underpin everything, why don't outages happen more often?

The infrastructure is genuinely well-engineered. Cloudflare, AWS, and the major DNS providers run their networks to a higher uptime standard than almost any individual customer could afford. The outages that happen are dramatic precisely because they're rare — total annual downtime for these providers is typically measured in tens of minutes, not hours. The cost-benefit is in their favour even when the worst-case event happens.

Why do my coworkers' apps fail and mine works during these outages?

Often a routing-level coincidence: your packets get routed to a different CDN POP or a different upstream than theirs, and only one of those routes is broken. See why a website can be down in one country but working elsewhere for the full mechanism. The shorter version: the internet's path-routing is heterogeneous enough that two users in the same office can have meaningfully different views of an upstream incident.

Should I avoid services hosted on AWS to reduce my exposure?

Probably not. The 'don't use AWS' alternative is usually some combination of (1) services that use AWS anyway through a layer of indirection, (2) smaller cloud providers with worse uptime, (3) self-hosting at a cost most consumers won't pay. The pragmatic answer is to accept that the small handful of large providers will occasionally fail, and to have a backup plan for the hour you need a service that's unavailable.

Is multi-cloud always the right answer for developers?

No. Genuine multi-cloud — where your service runs identically on at least two providers and can fail over between them — is hard to do well and expensive to maintain. The cost is usually higher than the cost of the outages it protects against, unless you're a platform whose own customers depend on high availability. For most applications, "deploy to two regions of one cloud provider, with a separate DNS provider as a backup" is the right level of investment.

StatusDetector

We check whether a website, app, API, or domain is working, broken, expired, parked, or permanently shut down. Free, no signup — run a check or open the shutdown radar.