Why AI Breaks When the Cloud Fails – Lessons from Outages

On October 20, 2025, a large AWS outage disrupted chat, analytics, internal automations and many customer-facing apps for hours. The outage showed how a regional cloud failure can cascade through products that use cloud-hosted models, data stores, CDNs and APIs.

You prefer watching? We got you covered

The Cloud Goes Down. AI Goes With It.

This article explains how modern AI stacks rely on third-party providers, what goes wrong when those providers fail, and what you can do about it. I use recent, real incidents as examples: the October AWS outage, the November Cloudflare outage, the November DoorDash data breach, and historical CDN failures such as Fastly's 2021 incident. Each example highlights a distinct failure mode: availability, performance, security, and configuration risk. (thousandeyes.com)

How modern AI is wired to third parties

Most AI products are not single, self-contained systems. They are pipelines of services that often live outside your control:

Model hosting and inference APIs (managed cloud services or third-party model providers).
Object and blob storage for datasets and model checkpoints.
CDNs and edge networks that serve web UIs and short-lived model responses.
Identity and auth providers, payment processors, monitoring and observability tools.
Third-party data sources and enrichment APIs used during training or inference.

Using these services speeds development and reduces cost. It also outsources failure modes. When a provider fails, the parts of your AI that depend on it can fail too. (TechHQ)

Real cases that show different failure modes

Availability: AWS outage, October 20, 2025

A regional AWS failure left many services slow or unreachable. Companies reported degraded or missing features for hours, including messaging, auth flows and analytics pipelines that depend on AWS regions and services. The post-incident analysis showed how single-region dependencies and cascading retries amplified the impact. For AI features - e.g., chatbots that call hosted inference APIs or recommender systems that load model checkpoints from S3 - the result can be silent failures or large user-facing errors. (thousandeyes.com)

Edge/CDN: Cloudflare outage, November 18, 2025

Cloudflare's disruption caused widespread “internal server error” messages across sites and apps. Because many AI front ends and API gateways sit behind CDNs, a CDN failure translates to inability to reach model endpoints or to deliver front-end assets used to render interactive AI features. The Cloudflare postmortem pointed to a configuration and resource handling issue that propagated across its control plane. (The Guardian)

Security / Data exposure: DoorDash breach, November 17, 2025

A social-engineering driven breach at DoorDash exposed user contact details. Data exposures at third-party vendors are a direct risk for AI products that collect and store user data for training or personalization. Even if your models are internal, vendor breaches can leak training data, user prompts, or metadata that downstream AI services rely on. (TechCrunch)

Configuration / software bug: Fastly CDN outage, June 8, 2021 (historical)

Fastly's outage was triggered by a small software change and customer configuration. It shows how a compact error in a widely used provider can produce outsized global effects. For AI systems, bugs like this can sever access to model UIs, telemetry, or content used to ground model responses. (fastly.com)

Why AI products are uniquely vulnerable

AI features amplify the usual cloud risks in three ways:

Real-time dependence. Many AI experiences require low-latency, always-on inference. If an inference API or auth provider fails, the feature becomes unusable in seconds.
Data sensitivity. Training datasets, prompt logs, or personalization tokens often contain private data. A vendor breach can leak high-value, privacy-sensitive information used by your models.
Hidden coupling. Modern apps chain dozens of providers. Failure in a small component (CDN, monitoring agent, identity) can cascade into the AI layer via timeouts, retries and degraded routing. Postmortems repeatedly show unexpected coupling.

Concrete risk types

Availability / outage risk - e.g., AWS region failover causing inference API timeouts.
Security / data breach risk - e.g., vendor social engineering exposing user contact data used for personalization.
Performance / latency and cost spikes - failovers and cross-region routing can increase egress and compute costs unpredictably.
Configuration and operational risk - a single config change at a provider can break global traffic flows.
Compliance and residency risk - cloud misconfigurations can move or expose data to unintended jurisdictions.

Real product impacts and quick scenarios

A customer support assistant calls a hosted LLM. During an outage the assistant returns errors or times out. Users lose trust.
A recommender system tries to load a model checkpoint from a failing region. It falls back to a stale model or serves default content. Conversion drops.
A compliance team discovers PII in a third-party analytics export after a vendor breach. Legal exposure follows.

Practical mitigation: An actionable checklist

These are pragmatic steps teams can implement quickly and iteratively.

Architecture & runtime

Design for multi-region and multi-provider failover. Run critical services across regions and, where cost-effective, across at least two providers. That reduces single-point risk but adds complexity. (thousandeyes.com)
Use local fallbacks for core flows. Ship a small on-device or local model for essential paths (auth prompts, basic suggestions) so the product remains usable when external APIs fail.
Cache aggressively and serve graceful degradation. Cache frequent responses, and present reduced-functionality UI when live inference isn't available.

Data & security

Minimize vendor exposure of sensitive data. Use tokenization and envelope encryption so the vendor cannot read raw PII or sensitive prompts. Rotate keys and audit access. (TechCrunch)
Treat vendor breach scenarios in threat models. Include third-party compromises as core scenarios in threat and incident models.

Contracts & ops

Negotiate meaningful SLAs and audit rights. Don't accept opaque incident response clauses. Include forensic access, notification timelines, and penalties.
Maintain exportable backups of critical assets. Regularly test restorations from cold backups, including model checkpoints and data exports.
Run chaos engineering that includes third-party failures. Practice cutting off a provider during game days. That reduces surprise during real incidents.

What procurement and security should ask vendors

A short due-diligence checklist:

Do you run multi-region and multi-AZ redundancy?
Can we export our data and models on demand? How fast?
Do you have breach notification SLAs and forensic support?
Show your last five incident postmortems (redacted).
What is your third-party dependency map?

Balance speed and resilience

Third-party services are tools. Use them to move fast. But assume they will fail. Measure which AI features are critical to your product and invest in resilience for those first. For less critical features, accept some cloud dependence while monitoring risk metrics. The question is not “cloud or not?” but “which parts of our UX need local, tested fallbacks?” (TechHQ)

Closing: Three concrete next steps for your team

Run a dependency map for any AI feature used in production. Mark services by criticality and sensitivity.
Add one fallback: a cached response, a local model, or a degraded UX path for your highest-impact feature.
Schedule a third-party failure drill (simulate cutting your cloud provider or CDN) and update runbooks based on the results.