Cloud Outages Are Predictable—Stop Treating Them Like Rare Weather

ChatGPT · 2026-07-04T01:32:45-0400

Cloud outages over the past year have shown that failures at Google Cloud, AWS, and Microsoft Azure can disrupt businesses, consumer apps, developer workflows, and identity-dependent services far beyond the original cloud platform. The lesson is no longer that hyperscalers sometimes fail; everyone in IT already knows that. The more uncomfortable conclusion, argued this week by InfoWorld and reinforced by public incident reports from the providers themselves, is that too many organizations still treat cloud outages as rare weather events rather than predictable operating conditions. The next outage should not be wasted as another round of status-page watching and executive frustration.

The Cloud Reliability Story Has Outgrown Its Marketing

For more than a decade, the cloud was sold as a reliability upgrade. Move out of the fragile server closet, stop nursing aging SANs, stop betting the company on one overworked admin and a blinking rack in a branch office. In many respects, that pitch was true: hyperscale platforms brought professionalized operations, elastic capacity, geographic reach, and engineering depth that most organizations could never build alone.
But the bargain was never “no failures.” It was “different failures.” The industry is now living inside that difference.
InfoWorld’s David Linthicum frames the past year’s outage pattern as a warning that cloud disruptions have become a normal part of business life, not an exception that can be waved away after a postmortem. That is the right starting point, but the sharper issue is architectural: the cloud did not eliminate single points of failure so much as move them into places customers cannot see, cannot patch, and often cannot meaningfully route around during the event.
That does not make AWS, Azure, or Google Cloud unreliable in the simplistic sense. It makes them systemic. When a hyperscaler’s control plane, identity dependency, global routing layer, or dominant region stumbles, the blast radius is not bounded by one customer’s subscription. It reaches SaaS platforms, payment flows, security tooling, collaboration suites, mobile apps, developer pipelines, smart-home devices, and the support systems companies rely on to explain the outage to their own customers.

June 2025 Was the Internet Remembering Its Dependency Graph

The June 12, 2025 Google Cloud outage was one of those incidents that made the abstract dependency map visible. As TechCrunch, the Associated Press, TechTarget, and others reported at the time, the disruption rippled through Google services and third-party platforms, with Spotify, Cloudflare-connected services, and other applications feeling the effects. Google later attributed the incident to a problem involving Service Control, a critical enforcement layer used by Google Cloud APIs.
That detail matters because it moves the story away from the cartoon version of cloud outages. This was not simply “a data center went dark.” It was a failure in a shared internal service that many other services rely on to decide whether requests should be accepted, authorized, metered, or rejected. In modern cloud architecture, those shared services are the hidden load-bearing walls.
The lesson for customers is brutal. You may have built your application across zones. You may have replicated data. You may even have rehearsed a regional failover. But if the affected dependency is global, administrative, or identity-adjacent, your beautifully distributed application can still find itself waiting on one provider’s internal machinery.
Cloudflare’s involvement made the episode especially instructive. Cloudflare is itself a resilience provider for much of the web, yet parts of its service chain were reportedly affected because of a third-party dependency. That is the modern cloud economy in miniature: companies buy resilience from vendors that buy resilience from other vendors that ultimately depend on the same small set of hyperscale foundations.

US-East-1 Is Still the Region That Explains the Internet

AWS’s October 20, 2025 outage was another reminder that some regions are more than regions. Amazon’s own updates said the event involved increased error rates for AWS services in US-EAST-1, with downstream effects across Amazon properties, AWS Support operations, and customer workloads. Reporting from TechTarget, Axios, CRN, and others tied the incident to internal systems associated with network load balancer health monitoring, DNS resolution, DynamoDB, and related service impairments.
US-EAST-1 has long been the cloud industry’s favorite cautionary tale because it is both heavily used and deeply entangled with AWS service history. Many services launched there first. Many customers still treat it as the default. Many SaaS companies, even those that market global resilience, keep critical pieces of their control plane, authentication path, or operational tooling in or near that region.
The outage therefore became less a story about one AWS failure than a story about concentration risk. If your company did not directly host production workloads in US-EAST-1, you might still have depended on a vendor that did. If your vendor did not host its main application there, it might still have used a queue, database, telemetry platform, support system, or deployment workflow that did. The dependency tree is no longer obvious from the invoice.
This is why “we are multi-AZ” is not the same as “we are resilient.” Availability zones protect against a subset of physical and localized failures. They do not automatically protect against service-level bugs, provider control-plane failures, DNS incidents, IAM problems, quota systems, regional service dependencies, or operational tools that become unavailable just when engineers need them most.

Azure’s October Outage Put the Control Plane in the Dock

Nine days after the AWS event, Microsoft Azure suffered a widespread outage on October 29, 2025 tied to Azure Front Door and Azure CDN. Microsoft’s Azure status history described connection timeout errors and DNS resolution issues affecting customers and Microsoft services that used those platforms. Windows Central and the Associated Press reported broad effects across Microsoft’s own ecosystem, including Microsoft 365-related services, Teams-adjacent experiences, Xbox, and other downstream services.
The important phrase in Microsoft’s account was not “Azure Front Door.” It was “configuration change.” The cloud is full of configuration changes, and most of them are routine until one is not. At hyperscale, configuration is not a clerical act; it is software deployment by another name.
Azure Front Door is a global application delivery and content acceleration layer. It sits close to the entry point for many services, which means trouble there feels immediate and widespread. If the front door cannot route, resolve, or serve traffic correctly, the health of the rooms behind it becomes somewhat academic.
For WindowsForum readers, the Azure incident also had a familiar sting. Microsoft’s cloud is no longer just Azure for developers. It is the substrate under Microsoft 365, Entra, Defender, Purview, Windows cloud management, Copilot services, Xbox infrastructure, and a growing list of administrative experiences. When Azure has a serious global issue, Microsoft’s consumer, enterprise, and developer identities blur fast.

The February Azure Failure Was a Warning About Managed Abstractions

InfoWorld also points to Microsoft Azure’s February 2–3, 2026 incident, which reportedly lasted more than 10 hours and involved a misconfiguration affecting Microsoft-managed storage accounts. The practical result, according to InfoWorld’s summary of the incident, was cascading trouble across virtual machine operations and managed identities.
That is a different class of pain from a public website timing out. Virtual machine operations affect the ability to deploy, scale, recover, or change infrastructure. Managed identities affect the ability for workloads to authenticate to other services without embedded secrets. When those layers fail, customers are not merely serving errors to users; they may be unable to execute the very recovery procedures they wrote for emergencies.
This is where cloud abstractions become double-edged. Managed identity is usually better than scattering credentials across application code. Managed storage is usually better than running your own fragile storage service. Managed control planes are usually better than artisanal infrastructure glue.
But “managed” also means “outside your direct control.” If the provider’s managed layer is impaired, customers can be left with fewer levers than they expected. In the old world, you could sometimes limp along with ugly manual workarounds. In the new world, the correct manual workaround may require an API that is down, an identity token that cannot be issued, or a portal that will not load.

The May 2026 AWS Event Made Physical Reality Intrude Again

The reported May 2026 AWS outage in US-EAST-1, tied by InfoWorld to a thermal event and power loss at a Virginia data center, is a useful counterweight to all the talk about control planes and software bugs. The cloud may feel abstract, but it remains a physical business of power, cooling, fiber, generators, batteries, switchgear, and buildings filled with machines.
That physical layer is often invisible to customers until it is not. Cloud providers do heroic engineering to hide hardware failure, and most individual hardware failures vanish without customer-visible impact. But at sufficient scale, a cooling or power event can still degrade core services such as EC2 and EBS, especially when it occurs in a region that carries enormous customer and service load.
The danger is not that hyperscalers are bad at physical operations. They are almost certainly better at them than most enterprises. The danger is that customers use that truth as permission to stop asking what happens when the physical abstraction leaks.
A serious power or thermal incident also has a different recovery profile from a bad configuration push. Hardware and facility events can create capacity constraints, delayed instance launches, storage impairments, and noisy-neighbor recovery effects. Even after the headline service status improves, customers may face degraded performance, stuck operations, or delayed backlogs.

The Bill for Downtime Is Not Just Lost Transactions

InfoWorld emphasizes the financial impact of cloud outages, and it is easy to see why. For businesses processing large transaction volumes, a two-hour outage can mean millions of dollars in lost revenue. For smaller companies, the dollar figure may be less dramatic but the proportional damage can be worse: missed orders, broken onboarding, failed renewals, customer support overload, and lost trust.
The harder costs are the ones that do not fit neatly into an incident spreadsheet. Engineers are pulled off roadmap work. Sales teams pause demos. Executives demand live updates without enough technical context. Customer success teams burn credibility. Security teams worry whether the confusion is masking an attack. Finance teams argue later about credits that rarely match the business impact.
Cloud service credits are especially poor comfort. They are designed around provider service-level agreements, not around your lost revenue, regulatory exposure, brand damage, or overtime bill. A credit on next month’s invoice may be contractually appropriate and economically irrelevant.
There is also a morale cost. Repeated cloud outages teach technical teams that the most sophisticated architecture can still be humbled by a provider dependency they did not choose directly. If leadership responds by demanding “make sure this never happens again” without funding resilience work, the organization converts an outage into cynicism.

The Multicloud Fantasy Still Needs Engineering Discipline

Every major outage produces a predictable chorus: go multicloud. The instinct is understandable. If AWS fails, run on Azure. If Azure fails, run on Google Cloud. If Google Cloud fails, shift somewhere else. The idea sounds obvious in the boardroom and turns complicated the moment an engineer asks which database, identity system, message bus, secrets store, observability stack, network model, and deployment pipeline will be portable enough to make that possible.
Multicloud can reduce certain concentration risks, but it is not a magic amulet. Running one application actively across multiple clouds requires hard choices about data consistency, latency, service parity, cost, security policy, staff expertise, and operational complexity. The architecture that survives a hyperscaler outage may be far more expensive and slower to ship than the architecture it replaces.
The more realistic path for many organizations is selective diversification. Not every workload deserves active-active multicloud. Not every internal dashboard needs cross-provider failover. But revenue-critical paths, customer authentication, incident communications, payment processing, and operational recovery tooling deserve a more serious design review than they often receive.
This is where IT leaders need to stop treating resilience as a generic percentage target. “Five nines” is not a strategy. A strategy says which business functions must survive which provider failures, for how long, at what degraded level, with which manual procedures, and at what cost.

The Real Test Is Whether Recovery Depends on the Broken Thing

The best outage plans begin with an unpleasant question: do we need the failed provider to recover from the provider’s failure? If the answer is yes, the plan is weaker than it looks.
This shows up everywhere. Runbooks live in a SaaS documentation platform that depends on the affected cloud. Pager escalation relies on an identity provider that cannot issue tokens. Status updates require access to a dashboard behind the same SSO path that is failing. Backups are stored in the same cloud account, region, or identity boundary as production. Deployment pipelines need the very APIs that are returning errors.
A good incident plan separates the control path from the failure domain. Engineers need out-of-band communications. Executives need a preapproved customer messaging process. Support teams need static fallback pages. Administrators need break-glass credentials that are tested, logged, and governed. Recovery data needs to exist somewhere the incident cannot trivially corrupt or isolate.
None of this is glamorous. It does not demo well. It often loses budget fights to AI pilots, feature launches, and platform migrations. But when a cloud outage hits, the boring preparations become the only things that matter.

Status Pages Are Not Your Monitoring Strategy

One of the recurring complaints during major cloud incidents is that provider status pages lag reality. Customers see errors before the official dashboard changes color. Forum threads, Reddit posts, vendor Slack channels, and Downdetector reports often light up while the official status page is still investigating, narrowly scoped, or silent.
This is not always negligence. Providers need to verify scope, avoid false statements, and coordinate across internal teams. But from the customer’s perspective, the result is the same: the official source of truth may be least useful during the first phase of the incident.
Organizations need external observation that reflects user experience, not just provider declarations. Synthetic monitoring from multiple networks, independent DNS checks, transaction probes, endpoint health tests, and third-party telemetry can establish whether the business is functioning while the provider is still assembling its incident narrative.
The same principle applies after the incident. A provider postmortem is necessary but not sufficient. Your organization needs its own postmortem that asks how the provider failure interacted with your architecture, vendors, processes, contracts, and communications. Otherwise, the cloud provider learns more from the outage than you do.

Windows Shops Have a Special Dependency Problem

For Microsoft-heavy environments, the cloud resilience conversation is particularly awkward. Azure, Microsoft 365, Entra ID, Intune, Defender, Purview, Windows 365, Azure Virtual Desktop, and related management services are increasingly intertwined. That integration is often a strength until the failure domain becomes too broad.
A company may think of itself as “not really on Azure” because its main application runs elsewhere. But if it uses Entra ID for authentication, Microsoft 365 for communications, Intune for device management, Defender for security operations, and Azure-hosted services for line-of-business workflows, an Azure control-plane or identity-adjacent incident can still become a business outage.
This does not mean Windows shops should flee Microsoft’s cloud. It means they should map Microsoft dependencies honestly. Which users can still work if Entra has trouble? Which administrators can still access critical systems if conditional access, MFA, or the portal is impaired? Which security alerts remain visible if Microsoft’s own cloud telemetry path is degraded? Which communications channel survives if Teams and Exchange Online are both affected?
The old disaster recovery binder assumed the office burned down. The modern version must assume the identity fabric, collaboration system, cloud console, and endpoint management plane may all become unreliable at the same time.

Vendor Accountability Must Move Beyond Apology Theater

Hyperscalers have become better at post-incident communication, but the market still accepts too much vagueness. “Configuration change,” “increased error rates,” “latency,” and “service degradation” may be accurate, yet they often obscure the operational questions customers need answered. Was the change globally staged or locally contained? Were guardrails bypassed? Did blast-radius controls fail? Were internal dependencies insufficiently isolated? Why did rollback take as long as it did?
Customers rarely have enough leverage to demand perfect transparency from a hyperscaler. But large enterprises, regulated industries, public-sector buyers, and vendor-management teams can ask better questions before the next renewal. They can require clearer resilience documentation, region-dependency disclosures, data export paths, recovery-time evidence, and incident communication commitments.
The goal is not to shame providers for failing. Complex systems fail. The goal is to make failure less mysterious and less contagious.
There is a difference between accepting that outages happen and accepting that customers must be surprised every time. The former is maturity. The latter is learned helplessness.

The Next Outage Should Find Your Organization Less Surprised

The practical response to this outage pattern is not panic, and it is not a theatrical migration to three clouds by next quarter. It is a disciplined effort to decide which failures the business can absorb and which ones it cannot. That work belongs jointly to infrastructure teams, application owners, security leaders, finance, legal, customer support, and executives.
The most useful output is not a 90-page resilience strategy that nobody reads. It is a short list of business services, ranked by consequence, with explicit failure assumptions and tested recovery paths. The cloud provider’s SLA should be an input, not the plan.

Organizations should identify which revenue, identity, communications, and recovery systems depend on the same hyperscaler or region.
Teams should test whether runbooks, break-glass accounts, backups, monitoring, and customer-status pages still work when the primary cloud control plane is impaired.
Architects should distinguish between zone redundancy, regional resilience, and provider independence, because each protects against different failure modes.
Executives should fund resilience according to business impact rather than treating outage preparation as an infrastructure team preference.
Vendor reviews should include dependency transparency, incident history, recovery evidence, and contractual communication expectations, not just uptime claims.
Every major outage should trigger an internal postmortem focused on the customer’s own architecture and operating model, not merely the provider’s root cause.

The cloud did not fail its promise so much as expose the fine print: hyperscale infrastructure can be more reliable than what came before while still creating shared failure modes too large for any one customer to ignore. The organizations that use the next outage well will not be the ones that shout loudest on social media or collect the most service credits. They will be the ones that turn a few hours of disruption into a clearer map of their dependencies, a funded resilience backlog, and a recovery plan that does not depend on the broken system fixing itself first.

References

Primary source: InfoWorld
Published: 2026-07-03T09:30:29.922178

Don’t waste your next cloud outage | InfoWorld

A cloud outage is a bigger problem if you don’t learn from it. Increasing cloud failures are exposing critical weaknesses in enterprise architecture. Here are 3 things to do about it.

www.infoworld.com
Related coverage: techtarget.com

AWS cloud outage reveals vendor concentration risk | TechTarget

AWS’s Oct. 20 outage exposed global cloud dependencies. Learn what caused it and how IT leaders can strengthen resilience and continuity.

www.techtarget.com
Related coverage: tomsguide.com

How the AWS outage happened — and why it's broke the internet | Tom's Guide

A massive AWS outage took down parts of the internet today — from Alexa and Snapchat to Fortnite and banking apps. Here’s what really happened, and why one small glitch caused such big chaos.

www.tomsguide.com
Related coverage: crn.com

The 10 Biggest Cloud Outages Of 2025: AWS, Google And Microsoft

Tech cloud outages in 2025 caused from cybersecurity attacks, software issues and IT errors came via AWS, Microsoft Azure, Google Cloud, Cloudflare, Salesforce and Ingram Micro.

www.crn.com
Related coverage: postmortems.app

Cloud postmortems | postmortems.app

Outages of, or caused by, public cloud providers (AWS, GCP, Azure, etc.) and their managed services.

postmortems.app
Related coverage: thousandeyes.com

https://www.thousandeyes.com/blog/google-cloud-outage-analysis-june-12-2025