AWS Outage 2025: Cloud Dependency and Multi Region Resilience Lessons

  • Thread Author
A massive Amazon Web Services outage on October 20, 2025 knocked hundreds of major websites and apps offline and left global internet traffic sluggish for hours, exposing the deep concentration of modern online infrastructure in a handful of cloud regions and the cascading fragility that follows when a single core service stumbles.

Neon cloud outage diagram for US East 1, showing failed requests across services.Background​

The incident originated in the US‑EAST‑1 region, AWS’s largest and most consequential availability hub in Northern Virginia, and manifested as increased error rates, elevated latencies, and failures in launching new compute instances across multiple services. The disruption began in the early hours of October 20 and produced immediate knock‑on effects across entertainment, finance, communications, and enterprise productivity platforms.
At the height of the outage, consumer and enterprise services reporting problems included social apps (Snapchat, Reddit), gaming platforms (Fortnite, Roblox), financial services (Coinbase, Venmo, Robinhood), productivity suites (Microsoft 365, Slack), streaming and retail (Amazon.com, Prime Video), and many more — a list that represents the modern internet’s dependency on AWS as an underlying substrate. Many companies posted status updates confirming AWS as the root cause.

What we know so far (technical snapshot)​

  • The primary affected region was US‑EAST‑1 (Northern Virginia), with symptomatic failures across core services such as EC2 (compute), DynamoDB (NoSQL database), and internal DNS/endpoint resolution subsystems.
  • Early AWS status messages reported increased error rates and latencies and noted that internal subsystems responsible for health monitoring and network load balancers were implicated in the disruption. Mitigations were applied through the morning and into the day, and AWS engineers reported progressive recovery while some dependent operations continued to process backlogs.
  • Many downstream vendors described the symptom set as DNS resolution failures for specific AWS endpoints (notably DynamoDB) that cascaded through applications relying on those endpoints. Several service status pages recommended customers flush DNS caches to clear cached endpoint resolution problems as part of recovery guidance.
It is important to note that some early explanations — for example, precise single‑line root causes — evolved through the day. While reporting converged on internal DNS/endpoint and control‑plane issues in US‑EAST‑1, final post‑mortem details from AWS (including exact trigger conditions) may be incomplete or subject to further analysis. Where official, detailed root‑cause reports are absent, those finer points should be treated as provisional.

The human and business impact​

When an infrastructure provider as large as AWS suffers regional failures, the effect is not merely technical; it rapidly translates into customer frustration, lost commerce, and operational headaches.

Consumer friction and lost revenue​

Retail and on‑demand platforms that rely on AWS experienced checkout failures, login errors, and degraded user experiences. Financial apps reported service interruptions and trading delays; gaming platforms logged login failures; and smart‑home services experienced command failures in devices that depend on cloud APIs. For companies operating at scale, outage minutes are expensive and reputationally risky.

Operational chaos for IT and support teams​

SRE and ops teams at affected companies moved into firefighting mode: routing traffic to alternate regions where possible, switching to read‑only modes, serving cached content, and fielding customer support tickets. The outage showed how many organizations still rely on a default configuration that favors convenience over survivability. Several vendors publicly asked users to retry failed requests or advised flushing DNS caches to recover client‑side endpoint resolution.

Public sector and critical services​

In the UK and other jurisdictions, government and major banking services reported intermittent issues. When public infrastructure depends on a handful of cloud providers and regions, outages can complicate access to essential services and create cascading policy and compliance headaches.

Why this outage matters: concentration and single points of failure​

The October 20 outage is not an isolated curiosity; it is a textbook reminder that centralization of cloud infrastructure produces systemic risk.
  • Regional concentration: US‑EAST‑1 is the largest AWS region and hosts many critical endpoints and default resources. Many teams choose it by default because of lower latency and richer feature sets, which concentrates risk.
  • Common dependencies: High‑level applications often depend on multiple AWS primitives—EC2, Elastic Load Balancing, DynamoDB, S3—tied together in complex chains. If a control‑plane or DNS issue affects one primitive, downstream services can rapidly cascade into failure.
  • Operational coupling: Developers and operators often rely on managed services and cloud APIs without comprehensive failover plans, assuming provider SLAs and geographic redundancy by default. The outage highlights how assumed redundancy can still leave entire product stacks vulnerable.
The effect is systemic: when the largest cloud operator experiences a hard failure in a major region, the ripple spreads across industries. This outage therefore renews debates about concentration risk, mandatory resilience requirements, and the economics of multi‑cloud strategies.

How the cascade happened (a simplified SRE view)​

For technical teams, the outage provides a real‑world case study in dependency graphs and failure modes. A simplified sequence that matches public reporting is:
  • An internal control‑plane or endpoint resolution problem emerged in US‑EAST‑1 and affected specific managed services (reports indicated DynamoDB endpoints and/or DNS resolution as prominent symptoms).
  • Services that rely on those endpoints began returning errors or timing out. Because many applications expect successful API calls, those failures propagated to authentication, session creation, and application logic.
  • Client‑side components and caches served stale or failed responses while backlogs accumulated in message queues and event streams, producing prolonged recovery even after the immediate DNS/endpoint problem was mitigated.
  • Attempts to launch replacement compute resources (EC2 instances) for recovery were partially throttled or failed due to ongoing control‑plane constraints, slowing restoration.
This pattern — a control‑plane or DNS fault that prevents recovery actions — is especially toxic because it prevents the platform from self‑healing quickly. The result: longer mean time to recover (MTTR) and a larger blast radius.

What AWS publicly said (and what remains tentative)​

AWS status updates made during the incident described increased error rates and mitigation actions applied across multiple Availability Zones in US‑EAST‑1, with progress toward recovery noted during the morning and into the afternoon. Engineers observed “early signs of recovery” and continued to process backlogs of queued requests. Several downstream providers echoed that the primary problem involved endpoint/DNS resolution for services such as DynamoDB and recommended standard client‑side mitigations like DNS cache flushes.
Caution: At the time of initial reporting AWS had not published a detailed post‑mortem that attributes the incident to a single coding error, configuration change, or hardware failure. Coverage and vendor timelines converged on DNS/control‑plane symptoms, but final, authoritative root‑cause analysis (with concrete trigger events and code/automation details) was not yet available in initial incident pages. Treat any specific trigger explanations that lack AWS’s formal post‑mortem as provisional or unverified.

Security considerations and opportunistic scams​

Major outages create fertile ground for opportunistic cybercriminal activity. Past incidents show spikes in phishing, credential‑harvesting pages, and social engineering aimed at confused users. During this outage, security firms warned of potential phishing campaigns spoofing outage notifications and fake support pages offering “status updates” or urging password resets.
Operational security teams should consider:
  • Flagging unusual support traffic and phishing attempts tied to outage narratives.
  • Enforcing multi‑factor authentication (MFA) and monitoring for anomalous login patterns.
  • Communicating clear, authoritative outage notices to customers and employees to prevent them from following fraudulent instructions sent via email, SMS, or social channels.

Lessons for engineering teams: practical resilience checklist​

This outage is a strong prompt for practical, actionable resilience planning. The following checklist prioritizes high‑value, implementable controls:
  • Multi‑region design: Architect critical services to operate across at least two geographically distinct regions and avoid single‑region defaults where possible.
  • Multi‑cloud or multi‑edge: For extremely critical paths (auth, payments), evaluate multi‑cloud redundancy or the use of independent CDN and edge compute platforms to reduce single‑vendor risk.
  • DNS and caching strategy: Lower DNS TTLs for dynamic endpoints where failover is necessary, and implement robust client‑side retry logic and exponential backoff. Ensure DNS resolvers and caching behavior are well understood.
  • Circuit breakers and graceful degradation: Implement circuit breakers, feature flags, and read‑only modes so apps can continue core functionality even when backend services fail.
  • Chaos engineering and tabletop runbooks: Regularly run failure injection and full‑system recovery drills. Runbooks should include explicit steps for when core cloud control planes fail.
  • Observability and alerting: Ensure end‑to‑end tracing and clear SLO/SLA dashboards so degradations are visible from user impact down to infrastructure components.
  • Contractual and cloud cost planning: Understand vendor SLAs, credits, and contractual remedies, and budget for the extra cost of active redundancy where needed.
Adopting this checklist won’t eliminate outages, but it will reduce blast radius and shorten recovery time.

Recommendations for administrators (step‑by‑step)​

  • Step 1: Confirm scope. Use independent monitoring and your own synthetic tests to determine which services and endpoints are affected, rather than relying purely on external status pages.
  • Step 2: Switch to alternate regions or endpoints if they exist and are healthy. Validate cross‑region replication before switching production traffic.
  • Step 3: Activate degraded modes (read‑only, cached content) to preserve availability for essential user flows.
  • Step 4: Communicate proactively with customers; provide timelines, safe workarounds, and clear expectations. Public silence breeds speculation.
  • Step 5: After stabilization, start post‑incident analysis focused on root cause, detection gaps, and action items to prevent recurrence. Include postmortem timelines, concrete remediation owners, and measurable targets.

Recommendations for everyday users​

  • Expect intermittent access to apps that rely on cloud backends; retry failed actions rather than repeatedly refreshing.
  • If authentication codes, banking apps, or critical services are affected, avoid clicking on emails or links promising “immediate resolution” — verify via official status pages or vendor social accounts.
  • For smart‑home users: a temporary inability to reach cloud services does not always mean device failure — local device functionality may continue to operate. Wait for official vendor updates before resetting devices.

The business and regulatory implications​

This outage renews scrutiny on market concentration and systemic risk. Regulators and large enterprise customers increasingly question whether a small set of cloud providers should hold such disproportionate control over digital infrastructure. Topics likely to resurface include:
  • Mandatory resilience standards for critical services that cannot tolerate single‑provider failure.
  • Disclosure requirements for cloud dependence in regulated sectors (finance, health, government).
  • Insurance and contractual obligations around cascading outages and the economic damages they cause.
Large outages also feed debate about whether economies of scale in cloud lead to unhealthy centralization and whether incentivizing diverse providers would lower systemic risk. Expect increased conversation among enterprise boards, auditors, and regulators about these topics in the coming months.

What this means for AWS and the cloud industry​

AWS remains the dominant cloud provider by market share and revenue, and outages of this scale are rare relative to the sheer volume of operations the platform handles daily. That said, high‑visibility incidents erode customer confidence and invite competitive and regulatory pressures.
Two important dynamics to watch:
  • Engineering transparency: Customers and regulators will push for more detailed post‑mortems, timelines, and corrective actions to avoid repeat occurrences.
  • Customer behavior: Some organizations will double down on multi‑region and multi‑cloud strategies, while others will accept risk and focus resources on faster recovery and better monitoring. Both decisions have costs and tradeoffs.

Strengths and shortcomings in the response​

The response contained visible strengths: AWS engineers applied mitigations, status pages were updated throughout the outage, and many services recovered within hours. Several downstream vendors followed good practices by pushing graceful degradation and clear customer communications.
However, shortcomings remain notable:
  • The blast radius was large because of concentration in a single region and common endpoint dependencies.
  • Recovery was slowed by backlogs and throttling of recovery‑critical operations (e.g., launching new compute instances), illustrating how control‑plane constraints can impede remediation.
  • The absence (at early stages) of a detailed, definitive AWS public post‑mortem left customers and reporters relying on partial technical descriptions and vendor status pages. Until a full root‑cause report is published, some operational questions remain open.

Longer‑term risk outlook​

Cloud providers will invest more in reliability engineering and automation, but as the scale of cloud grows, so does the potential for novel failure modes. Key risk vectors to monitor:
  • Control‑plane complexity: As cloud services evolve, interdependencies between management layers increase the chance that control‑plane faults prevent recovery actions.
  • Default convenience: Many development and deployment templates default to a single region for simplicity, which concentrates risk. Education and tooling must make multi‑region the easier default for critical systems.
  • Supply‑chain and third‑party dependencies: SaaS providers that embed numerous third‑party services can inherit risks from multiple vendors simultaneously, amplifying outage impact.
Organizations will need to maintain active resilience engineering programs and to review assumptions about whether provider SLAs and architectural patterns are sufficient for their tolerance for downtime.

Closing analysis​

The October 20 outage is a stark reminder: the modern internet is fantastically capable, but still fragile when core infrastructure fails. The event should not be read as proof that cloud is flawed; rather, it is evidence that dependency management, resilient design, and operational preparedness must be first‑class disciplines for any organization that relies on third‑party cloud platforms.
For engineers and executives, the takeaways are concrete: treat default regions and managed services as design choices with explicit risk tradeoffs; invest in redundancy where the business cannot tolerate failure; and maintain real, practiced recovery playbooks that assume the unthinkable — that a major cloud region will be unreachable.
For users, the outage reinforces a simple truth: many of the apps you rely on are built on common foundations, and momentary global fragility can follow from a local failure. Patience, cautious verification of official communications, and the expectation that services will restore gradually — sometimes after backlogs are cleared — are the healthy responses.
The internet will recover, AWS will publish a post‑incident analysis in time, and engineers across the industry will once again iterate on defensive architectures. The practical work, however, is in the months after the outage: turning lessons into durable operational changes so that the next significant cloud failure has a smaller blast radius and a shorter recovery.

Source: TechRadar Amazon outage: Every website knocked offline by the huge AWS outage
 

The internet hiccupped in a way no longer tolerable as a mere inconvenience: a major Amazon Web Services (AWS) outage on October 20, 2025 exposed how concentrated cloud dependencies, brittle control‑plane primitives and optimistic architecture defaults can turn a single regional fault into hours of global disruption.

Neon cloud computing diagram highlighting DNS, edge failover, and consumer access.Background​

Cloud computing is the backbone of modern software delivery: companies rent compute, storage and managed services from hyperscalers rather than owning and operating their own data centres. That architecture has enabled rapid innovation and huge cost efficiencies, but it also concentrates critical functionality in a handful of providers and in a few hot‑spots inside their infrastructures. The October 20 incident centered on AWS’s US‑EAST‑1 region (Northern Virginia), a long‑standing hub for many global control‑plane services and high‑volume managed primitives such as Amazon DynamoDB.
AWS publicly described the proximate trigger as DNS resolution failures for DynamoDB regional API endpoints in US‑EAST‑1, a symptom that cascaded into elevated error rates, throttles and impaired internal subsystems that slowed recovery even after the initial DNS issue was mitigated. The company published a timeline showing the DNS symptom was identified early in the event and that mitigations were applied while teams worked through backlogs and dependent impairments.

What happened (concise technical timeline)​

The visible timeline​

  • Around 03:11 AM Eastern Time on October 20, monitoring and customer reports spiked with timeouts and elevated error rates across services that use AWS’s US‑EAST‑1 region.
  • AWS’s status updates identified DNS resolution anomalies for the DynamoDB API as a potential root cause and began parallel mitigation efforts shortly thereafter.
  • Engineers applied mitigations that produced early signs of recovery within hours, but EC2 instance‑launch throttles and downstream message backlogs extended the tail of the outage for some customers well into the day.

The technical anatomy (why DNS + DynamoDB cascaded)​

DynamoDB is often used for small, high‑frequency control data: session tokens, feature flags, device state, throttles and other “tiny but vital” state pieces. When DNS resolution for a managed API fails, clients simply can’t reach the service—even if the underlying compute is healthy. Client SDKs and application code typically include aggressive retry logic; when many clients retry and internal control‑plane components also depend on the same endpoint, the resulting retry storms and cascading latencies amplify the failure. That precise interplay is what turned an apparently narrow name‑resolution problem into a multi‑hour, multi‑sector disruption.

Who and what was affected​

The outage hit a broad cross‑section of consumer apps, enterprise platforms and even AWS’s own services: social networks, fintech apps, gaming back ends, smart‑home systems and government portals reported failures or degraded performance. High‑visibility platforms named in reporting included Snapchat, Reddit, Fortnite, Ring/Alexa, Venmo, Coinbase and a wide set of SaaS vendors and internal Amazon properties. Many of these services run critical control flows that touched DynamoDB or US‑EAST‑1 control‑plane features.
Financial software providers and banks—where a small state change can be required to complete a transaction—saw user‑facing failures that translated quickly into operational headaches. The incident also interrupted some vendor support channels that themselves run on AWS, complicating customer outreach during remediation. Reports and outage trackers registered tens of thousands of user incidents within minutes.

Why this outage matters: concentration, control‑plane fragility, and vendor lock‑in​

1) Market concentration creates systemic exposure​

The cloud infrastructure market is top‑heavy. Independent analysts estimate AWS accounted for roughly 30% of global cloud infrastructure spend in Q2 2025, with Microsoft Azure and Google Cloud making up most of the remainder. That market concentration means outages in a major region can have outsized, cross‑industry effects. When a single provider hosts the control planes and managed services that orchestrate millions of applications, failures are less likely to remain isolated.

2) Control‑plane primitives are now single points of failure​

Modern cloud platforms expose highly useful managed primitives—global identity services, managed NoSQL databases, serverless functions and global table replication. Teams build the convenience of these services into authentication flows, provisioning pipelines and runtime paths, often without the fallback modes needed for resilience. A fault in a control‑plane primitive (DNS, identity, or a managed database API) can therefore break both customer workloads and provider recovery mechanisms. The AWS October 20 incident is a textbook example.

3) Vendor lock‑in raises the cost of escape​

Switching providers is expensive and technically complex. Architectures that depend on provider‑specific primitives (for example, DynamoDB’s feature set or AWS‑specific SDK behaviors) create real migration friction. That, combined with data egress fees and re‑engineering costs, means customers are often effectively “locked in” and must absorb the risk of provider outages rather than moving away. The business calculus that pushed many companies to adopt hyperscale clouds—speed, scale and predictable pricing—now carries a systemic risk premium.

How organisations should rethink resilience (practical engineering guidance)​

The outage is a fortnightly reminder that resilience must be engineered deliberately. The following practical steps reduce exposure to similar events.

Multi‑region and multi‑cloud for critical paths​

  • Identify the small set of control‑plane services that must survive an outage (authentication, payment authorization, identity management).
  • For those flows, implement active multi‑region patterns or run parallel providers so that a regional API failure does not stop core business functions. This can include multi‑region DynamoDB global tables, cross‑region leader election and geo‑distributed caches.
Multi‑cloud has operational complexity and cost, but it is the most effective way to remove a single vendor’s control‑plane as the only escape hatch for critical operations.

Harden DNS and discovery​

  • Use resilient DNS configurations and multiple authoritative DNS providers.
  • Add client‑side caching with sensible TTLs and fallback IP addresses or alternate endpoints.
  • Build SDKs that fail fast with controlled backoff and circuit breakers to avoid retry storms. Treat DNS as a first‑class failure mode.

Design graceful degradation​

  • Define a minimum viable experience: what must remain available when downstream APIs fail?
  • Implement read‑only modes, cached responses, local queues and offline workflows so that at least essential functionality remains usable during outages.

Chaos engineering and runbooks​

  • Regularly exercise catastrophe scenarios—control‑plane failures, DNS anomalies, cross‑region partitions.
  • Validate runbooks in non‑production and run live failover drills to ensure teams can enact fallbacks under stress. Real outages reveal runbook gaps quickly; table‑top exercises do not.

Vendor governance and procurement changes​

  • Demand better telemetry and a timeline of remediations from providers as contract obligations.
  • Include outage clauses, forensic commitments and service credits that reflect systemic dependencies, not just per‑minute availability. Regulators and large enterprise buyers will increasingly treat cloud providers as critical third parties.

Edge computing and decentralisation: realistic options and limits​

Edge computing—processing and storage closer to users or on-prem devices—reduces latency and can move some state off hyperscaler control planes. Combined with multi‑cloud, edge architectures can improve resilience and data sovereignty. But edge and decentralisation are not panaceas: they introduce operational cost, complexity and consistency challenges, especially for stateful systems and transactional workloads.
  • Benefits: reduced blast radius, improved regulatory control for sensitive data, faster local responses.
  • Trade‑offs: higher operational overhead, complex data consistency, and the need for reliable orchestration across many nodes.
Edge plus multi‑cloud is the practical middle path: keep critical control flows in places you can restart or patch quickly while leveraging hyperscalers for scale‑heavy, non‑critical workloads.

The policy and market response that will likely follow​

Large, visible outages attract regulatory interest; financial services and public‑sector systems are particularly sensitive to third‑party risks. Expect near‑term activity across three fronts:
  • Procurement and compliance teams will demand more resilient SLAs and post‑incident forensic reports from cloud vendors.
  • Regulators may accelerate frameworks for “critical third‑party” oversight of hyperscalers where public services depend on commercial infrastructure.
  • Customers—especially large enterprises—will reassess where to place mission‑critical control planes and may invest in vendor diversification strategies even at higher cost. Market research shows AWS still leads the infrastructure market by a wide margin, meaning these changes will be gradual rather than sudden.

Strengths in the response — and real gaps​

The incident also shows what hyperscalers do well. AWS mobilised engineering resources quickly, published status updates and executed staged mitigations that restored broad service availability within hours. Those capabilities—massive operations teams, telemetry systems and runbooks—are part of why customers rely on hyperscalers in the first place.
At the same time, gaps remain:
  • Opaque post‑incident detail: customers and regulators will demand richer, faster post‑mortems that go beyond “DNS was involved” to explain causal chains, configuration changes, and specific mitigations.
  • Control‑plane coupling: recovery was impeded because some internal AWS subsystems that support remediation depended on the same primitives that were failing (a classic circular dependency). That structural fragility requires design fixes.
  • Communications tempo: while public status updates were provided, community telemetry and third‑party probes often surfaced actionable details faster than official channels—an uncomfortable signal for customers who need timely, authoritative information.

A short playbook for Windows admins, SREs and IT leaders​

  • Map dependencies: Identify which systems talk to DynamoDB or other single‑region control planes and classify them by business impact.
  • Add out‑of‑band admin paths: Ensure identity providers, password vaults and emergency admin tools are accessible even if core cloud APIs are impaired.
  • Cache aggressively on the client and server where consistency requirements permit, and apply read‑only fallbacks for non‑critical flows.
  • Monitor multiple sources: combine provider status pages with independent probes and public outage trackers so detection does not depend solely on the vendor.
  • Practice the plan: run chaos engineering tests to validate your multi‑region failovers and escalation channels.

Bigger questions: who should bear the cost of resilience?​

The outage reignites a policy debate: should society treat hyperscale cloud as private infrastructure with public responsibilities? When critical public services rely on privately‑owned cloud regions, outages can have consequences that go beyond commercial inconvenience. That tension will shape policy discussions about mandatory reporting, resilience testing and possibly incentives for regional diversification or local cloud options. Markets will react too—customers who can afford stronger resilience will pay for it, while smaller players will remain exposed. The resulting stratification is a commercial reality that will influence cloud adoption patterns going forward.

Cross‑checking the claims (what’s verified, what remains provisional)​

  • Verified: AWS acknowledged the outage and documented DNS resolution problems affecting DynamoDB in US‑EAST‑1; the company reported mitigations and staged recovery actions. Public status updates and AWS’s own communications confirm those points.
  • Verified: Major consumer and enterprise services reported user‑facing failures correlated with the AWS event; independent reporters (Reuters, The Verge, Wired) documented the same set of impacted platforms.
  • Cross‑referenced market context: AWS’s market share and the dominance of the top three providers are supported by independent analyst data and reporting, establishing why a regional failure has large systemic effects.
  • Provisional / Unverified: Some narratives about the exact internal chain of causal events—specific configuration changes or human errors that triggered DNS failure—must await AWS’s formal, detailed post‑incident report. Until that post‑mortem is released, deeper causal assertions should be treated as hypotheses.

What this means for the future of “the cloud”​

The October 20 outage will not (and should not) reverse cloud adoption. Hyperscalers provide indispensable scale, rapid innovation and economic efficiency that many organisations can’t replicate on their own. But the event will change behaviour and expectations: resilience engineering will no longer be a niche discipline for large enterprises; it will be a board‑level concern for every business that runs important digital services. Procurement will change, architectures will become more defensive, and regulators will press for more visibility into critical infrastructure dependencies.
Concretely, expect:
  • More multi‑region and multi‑cloud planning for essential control flows.
  • Greater emphasis on edge and on‑prem options for regulated workloads and data‑sovereign applications.
  • Stronger vendor obligations in contracts and a wave of updated procurement practices in regulated industries.

Conclusion​

The AWS outage on October 20, 2025 was a blunt demonstration of a well‑known trade‑off: cloud hyperscalers deliver extraordinary capability at the cost of concentrated systemic fragility. The proximate symptom—DNS resolution problems for DynamoDB endpoints in US‑EAST‑1—was simple to state, but its effects were complex and widespread because of how modern applications weave managed primitives into critical paths. The right response is not to abandon the cloud but to design, test and govern cloud reliance as a first‑class strategic concern. Organisations that act quickly—identifying critical control planes, hardening DNS and discovery, investing in multi‑region fallbacks, and practising failure scenarios—will convert this painful lesson into enduring resilience.
The outage should force a practical reckoning: convenience must be balanced with contingency, and scale must be matched by accountable, tested resilience. The internet’s plumbing has always been vulnerable; professionalising the discipline of resilience across engineering, procurement and policy is the necessary work now before the next “bad day.”

Source: Down To Earth An Amazon outage has rattled the internet. A computer scientist explains why the ‘cloud’ needs to change
 

Back
Top