Ultimate Guide to Deploying SQL Server Stretch Failover Clusters on AWS for High Availability and Disaster Recovery

ChatGPT · May 16, 2025

Glowing blue Windows logos connected by cables in a futuristic digital network setup.

Deploying highly available, disaster-resilient Microsoft SQL Server clusters has long been a backbone of mission-critical IT infrastructure. In the modern enterprise era, where cloud adoption is accelerating and workloads are rapidly migrating from on-premises environments, reimagining the traditional SQL Server Failover Cluster Instance (FCI) for the cloud is no longer a luxury—it’s a necessity. Amazon Web Services (AWS) has emerged as a leading destination for these migrations, but the journey is fraught with challenges, ranging from shared storage requirements to cross-Availability Zone (AZ) high availability. In this article, we’ll deliver a thorough, hands-on guide to deploying a SQL Server Stretch Failover Cluster Instance using AWS EC2, Amazon Elastic Block Store (EBS) Multi-Attach, and Microsoft Windows Server Storage Replica. Along the way, we’ll critically analyze the approach’s strengths, weaknesses, and best practices.

Understanding the Problem Space: Legacy Clusters in a Cloud World

The evolution from on-premises datacenters to the cloud compels IT architects and DBAs to rethink core assumptions about “shared storage,” physical proximity, and network reliability. Traditionally, SQL Server FCI has depended on shared disks—such as SANs—to provide seamless failover of the database engine between Windows Server nodes. In AWS, block storage is virtualized and locality-dependent, complicating cross-AZ high availability.

What Is a Stretch Cluster?

Microsoft’s Stretch Cluster technology enables organizations to extend a single WSFC across physically separate locations, such as multiple datacenters or disparate cloud zones. In AWS, the equivalent is a Multi-AZ or Multi-Region deployment, designed for exceptional disaster recovery (DR) and business continuity.
A Stretch FCI leverages active-active configurations to maximize resource utilization and meet stringent compliance standards, particularly for regulated industries. But the “stretching” introduces synchronization and failover risks that must be engineered with care.

The Solution: AWS EBS Multi-Attach and Windows Storage Replica

AWS offers a unique value proposition for SQL Server Stretch FCI with its EBS io2 Multi-Attach volumes. These volumes can be concurrently attached to up to 16 Nitro-based EC2 instances within the same AZ, but crucially, not across different AZs. To bridge this gap, Windows Storage Replica is used to synchronize block-level storage between AZs, providing an illusion of shared storage. This hybrid approach delivers both high availability and disaster tolerance.

Solution Architecture: Blueprint for Enterprise-Grade HA/DR

High-Level Design

At its core, the solution comprises:

Amazon EC2 instances: Windows Server-based VMs, split across two or more AZs.
EBS io2 Multi-Attach Volumes: Shared storage within each AZ for SQL data and logs.
Windows Server Storage Replica: Replicates volumes asynchronously or synchronously between AZs for DR.
WSFC: Orchestrates failover and cluster health monitoring.
SQL Server FCI: Provides database engine-level failover, harnessing the clustering stack.

Topology and Node Roles

Let’s define the example environment using four nodes:

Availability Zone	Node Name
AZ-A	SRVSQL01
AZ-A	SRVSQL02
AZ-B	SRVSQL03
AZ-B	SRVSQL04

The architecture can scale to two nodes (suitable for SQL Server Standard Edition) or more for larger footprints, but four nodes provide the best fault and load distribution.

Networking and Domain Services

Active Directory is mandatory for cluster operations and service account administration.
Network subnets must be stretched, with proper routing and firewall policies to support cross-AZ connectivity and failover behavior.

Step-by-Step Walkthrough: From EC2 to SQL Cluster

1. Prepare the Foundation: EC2 Nodes and EBS Multi-Attach

Allocate io2 EBS volumes in each AZ, configuring Multi-Attach for paired EC2 instances in that AZ.
Attach and initialize volumes on Windows (using GPT format, noting drive letters for consistency).
Install the latest AWS NVMe drivers for optimal disk performance.
Set up SCSI persistent reservations, vital for Windows Server Clustering.

2. Install Windows Server Failover Clustering Components

Using PowerShell as an administrator, execute:

Install-WindowsFeature -Name Failover-Clustering, Storage-Replica, FS-FileServer, RSAT-AD-Tools -IncludeManagementTools -Restart

Repeat on all nodes, ensuring all required cluster and file server features are active.

3. Form the Failover Cluster

Using PowerShell ISE, run:

Code:

$nodes = "SRVSQL01", "SRVSQL02", "SRVSQL03", "SRVSQL04"
$vips = "VIP1", "VIP2", "VIP3", "VIP4"
$ClusterName = "ProdSQLCluster"
New-Cluster -Name $ClusterName -Node $nodes -StaticAddress $vips

Caveat: A warning about cluster quorum may surface due to unassigned storage. This is rectified in subsequent steps.

Establishing Fault Domains

Define fault domains to group nodes by site/AZ, providing clear failover policy boundaries:

Code:

New-ClusterFaultDomain -Name Site1 -Type Site -Description "AZ-A" -Location "us-east-1a"
New-ClusterFaultDomain -Name Site2 -Type Site -Description "AZ-B" -Location "us-east-1b"
Set-ClusterFaultDomain -Name SRVSQL01 -Parent Site1
Set-ClusterFaultDomain -Name SRVSQL02 -Parent Site1
Set-ClusterFaultDomain -Name SRVSQL03 -Parent Site2
Set-ClusterFaultDomain -Name SRVSQL04 -Parent Site2
(Get-Cluster).PreferredSite="Site1"
(Get-Cluster).AutoAssignNodeSite=1

Disk and Quorum Configuration

Ensure all cluster disks are visible and add them:
Get-ClusterAvailableDisk -All | Add-ClusterDisk
Set a File Share Witness quorum for additional resilience:

Code:

$Clusterfqdn = "ProdSQLCluster.corp.local"
$FSWitness = '\\fileserver\witness'
Set-ClusterQuorum -Cluster $Clusterfqdn -FileShareWitness $FSWitness

4. Install SQL Server FCI on the Primary Node

Grant “Create Computer Objects” permission to the cluster computer object in Active Directory.
Begin SQL Server installation on SRVSQL01, choosing “New SQL Server failover cluster installation.”
On Cluster Disk Selection, acknowledge warnings about unavailable disks—these are addressed once Storage Replica is set up.
Select available data/log disks and complete setup, ignoring subsequent offline disk warnings for now.

5. Configure Windows Storage Replica

With SQL installed on SRVSQL01:

New-SRPartnership -SourceComputerName SRVSQL01 -SourceRGName rg01 -SourceVolumeName "D:","E:" -SourceLogVolumeName R: -DestinationComputerName SRVSQL03 -DestinationRGName rg02 -DestinationVolumeName "D:","E:" -DestinationLogVolumeName R: -ReplicationMode Asynchronous -EnableConsistencyGroups

Tip: Adjust volume letters to match your configuration. Once partnership is established, both clusters’ disks should go online and initial block copy commences.
Monitor the mirror/copy status in Failover Cluster Manager > Disks.

6. Expand SQL Cluster to Remaining Nodes

After block copy completes:

Run the “Add node to a SQL Server failover cluster” wizard on all other nodes (SRVSQL02, SRVSQL03, SRVSQL04).
Validate full cluster health and failover readiness in the Failover Cluster Manager.

7. Test Cross-AZ Failover

For client connections, include multisubnetfailover=true in connection strings for seamless redirect after a node or AZ failure.
Example for SQL Server Management Studio (SSMS):

In Additional Connection Parameters: multisubnetfailover=true

To monitor ownership and failover in T-SQL:

Code:

SELECT @@servername;
SELECT NodeName, status, status_description, is_current_owner FROM sys.dm_os_cluster_nodes ORDER BY NodeName;

Manual failover is performed in the Failover Cluster Manager by moving the SQL Server role from node to node. After each transition, verify both cluster service health and disk connectivity on the new active node.

Strengths of the AWS Stretch FCI Approach

High Availability and Business Continuity

By distributing SQL Server nodes across AZs with Storage Replica, the solution delivers true DR—surviving not just server failures but also full-AZ outages. This level of resilience matches or exceeds most on-premise DR strategies.

Cost-Effectiveness

EBS Multi-Attach and Storage Replica sidestep the need for expensive third-party clustering or storage replication software.
Architectures can be right-sized for both Standard and Enterprise editions of SQL Server, minimizing unnecessary licensing costs.

Scalable and Familiar

The paradigm closely mirrors traditional Windows clustering models, smoothing migration for DBAs and system administrators. It supports both “lift-and-shift” and phased modernization scenarios.

Potential Weaknesses and Risks

Storage Replica Limitations

Synchronous Replication: Provides strong consistency but can increase write latency, especially across distant AZs with higher network roundtrip times. This may hamper transaction throughput for write-heavy applications.
Asynchronous Replication: Lowers latency but risks data loss during AZ disaster.
EBS Multi-Attach cannot span AZs; thus, Storage Replica is essential but introduces additional cost and management overhead.

Complexity and Operational Overhead

Combining AWS EBS, Windows Storage Replica, WSFC, and SQL FCI creates a multi-layered stack. Each component must be patched, monitored, and maintained.
Troubleshooting failover or performance issues may require cross-team coordination between cloud, infrastructure, and database administrators.
Misconfiguration of quorum, witness, or Replica roles can lead to split-brain or prolonged downtime.

Licensing and Compliance

While AWS provides robust infrastructure, licensing for SQL Server FCI (especially with Enterprise Edition for more than two nodes) can be substantial.
Security and compliance controls must be reinforced at every layer—AWS, Windows, and SQL—to satisfy audit and regulatory requirements.

Cross-AZ and Data Transfer Costs

AWS charges for cross-AZ data transfers, notably impacting operational expense during replication and failover events.
Sizing IOPS and throughput for io2 volumes requires careful planning to avoid both bottlenecks and cost overruns.

Best Practices for Stretch SQL FCI on AWS

Optimize Cluster and Application Design

Always use multisubnetfailover=true in your client connection strings for faster failover detection and minimal disruption.
Co-locate application servers in the same AZ as the current SQL primary where possible, to minimize added latency during failover.

Right-Size Resources

Use AWS Cost Explorer to model and track cross-AZ transfer costs, especially for storage replication.
Leverage Reserved Instances for EC2 savings on predictable, always-on clusters.

Secure the Stack

Enable EBS volume encryption for all storage.
Grant least privilege to AD service accounts and cluster roles; audit permissions regularly.
Integrate with AWS Identity and Access Management (IAM) for additional safeguards.

Test, Validate, Document

Simulate failover and load tests before production cutover—verify end-to-end SLA adherence under heavy load and during disasters.
Document full DR and rebuild procedures. Test and refine these processes at least quarterly.

Disaster Recovery Hygiene

Regularly back up all cluster configurations, SQL data, and critical system state.
Before destroying any resources, verify all backup and export steps to avoid catastrophic data loss.

A Note on Clean-Up

For those running lab or proof-of-concept deployments, AWS makes it easy to script clean-up. But before deleting, triple-check backups and snapshot status.
General steps:

On the active SQL node: remove the FCI.
Remove Storage Replica partnerships and resource groups.
Delete the WSFC cluster.
Detach and wipe EBS volumes.
Terminate EC2 instances.
Clean up AD objects and revoke permissions.
Delete file share witness resources.

Deleted resources cannot be recovered—be cautious!

Conclusion: Is SQL Server Stretch FCI on AWS EC2 Right For You?

The combination of AWS EBS Multi-Attach and Windows Storage Replica to provide a robust SQL Server Stretch Cluster FCI architecture empowers organizations to achieve cloud-native HA/DR while leveraging familiar, time-tested Microsoft technologies. It is a compelling option for regulated industries, enterprises seeking zero-data-loss failover, and hybrid cloud migrations requiring continuity across physical locations.
Yet, this approach is not a panacea. Teams must navigate added configuration complexity, higher licensing/deployment cost for multi-node clusters, and the performance nuances of cross-AZ data replication. Only a thorough pilot and continuous testing will validate suitability for your environment.
For IT decision-makers, DBAs, and architects, AWS’s pace of innovation—especially in storage and database features—means these architectures will continue evolving. Remaining vigilant, following best security and cost practice, and automating as much as possible will yield the highest returns.
Organizations committed to SQL Server’s reliability can look to this architecture as a model for resilient, compliant, and performant deployments in the AWS Cloud. For further tuning, architectural reviews, and hands-on migration support, consult both the official .NET on AWS and AWS Database blogs, and consider expert guidance for mission-critical workloads. The cloud removes the constraints—now it’s up to you to architect for the future.

Source: Amazon Web Services Running SQL Server Stretch Failover Cluster Instance on Amazon EC2 | Amazon Web Services

Search

Navigation section

Ultimate Guide to Deploying SQL Server Stretch Failover Clusters on AWS for High Availability and Disaster Recovery

Understanding the Problem Space: Legacy Clusters in a Cloud World

What Is a Stretch Cluster?

The Solution: AWS EBS Multi-Attach and Windows Storage Replica

Solution Architecture: Blueprint for Enterprise-Grade HA/DR

High-Level Design

Topology and Node Roles

Networking and Domain Services

Step-by-Step Walkthrough: From EC2 to SQL Cluster

1. Prepare the Foundation: EC2 Nodes and EBS Multi-Attach

2. Install Windows Server Failover Clustering Components

3. Form the Failover Cluster

Establishing Fault Domains

Disk and Quorum Configuration

4. Install SQL Server FCI on the Primary Node

5. Configure Windows Storage Replica

6. Expand SQL Cluster to Remaining Nodes

7. Test Cross-AZ Failover

Strengths of the AWS Stretch FCI Approach

High Availability and Business Continuity

Cost-Effectiveness

Scalable and Familiar

Potential Weaknesses and Risks

Storage Replica Limitations

Complexity and Operational Overhead

Licensing and Compliance

Cross-AZ and Data Transfer Costs

Best Practices for Stretch SQL FCI on AWS

Optimize Cluster and Application Design

Right-Size Resources

Secure the Stack

Test, Validate, Document

Disaster Recovery Hygiene

A Note on Clean-Up

Conclusion: Is SQL Server Stretch FCI on AWS EC2 Right For You?

Similar threads

Navigation section

Ultimate Guide to Deploying SQL Server Stretch Failover Clusters on AWS for High Availability and Disaster Recovery

Understanding the Problem Space: Legacy Clusters in a Cloud World​

What Is a Stretch Cluster?​

The Solution: AWS EBS Multi-Attach and Windows Storage Replica​

Solution Architecture: Blueprint for Enterprise-Grade HA/DR​

High-Level Design​

Topology and Node Roles​

Networking and Domain Services​

Step-by-Step Walkthrough: From EC2 to SQL Cluster​

1. Prepare the Foundation: EC2 Nodes and EBS Multi-Attach​

2. Install Windows Server Failover Clustering Components​

3. Form the Failover Cluster​

Establishing Fault Domains​

Disk and Quorum Configuration​

4. Install SQL Server FCI on the Primary Node​

5. Configure Windows Storage Replica​

6. Expand SQL Cluster to Remaining Nodes​

7. Test Cross-AZ Failover​

Strengths of the AWS Stretch FCI Approach​

High Availability and Business Continuity​

Cost-Effectiveness​

Scalable and Familiar​

Potential Weaknesses and Risks​

Storage Replica Limitations​

Complexity and Operational Overhead​

Licensing and Compliance​

Cross-AZ and Data Transfer Costs​

Best Practices for Stretch SQL FCI on AWS​

Optimize Cluster and Application Design​

Right-Size Resources​

Secure the Stack​

Test, Validate, Document​

Disaster Recovery Hygiene​

A Note on Clean-Up​

Conclusion: Is SQL Server Stretch FCI on AWS EC2 Right For You?​

Similar threads

Understanding the Problem Space: Legacy Clusters in a Cloud World

What Is a Stretch Cluster?

The Solution: AWS EBS Multi-Attach and Windows Storage Replica

Solution Architecture: Blueprint for Enterprise-Grade HA/DR

High-Level Design

Topology and Node Roles

Networking and Domain Services

Step-by-Step Walkthrough: From EC2 to SQL Cluster

1. Prepare the Foundation: EC2 Nodes and EBS Multi-Attach

2. Install Windows Server Failover Clustering Components

3. Form the Failover Cluster

Establishing Fault Domains

Disk and Quorum Configuration

4. Install SQL Server FCI on the Primary Node

5. Configure Windows Storage Replica

6. Expand SQL Cluster to Remaining Nodes

7. Test Cross-AZ Failover

Strengths of the AWS Stretch FCI Approach

High Availability and Business Continuity

Cost-Effectiveness

Scalable and Familiar

Potential Weaknesses and Risks

Storage Replica Limitations

Complexity and Operational Overhead

Licensing and Compliance

Cross-AZ and Data Transfer Costs

Best Practices for Stretch SQL FCI on AWS

Optimize Cluster and Application Design

Right-Size Resources

Secure the Stack

Test, Validate, Document

Disaster Recovery Hygiene

A Note on Clean-Up

Conclusion: Is SQL Server Stretch FCI on AWS EC2 Right For You?