Three AZ SQL Server FCI on EC2 with Storage Spaces Direct (S2D)

  • Thread Author
This article walks through a practical, production‑grade design and implementation for running a three‑node SQL Server Failover Cluster Instance (FCI) across three AWS Availability Zones using Storage Spaces Direct (S2D) built on locally attached Amazon EBS volumes — a pattern that brings true multi‑AZ cluster resilience to Windows Server clusters on EC2 while avoiding the two‑AZ or single‑AZ constraints of some managed options. It explains the architecture, prerequisites, step‑by‑step deployment guidance (PowerShell-centric), and a hard‑nosed analysis of operational tradeoffs, licensing, performance tuning, and failure modes so Windows administrators and DBAs can decide whether this is the right approach for mission‑critical SQL Server workloads.

Windows Server with Storage Spaces Direct spanning AZ A, AZ B, AZ C, connected to AWS cloud.Background / Overview​

Microsoft’s Storage Spaces Direct (S2D) is a software‑defined storage subsystem built into Windows Server Datacenter that pools direct‑attached storage from cluster nodes, implements replication/striping/tiering, and relies on the SMB3 stack (SMB Direct / RDMA) to synchronize data between nodes. S2D can operate in “guest cluster” scenarios — that is, inside VMs — making it possible to aggregate local virtual disks (EBS volumes) into a shared, highly available storage pool for SQL Server FCIs. Microsoft’s S2D documentation provides the official deployment and configuration steps for enabling S2D on Windows Server and for creating resilient virtual disks (volumes).
AWS provides two commonly referenced shared‑storage options for Windows workloads in the cloud: Amazon FSx for Windows File Server (a managed Windows SMB file server) and EBS Multi‑Attach (which lets some EBS volume types be attached concurrently to multiple instances in the same AZ). FSx Multi‑AZ is designed as a two‑AZ active/standby file system; EBS Multi‑Attach is restricted to instances that share the same Availability Zone. These constraints mean neither technology alone provides native three‑AZ shared storage for a Windows FCI; S2D across EC2 instances using separate EBS volumes is an architecture that can be implemented across three AZs because the replication is done at the cluster/software layer rather than the block device layer.
At the SQL Server licensing level, Standard Edition is limited to two‑node FCIs while Enterprise Edition supports multi‑node clusters (Enterprise allows nodes up to the operating‑system maximum in practice). For any design that uses three or more nodes for a single SQL Server FCI, SQL Server Enterprise Edition (or equivalent licensing) is required — this is an important TCO and compliance constraint.

Why this approach (summary)​

  • True three‑AZ resilience: S2D replicates data across cluster nodes at the software layer, so you can host cluster nodes in separate AZs and have mirrored copies of data across AZ boundaries.
  • No dependency on two‑AZ FSx constraints or single‑AZ EBS multi‑attach: FSx Multi‑AZ is architected as active/standby across two AZs and EBS Multi‑Attach is limited to a single AZ; the S2D method moves replication into Windows Server and uses local EBS volumes to achieve three‑AZ distribution.
  • Windows-native replication and failover: Because the solution uses Windows Failover Clustering and S2D, SQL Server sees familiar shared storage semantics (CSV/REFS volumes) and fails over at the instance level using WSFC resource management.
  • Performance and flexibility: S2D supports tiering, caching, and RDMA‑accelerated SMB Direct for low latency inter‑node replication — allowing NVMe or SSD backed instances to achieve strong throughput when properly networked.

Architecture & design considerations​

High‑level topology​

  • Three EC2 Windows Server 2022 (or later) instances provisioned in three distinct AWS Availability Zones (each instance in its own subnet).
  • Each EC2 instance is joined to your Active Directory domain (self‑managed or AWS Managed Microsoft AD) and has the Failover Clustering role installed.
  • Each node receives multiple additional EBS volumes (raw, unformatted) that S2D claims into the pool — for example, two or more non‑root GP3/IO2 volumes per node. These volumes remain in each node’s AZ but S2D replicates data across AZs at the software layer.
  • Storage Spaces Direct (Enable‑ClusterStorageSpacesDirect) is enabled on the WSFC cluster. S2D builds a storage pool, configures tiers/caches, and creates CSVFS_REFS virtual disks for SQL Data, Log, and System databases.
  • SQL Server Failover Cluster Instance (FCI) is installed as a clustered resource. SQL Server Network Name and a node‑specific secondary IP are configured for each node to support multi‑subnet cluster networking.
  • Use a witness (cloud witness or file share witness) for quorum control appropriate to a three‑node multi‑AZ cluster.

Network topology and RDMA​

  • S2D strongly benefits from high‑bandwidth, low‑latency intra‑cluster networking. Microsoft recommends enabling SMB Direct (RDMA) where possible; RDMA reduces CPU overhead and latency for S2D’s inter‑node traffic. In cloud VMs this means choosing instance types that support SR‑IOV and enhanced networking and using instance NICs or accelerated networking options that support RDMA or low‑latency connectivity.
  • Use dedicated cluster networks (separate subnets) for storage traffic, and enable SMB Multichannel to leverage multiple NICs and paths.

Storage and volume design​

  • Each EC2 node should have a number of raw, dedicated EBS volumes (identical sizes and performance characteristics recommended). Common patterns use multiple GP3 or io2 volumes per node.
  • Run Enable‑ClusterStorageSpacesDirect from within the cluster to create the pool and virtual disks. S2D will configure performance and capacity tiers automatically if different drive types exist.
  • Format cluster volumes as CSVFS_REFS for clustered virtual disk use and create folders for SQLData, SQLLog, and SystemDB under C:\ClusterStorage.

Quorum and cluster witness​

  • For three‑node clusters the typical quorum model is Node Majority. To tolerate AZ failures and maintain quorum, use a file‑share witness or cloud witness that is resilient and reachable from all AZs.
  • Consider the semantics of failover when an entire AZ becomes unreachable: plan the witness location and quorum expectation accordingly.

Prerequisites (concise checklist)​

  • AWS account, VPC with at least three subnets in different Availability Zones.
  • Three EC2 Windows Server 2022 (Datacenter) instances, in separate AZs, domain‑joined to your Active Directory.
  • Elevated AD rights to create computer objects and join machines to domain.
  • Additional secondary IP addresses for each instance’s network interface — at least two secondary IPs per instance for cluster and SQL virtual IPs.
  • At least two raw, non‑root EBS volumes per node (identical size). Many customers use two 100GB GP3 volumes for testing; production sizing depends on IOPS and capacity needs.
  • PowerShell 5.0 or later on each node; Failover Clustering and RSAT Clustering PowerShell features available and installed.
  • SQL Server media (SQL Server 2022 or supported version). Note: Developer Edition is for test only; Enterprise Edition is required for production clusters with three or more nodes.

Deployment walkthrough (step‑by‑step)​

The following is a practical, repeatable sequence that maps closely to the commands you’ll run. Replace placeholders (node names, IPs, volume sizes) with values from your environment.

1. Provision EC2 instances and EBS volumes​

  • Launch three EC2 Windows Server 2022 instances in AZ A, AZ B, AZ C.
  • Attach two or more additional EBS volumes (raw, no partitions) to each instance. Use GP3 for cost/IOPS flexibility or io2 for guaranteed IOPS if using Multi‑Attach patterns in the same AZ (not required for S2D design). Ensure volumes are visible as RAW in Disk Management.

2. Domain join and network prep​

  • Join all instances to the AD domain using a domain account with rights to join machines.
  • Configure a management/bastion host for secure RDP/SSM access. Harden inbound rules: only allow required management IPs; open cluster ports and Windows RPC, SMB, LDAP as necessary for domain and cluster operations.
  • Add required secondary IPs to each NIC — you will use one for the cluster name and one for the SQL virtual network name per node.

3. Install Failover Clustering features (PowerShell)​

Run on one privileged admin host or iterate remotely — install the Failover Clustering feature and management tools:
  • Example (remote install order should list remote nodes first and local node last to avoid reboot issues):
  • Save a list of nodes:
  • $servers = "node1a","node1b","node1c"
  • Install clustering features:
  • $servers | ForEach { Install-WindowsFeature -ComputerName $_ "RSAT-Clustering-PowerShell", "Failover-Clustering" –IncludeManagementTools -restart }
Verify installation and reboot as required. The features list for your images should include RSAT‑Clustering‑PowerShell and Failover‑Clustering.

4. Validate cluster configuration​

From one node, run the validation tests including S2D‑specific tests:
  • Test-Cluster -Node "node1a","node1b","node1c" -Include "Storage Spaces Direct","Inventory","Network","System Configuration"
Review warnings vs errors. Warnings can often be accepted if you understand their root cause; abort for any errors.

5. Create the WSFC (multi‑subnet)​

Create the cluster and supply static addresses for each AZ (use the secondary IPs created earlier):
  • $WSFCClusterName = "S2D-Cluster"
  • $ClusterNodes = ("node1a","node1b","node1c")
  • $ClusterIPs = ("172.31.4.51","172.31.5.51","172.31.8.51")
  • New-Cluster -Name $WSFCClusterName -Node $ClusterNodes -StaticAddress $ClusterIPs
Verify Get-ClusterNode shows all nodes online.

6. Configure fault domains (Windows Server 2022+)​

Assign site/fault domain names for AZ awareness:
  • New-ClusterFaultDomain -Name "Site1" -FaultDomainType Site
  • Set-ClusterFaultDomain -Name "node1a" -Parent "Site1"
  • Set-ClusterFaultDomain -Name "node1b" -Parent "Site1"
  • Set-ClusterFaultDomain -Name "node1c" -Parent "Site1"
  • Update-ClusterFunctionalLevel
This helps the cluster make informed placement and resiliency decisions.

7. Enable Storage Spaces Direct​

From any cluster node run:
  • Enable-ClusterStorageSpacesDirect
This cmdlet automatically creates the storage pool, configures cache and tiers (if mixed media exists), and claims disks. Use Failover Cluster Manager to inspect the new pool and virtual disks.

8. Create CSV volumes for SQL​

Create the volumes you’ll use for SystemDB, Data, and Log:
  • $ClusterName = "S2D-Cluster"
  • $StoragePoolName = (Get-StoragePool -CimSession $ClusterName | Where-Object IsPrimordial -eq $false).FriendlyName
  • New-Volume -CimSession $ClusterName -StoragePoolFriendlyName $StoragePoolName -FriendlyName "SQLSystemDB" -FileSystem CSVFS_REFS -Size 25GB
  • New-Volume -CimSession $ClusterName -StoragePoolFriendlyName $StoragePoolName -FriendlyName "SQLData" -FileSystem CSVFS_REFS -Size 75GB
  • New-Volume -CimSession $ClusterName -StoragePoolFriendlyName $StoragePoolName -FriendlyName "SQLLog" -FileSystem CSVFS_REFS -Size 25GB
Create folder structure in each CSV:
  • New-Item -Path "C:\ClusterStorage\SQLData\MSSQL\" -ItemType Directory
  • New-Item -Path "C:\ClusterStorage\SQLLog\MSSQL\" -ItemType Directory
  • New-Item -Path "C:\ClusterStorage\SQLSystemDB\MSSQL\" -ItemType Directory

9. Install SQL Server FCI​

  • On Node 1 run SQL Server Setup → New SQL Server failover cluster installation.
  • Configure the SQL Server Network Name (virtual network name), pick the cluster disks (SQLData, SQLLog, SQLSystemDB), and specify the cluster network IP (one of the secondary IPs — per‑node).
  • IMPORTANT: Configure TempDB to reside on local instance storage (NVMe or separate local EBS where available) rather than the S2D CSVs for performance. Microsoft and AWS guidance recommend placing TempDB on local, low‑latency storage to reduce contention and latency for TempDB heavy workloads.
  • Complete installation and verify the SQL Server resource is listed in Failover Cluster Manager.
  • On Node 2/Node 3 run SQL Server Setup → Add node to a SQL Server failover cluster and follow the wizard to join.

10. Test failover & validation​

  • Create a test database, failover the SQL Server FCI resource to each node, and confirm the database comes online and mounts successfully.
  • Validate S2D resiliency by simulating disk/node failure scenarios in a controlled maintenance window and confirming automatic recovery and data integrity.

Operational best practices​

Networking and RDMA​

  • Use instance types and AMIs that support enhanced networking and SR‑IOV; configure accelerated networking where supported.
  • If possible, enable RDMA/SMB Direct on the storage network to reduce CPU overhead and latency for S2D replication; Microsoft recommends this for production S2D clusters.

TempDB placement & sizing​

  • Put TempDB on local NVMe instance storage or local ephemeral instance storage where possible. AWS testing demonstrates meaningful TempDB performance gains when TempDB uses local instance storage compared with EBS‑based TempDB. Always benchmark with representative workload.

Monitoring & alerting​

  • Monitor cluster health (Get‑Cluster, Get‑ClusterNode), S2D pool health, and virtual disk performance metrics.
  • Instrument SMB/SMB Direct counters, CSV latency, and per‑node CPU to detect storage network bottlenecks.

Backups​

  • Continue standard SQL Server backup practices (full/diff/log) to off‑cluster durable storage such as S3 (via backup agents) or third‑party backup appliances. Do not rely on cluster replication alone for backups.

Patching & maintenance​

  • Test Windows and SQL patching in a staging cluster. S2D clusters require careful rolling upgrades and reboots — follow Microsoft guidance for node upgrade ordering.
  • For cluster functional level changes or OS upgrades, consult Microsoft documentation and test the full upgrade path first.

Cost & licensing implications​

  • SQL Server licensing: Running a three‑node FCI requires SQL Server Enterprise licensing for production. Factor Enterprise license cost into the total cost of ownership.
  • EC2 instance types & EBS costs: Choose instance SKUs sized for RAM/CPU needed by SQL Server plus network capability for RDMA. EBS GP3 provides flexible baseline IOPS and throughput at lower cost than io2, but heavy OLTP workloads may justify io2 or io2 Block Express pricing.
  • Networking and accelerated features: RDMA / accelerated networking capable instances can carry price premiums; include them in cost estimates.

Strengths, risks, and limitations (critical analysis)​

Strengths​

  • True multi‑AZ cluster resilience: Because S2D replicates storage across nodes, the cluster can survive a full AZ outage and failover to nodes in other AZs without data loss (subject to quorum/witness design).
  • Windows‑native stack and tooling: Uses WSFC, S2D, and SQL Server’s FCI semantics — DBAs and Windows admins remain in familiar operational territory.
  • Performance options: With NVMe and RDMA enabled, S2D can deliver high IOPS and throughput that meet demanding SQL Server workloads.

Risks and operational caveats​

  • Complex network & instance selection: S2D’s performance depends heavily on low‑latency, high‑bandwidth networking (RDMA). Cloud VPC network variability and instance NIC capability become critical design points.
  • Support matrix & production validation: While S2D is supported on Windows Server and guest clusters, running S2D on public cloud VMs is a scenario that demands validation with Microsoft and AWS support for your particular combination of instance types, EBS volume types, and OS builds. Perform test failovers, performance benchmarks, and confirm supportability of your chosen configuration with Microsoft and AWS if you plan production use.
  • Licensing cost: Enterprise Edition licensing for multi‑node FCIs can be expensive compared to other HA patterns (Always On Availability Groups, managed databases). Include license cost in TCO and compare with AWS managed options.
  • Operational complexity: Storage Spaces Direct requires using the Failover Cluster Manager and S2D tools for storage management; day‑two operations require S2D knowledge and runbooks.

Caveats & unverifiable claims​

  • Statements that S2D on EBS is categorically “more resilient” than FSx or any managed service should be treated as conditional and workload‑dependent. FSx Multi‑AZ provides synchronous replication between two AZs with a managed SLA and may be a simpler production choice for many customers. Any claim of superior resilience, recovery RTO/RPO, or performance should be validated with your workload tests and failover drills. If a vendor or article makes strong resilience claims without public test data, flag those claims as requiring independent validation.

Troubleshooting common issues​

  • If Enable‑ClusterStorageSpacesDirect fails to claim disks, ensure disks are RAW and uninitialized (no partitions), same size, and accessible on each node.
  • If CSV latency spikes on failover, check SMB network health and NIC settings; enable SMB Multichannel and SMB Direct where available.
  • If SQL Server setup cannot see cluster disks, confirm virtual disks are online in Failover Cluster Manager and that CSV pathways are healthy.
  • If quorum flaps occur during AZ disruptions, reevaluate witness placement and network reachability.

Test scenarios & validation plan​

  • Functional sanity tests
  • Create test DB, failover FCI between nodes, confirm DB mount and application connectivity.
  • AZ failover tests
  • Simulate full AZ outage by quarantining or stopping network to one zone; confirm cluster recovers and SQL owner moves to surviving node.
  • Disk failure tests
  • Remove an EBS volume from a node (in a controlled way) and validate resiliency and rebuilding of S2D mirrors.
  • Load/performance tests
  • Run representative OLTP workload and measure latency/throughput with and without SMB Direct/RDMA enabled.
  • Backup & restore exercise
  • Execute full backup to durable external store, then test database restore to a new cluster to validate recovery processes.

Cleanup and decommission steps​

When you finish testing or need to dismantle the cluster:
  • Stop SQL Server cluster group and cluster resources.
  • Disable Storage Spaces Direct with Disable‑ClusterStorageSpacesDirect.
  • Remove cluster roles and then the cluster with Remove‑Cluster -Force -CleanupAD.
  • Terminate EC2 instances and delete EBS volumes to avoid ongoing charges.
A scripted cleanup approach reduces human error and leftover resources.

Final verdict and recommendation​

Deploying a SQL Server FCI across three AWS Availability Zones using Storage Spaces Direct on EC2 is a valid, powerful architecture that provides true multi‑AZ resilience and keeps storage replication inside the Windows Server stack. It is best considered when:
  • Your organization requires instance‑level failover with locality in separate AZs, and
  • You have SQL Server Enterprise licensing and Windows/SQL administrative expertise, and
  • You can commit to designing and operating high‑bandwidth, low‑latency networking (RDMA where possible) and comprehensive testing.
For teams that prefer a managed SMB storage solution with simpler operations and are satisfied with two‑AZ resilience, Amazon FSx for Windows File Server Multi‑AZ is a strong alternative. For teams constrained by Standard Edition licensing or wanting single‑AZ shared volumes, EBS Multi‑Attach (within the same AZ) or managed database services may be more cost‑efficient. Wherever you land, validate assumptions with representative load tests and failover drills, and confirm supportability with vendors for your exact OS, instance type, and EBS volume choices.

This guide combines the Windows Server S2D technical model and AWS service behaviors so administrators can evaluate, design, and implement a three‑AZ SQL Server FCI on EC2 with clarity about tradeoffs, operational requirements, and the tests required before production adoption.

Source: Amazon Web Services (AWS) How to deploy a SQL Server Failover Cluster Instance across three Availability Zones using Storage Spaces Direct | Amazon Web Services
 

Back
Top