Hybrid Cloud DR for SAP ERP: Rapid Azure Based Recovery with On Prem Core

  • Thread Author
Kyndryl’s engagement with a global steel producer shows a pragmatic path to fast, reliable disaster recovery for mission‑critical SAP ERP by combining selective cloud migration with on‑premises continuity: the company moved a set of x86 workloads to Microsoft Azure for agility and scale while keeping the SAP ERP on its own data centers, then layered a cloud‑based disaster‑recovery option to remove single‑site dependency and enable rapid recovery when seconds and minutes matter.

Background​

The steel producer’s original IT strategy relied on self‑operated data centers for its core systems. To improve flexibility and reduce operational overhead, the company migrated many x86 workloads to Microsoft Azure, but retained its mission‑critical SAP ERP (the digital core that runs finance, logistics and production) on premises. That left a gap: a modern cloud platform for scale and resilience, yet a single‑site SAP deployment vulnerable to site‑level outages. Kyndryl’s role was to design and deliver a DR solution that removes the company’s dependence on its own physical infrastructure while keeping day‑to‑day SAP operations local.
This hybrid approach — migrate non‑ERP workloads to Azure and use cloud resources for DR of on‑prem SAP — is increasingly common in heavy‑industry and manufacturing environments where regulatory, latency or integration constraints make full SAP rehosting impractical in the short term. Microsoft and ecosystem partners routinely recommend hybrid DR patterns that pair database‑native replication for the DB layer with Azure Site Recovery (ASR) for application‑layer replication and orchestration.

Why a hybrid DR approach makes sense for SAP ERP​

  • Mission risk vs. migration risk: Rehost or refactor SAP (especially SAP HANA or customized NetWeaver landscapes) can take months and carry operational risk. A hybrid DR approach reduces migration risk while improving resilience.
  • Cost‑effectiveness: Running a cold or warm DR footprint in Azure — with capacity reserved or dynamically sized — can be far cheaper than maintaining a parallel physical DR site.
  • Leverage native DB replication: SAP HANA, Oracle, SQL Server and other databases provide replication technologies designed for transactional consistency; combining these with cloud orchestration yields pragmatic RPO/RTO tradeoffs.
  • Operational continuity: Teams keep their proven on‑prem processes but gain the option to fail over to Azure for recovery without a full cloud migration.
These are not theoretical benefits — Microsoft partner stories show real reductions in RTO and RPO when database replication and Azure Site Recovery are combined and automated. For example, a documented SAP on Azure engagement reduced RTO to around four hours and brought RPOs into the single‑digit minute range through HANA system replication plus ASR for application servers.

Technical patterns Kyndryl and partners typically use​

1) Database layer: native replication (preferred)​

For SAP HANA and most other enterprise DBMS, the recommended approach is to use database‑native replication (HANA System Replication, Oracle Data Guard, SQL Always On, etc.) for the DB layer because it preserves transactional consistency and gives the best control over RPO. Database replication can be synchronous (zero data loss) within short‑latency zones or asynchronous across regions to limit latency impact.
  • Benefits:
  • Predictable, low‑RPO replication.
  • Avoids database consistency issues that can occur with block‑level replication or VM replication alone.
  • Tradeoffs:
  • Synchronous replication is latency‑sensitive; cross‑region synchronous replication may not be feasible.
  • Requires matching compute/storage sizing on DR side (or orchestration to upscale on failover).

2) Application layer: Azure Site Recovery for VMs and orchestration​

Use Azure Site Recovery (ASR) to replicate SAP application VMs (ASCS, app servers, web dispatchers, etc.) and to orchestrate failover steps and networking changes in the DR region. ASR replicates VM disks and allows non‑disruptive tests and scripted failover sequences. ASR is not a substitute for database replication — it complements it.
  • Key caveat: ASR does not replicate all storage types (NFS layers such as /sapmnt sometimes need alternate approaches). For clustered NFS or DRBD‑backed storage, specialized replication or snapshot strategies are required.

3) Shared file systems and SAPMNT: Azure NetApp Files or Azure Files​

SAP shared directories (/usr/sap, /sapmnt, transport directories) are critical. Recommended Azure storage options include Azure NetApp Files (ANF) for NFS workloads or Azure Files/Shared Disk for Windows deployments. ANF supports snapshots and cross‑region replication, which are useful to accelerate DR and reduce restore time.

4) Networking: ExpressRoute/private connectivity and DNS orchestration​

Enterprise DR requires private, high‑capacity connectivity (ExpressRoute, private peering) and a plan to reparent DNS and service endpoints during failover. Network latency measurements must be part of the design, especially if synchronous DB replication is considered.

5) Capacity strategy: reserve vs. on‑demand​

Two common models:
  • Reserve the DR capacity (via Reserved Instances or capacity reservations) for immediate failover at higher cost but lower risk.
  • Use smaller “skeleton” VMs, replicate data, then upscale during failover (cheaper but slower). Microsoft guidance discusses tradeoffs and cautions that Reserved Instances reduce cost and improve priority for capacity but don’t absolutely guarantee instantaneous capacity in every scenario.

What Kyndryl likely delivered — and what’s verifiable vs. what needs caution​

Based on the steel‑producer summary, Kyndryl’s solution very likely combined these elements: database‑aware replication for SAP DBs, ASR for application VM replication and orchestration, ANF or equivalent for shared file systems, and ExpressRoute or secure connectivity to Azure. Those patterns align with Microsoft’s recommended practices and with multiple field engagements where RTOs were brought down to hours and RPOs to minutes.
Caution: the case text provided does not include exact RTO/RPO targets, specific SKUs, or whether SAP HANA system replication (HSR) was used versus storage‑snapshot replication. Those implementation details are material to recovery behavior and must be validated in contract runbooks and DR‑test results. Any claim about exact timings should be treated as conditional until proven by a full DR exercise. If you intend to treat “fast recovery” as a contractual guarantee, require measured DR drills and documented RTO/RPO validation.

Step‑by‑step roadmap for implementing cloud DR for on‑prem SAP (practical playbook)​

  • Define business requirements
  • Quantify RPO and RTO per SAP system, transaction class and business process. Map financial and operational impact for each minute of downtime. Those numbers drive the architecture.
  • Inventory and dependency mapping
  • Identify DB flavors (HANA, Oracle, MSSQL), shared file systems, integrations, custom transports and dependent non‑SAP systems. Use discovery tools to capture full topologies.
  • Choose DB replication strategy
  • For SAP HANA, prefer HANA System Replication (HSR) for DB consistency. For Oracle use Data Guard. For SQL Server use Always On. Design sync vs async replication based on measured latency and RPO tolerance.
  • Design application‑layer replication and orchestration
  • Use Azure Site Recovery for application VM replication and test orchestration. Build runbooks for failover steps (DNS switch, bring up iSCSI/SBD fencing, rebuild pacemaker clusters if needed). ASR accelerates recovery but requires manual cluster reconfiguration for many Linux HA setups.
  • Plan shared storage and backups
  • Use Azure NetApp Files or Azure Files for SAP shared storage; implement scheduled snapshots and cross‑region replication for fast restore. For very large DBs, snapshot‑based workflows dramatically reduce RTO versus streaming restores.
  • Network and connectivity
  • Provision ExpressRoute or equivalent private connectivity; test latency with SAP nipping or equivalent tools. Publish network failover patterns and ensure firewall/security rules are available in DR region.
  • Capacity and cost plan
  • Decide reserved vs on‑demand DR compute and pre‑allocate storage. Model TCO including egress/cross‑region replication fees, managed service costs and DR test overhead. Microsoft guidance highlights the tradeoffs between ongoing capacity cost and recovery speed.
  • Automation, runbooks and testing
  • Automate as many failover steps as possible through Azure Automation, ARM/Terraform scripts and documented runbooks. Execute non‑disruptive DR tests frequently and measure RTO/RPO; only validated runs should feed vendor SLAs.
  • Governance and SLAs
  • Define responsibilities across customer, Kyndryl and Microsoft (or other vendors). Map support contacts, escalation paths and who owns each remediation step. Avoid “vendor ping‑pong” by defining clear runbooks and SLAs.

Testing: the non‑negotiable step​

DR plans without validated drills are promises without proof. Non‑disruptive DR testing — where failover steps are rehearsed without impacting production — is critical and typically requires isolated VNets, temporary compute, and snapshot‑driven validation. Microsoft customer stories show that using ASR plus database replication can reduce RTO to a few hours when the runbooks and automation are mature; those numbers are achievable but entirely dependent on repeatable, practiced DR runs.
Key DR test checklist:
  • Execute full failover and failback cycles.
  • Validate application behavior, transaction integrity, and integrations.
  • Measure total elapsed RTO and the final data gap (RPO).
  • Test certificate, PKI and identity flows (Azure Entra / SAP logins).
  • Confirm cost implications of failover (auto‑scale costs, data egress).

Strengths of the cloud DR pattern Kyndryl used​

  • Rapid recovery potential: Combining DB native replication with ASR enables sub‑hour or multi‑hour RTOs depending on chosen tradeoffs. Real implementations have recorded RTO reductions into the single‑digit hours.
  • Cost flexibility: Teams can choose a reserved warm footprint or an on‑demand skeleton, trading cost for recovery speed.
  • Modern tooling: Azure’s snapshot, ANF replication and ASR orchestration reduce manual restore complexity for large DBs where streaming restore would otherwise take hours.
  • Operational continuity for business: Keeps core operations in familiar on‑premises environment while removing hard dependencies on a single physical site.

Risks and mitigations​

  • Latency and synchronous replication: Aiming for zero RPO with synchronous DB replication across distant regions will fail without low latency. Mitigation: use synchronous within availability zones and asynchronous across regions; measure with SAP nipping.
  • NFS and cluster fencing incompatibilities: Some NFS/DRBD setups don’t work with ASR. Mitigation: design file system DR with ANF snapshots or vendor‑supported clustering tools and include those in runbooks.
  • Complex failover steps for pacemaker clusters: Pacemaker configs often need manual reconfiguration post‑failover. Mitigation: document exact reconfiguration steps and automate where possible; include the SBD fencing approach as part of the design.
  • Vendor coordination and “vendor ping‑pong”: Multiple vendors (customer, Kyndryl, Microsoft, DB vendor) can create ambiguity. Mitigation: explicit SLAs and ownership matrices in the Service Agreement; run multi‑vendor DR exercises.
  • Cost surprises: Cross‑region replication, egress charges and reserved capacity can add up. Mitigation: detailed TCO modelling and a proof‑of‑value pilot that measures real egress and scaling behavior.

Commercial and operational governance — negotiation checklist​

  • Insist on measured, repeatable DR test reports that validate RTO/RPO before final acceptance.
  • Contractually require documented runbooks, playbooks and escalation paths.
  • Confirm who manages periodic DR drills and who pays for test infrastructure.
  • Require a clear exit and portability plan for data if vendor relationships change.
  • Verify backup immutability and retention settings for ransomware resilience.
These governance items have operational impact and are frequently the difference between a technically sound design and a truly operational DR capability.

Practical checklist for WindowsForum readers (engineers and architects)​

  • Capture current RTO/RPO requirements for every SAP system and process.
  • Confirm DB type and choose native DB replication where possible (HSR, Data Guard, Always On).
  • Inventory shared filesystems and evaluate ANF/snapshots for cross‑region replication.
  • Measure network latency between primary and candidate DR regions with SAP tools.
  • Build DR runbooks that include DNS cutover, internal load balancer creation and cluster reconfiguration steps.
  • Run non‑disruptive DR tests and keep the results as part of contractual acceptance.

Conclusion​

Kyndryl’s solution for the steel producer exemplifies how industry‑grade hybrid designs can reconcile the operational demands of on‑prem SAP with the resiliency and scale of cloud platforms. By combining database‑native replication (for consistent RPOs), Azure Site Recovery (for orchestration and app VM replication), and modern storage primitives like Azure NetApp Files and snapshots (for fast restores), organizations can eliminate a hard dependency on a single physical site and achieve rapid, verifiable recovery.
This hybrid path offers a pragmatic balance: it reduces risk and cost compared with maintaining a second physical DR site, while avoiding the immediate complexity of a full SAP migration. The only way to turn potential into guarantee is disciplined execution — precise RTO/RPO definition, rigorous DR drills, clearly assigned vendor responsibilities, and careful capacity planning. When those elements are in place, recovery that once took days can become predictable and measured in hours — and that is the fundamental operational win Kyndryl delivered to the steel producer.

Source: Kyndryl Enabling rapid disaster recovery of critical ERP systems with cloud migration