SAP high availability architecture : patterns that hold under real production pressure

Summary

The most dangerous SAP high availability configuration is one that has never been tested. It looks correct in the architecture diagram, it passed the go-live review, and everyone on the team is reasonably confident it would work if something went wrong. But it has never actually failed over, so nobody knows for certain whether the secondary system picks up in 90 seconds or 25 minutes, or whether the Enqueue Replication Server reconnects cleanly after a primary node failure, or whether the monitoring team gets an alert before users start calling.

This article covers the architectural patterns that actually hold up when production systems fail, not the ones that look good in vendor presentations. It assumes you know what SAP HANA System Replication is. The goal is to cover the specific decisions and misconfigurations that determine whether your HA setup survives contact with a real outage.

What SAP high availability actually needs to protect against ?

The difference between HA and DR, and why conflating them creates gaps ?

High availability and disaster recovery are different problems with different solutions, and the SAP teams that handle both well treat them as separate design concerns from the start. HA is about keeping service running when a single component fails: a database node, an application server, a network link. The recovery target is minutes. The geographic scope is typically within a single data center or between two sites with low-latency connectivity.

Disaster recovery is about restoring service after a site-level event: a data center power failure, a major hardware incident, a regional network outage. The recovery target is hours. The geographic scope requires physical distance, which typically means asynchronous replication because synchronous replication across long distances introduces write latency that production HANA systems cannot absorb.

The gap that costs organizations the most is designing one system to do both jobs. A HANA System Replication setup in synchronous mode between two racks in the same data center provides HA but not DR. A setup in asynchronous mode between two data centers 500km apart provides DR but accepts potential data loss. A multi-tier replication setup with a synchronous secondary nearby and an asynchronous tertiary at distance provides both, but it is a more complex configuration to operate and test. Knowing which scenario you are designing for before you configure replication prevents the expensive discovery that your DR plan assumed capabilities your HA configuration does not provide.

Failure modes that standard HA configurations miss

A database node failover is the failure mode every SAP HA configuration is designed to handle. It is also among the rarest failure modes in a well-maintained environment. The failures that actually cause downtime in production are typically more mundane and more varied: network interface saturation between the application and database tiers, storage I/O degradation affecting HANA log write performance, a background job monopolizing work processes across all application server instances, or a configuration change that was not tested in non-production first.

Standard HA architecture does not protect against any of these. A two-node HANA cluster with Pacemaker handles database node failure. It does not help when the application server layer is the problem. It does not help when all application servers are reachable but the HANA primary is so busy with a rogue analytical query that dialog transactions are timing out.

This matters for architecture decisions because teams that design purely for node failure end up with a robust response to one failure mode and no coverage for the others. A complete HA approach considers the application tier, the integration layer, and the monitoring infrastructure alongside the database cluster.

The SAP HANA layer : system replication done right

Synchronous vs Asynchronous replication : the Choice that fixes your RPO

SAP HANA System Replication operates in three modes. The choice between them is not a preference. It is a decision that directly determines your Recovery Point Objective and has a measurable impact on primary system write performance.

ModeRPOPerformance impactTypical use case
SYNCZero data loss (RPO = 0)Higher latency on writes. Network round-trip on every commit.HA within same data center or low-latency network.
SYNCMEMNear-zero (data in memory on secondary)Lower than SYNC. Secondary acknowledges once data is in memory, before disk write.HA where some performance recovery justifies minimal theoretical data loss risk.
ASYNCData loss possible (depends on lag at failure time)Minimal impact on primary performance.DR replication over WAN. Not appropriate for primary HA.

SYNC mode is the correct choice for primary HA within a data center or between sites connected by a network with round-trip latency under 1ms. The performance impact is real but bounded, and it is the only mode that guarantees zero data loss on failover. Running SYNC over a high-latency WAN link introduces write latency that accumulates across every committed transaction. In a busy production system, this becomes visible to users.

SYNCMEM is a reasonable compromise when SYNC performance impact is measurable and the theoretical risk of data still in memory on the secondary (not yet written to disk) is acceptable. For most production environments, this risk is very small, and SYNCMEM performs noticeably better than SYNC under high write load.

ASYNC for primary HA is an architectural mistake that is surprisingly common. It is easy to see why it happens: the configuration looks identical, the secondary is online, and in a test environment with low write volume the replication lag is negligible. Under production load, ASYNC lag can grow to seconds or minutes during write-intensive periods. A failover during a lag window means committed transactions on the primary that never reached the secondary are permanently lost.

Common mistake:  Using ASYNC mode for a HA secondary in the same data center is a documented antipattern. The performance savings are real but small. The data loss risk is also small but not zero. For a production financial or logistics system, that trade-off is not worth making.

HANA system replication with Pacemaker : configuration details that matter

Pacemaker is the standard cluster resource manager for SAP HANA HA on Linux. It handles node failure detection, failover sequencing, and fencing (STONITH: Shoot The Other Node In The Head). Each of those three functions has configuration details that determine whether a failover completes cleanly or hangs.

Node failure detection speed is controlled by the Pacemaker timeout and ping interval configuration. The defaults are conservative, which means a failed node may not be detected for 30-60 seconds under default settings. For environments with aggressive RTO targets, tuning these values downward is necessary, but it requires careful testing. Overly aggressive detection thresholds cause false positives: Pacemaker declares a node failed during a brief network hiccup and initiates a failover that was not needed.

STONITH configuration is the piece most frequently treated as a checkbox during setup. The purpose of fencing is to guarantee that a node Pacemaker believes is failed cannot continue writing to shared resources before the secondary takes over, which could otherwise cause split-brain data corruption. In a virtual machine environment, STONITH is typically implemented via the hypervisor’s power management API. In a bare-metal environment, it requires a dedicated out-of-band management network and hardware. Environments that disable STONITH because it is difficult to configure are accepting split-brain risk in exchange for simpler setup. This is not a trade-off that should be made for production HA.

Note:  SAP Note 1984787 covers the SUSE Linux Enterprise Server configuration requirements for SAP HANA HA with Pacemaker. SAP Note 2578899 covers the equivalent for Red Hat Enterprise Linux. Both are required reading before production deployment, not optional references.

Multi-Tier Replication : when you need both HA and DR ?

HANA System Replication supports chaining: a primary replicates synchronously to a local secondary (HA), and that secondary replicates asynchronously to a remote tertiary (DR). This configuration provides near-zero RPO for local failures and a DR target that accepts some data loss but preserves geographic separation.

The operational complexity of multi-tier replication is higher than single-tier. After a primary failure and HA failover, the old secondary is now the new primary, and the tertiary needs to be repointed to replicate from it. This repointing is not automatic by default. It requires either manual intervention or a custom automation script as part of the failover runbook. Teams that deploy multi-tier replication without documenting and testing the DR repointing procedure have a configuration that works for the first failure mode but leaves them manually patching together a DR chain while dealing with an active incident.

Application server high availability : the layer that gets less attention

The enqueue replication server : the most commonly misconfigured SAP HA component

The SAP Enqueue Server manages lock management for the ABAP application. It is a single process. In a standard installation without HA configuration, the Enqueue Server runs on one application server instance, and if that instance fails, all locks are lost. Users receive a session error, ongoing transactions cannot be committed, and the system effectively needs a restart to clear the lock table.

The Enqueue Replication Server (ERS) solves this by maintaining a replicated copy of the lock table on a second instance. When the primary Enqueue Server fails, the ERS promotes itself, and the lock table is preserved. The failover is transparent to users in most cases.

The configuration gap that appears repeatedly in production environments is ERS setup that is technically present but not correctly integrated with Pacemaker. The ERS instance exists and replicates the lock table, but Pacemaker does not know about it, so a node failure that takes down the primary Enqueue Server does not trigger an automatic ERS promotion. The lock table is replicated but not activated. The result is the same downtime as if ERS had not been configured at all.

Verifying correct ERS-Pacemaker integration should be part of every HA configuration review. The test is straightforward: simulate a failure of the node running the primary Enqueue Server and confirm that the ERS promotes cleanly and users with active sessions can continue their transactions. If that test has not been run, the ERS configuration is unverified.

In practice:  In environments where the Enqueue Server is running on the same node as the HANA primary, a single node failure takes down both the database and the lock management simultaneously. Placing the Enqueue Server and HANA primary on separate physical or virtual hosts means a database node failure does not automatically create an application-layer lock table event.

Distributing application server instances without creating new failure modes

Running multiple SAP dialog instances across several application servers is the standard approach to application-layer availability. If one AS instance fails, users reconnect through the message server to another available instance. This works reliably for stateless operations.

The failure mode that multiple AS instances do not protect against is work process pool saturation across all instances simultaneously. If a runaway batch job or a poorly optimized report monopolizes background or dialog work processes on every available application server, the system is effectively unavailable even though no instance has failed. HA architecture at the infrastructure level does not address this.

Work process distribution also requires that the message server, which handles load balancing between application server instances, is itself protected. In most production deployments the message server runs as a component of the primary application server instance. A failure of that instance takes down both the dialog processes and the routing layer. Running the message server on a dedicated instance with its own failover configuration adds complexity but removes a meaningful single point of failure in high-demand environments.

High availability in cloud and hybrid environments

What changes when SAP runs on a hyperscaler ? 

The underlying HA concepts for SAP on AWS, Azure, or GCP are the same as on-premise. HANA System Replication, Pacemaker, ERS, multiple AS instances. What changes is the infrastructure layer those components run on, and that layer introduces both new capabilities and new constraints.

Availability Zones on hyperscalers provide physical separation within a region, and placing the primary and secondary HANA nodes in separate AZs protects against a zone-level failure. The practical constraint is network latency. AZ-to-AZ latency within a single region typically sits between 1ms and 2ms, which is within the range where HANA SYNC replication is viable but closer to the threshold where write-heavy workloads start showing measurable impact. Testing replication performance under production-representative write load before committing to a cross-AZ SYNC configuration is a necessary step, not an optional one.

Hyperscalers also handle STONITH differently from on-premise environments. The hardware-based IPMI/BMC fencing that is standard on bare metal is not available in a virtual machine context. Cloud-native STONITH agents use the hyperscaler’s instance management APIs to power off a target node. These agents are well-tested and reliable, but they require the appropriate IAM permissions to be configured, and those permissions need to be verified before a real failover scenario, not during one.

RISE with SAP and the monitoring boundary

RISE with SAP moves infrastructure management and base platform operations to SAP and the underlying hyperscaler. For HA architecture specifically, this means SAP manages the HANA cluster configuration, the Pacemaker setup, and the infrastructure-level failover for the S/4HANA core. The customer does not configure or maintain those components directly.

What the customer retains in a RISE environment is responsibility for application-level availability: the extensions built on SAP BTP, the integrations with non-SAP systems, the background job scheduling that determines whether critical processes run within their windows, and critically, the monitoring layer that provides visibility into whether the system is actually healthy from a business perspective.

SAP’s RISE SLA covers infrastructure availability. It does not cover the health of custom ABAP code, the success rate of interface processing, or whether scheduled jobs completed within their business deadlines. An organization running on RISE that has no independent monitoring layer is dependent entirely on SAP’s support portal to discover when something is wrong at the application layer. That is a meaningful operational gap, regardless of what the infrastructure SLA says.

Testing your HA configuration : the step most teams skip

An untested failover is an unknown failover

HA configurations degrade over time without anyone noticing. A cluster resource that was correctly registered with Pacemaker at go-live gets orphaned when an OS patch changes a service name. A Pacemaker timeout that was calibrated for the original hardware profile no longer fits after a VM resize. An ERS instance that was verified after initial setup was never re-verified after a major SAP upgrade.

The only way to know whether a failover will work is to run one. Not a simulated failover in a sandbox environment with a fraction of the data. An actual failover of the production cluster, in a maintenance window, with the operations team watching the monitoring dashboards and timing each phase of the recovery sequence.

Most organizations do not do this annually. Many have never done it at all. The argument against testing is risk: a failover test could itself cause an unplanned outage if something goes wrong. This argument is real. But the counterargument is that discovering a broken failover configuration during an actual production incident, without a maintenance window, under pressure, with users affected, is a much worse outcome than discovering it during a controlled test when you have time to fix it.

What a realistic HA test covers?

A useful HA test is not just verifying that the secondary HANA node picks up after a primary shutdown. It tests the full recovery sequence and measures each phase.

  • Time from primary failure to Pacemaker detection : should be under 30 seconds with standard configuration.
  • Time from detection to STONITH completion : confirms the fencing mechanism is working.
  • Time from fencing to secondary promotion : the HANA takeover phase, typically 30-90 seconds depending on delta log volume.
  • Time from secondary promotion to first successful user login : includes application server reconnection and message server registration.
  • ERS behavior : confirm that users with active sessions at the time of failure can continue their transactions without a session error after reconnection.
  • Alert delivery : confirm that the monitoring system detected the failure and routed the correct alert to the correct team within the expected timeframe.

That last point is worth emphasis. An HA test that does not include monitoring verification is incomplete. The cluster might recover in 90 seconds, but if the operations team found out about the failure by reading an email 20 minutes later, the effective MTTR for that incident includes the detection gap. Monitoring is not separate from HA architecture. It is the layer that determines whether the architecture performs as designed when something actually goes wrong.

Monitoring as a functional component of SAP high availability

HA architecture documents typically end at the infrastructure layer. Cluster configuration, replication mode, fencing setup. The monitoring layer is usually addressed separately, in a different section of the architecture documentation, by a different team.

This separation creates a practical problem. During a real failover, the monitoring system is the primary source of truth for what is happening and how fast. If it shows a HANA takeover in progress, the operations team knows not to restart services manually. If it shows the ERS promotion failing, they know where to focus. If it shows the alert was triggered 40 seconds after failure detection, they know the escalation chain is working. Without that real-time view, the response is slower and the risk of a manual intervention making things worse is higher.

The monitoring layer also catches the failure modes that HA architecture does not handle: the degraded performance that precedes a node failure, the work process saturation that makes the system effectively unavailable without any infrastructure event, the background job that is running at 200% of its normal duration and is about to miss a deadline that will require manual remediation. None of those trigger a Pacemaker event. All of them affect business continuity.

A well-designed SAP HA architecture specifies what the monitoring layer needs to observe, at what frequency, with what alert routing. It does not leave monitoring as an afterthought that gets configured by whoever is available after go-live. The two designs are not separate: one protects the infrastructure, the other provides the visibility that makes the infrastructure protection actionable.

The patterns that actually hold

SAP high availability is not a configuration you set once and forget. It is a system of interdependent components, each of which can drift from its intended state over time, and each of which needs to be periodically verified to confirm it still does what the architecture document says it does.

The patterns that hold under real production pressure have a few things in common. They separate HA from DR by design, not by assumption. They protect the Enqueue layer with the same rigor as the HANA layer, because a lock table loss is as disruptive as a database node failure. They test failover on a schedule rather than waiting for an incident to discover gaps. And they treat monitoring not as an add-on but as a structural requirement: the layer that converts a theoretical HA capability into a practical one.

The configurations that fail under pressure are the ones where each individual component was correctly installed but the interactions between components were never validated as a system. The HANA cluster is ready. The ERS is configured. The monitoring is running. But the ERS was never registered with Pacemaker, the monitoring threshold for HANA takeover was set too low and fires false positives, and the failover test was scheduled twice and cancelled both times because production was too busy.

That is not an HA configuration. It is a set of HA components that have not been assembled into a working system. The difference becomes apparent at 03:00 on a Monday when the primary node goes down.

Redpeaks monitors SAP HANA System Replication status, Pacemaker cluster health, application server availability, and ERS state in real time, with alerting that covers the transition phases of a failover, not just the steady-state metrics.

See the monitoring coverage details.

You might also like:

There are no more posts to display

Become a Redpeaks Partner

Join forces as Redpeaks Partner and elevate your business to new heights!

Unlock unparalleled insights and operational efficiency with Redpeaks Monitoring. 
Join us as a reseller or referral partner and empower your clients with the tools they need to thrive in today’s dynamic IT landscape.

Together, let’s revolutionize the way businesses monitor and optimize their operations.

Download our complete brochure