SAP HANA Database Monitoring : the metrics that predict problems before they happen

Summary

Log volume full. The message appears in the system log at 14:22 on a Wednesday. By 14:23, the HANA database has performed an emergency stop. No prior alert, no gradual degradation, no warning visible to users before the sessions dropped. The log volume had been growing steadily for three weeks. Nobody was watching it.

That specific scenario accounts for a category of HANA outages that are entirely preventable and almost embarrassingly simple to catch. The log volume fills, the database stops. Set one alert at 70% utilization and you never see this failure mode in production.

But log volume is only one entry on a longer list of metrics where the gap between current state and production incident is both measurable and actionable. This article covers the HANA-specific monitoring signals that carry genuine predictive value: not generic database performance metrics rebranded for HANA, but the readings that reflect how HANA actually manages memory, persistence, and concurrency. For each one, the goal is to explain not just what to measure, but why the measurement matters and what it tells you about conditions that have not yet become incidents.

Why HANA monitoring is different from generic database monitoring ? 

Memory is not an abstraction in HANA

In a traditional row-store database, memory is a performance layer. Data lives on disk. Queries read from disk into a buffer cache, and when memory fills up, the database pages out to disk. Performance degrades gracefully. The system stays up.

SAP HANA does not work this way. Column store data lives in memory. It is loaded at startup and stays there. When memory fills up, HANA does not page out to disk in the way a conventional database would. It uses an internal unload mechanism to evict cold column table data from memory, which works within defined bounds. But if the total memory demand from active queries, row store, code heap, and column store combined exceeds the configured allocation limit, HANA triggers an out-of-memory event and stops.

This distinction changes what monitoring needs to do. You are not watching memory to optimize performance. You are watching memory to prevent a hard stop. The threshold where a metric becomes critical is different from any other database platform, and the relationship between memory pressure and system stability is more direct and less forgiving.

The metrics that look normal until they don’t

Several HANA metrics behave stably across a wide range and then change character abruptly near their limits. Log volume utilization is flat for weeks, then suddenly critical. Delta merge backlog grows slowly for months without visible performance impact, then query plans start degrading. Code heap grows incrementally with each new deployment and never shrinks, until the cumulative total crosses a threshold that pushes total memory allocation past a safe ceiling.

This pattern, slow drift followed by a sharp consequence, is what makes HANA monitoring specifically about trend detection rather than threshold alerting alone. A metric at 65% that has grown from 40% in three months is a different situation from a metric that has been stable at 65% for two years. Both read the same at the moment of measurement. Only one represents an approaching problem.

Any monitoring setup for HANA that only checks current values against static thresholds is incomplete. The trend dimension, how fast a metric is moving and toward what boundary, is often the more useful signal.

Memory metrics : where most HANA incidents start ? 

The difference between used memory, allocated memory, and the limit

HANA exposes several memory figures that are easy to conflate. Understanding what each one represents determines which one you should be alerting on.

The memory allocation limit is the ceiling HANA has been configured to use. It is set via the global.ini parameter global_allocation_limit and is typically sized at 90-95% of available physical RAM to leave headroom for the operating system. This is the hard boundary. Cross it, and HANA stops.

Allocated memory is how much of that limit HANA has currently claimed from the operating system. It grows as HANA loads data and shrinks only partially when data is unloaded. Allocated memory staying close to the limit after data unloads is a sign that memory fragmentation is building up.

Used memory is the subset of allocated memory that is actively holding data or being used by running processes. This is what fluctuates with query activity and connection load. A spike in used memory during a large analytical query is normal. Used memory that does not return to baseline after the query completes indicates a leak or accumulation pattern worth investigating.

The metric to alert on is used memory as a percentage of the allocation limit, not as a percentage of physical RAM. The allocation limit is the actual operational ceiling. Alert at 85%. Escalate at 90%. Above 90%, a single large unoptimized query can push the system past the limit.

Watch out:  The M_MEMORY_OVERVIEW view provides a consolidated memory picture but does not break down consumption by component. For root cause analysis during a memory pressure event, M_HEAP_MEMORY (code heap), M_RS_MEMORY (row store), and the column store unload candidate view M_CS_UNLOADS are necessary to understand what is consuming space.

Column store and row store : monitoring them separately

Column store memory is dynamic. HANA loads and unloads column table data based on access patterns and available memory, managed by the auto-unload mechanism. When memory pressure builds, HANA automatically evicts cold column store partitions. This is by design and does not indicate a problem. The problem occurs when the working dataset, the data actively needed by production queries, is larger than available memory, forcing HANA to load and unload the same pages repeatedly. This shows up as elevated disk I/O on the data volume and increased query latency on specific tables, not as an error message.

Row store memory behaves differently and is monitored for a different reason. Row store tables, which include HANA’s internal catalog tables and any application tables configured with ROW storage, do not benefit from the auto-unload mechanism. Memory consumed by row store data stays allocated even after rows are deleted. The space is marked as free internally but not returned to the system. Over time, in environments with high row store churn, row store size grows until explicit reorganization is run via HANA Studio or the HANA SQL command ALTER TABLE … RECLAIM DATA SPACE.

Row store memory that grows steadily in a system with stable data volumes is a sign that row store reorganization has not been run in a long time. Left unaddressed, it contributes to overall memory pressure without any corresponding growth in actual data. Monitoring row store used versus row store allocated, and alerting when the gap between them grows large, identifies this condition before it becomes a memory headroom problem.

Code heap : the metric that accumulates silently

Code heap is the memory HANA uses for compiled code objects, including ABAP stored procedures, scripted calculation views, and XS application code. It grows as new objects are loaded or recompiled and does not release automatically. Code heap is not bounded by the column store unload mechanism. It counts toward the total allocation limit.

In development-active systems where stored procedures or calculation views are frequently deployed and updated, code heap grows with each deployment cycle. Old compiled versions are not immediately purged. A system that has been running for two or three years with regular ABAP stored procedure development can accumulate several gigabytes of code heap that serves no active purpose.

Code heap above 8-10 GB in a production system deserves investigation. The remediation is typically a service restart on the index server, which releases compiled objects no longer in use. In a well-maintained system, code heap should be reviewed as part of regular health checks, particularly after major release deployments.

Storage metrics : the ones that cause hard stops

Log volume : the metric that ends databases without warning

SAP HANA uses a redo log to ensure transactional durability. Every committed transaction writes to the log volume before the acknowledgment is sent to the application. This log must be backed up regularly to a backup medium, at which point the backed-up segments become eligible for overwrite. If log backups stop running, whether due to a configuration problem, a backup medium issue, or deliberate disabling, the log volume fills continuously with no release mechanism.

When the log volume reaches 100% utilization, HANA performs an immediate emergency stop. Not a graceful shutdown. Not a write suspension. A stop. The log files cannot be extended. No new transactions can be committed. The system is offline until log space is freed, which requires either restoring from backup or manually deleting log segments, neither of which should be a first response in a production environment.

The alert threshold for log volume utilization is 70%. Not 80%, not 90%. At 70%, there is time to investigate whether log backups are running correctly, whether the backup medium has sufficient space, and whether the log backup interval is appropriate for the current transaction volume. At 90%, you are in remediation mode. Above 95%, you are in incident mode.

In practice:  Log backup frequency determines how quickly the log volume fills under normal write load. A system with high transaction volume and log backups configured for every 15 minutes will fill the log volume much more slowly than one where log backups run hourly. If log volume utilization is consistently high even when backups are running correctly, the backup interval may need to be reduced or the log volume may need to be sized larger.

Data volume growth and savepoint behavior

The HANA data volume holds the persistent column store data, the row store, and the undo/redo structures needed for crash recovery. Its growth rate under normal operations is predictable and tied to data volume changes in the database. A sudden acceleration in data volume growth indicates either a bulk data load, an archiving configuration that has changed, or a runaway process creating large temporary structures that are not being cleaned up.

Savepoints are the mechanism by which HANA periodically flushes dirty pages from memory to the data volume. They run automatically at configurable intervals (default: 5 minutes) and during specific events like log backups. Savepoint duration is a metric that most monitoring configurations do not track but carries meaningful predictive value.

A savepoint that normally completes in 15 seconds and starts taking 3 minutes is telling you something: either the delta between the last savepoint and the current one is much larger than normal (indicating high write volume or a large bulk operation), or the I/O path to the data volume is saturated. Both conditions tend to precede broader performance issues. Elevated savepoint duration is often visible in HANA monitoring before users report slowness, which makes it a useful early indicator for I/O-related degradation.

Backup monitoring : the metric that matters exactly when you least want to discover it

The relevance of backup monitoring becomes apparent at the moment of a database failure, when the recovery timeline depends entirely on the age of the last successful backup. Discovering at that point that the last successful data backup was four days ago, because the backup job had been silently failing since then, is a situation that backup monitoring exists to prevent.

Two backup metrics warrant continuous monitoring. The age of the last successful complete data backup should never exceed 24 hours in a production environment. The integrity of the log backup chain should be verified continuously: a gap in the log backup sequence means that point-in-time recovery to any time after that gap is impossible, even if subsequent log backups are present and valid.

Both metrics are available in the M_BACKUP_CATALOG view. Both are straightforward to alert on. And both are systematically undermonitored because backup failures tend to be treated as infrastructure issues rather than database availability issues. A failed backup does not affect current system performance. The consequences appear only after the next failure, at recovery time.

Performance metrics : SQL, delta merge, and lock contention

Long-running statements and what they signal beyond their own execution

An expensive SQL statement in HANA does more than consume resources during its own execution. A query that runs a full column store scan on a multi-billion row table while holding shared locks blocks parallel access to those structures. It delays delta merge operations that need to process the same data. It may trigger the auto-unload of other column store partitions to free memory, causing I/O when those partitions are reloaded for the next query that needs them.

The M_EXPENSIVE_STATEMENTS view records statements that exceeded a configurable duration or resource threshold. By default, this threshold is set too high for most production environments to catch the queries that matter. Setting the expensive statement threshold to 30 seconds in OLTP environments and 5 minutes in analytics environments provides a record of statements worth investigating, without the overhead of logging every query.

A monitoring alert that fires when any statement has been running for more than 5 minutes in production gives the operations team time to investigate whether the statement is expected (a large batch process) or unexpected (a user accidentally running an unoptimized report against a production system). The distinction determines the response, but either way the team should know about it while it is still running.

Delta merge backlog : the slow degradation nobody notices in time

HANA’s column store uses a write-optimized delta store for new row inserts before merging them into the main compressed store. The delta merge process runs automatically based on configurable triggers: when delta size reaches a threshold, on a schedule, or manually. Reads on column store tables query both the main store and the delta store, which is less optimized for reads than the main store.

When delta merges fall behind, whether because merge operations are being cancelled by competing resource demands, because the merge threshold is set too high, or because the auto-merge configuration has been inadvertently changed, the delta store grows. Query performance on the affected tables degrades gradually as the read-optimized main store shrinks as a proportion of total table size. There is no error. No alert fires. Execution plans get slightly worse. Users perceive the system as slow in ways they cannot precisely articulate.

Monitoring pending delta merges and merge failure counts via M_DELTA_MERGE_STATISTICS catches this drift before it reaches the point where query plan degradation is visible in response time metrics. A delta merge failure count that is climbing, or a pending merge queue that is consistently above 50 across the system, warrants investigation of what is blocking or delaying the merge process.

Lock waits and blocked transactions

HANA’s multiversion concurrency control architecture is designed to minimize lock contention compared to traditional row-store databases. Reads generally do not block writes, and writes do not block reads. The lock contention that does occur in HANA tends to be concentrated in specific patterns: concurrent modifications to the same row in an OLTP workload, table-level locks held by DDL operations, or record locks held by long-running transactions that have not committed.

The M_BLOCKED_TRANSACTIONS view shows transactions currently waiting on locks. A lock wait that resolves in under a second is normal. A lock wait that persists for 30 seconds or more in a production OLTP system indicates a blocking pattern that will worsen under increased concurrent load. The view shows both the blocked transaction and the blocking transaction, which is the piece the operations team needs to decide whether to wait for the blocker to complete or to intervene.

Persistent lock waits in production are almost always a symptom of a broader issue: a transaction that was started but not committed (open transaction left by an application error), a batch process that holds locks across large record sets without intermediate commits, or a table design that creates update hotspots. Monitoring lock wait frequency and duration surfaces these patterns before they result in user-visible errors.

Replication status and service health

System replication lag as an early warning signal

In environments running SAP HANA System Replication for high availability, replication lag is a metric that matters both operationally and architecturally. Under normal conditions with SYNC or SYNCMEM replication mode, lag should be near zero. The secondary is continuously receiving and applying log entries from the primary. Any sustained lag indicates that the secondary is falling behind.

Replication lag can build for several reasons: network bandwidth saturation between primary and secondary, I/O bottleneck on the secondary’s storage, or a spike in primary write volume that temporarily exceeds the secondary’s processing capacity. Each cause has a different remediation.

The monitoring value of replication lag goes beyond ensuring the secondary is healthy. A secondary that is consistently running 5-10 seconds behind the primary under normal load will be further behind during a high-write event, which is exactly when a failover is most likely to occur due to resource exhaustion on the primary. Lag at the time of failover is the practical RPO for SYNCMEM and ASYNC configurations. Monitoring lag trend provides advance warning that the de facto RPO is diverging from the design assumption.

Service availability beyond “is HANA up”

An HANA system that responds to a connectivity check is not necessarily a fully healthy system. HANA is composed of multiple services that run as separate processes: the index server (column store and SQL processing), the name server (topology and routing), the statistics server (monitoring and alerting internally), the preprocessor server (text search), and in some configurations the XS engine for HTTP-based applications.

Each of these services can fail or degrade independently while the system as a whole remains reachable. An index server that has restarted due to a memory event will show the system as available to a ping check while the column store data is being reloaded from disk, a process that can take minutes to hours depending on the data volume involved. During that reload, query performance may be severely degraded without any availability alert firing.

Service-level monitoring via M_SERVICES tracks the status, uptime, and memory consumption of each HANA service individually. Alerting on index server restarts, even when the service recovers automatically, creates a record of instability events that would otherwise be invisible. Two or three index server restarts in a month, each resolving automatically, is a pattern that warrants investigation before it produces an unrecoverable stop.

HANA monitoring metrics reference

The table below consolidates the key metrics discussed in this article, with recommended alert thresholds, the HANA system view that provides the data, and the operational reason each metric matters. Thresholds are starting points. The correct values for your environment depend on your system profile, data volume, and SLA targets.

MetricAlert thresholdView / sourceWhy it matters
Used memory / allocation limit> 85%M_MEMORY_OVERVIEWPrimary OOM risk indicator. Sustained above 90% means an unconstrained query can trigger an emergency stop.
Row store memory used> 70% of row store sizeM_RS_MEMORYRow store does not release memory automatically on row deletion. Growth here is permanent until explicitly reorganized.
Code heap used> 8 GB (absolute)M_HEAP_MEMORYCode heap growth is a sign of memory leaks in ABAP stored procedures or XS applications. Does not shrink without a service restart.
Log volume used> 70%M_DISK_USAGELog volume exhaustion causes a full database stop with no graceful shutdown. No user-visible warning before the stop occurs.
Data volume used> 80%M_DISK_USAGETracks total data persistence growth. Combine with monthly growth rate to project when volume will need expansion.
Last successful data backup> 24 hoursM_BACKUP_CATALOGAn HANA system without a recent backup has undefined RPO. Backup failure is often silent without explicit monitoring.
Log backup gapAny gap > backup intervalM_BACKUP_CATALOGGaps in log backup chain break point-in-time recovery. Every gap is a potential data loss window.
Long-running statementsActive > 5 minutes (prod)M_EXPENSIVE_STATEMENTSA single expensive statement can monopolize I/O, block other queries, and cause a cascade of wait conditions across the system.
Pending delta merges> 50 pending across all tablesM_DELTA_MERGE_STATISTICSDelta backlog degrades read performance progressively. At high counts, query plans switch to suboptimal paths without any error.
Lock wait countAny lock wait > 30 secondsM_BLOCKED_TRANSACTIONSPersistent lock waits in production OLTP indicate contention patterns that worsen under peak load.
Thread blocking / stuck threadsAny thread in status Semaphore Wait > 5 minM_SERVICE_THREADSStuck threads consume work capacity without doing work. At sufficient count they starve legitimate requests.
System replication lag (SYNCMEM/ASYNC)> 10 seconds sustainedM_SERVICE_REPLICATIONReplication lag during a SYNC/SYNCMEM primary failure means the secondary is behind. Takeover precision depends on lag at failure time.
Secondary connection statusAny disconnectionM_SERVICE_REPLICATIONA disconnected secondary provides no HA coverage. Reconnection may require manual intervention depending on the disconnect reason.
Savepoint duration> 5 minutesM_SAVEPOINTSLong savepoints indicate I/O saturation or lock contention at the persistence layer. They precede broader performance issues.
Savepoint write size growthMonth-on-month trendM_SAVEPOINTSGrowing savepoint write size indicates that the working dataset is expanding. Relevant for capacity planning.

Building a monitoring baseline for HANA : what good coverage looks like ?

The metrics in the table above are not equally urgent to implement. In environments with no current HANA monitoring or with monitoring that only checks availability, a practical prioritization is memory allocation limit (the hardest stop), log volume utilization (the most common cause of unplanned outages), and last successful backup age (the metric that determines recovery capability). Those three, with alerts set and routed correctly, eliminate the most common category of HANA incidents.

From that baseline, delta merge monitoring and savepoint duration add the trend visibility that catches slow-moving performance degradation before it becomes user-visible. Long-running statement monitoring adds the query-level signal that explains why memory pressure spikes at specific times of day.

The full set of metrics in the reference table covers a production HANA environment comprehensively, but it represents several weeks of configuration and threshold tuning work to implement correctly. The value of doing that work is proportional to the cost of the incidents it prevents. For a production S/4HANA system running financial operations, payroll, or logistics execution, the answer to whether that work is worth doing is not complicated.

Two practical notes on implementation. First, several of these metrics require access to HANA system views that need specific HANA privileges, separate from standard SAP application authorization. Monitoring user setup is a prerequisite that gets underestimated in project timelines. Second, baselines for metrics like savepoint duration and delta merge frequency require at least two to four weeks of data collection before alerting thresholds are meaningful. Starting monitoring before go-live, even in non-production, produces baseline data that makes production alerting immediately useful rather than requiring a recalibration period after launch.

The predictive value is in the trend, not the threshold

HANA performance does not usually fail suddenly. It accumulates. Code heap builds across deployment cycles. Delta merges fall behind gradually. Log volume grows as backup windows extend. Row store space fills as reorganization gets deprioritized. Each metric, monitored individually, looks manageable until it reaches its critical point. Monitored together as a trend, they tell a coherent story about a system moving toward a constraint.

The monitoring setups that catch HANA problems before users report them share a characteristic: they treat trend as a first-class signal alongside current state. A metric at 70% that has grown from 40% in six weeks deserves the same attention as a metric at 85% that has been stable for months. The stable metric is a known condition. The trending one is an approaching event.

The metrics covered in this article do not require instrumentation beyond what HANA already exposes through its system views. The data is there. The question is whether the monitoring layer is configured to read it, correlate it, and surface it to the right people at a time when action is still preventive rather than reactive.

Redpeaks connects to SAP HANA via native SQL APIs without agents or transports, collecting memory, storage, performance, and replication metrics in real time. Alerts are baseline-driven, not default thresholds. 

See the HANA monitoring coverage

You might also like:

There are no more posts to display

Become a Redpeaks Partner

Join forces as Redpeaks Partner and elevate your business to new heights!

Unlock unparalleled insights and operational efficiency with Redpeaks Monitoring. 
Join us as a reseller or referral partner and empower your clients with the tools they need to thrive in today’s dynamic IT landscape.

Together, let’s revolutionize the way businesses monitor and optimize their operations.

Download our complete brochure