Using Self-healing for Ops Monitoring Readiness and Site Reliability for SAP
Today we conclude our 4-part blog series on self-healing as the solution for ensuring site reliability in modern day enterprises with ultra-dynamic workloads. In the previous blogs of the series, we have seen the various issues that typically plague such environments, how AIOps and self-healing tools help resolve these issues, particularly how Heal helps by monitoring the Four Golden Signals for SRE. We have also applied a special focus to E-Commerce and how SRE for this domain can benefit from self-healing in general and Heal in particular.
Our final blog of the series today will focus on Operational Monitoring readiness for SAP – the various shortcomings in the current SAP monitoring landscape, the advantages that AIOps tools and modern SAP monitoring tools like IT-Conductor provide, and the incremental benefits that Heal brings to the table. We conclude with some use cases for self-healing in SAP environments, including proactive, autonomous and projected healing scenarios.
Issues in SAP Monitoring
Native monitoring of SAP environments is tightly coupled with the SAP Solution Manager and its capabilities. The Solution Manager (SolMan for short) is responsible for generating alerts across metrics based on static thresholds, and while it does provide 360 degree monitoring, it becomes exhausting to sift through the overwhelming amount of data and arrive at a root cause to reduce MTTR (Mean Time to Resolution). Static thresholds might result in alerts that are not relevant to the issue at hand being generated and analysed – with the result that administrators are always in tactical firefighting mode.
Operations teams are hard pressed to find out whether an error originated on the client side, server side, in the ECC (ERP Central Component) or the network. Lack of tier wise breakdown in response times of critical processes might result in failure to isolate a root cause lying outside the SAP ECC infrastructure itself; analysts might be wasting valuable time looking into ECC processes and health when the actual root cause is attributable to a bad network or end user latency. As a result, remediation also becomes a trial-and-error process.
The current SAP monitoring landscape looks like this:
Fig 1: Current SAP Monitoring Landscape.
System and business process monitoring (including interface monitoring) in SAP Solution Manager use the Computing Center Management System (CCMS) architecture. This means that alerts that occur in the local CCMS are passed to SAP Solution Manager via RFC connections between SAP Solution Manager and the relevant managed systems. Alerts can be handled centrally without having to switch to the local CCMS of the managed systems. You can alternatively identify the alerts from multiple systems of a solution landscape in a graphical overview in SAP Solution Manager. All alerts emerging from multiple heterogenous systems in the landscape are dependent on static thresholds.
DBA Cockpit is a platform-independent tool that you can use to monitor and administer your database. It provides a graphical user interface (GUI) for all actions and covers many aspects of handling a database system landscape. The DBA Cockpit is part of SAP NetWeaver ABAP systems and integrated into SAP Solution Manager. You can run the DBA Cockpit as part of your system administration activities in SAP Solution Manager. The DBA Cockpit is optimized for handling administration and monitoring the databases of your entire system landscape from a central system. You can administer and monitor remote databases from the DBA Cockpit using remote database connections.
Other than native monitoring tools that are packaged with SAP ABAP/HANA, there are specialized solutions from IBM Tivoli, OzSoft, BMC ProactiveNet and IT-Conductor, which provide an enhanced level of process automation, more control over thresholds (including dynamic and standard deviation thresholds), and event-based recovery actions.
In the last rung, we have the APM tools with SAP Connectors, like AppDynamics and Dynatrace, or AIOps tools specifically targeting SAP monitoring and automation – players like Avantra, FixStream and MicroFocus.
How do AIOps Tools help?
In the previous section, we examined some pain points that native SAP monitoring tools suffer from:
- High number of data points being ingested
- Alert storms caused due to static thresholds
- Lack of accurate root cause suggestions; no automated remedial workflows.
AIOps tools for SAP environments address these three main pain points.
Addressing Alert Fatigue
The first key pain point that an AIOps solution addresses is alert fatigue resulting from the volume and variety of alerts generated by the Solution Manager and a heterogeneous set of legacy tools to monitor a growing landscape of systems and solution stack layers. The sheer volume of alerts means that some alerts get totally ignored – and those could be the ones instrumental in determining root cause and minimizing MTTR. AIOps tools use AI/ML techniques to generate behaviour profiles, dynamic thresholds and event correlations to reduce alert storms and extend the Solution Manager’s capabilities.
Expediting Root Cause Analysis
The second key pain point with SAP landscapes is making sense of all the data that SAP system logs and legacy heterogeneous monitoring tools collect to isolate the root cause of an error. IT teams need to work with people of diverse skill sets including administrators, solution architects, developers, solution stack component experts, infrastructure services specialists, business process/functional experts, and support process stakeholders. AIOps tools expedite root cause analysis by providing incident contextualization based on log analytics and machine learning algorithms that analyse event correlation, incident co-occurrence and time series data.
Eliminating Trial-and-error Remediation
AIOps solutions for SAP environments can link probable root causes to the SAP knowledge base of suggested best practices and remediation actions. This focuses the remediation process on a narrower set of options, potentially leading to faster resolution.
Smart Automation for Ops Monitoring Readiness
To know if everything is ‘okay’ with the SAP environment, here are some questions Operations teams need to know the answers to:
- Is my environment available?
- Are my background jobs finishing correctly and on time?
- Am I able to arrive at root causes for my alerts/incidents?
- Am I receiving too many alerts I do not know what to do with?
- Have I planned my environment well enough to scale
- Can I predict a job’s failure ahead of time?
What is required to address the above concerns is smart automation along the following dimensions:
Fig 2: Dimensions of Smart Automation Required for Ops Readiness.
As we have elaborated in our previous blogs, Heal provides features that map to each of the capabilities listed above:
- Availability and Performance monitoring are achieved through our own native agents or connectors to other APM and AIOps tools. Full stack visibility is also possible thanks to connectors for cloud and container environments.
- Our MLE intelligently stitches together an event correlation for root cause suggestions superimposed on application topology so that service dependencies and workload-behaviour correlations can be examined.
- At the time of an early warning signal or problem, the entire snapshot of the system is captured in the form of DB query deep-dive data, code snapshots, log snippets and forensic / diagnostic data that all aid in manual remediation and analysis of the root cause of the issue.
- Proactive healing is carried out by examining the contextual data captured as mentioned above. Autonomous healing can also be implemented via action scripts that trigger based on a certain set of conditions prevailing in the system at the time of the issue – for instance, a CPU hogging background process can be deferred if its priority flag is set to false.
- Service level management is carried out by detailed dashboards giving an insight into service performance, metric graphs with forensic data as applicable, workload and transaction traces, query deep-dive and code snapshots.
- Service Impact Awareness is done at two levels – hotspot analysis to determine the metric breaches across services in response to current workloads, as well as what-if analysis on capacity forecasting for projected workloads. Both these scenarios rely on the workload-behaviour correlation performed by our Machine Learning Engine.
Self-Healing – the “Extra” Element for Ops Readiness and SRE
In the previous section, we visited some critical questions SREs need to ask themselves while preparing their SAP systems for Ops Monitoring Readiness. AIOps tools provide most of the features required for smart automation; however, two areas where they fail to adequately address keeping a complex SAP environment always available, up and running are:
- Proactive signalling and autonomous healing i.e. automated recovery (to answer the question: Can I predict a job’s failure ahead of time, and do something to prevent it?)
- Projective planning for workload impact awareness at a service level (to answer the que.stion: Have I planned my environment well enough to scale?)
Heal brings both these capabilities to the table – let’s see how.
Proactive Signalling and Autonomous Healing
Heal’s Machine Learning Engine (MLE) uses our patented workload-behaviour correlation technique to establish dynamic baselines for behaviour in response to system load and raise meaningful signals if workload (dialog and background/batch processes) or behaviour violate these baselines. You can read more about this in our blog: How Heal’s Machine Learning Engine Supports Preventive Self-Healing.
Through its Ingestion APIs, Heal can also ingest alerts from SolMan or existing monitoring agents like CCMS, or derive anomalies on dynamic thresholds based on system KPIs fed into MLE. Heal also automatically derives correlations in behaviour anomalies across tiers to come up with accurate root cause suggestions. Since Heal’s MLE can also work directly with behaviour KPIs ingested from the SAP environments through Heal’s own agents, it can raise early warning signals before an incident resulting in service unavailability or transaction slowness occurs, without depending on the post-facto alerts raised by SolMan or CCMS on static thresholds. Proactive or autonomous healing actions can then be initiated to pre-empt a system outage. Workload can be shaped/throttled to pre-empt resource bottlenecks and effect app-aware scaling.
Projective Planning for SAP Environment to Scale
The workload-behaviour correlation that MLE performs is used even for projected healing- that is, to apply application centric workload projections and forecast infrastructure and configuration requirements in systems, to prevent overprovisioning or under-provisioning of services. The performance, stability and availability for the entire SAP stack is baselined, so the success of a S/4HANA and HANA migration effort from an operational standpoint can be ascertained. Resource usage based on seasonality in workload and spikes in load during the day/week/month/year are also baselined via AI/ML techniques using a variety of models. This helps identify hotspots, issue scale warnings and plan capacity. You can read more about this in our blog: Proactive and Projected App-aware Scaling.
How Heal for SAP Works
The solution architecture for Heal in SAP environments looks like this:
Fig 3: Heal for SAP – Solution Architecture and data flows.
The Heal SAP agent collects various metrics from the environment, including:
- Discovery & Configuration: Landscape discovery information and configuration details for services
- Workloads: User workload, Datacenter processes, end user transactions, batch jobs (background workload)
- Component Metrics: Relevant performance metrics from SAP application servers, databases and hosts (using Heal Component Agent)
- Logs
The information is collected by invoking specific T-Codes over RFC (Remote Function Calls). The data bus processes the data in real time and populates our ExOps data repository, from where the Machine Learning Engine retrieves the required performance data in batch/non-real time mode for baselining and creating models, and in real-time mode for live/streaming analytics. The insights derived via streaming analytics – mainly the workload-behaviour correlation and anomaly signals – are both persisted back to our data store for reference. These are then processed real-time by an Action Trigger to issue notifications, trigger forensic data collection and healing actions, as well as populate real-time signal dashboards. A Query API allows retrieval of historical data to view application trends, generate scheduled/on-demand reports and populate the user interface.
Sample Use Cases for Heal in SAP SRE
The below table encapsulates some use cases and scenarios where Heal helped mitigate downtime and outages in SAP environments through its event correlation and healing capabilities.
Example of Autonomous Healing – Application-aware Scaling
Fig 4: Illustration of scenario no. 3 in the above table; workload surge is identified and application of ABL results in early warning signal of a potential outage due to metric breach (in this case, CPU). Workload is throttled by deferring low-priority transactions.
Example of Projected Healing – Capacity Forecasting via What-if Analysis
Fig 5: Illustration of scenario no. 6 in the above table; MLE applies ABL to identify choke points when specific workloads increase. Projected healing is used to perform what-if analysis and showcase performance choke points as well as resource wastage, if any.
Conclusion
Our blog today concludes our series on Site Reliability Engineering and operational monitoring readiness for different environments, and how self-healing helps your enterprise prepare to meet operational challenges. Although our blog today mentions the Heal SAP agent, you can even try out the Heal Agentless Edition with our ingestion API to stream your existing monitoring data into Heal’s MLE for deriving insights. In the future, we will be coming out with more blogs on industry-specific monitoring and operational issues, and how self-healing is a solution for all your SRE needs. Meanwhile, do keep tuning in to our website for a new blog every week, and do get in touch with us to schedule a demo of Heal for you to know more about the power of self-healing!