How Ops Teams Can Use Heal’s Configuration Drift Tracking and Log Parsing Capabilities to Troubleshoot

by Vamsi Vedula | May 27, 2020

A Case for Situational Awareness

Our blogs so far have highlighted the predictive and preventive capabilities of Heal’s MLE (Machine Learning Engine) and the edge that self-healing gives you over other AIOps tools in mitigating outages and minimizing downtime. However, there are scenarios where autonomous self-healing is not possible due to insufficient information. There could also be a hitherto unseen scenario occurring for which an action trigger has not been set up. In such situations, it is up to Operations teams to look for an accurate root cause for the early warning signal or issue being raised, so appropriate healing measures can be initiated and the system metrics brought back to normal as soon as possible. In today’s increasingly complex enterprise environments, this is tantamount to finding a needle in the haystack.

Heal’s Situational Awareness capability helps your Operations teams drill down into issues and possible root causes faster. We have already seen how MLE’s patented Workload-Behaviour technique helps raise early warning signals predictively and preventively even before transaction SLAs have been affected and helps correlate events intelligently so service dependencies and root causes can be isolated. We also saw how relevant forensic and diagnostic data can help remediate early warnings faster and more accurately via proactive self-healing. Today, we focus on the third dimension to root cause analysis – situational awareness.

What is Situational Awareness?
Our previous blog series on Site Reliability Engineering specifically mentioned the integral capabilities an AIOps/Self-healing tool must have to bring about much-needed automation in operations.

Fig 1: Dimensions of Smart Automation required for Ops readiness.

The highlighted point is the one we will be focusing on today – how a time-synchronized troubleshooting context can aid in the automation of issue resolution. Once an early warning signal is raised or a problem causing a breach in any transaction SLAs is detected, the next step is to present relevant root cause scenarios and additional data required to perform RCA (root cause analysis). This is achieved via collection of contextual data at the time of the anomaly – forensic data capturing the state of the processes/queries running on the system at the time, database query statistics and code-level deep dive, snippets from log files pertaining to the error being analysed and any configuration changes in the system or the components running on it leading up to the time of the issue. This situational awareness makes Heal’s RCA approach more coherent and complete.

There are 4 aspects of Situational Awareness that Heal encompasses:

Configuration Change Tracking
Log Scan
Database Query Deep-dive
Code level instrumentation

In today’s blog, we cover the first two.

Configuration Change Tracking in Heal
Changes in system and service configurations often lead an application to behave differently than is expected. For example, an OS patch upgrade may cause an application to run slower for the same load as earlier without any changes in the code. There might be human errors in making changes to the configuration files; even an erroneously entered whitespace or a typo has been known to cause havoc in enterprise systems. It is difficult to check when the changes were made and what the changes were. In Cloud systems, configuration drift occurs whenever someone makes a change to the production environment without recording those changes and without ensuring complete parity between staging and production. This can end in unanticipated bugs and the resulting need for rapid incident response.

Heal analyses any configuration drift that has occurred in the system in the run-up to the issue. It provides a feature called Config Watch to monitor configuration changes in services. These changes are pertinent for troubleshooting teams to perform a more comprehensive RCA. Heal compares workload signatures temporally to examine the exact workload pattern that caused system metrics to go awry vis-à-vis the previous occurrence of the similar workload signature where the system was behaving normally. If, for the same workload pattern, the system is misbehaving now when it had been processing said workload sans any issues earlier, it clearly means something has changed in the system since then. This gives us the timeframe within which to compare snapshots of system configuration and highlight changes that may have led to the current issue.

Config Watch provides a comparison between the previous state of the system and the current state. A snapshot of the system is taken periodically and compared with current snapshot. The periodicity of snapshot collection is configurable. The different types of Config Watch provided by Heal are:

Property Watch
Property files (.prop, .conf etc.) contain key value pairs. Property watch detects if there are any changes in the property values and flags the time the change was detected.

File Watch
File watch detects if any file(s) was changed, the timestamp of the modified file and the time that the change was detected. Since hash values of file contents are compared, inadvertent additions like spaces will not be flagged.

Query Watch
It monitors configuration values stored in the database. You can obtain configuration values stored as key-value pairs in the database (typically a CMDB or a Configuration Management Database) by running queries on config tables.

The types of changes detected in each case are: Added, Modified and Deleted.

The following figure shows a service dashboard with Config Watch integrated, where any configuration changes in a service are highlighted in yellow. The Operations teams can drill down into the Configuration Watch KPI category to see the various changes under the three categories of Config Watch mentioned above.

Fig 2a. Service Dashboard for the Booking service showing configuration changes over the selected timeline in one of the service instances.

Fig 2b. Details of configuration changes under Property Watch and File Watch. The page shows a list of properties added/deleted/modified over selected duration in each monitored file.

Log Scanning in Heal
Many AIOps tools rely on log management to put together a coherent RCA. Log files are indexed and searched by applying Big Data and Data Mining principles, and relevant log patterns with timestamps are stitched together to arrive at a chronology of events that most likely transpired for the issue to happen. However, the onus of configuring these log patterns is on the ITOps teams. Where Heal scores over these tools is in the context – since log snippets are retrieved as forensic data in response to a signal, they are timelier and more pertinent to the issue being analysed. Also, the snippet(s) to be retrieved depend on the nature of the KPI that is in violation of the static/dynamic threshold. The mapping of the KPI to the corresponding log pattern to be extracted as forensic data is done at the time of configuration. Additional patterns can be added as and when required. This brings in the situational awareness to the analysis.

An example of a log pattern tied to a system KPI could be an ORA error on connection pool size when the CONNECTION_POOL_PERCENTAGE_USED KPI on a database server maxes out. Similarly, a signal on the Connection Waiting Thread KPI could trigger a collection of the connection wait/timeout messages from the System.out logs, as well as ORA error 3136 on the Oracle listener logs. ORA-03136 means that an authentication was not completed within the specified time. Both error messages taken in context can indicate that the server is heavily loaded or there is higher RAM being utilized, due to which it cannot finish the client logon within the timeout specified.

Fig 3: Log snippets from the Oracle Listener logs showing ORA error 12170 on the eCommerce DB Service instance, captured as forensic data in response to anomalies in the DB Sessions KPI.

Customer Case Study
Situational Awareness using Heal saved the day at a large pan-African bank with a presence in over 20 countries. Users were unable to log into the Corebanking application. CPU Utilization was repeatedly hitting the 100% ceiling on the e-Banking production server between 1230 and 1330 hours and there was also a drastic increase in the JVM CPU Usage on the WebSphere application server.

Analysis of the systemout.log and systemerror.log snippets that were collected as part of the JVM CPU Usage KPI forensic action trigger, revealed that there were no connections available for the datasource configured in the WebSphere Application Server.

J2CA0045E: Connection not available while invoking method createOrWaitForConnection (ConnectionWaitTimeoutException) for jdbc /<app_name_redacted>

Connection pool failed to create new object in WAS server and was hence unable to connect to the backend resource.

Background for the Error
The J2CA0045E exception occurs when your JDBC connection pool has reached the maximum number of database connections you allow, and subsequent database connection request from your application cannot be serviced before your datasource configured connection timeout value expires.

When you configure a WebSphere Application Server managed datasource, you set two parameters that can directly affect this condition, the total number of connections you will allow the datasource to open (MAX connections), and in a case where all available connections are already active or in use, how long you will wait for one to complete and become available to service a new connection request (Connection Timeout).

In many cases, the default MAX connections value of 10 is simply too small to keep up with application demands, and the error can be eliminated by increasing the MAX connections allowed. However, you should also be aware of any maximum connection restrictions set at your database server. You do not want your WebSphere Application Server to attempt to exceed the number of connections that your database server will allow. (source: IBM Support)

What Config Watch Revealed
For an average of 100 concurrent users, it is desirable that the MAX connections parameter for the datasource should be set to 100. However, Config Watch revealed that the night before this issue occurred, a script ran in the production environment which erroneously restored the environment to default values. The MAX connection property value in the datasource configuration properties file was revealed to have been changed from 100 to the default value of 10. This was the reason that connections were unavailable and kept timing out when concurrent users increased during peak load time.

Fig 4: Increasing MAX connections back to 100 would allow the datasource to create sufficient connections to service a higher number of concurrent users.

Restoring the MAX connections value to 100 as per the previous configuration resolved the issue.

Conclusion
We have covered some aspects of situational awareness i.e. the time-synchronized troubleshooting context that makes Heal’s RCA more comprehensive, accurate and pertinent to the issue at hand. In our next blog of this series we will cover Java code deep-dive and database query statistics as additional aids in troubleshooting and pinpointing root cause. Meanwhile you can read the documentation on Config Watch here to know more about the feature. Keep tuning in, and do reach out to us if you want to know more about how Heal can help your enterprise achieve zero downtime!