A Case for Situational Awareness: Part 2
Our last blog spoke about Heal’s Situational Awareness capability, specifically how configuration change tracking and log parsing can help Operations teams pinpoint the exact root cause behind the occurrence of an early warning or problem. There are four aspects of situational awareness as detailed in the figure below.
Fig 1: Dimensions of Situational Awareness provided by Heal.
In today’s blog, we go over the other two aspects of gleaning context from your enterprise data to arrive at an accurate RCA faster:
- Code-level snapshots via instrumentation
- Database query-level statistics
Code-level Snapshots via Instrumentation
Services and applications rarely exist in isolation. They typically receive inputs from multiple sources and exit to multiple components – some of them external to the enterprise – such as Cloud databases and storage, API gateways and ESBs. In such a scenario, it becomes difficult to pinpoint exactly why a transaction has slowed down: due to an underlying application issue, or due to one of these exits – a non-responsive API gateway, a faulty storage switch, or a badly written query that takes time to execute. For instance, teams monitoring Java based applications are often concealed from the underlying method and class-level activities of their application. Without any data on these activities, it is difficult to get an insight into the performance of the applications and isolate problems.
Intrusively monitoring an application helps users in identifying bottlenecks like database queries, calls to other applications, exceptions thrown etc. This can be done by using an agent that can instrument and monitor the underlying code. The agent can track transactions/methods that are either running slow or throwing an exception. The stack trace of the offending transaction/method can be captured. This data can then be tied to the request on the respective service so users can drill down into method level execution statistics to identify problem areas.
Intrusive monitoring is one of the most powerful mechanisms of monitoring applications and is not new to those familiar with APM terminology. Most popular APM tools provide the capability of taking code execution snapshots and capturing method entries and exits to map them to slow transactions.
Java Intrusive Monitoring (JIM) in Heal can be used for a variety of purposes, some of which include:
- Basic workload monitoring
- Performing code deep dive by collecting stack traces
- Helping troubleshooting activities by capturing method arguments, bind variables, etc.
- Request tracing and stitching
- Business transaction monitoring by collecting business values
Heal also goes a step further by gathering these code snapshots in the context of an issue, so that findings from analysing the execution traces can be tied into proactive/autonomous healing measures. When MLE detects that certain requests are slowing down based upon dynamic thresholds computed on transaction response times, it triggers a special type of forensic action, i.e.: taking code snapshots of those requests.
Fig 2: Slowness on requests inbound to a Java service called Booking.
Service Dashboards in Heal have a section under their Workload tab, where slow requests can be drilled down into. Clicking on a specific request from the list takes you to a screen dedicated to this request where code snapshots, if taken, can be viewed.
Fig 3: Request dashboard for the Booking service, showing cluster-level metrics for slow/ failed/ successful transactions, along with options to drill down into any request. Instance level dashboards are also available.
Fig 4a shows this request detail page, which displays a list of code snapshots for the request in reverse chronological order based on collection time. It also displays response time of the request instance, the Java Virtual Machine (JVM) which processed the request and status. Each snapshot corresponds to an instance of the request. It displays snapshots of slow or failed requests only (failed – red, slow – orange) along with the date and timestamp for the snapshot. The first snapshot in the list is selected by default. The call stack of the selected request shows instrumented methods in dark text and non-instrumented methods in gray or light text. This indicates that the agent has captured the snapshot but does not monitor the method (most likely because the framework it belongs to is not supported by Heal currently). Methods that have parameters captured are denoted appropriately. Response time (time spent in each of the monitored methods) is displayed in milliseconds.
Fig 4a: Request detail page showing code snapshots taken on slow requests. Methods where parameters/arguments have been captured will be denoted via an icon, which, upon clicking, will show all arguments as name-value pairs.
Exceptions are marked in red and can be clicked to see the error in detail as shown in Fig 4b.
Fig 4b: Error details can be viewed by clicking on Exceptions (shown in red).
Exits are marked in blue and can be viewed in detail by clicking on them; this allows the user to view the parameters passed, time spent in the exit method and the call type. For instance, database exits will show the query that is being executed, as shown in Fig 4c.
Fig 4c: Exit method details can be viewed by clicking on highlighted method name in blue.
The call stack can be downloaded in PDF format. Snapshots associated with failed or slow status can be filtered as shown in Fig 4d.
Fig 4d: Filtering of snapshots on various criteria, such as status (failed/slow), response time range, snapshot time range and JVM name is possible.
The code level deep-dive capability of Heal is a powerful tool in the hands of Operations teams staffed with developers who have knowledge of the underlying Java application code. Even if Ops teams do not have resources with a background in programming, the pertinent snapshots can be shared across with development teams so that corrective action can be taken at a code level and a patch deployed before issues escalate – an instance of proactive healing with multiple teams working in synchrony.
Database Query-Level Statistics in Heal
More often than not, the root cause of slowness or failures in an enterprise can be attributed to the database layer. As seen in various examples of proactive and autonomous healing as well as situational awareness, a slow executing query on a database can cascade through the application infrastructure and impact transactions arriving at an entry level web service. Slow database exits slow down the business logic in the application server layer as well, causing hung threads and deadlocks which are notoriously difficult to locate and resolve.
Query-level statistics are accounted for in different ways by different database vendors. Oracle provides an OEM (Oracle Experience Manager) live dashboard to help system administrators troubleshoot database waits and problematic processes, in addition to AWR (Automatic Workload Repository) reports. MS-SQL provides DMV (Dynamic Management Views) to enable troubleshooting of issues. These constitute a database query-level deep dive, which gives insight on the queries hogging maximum CPU, performing most reads and writes, taking most time to execute and so forth. AWR reports additionally give data on the foreground and background processes – including log file sync waits and buffer busy waits – taking up most CPU and time on the database server.
Heal uses these reports as well as its own mechanisms for retrieving database performance snapshots via queries on the database session and performance tables, given adequate access. These snapshots are triggered as an additional forensic action in response to anomalies on certain database metrics related to sessions, locks, reads/writes and CPU/memory usage.
Imagine a scenario where Heal has flagged an early warning signal on session related database metrics, say, maximum and inactive DB Sessions. The corresponding forensic data shows us count of current active & inactive sessions and count of users inactive for more than 48 hours, as well as log snippets of connection timeouts seen in Oracle listener log file and ORA-Error 3136 in Oracle 11g alert.log. However, this information is not sufficient to tell us exactly which query is holding on to a database session longer than expected, causing a pile-up of requests and subsequent connection time-outs.
Fig 5a: Early Warning signal due to anomalies on DB Sessions KPIs, including count of max sessions and inactive sessions.
In such a case, to drill down into the problematic query, we navigate to the Query Deep Dive KPI category. Starting from the time the anomaly occurred, snapshots are taken from the database every 10 minutes to collate data on the query IDs which resulted in maximum disk reads/writes, executions, sorts and CPU/memory consumption. This helps the operations teams narrow down on those query IDs which are most likely causing connections to be held onto longer than expected, locks/waits on open database sessions and resultant CPU spikes and connection timeouts.
Fig 5b: Query deep dive screen with snapshots collected on the eCom DB server Instance 2 in response to the count of max sessions spiking to over 800 at 2002 hours.
Conclusion
In today’s blog, part 2 of the series on situational awareness, we have covered some more aspects of the time-synchronized forensic data that makes Heal’s root cause analysis more contextual, complete, and timely. Config watch, log pattern search, code deep-dive and database query statistics are powerful allies to Operations teams in troubleshooting and pinpointing root cause. Meanwhile you can read the documentation on Java Intrusive Monitoring here to know more about the feature. Do read our earlier blogs on the proactive, autonomous and projected healing capabilities that Heal provides to enable your enterprise achieve the Holy Grail of zero downtime!