How Can We Help?
An Early Warning (EW) is a notification generated by HEAL when a service experiences an event that may impact the performance of an application or service.
Early Warning Navigation
1. Go to the Signals Tab. See Signals Navigation.
2. Click on the Early Warning ID or select a link to the Early Warning in an email notification to display a report of Early Warning.
Field | Description |
---|---|
1 – Signal Id | This displays a unique Signal ID. |
2 – Status | This displays the status of a signal, whether it is open, closed, or upgraded. Open status indicates signal persists, yet to be fixed. Closed status indicates the signal is resolved. |
3 – Severity | This indicates the intensity of the signal. |
4 – Signal Timeframe | This shows the beginning date and time for an activated signal and the concluding date and time for a signal that has been closed or enhanced. |
5 – Timeline | The incident timeline provides a chronological breakdown of affected services. It arranges events sequentially across various services, offering a brief synopsis and timestamp for each service’s initial event. MLE generates this comprehensive view of all incidents through its unique ensemble modeling methods. From the onset of an incident, such as a transaction slowdown, MLE establishes a related sequence of events organized within the incident timeline. You can focus on those services that are part of the applications allocated explicitly to you. |
6 – Violated KPI | This shows the name of a Key Performance Indicator that has been violated. |
7 – Current KPI Value | This shows the value of the violated KPI. |
8 – Normal Operating Range | This displays the Normal Operating Range (NOR) for the KPI. See Normal Operating Range. |
9 – Anomaly Score | This score measures the magnitude or severity of a particular event. This score ranges between 0 and 1, with higher values indicating more severe anomalies. Anomaly scores are associated with all KPIs, and transactions where HEAL generates events. The anomaly score only displays if the Machine Learning Engine (MLE) generates the event. However, the anomaly score does not apply to events generated through Static Operating Range (SOR). These events are treated separately and are not assigned an anomaly score. |
10 – Event Expansion/Collapse | This allows users to expand to see all events associated with a service in ascending order by timestamp, and to collapse the events list for a clearer view. |
11 – Root Cause Walk | The Root Cause Walk visually illustrates the possible root cause services contributing to a specific signal. This feature utilizes application topology and the relationships between the KPIs of involved services. See Root Cause Analysis. |
12 – ML Insights | This feature provides a comprehensive analysis of the top ten critical metrics associated with the services included in the Incident Timeline. See ML Insights. |
13 – Solution Recommendation | This feature suggests the top three potential solutions to help identify and address the root cause of a detected problem. See Solution Recommendation. |
14 – Related Signals | This displays a list of signals related to the current one, which can include both Early Warnings and Problems. If an Early Warning has been upgraded to a Problem, the corresponding Problem IDs will be displayed in this list. |
Life Cycle of an Early Warning
The EW lifecycle consists of the following phases:
- Detection: An EW is triggered when HEAL detects an event in a service.
- Aggregation: Events from the same service or services in the same line are aggregated into a single EW.
- Expansion: A separate EW is generated if a service not on the same line also has events.
- Mitigation: When a metric has no events for a specific interval, it is considered to have returned to normal. The EW is then closed.
- Resolution: If transactions in an entry point service in the path of the services in the EW are affected, then the EW is upgraded. HEAL creates a Problem and links the EW to it. The Problem is then resolved when the root cause of the issue is identified and fixed.
For Example:
As seen in the below screen, if the Bookings DB service has events, an EW is created.
Three Impacted Paths are possible:
- Path 1: Bookings DB -> Booking -> Hotels -> Travel Web
- Path 2: Bookings DB -> Booking -> Flights -> Travel Web
- Path 3: Bookings DB -> Booking -> Payments -> Payments Web
The timeline contains services along these paths where events are identified. A new EW is created if a service unconnected to the direct path has events.