How Can We Help?

A problem signal is a notification that an unexpected condition has occurred in a system or application.

Problem Navigation

1. Go to the Signals Tab. See Signals Navigation.

2. Click on the Problem ID or select a link to the Problem in an email notification to display a report of the Problem.

Problem Report

FieldDescription
1 – Signal IdThis displays a unique Signal ID.
2 – StatusThis displays the status of a signal, whether it is open, closed, or upgraded. Open status indicates signal persists, yet to be fixed. Closed status indicates the signal is resolved.
3 – SeverityThis indicates the intensity of the signal.
4 – Signal TimeframeThis shows the beginning date and time for an activated signal and the concluding date and time for a signal that has been closed or enhanced.
5 – TimelineThe incident timeline provides a chronological breakdown of affected services. It arranges events sequentially across various services, offering a brief synopsis and timestamp for each service’s initial event. MLE generates this comprehensive view of all incidents through its unique ensemble modeling methods. From the onset of an incident, such as a transaction slowdown, MLE establishes a related sequence of events organized within the incident timeline. You can focus on those services that are part of the applications allocated explicitly to you.
6 – Violated KPIThis shows the name of a Key Performance Indicator that has been violated.
7 – Current KPI ValueThis shows the value of the violated KPI.
8 – Normal Operating RangeThis displays the Normal Operating Range (NOR) for the KPI. See Normal Operating Range.
9 – InstanceEach event is linked to a specific cluster or instance within the service. Select the instance of the service, to navigate to the Instance Details screen for a comprehensive view of that instance.
10 – Anomaly ScoreThis score measures the magnitude or severity of a particular event. This score ranges between 0 and 1, with higher values indicating more severe anomalies.

Anomaly scores are associated with all KPIs, and transactions where HEAL generates events. The anomaly score only displays if the Machine Learning Engine (MLE) generates the event.

However, the anomaly score does not apply to events generated through Static Operating Range (SOR). These events are treated separately and are not assigned an anomaly score.

11 – Event Expansion/CollapseThis allows users to expand to see all events associated with a service in ascending order by timestamp, and to collapse the events list for a clearer view.
12 – SummaryThis displays a summary of a signal with RCA Walks listed.
13 – ML InsightsThis feature provides a comprehensive analysis of the top ten critical metrics associated with the services included in the Incident Timeline. See ML Insights.
14 – Solution RecommendationThis feature suggests the top three potential solutions to help identify and address the root cause of a detected problem. See Solution Recommendation.
15 – Root Cause WalkThe Root Cause Walk visually illustrates the possible root cause services contributing to a specific signal. This feature utilizes application topology and the relationships between the KPIs of involved services. Click on the Root Cause Walk to view its details, which will appear in a new tab on your screen. See Root Cause Analysis.
16 – Related SignalsThis displays a list of signals related to the current one, which can include both Early Warnings and Problems. If an Early Warning has been upgraded to a Problem, the corresponding Problem IDs will be displayed in this list.

Lead Problem Life Cycle

HEAL generates a Problem when events occur on an entry point service for transaction metrics – Slow, Fail, Timed Out, or Unknown.

  • A Problem can be created if transaction metrics at the entry point service record events, even if other behavior metrics don’t present any events.
  • A Problem might arise from an existing Early Warning. If multiple events occur along the service path where the Early Warning started, and the connected entry point service has transaction events, the Early Warning is considered escalated to a Problem.
  • A unique problem is formulated for every entry point service.

A Problem can progress from Default to Severe if it commences with events in Slow, and then records events in the other categories. Once labeled as Severe, the Problem retains this status for its life cycle.

Additional events generated on the same service or services in the same line as the initial service are included in the same Problem.

A Problem is resolved when all metrics recording events return to normal.

Example 1: Suppose Travel Web requests have events.

  • The Early Warning is resolved.
  • A Problem is generated.
  • One Root Cause (RC) walk is available.
  • Path 1: Bookings DB -> Booking -> Hotels: Flights -> Travel Web
  • The resolved Early Warning is marked as a related signal.

Example 1

Example 2: Suppose Payment Web requests have events:

  • A new Problem is generated.
  • One RC walk is available.
  • Path 1: Bookings DB -> Booking -> Payments -> Payments Web
  • The timeline includes services along these paths where events have been detected.
  • The Early Warning and previous Problem are designated as related signals.

Example 2

Table of Contents