How Can We Help?

Introduction

This section provides an understanding of your Kubernetes services. With a layout optimized for ease of use and clarity, HEAL is your central hub for monitoring, analyzing, and troubleshooting your Kubernetes environment. We will explore key features like the service dependency map, transaction lists, and alert indicators to ensure you can fully leverage HEAL’s predictive analytics and proactive healing capabilities for your Kubernetes clusters.

HEAL includes pre-built support for monitoring Kubernetes-based services. HEAL monitors hybrid applications. In an application, some services may be running in a monolith or on-prem environment, and some may be Kubernetes-based. HEAL discovers Kubernetes-based service automatically and adds to the Service Dependency Map in an application topology.

 

Service Details and Health Summary

Service Details View

The ‘Service Details’ provides a focused look at individual services within your Kubernetes cluster. You can select the specific service you wish to inspect from a dropdown menu, often labeled with the cluster or service name, such as ‘OpenShift-Cluster-Service’.

Health Summary Dashboard

Below the service selection, you can see the ‘Health Summary’, a visual representation of the selected service’s health over time. Here, HEAL condenses complex analytics into an easy-to-understand format, displaying various metrics like CPU, memory utilization, and more through a color-coded system.

 

HEAL displays time series data set on “Workload”, and “Behavior”.

  • HEAL monitors the requests serviced by a Kubernetes replica set. HEAL provides out of the box support for monitoring requests for any type of container via its side car-based approach.
  • HEAL monitors performance metrics of containers across the entire replica set. HEAL provides out of the box support for monitoring Java, Golang, Python, and nginx containers.
  • HEAL displays an integrated service dashboard which gives an overview of the health of the entire replica set.
  • HEAL baselines clustered metrics at replica set level and generates ML based events at replica set level.

Kubernetes Service Dashboard

In the context of HEAL, a service is a logical cluster of load balanced components. In Kubernetes, it resembles a replica set, or a microservice cluster. Click on a Kubernetes service in SDM screen to view Service Details screen of that particular service. It provides a quick overview of health of a service at that time. It displays the cluster level view of this service, Workload KPIs, Behavior KPIs, active signals, and events on the service.

Events

Kubernetes automatically adjusts the replica set size based on certain performance metrics. HEAL generates ML based events only at the cluster aka replica set level.

HEAL application generates events based on threshold.

Normal Operating Range (NOR) – Dynamic Range

  • Generated by the application as a time series of high and low values using proprietary ML models including load-behavior correlations.​
  • MLE generates these and keeps correcting at scheduled ​times.
  • Multiple ML methods are used to discover dynamic thresholds and save them as time series value ranges. ​

Generating Events

Application generates events when NOR is breached.​

Application applies some rules of persistence and suppression before generating an event.

  • It is generated after persistence and suppression rules have been applied.
  • NOR has its own persistence suppression rules and can’t be altered.
  • These rules of persistence and suppression are system-wide global configurations for NOR​.
  • These can be altered via back-end configs for NOR. This is not advisable.

Container Request Monitoring

HEAL monitors requests or workload for a Kubernetes based service. A side car approach is used to monitor requests hitting each Container of the Kubernetes based Service. HEAL clusters the request level metrics such as volume, latency and error rate at replica set level, and then baselines for ML. HEAL displays Requests dashboard for a service. Click Workload in Service Details screen of a service to view the same.

 

1Volume displays overall count for inbound requests at a service. It displays trend of the volume. This section also displays the arrival rate of the requests every minute.
2This section displays total number and percentage of slow requests. It displays Top 5 slow transactions. A transaction may lead to multiple requests internally. Success transactions violating the response time threshold set are considered as Slow.
3This section displays total number and percentage of failed requests. Requests with HTTP status code in the range of 4XX to 5xx, transactions with any error or exception are considered as Failed.​
4This displays total number and percentage of transactions for which response was received after expected response time. Requests are considered as Timed out, when the request session is timed out, request or response is incomplete.​
5This displays total number and percentage of transactions whose status is unknown. Requests are considered as Unknown, when the request has been initiated but the connection is closed before getting the response.
6This section displays performance of the transactions. It displays breakup of the performance for each time segment based on the criteria available for the overall performance numbers.
7This section displays Top 5 failed requests. It helps in knowing the requests that are performing badly. It displays Top 5 requests in the descending order of the percentage of requests that have failed. Percentage next to every request indicates the percentage of that particular request which failed.
8It displays Top 5 slow requests. It helps in knowing the requests that are performing badly. It displays Top 5 requests in the descending order of the percentage of requests that have slowed down.
9This section displays Request List. It displays all the requests monitored in
descending order of percentage of failed requests by default.
For each request, following details display:
Requests: This displays names of all the requests.
Volume: This displays the count of inbound requests of this type.
Success: This displays the total number of success requests of this type.
Success(%): This displays percentage of successful requests of this type.
Slow: This displays count of slow requests of this type.
Slow(%): This displays percentage of slow requests of this type.
Failure: This displays count of failed requests of this type.
Failure(%):This displays percentage of failed requests of this type.
10You can search for a specific request by its request name or can search for a partial request name. Accordingly requests display post search.

Container Performance Monitoring

HEAL monitors performance metrics of Containers inside the Pods of a replica set. Prometheus is required for this feature to work. HEAL collects performance metrics for each Pod instance, and for each Container inside the Pod. HEAL clusters these metrics at replica set level, and then baselines for ML.

Behavior KPIs are Key Performance Indicators about how the monitored instances behave or function. Performance Metric KPI represents the performance information of the component and host instances. The values can be numerical values of different types.

KPIs for Pods

KPI Data at Cluster Level

Click on a KPI category in the overview heat map to bring up the metrics related to that KPI category.

 

1If event(s) is there on a KPI category, then Events option is selected by default. When there is no event, then ALL option is selected as default. KPIs with events are shown first followed by KPIs without any event.
2All service behavior related metrics are available here, arranged by categories.
3Search option to search for categories. KPIs under that category show up post search. If a non-group KPI is not part of any category, then it is placed under category “Others”.
4Events are highlighted as Red dots. Hovering on an event brings up specifics.
5Total number of Metric pods displayed on the screen as part of this category.
6Search for KPIs within a category. Only those KPIs show up post search.

KPI Data at Instance Level

You can select the Pod instance from the drop box to view the metrics for the same.

 

KPIs for Containers

KPI Data at Cluster Level

A Pod can have multiple containers. Select the container to view metrics for the same.

KPI Data at Instance Level

 

Table of Contents