Proactive and Projected App-aware Scaling

by Vamsi Vedula | Apr 1, 2020

Using Hotspots and Forecasting to Affect Proactive and Projected Healing

Enterprises need to stay a step ahead of the curve in terms of anticipating business needs and scaling pre-emptively to meet them. Anomalies due to high workloads, unexpected configuration changes in underlying components and sudden spikes in hardware KPIs are all major factors behind incidents in your enterprise. Heal, by means of autonomous and proactive healing measures built on our patented workload-behaviour correlation techniques, helps you prevent such incidents – as our previous blogs have illustrated. However, there is a third dimension to healing – projected healing – used in conjunction with proactive healing, that helps you plan and scale application infrastructure in response to projected workload trends. Our blog today focuses on these use cases.

The three types of self-healing in an enterprise include: Autonomous, Proactive, Projected

There are 2 use cases in which proactive and projected healing can be employed to perform app-aware scaling, highlighted in bold in the above infographic:

Analysis of recurrent hotspots across an application; this gives users enough lead time to take preventive actions to avoid outages due to chokepoints.
Performing a what-if analysis based on projected workload growth and capacity forecasting to plan infrastructure better.

Application Trends and Hotspots Analysis
Heal provides you an integrated dashboard to view the health of all the applications it monitors within your enterprise. For an application of concern – where there are recurrent signals being detected – a Trends dashboard gives a quick view of hotspots over the last 3 months.

Fig. 1: Application Health Dashboard with the eCOM application in the “red”. Trend analysis can be performed by clicking on the Trends link on the eCOM application card

Transaction trends are collated according to top 10 transactions by average volume over a week. These are correlated with the KPI-wise occurrence of anomalies. The indicative KPIs in the bottom part of the page are those which result in anomalies and signals most frequently due to “chokepoints” i.e. where resource contention is of primary concern. What this means is that on these KPIs, we see maximum incidents occurring due to a capacity breach – CPU exceeding 100%, runtime memory hitting a ceiling, network IOs hitting the bandwidth limit etc. This is a fixed list which Heal has arrived at by analysing thousands of historical early warnings/problems and their associated root cause KPIs thereof.

Fig. 2: Hotspots in the eCOM application corresponding to week-wise workload trends. The bubble chart indicates the number of anomalies and average anomalous value of the respective KPI

These hotspots are a good indicator of the proactive healing measures that can be taken to prevent capacity breach-related signals and incidents in your application. For instance, in the above image, a peculiar workload mix in Week 3 resulted in a high number of CPU related anomalies. This could be due to a quarterly or annual event resulting in that specific mix of transactions that strains the CPU in the underlying infrastructure. A scaling script can be scheduled to run on those days as a potential “projected healing action” to provision more CPU on the server instance if this is a cloud setup. This is an example of a proactive approach to scale. When the problematic workload mix recurs on the eCOM application, an incident due to high CPU can thus be averted.

Capacity Forecasting – “What-if” Analysis
The second scenario is a “what-if” analysis and a capacity forecast generated based on projected workload. On selecting the workload mix for, say, Week 5, we see the option to generate a Capacity Forecast.

Fig 3: Option to generate a capacity forecast report on the workload mix for Week 5

The what-if analysis page allows you to tune individual transaction volumes or enter a percentage increase on all transactions across the board, and then observe the effect of doing so on the underlying system components in a heatmap representation. Business users can thus plan for an event like a marketing campaign or an acquisition increasing the underlying user base, which translates into a higher volume of transactions which need to be provisioned for. The KPIs that are included in the final report are only those which give a forecasting accuracy of more than 70% (this number can be tuned in the application configuration).

The KPIs are divided into host and component metrics. Component metrics depend on which tier the component belongs to. For instance, on an application server some metrics which can be forecast include JVM CPU and runtime memory, as well as connection pool size, whereas on the web server, some of the forecast metrics include busy workers and CPU load.

Fig 4a: Effect of increasing transactions HACLI and Login on the web server KPIs

Fig 4b: Effect of increasing all transactions by varying percentages on the application server KPIs

There are two possibilities which arise as a result of such a “what-if” analysis:

The user can clearly understand which system components would probably breach capacity in the coming days, were transaction growth rates to follow the projected trends.
The user can also pinpoint overprovisioned servers: those servers which have more than adequate headroom even when transaction volumes are, for instance, doubled or tripled. Cutting back on CPU or memory on these servers will result in cost savings in the long run.

Fig 4c: Effect of increasing all transactions by 50% across the board on DB server KPIs. The redness indicates these KPIs will be close to hitting the ceiling or breaching their limit, were all transactions to increase by 50%

To share the insights from capacity forecasting with decision makers, the dashboard can be exported to a report, based on which appropriate actions (scaling up or scaling down) can be planned and implemented to avoid outages and save costs.

Fig 5a: Extract from Capacity Forecast report showing an under-provisioned server

Fig 5b: Extract from Capacity Forecast report showing an over-provisioned server, along with recommendation for cost-cutting

Behind the Scenes
The AI/ML powering capacity forecasting also relies heavily on the workload-behaviour correlation technique. The hypothesis our patent is based on is that workload directly affects the underlying system behaviour; the same is extended to build predictive models trying to extrapolate behavioural metrics given a certain workload signature. Some techniques used for this include linear regression and Random Forest (classification) algorithms. The objective is to create a mathematical model with coefficients for each transaction type and equating it to a behavioural metric value, such that modifying the transaction volumes and applying the same coefficients will allow us to predict the corresponding metric value.

For example – 100 transactions of type T1, 80 transactions of type T2 and 65 transactions of type T3 typically result in a CPU utilization of 36%. This can be written as a mathematical equation of the type:

100X1 + 80X2 + 65X3 = 36

Where X1, X2 and X3 are the coefficients. If we were to find the values of X1, X2 and X3, it is possible to compute the value of CPU utilization when the transaction volumes T1, T2, T3 are changed.

Fig 6: Test results for one of our models on the CPU metric and how the actual and predicted CPU differ

For behavioural metrics showing non-linear behaviour, other mathematical models like classification are used.

Conclusion
Projected healing is another dimension of healing which helps you plan for unexpected traffic and scale up pre-emptively. With projected healing and app-aware scaling, we help you project transaction growth trends to view corresponding infrastructure and system requirements to implement capacity forecasting intelligently. Unlike other AIOps tools, our forecasts extend to KPIs other than CPU and Memory, and not only pinpoint servers close to capacity breach, but also those which are overprovisioned and where configurations can be tweaked to effect cost savings.

For more information on our patented AI/ML algorithms learning workload-behaviour correlations and performing ABL (Application Behaviour Learning), read our previous blogs on “How Heal’s Machine Learning Engine Supports Preventive Self-Healing” and “Heal Data Architecture”.