Using Self-Healing to Ensure Site Reliability for E-Commerce Websites
Our blog today is part 3 of a 4-part blog series on how self-healing is the Holy Grail for ensuring site reliability in modern day enterprises. Our first blog dealt with the most common problems plaguing site reliability engineers as they deal with ultra-dynamic workloads, landscapes spread out over cloud, on-premise and hybrid deployments, and an overload of data points to triage issues and narrow down root causes with. Our second blog introduced self-healing as the most effective way to deal with these problems, particularly Heal’s ability to baseline application behaviour with respect to the variations in workload, its capability to autonomously and proactively heal signals before they escalate into issues, and its forecasting feature that helps plan scaling the enterprise to handle projected workloads.
Today we focus on some aspects of Reliability Engineering as they apply to E-Commerce websites, and how Heal delivers incremental benefits over AIOps platforms, lending its self-healing capability to address some of the most burning issues faced by e-commerce platforms – including latency, availability, scalability and planning.
The Role of SREs in E-Commerce
Every industry today is dependent on E-Commerce, as it reaches out to customers online, incentivizes repeat business through loyalty programs and evolves constantly to provide better customer experiences. Digital consumers are driving a major transformation across all sectors, particularly in retail and e-commerce, where a robust error-free digital experience has a bearing on the overall customer experience, which in return drives the revenue and reputation of the brand. Retail websites, streaming services, online banking and payment services are all examples of the e-commerce wave on which consumers are currently riding, and it is imperative that they cater to the customers’ demand for speed and availability. A survey carried out by a leading APM vendor revealed that most customers have high expectations from their digital experiences – for instance, page load times must be 3 seconds or less, delays can lead to checkout/cart abandonment and social media could be used to propagate poor user experiences. With the power in the hands of the users, a less-than-optimal experience could damage the brand for good.
SREs for E-Commerce platforms have some pertinent questions they need to look at to ensure an optimal customer experience:
- Availability: Is it up and running?
- Errors: Is it working as expected?
- Latency: Is it working fast enough?
A failure to meet customer expectations in any of the above scenarios is at best, a lost opportunity, but at worst, revenue loss, potentially irreparable brand damage and a social media surge of customer complaints that can impact the brand’s standing even among nonactive users.
Additional Challenges in E-Commerce Monitoring
Other than the common issues SREs face while monitoring enterprises, which were covered in part 1 of this series entitled “Site Reliability Engineering for Dynamic Workloads”, there are peculiar problems in E-Commerce monitoring that APM and AIOps tools need to address for operational efficiency:
- Dependence on external APIs: Nearly all e-commerce systems talk to 3rd party systems via APIs and web services to provide search, authentication and payment capabilities. Full stack visibility and monitoring the user journey across all interactions poses a challenge in such cases.
- Need to monitor user experience: End user monitoring becomes all important in an e-commerce deployment to eliminate the possibility of latency due to network or page rendering issues, so remediation can be carried out accurately instead of focusing only on the backend infrastructure as root cause.
- Need to capture business errors: Business errors could have a completely different remediation workflow than technical errors – for instance, a payment failing because of an expired debit card does not involve any healing actions that can be triggered to remediate the error. However, the error still needs to be addressed by the ITSM system in place, and an appropriate error message shown to the user. The transaction would still be marked “successful” from a technical standpoint.
- Need to scale dynamically in response to dynamic workload: E-Commerce setups are a classic example of scenarios where the infrastructure is subjected to varying workloads. A food delivery app would see surges during mealtimes, retail websites would see increased footfalls on sale days etc. There need to be measures in place to ensure that the infrastructure is able to meet the demand placed by the workload, either through workload optimization or dynamic provisioning of resources on cloud deployments.
Operational Monitoring of E-Commerce Deployments
We went over some of the key elements of a solution that will address the above concerns in our previous blog of this series: “Self-Healing for Site Reliability Engineering”, specifically a self-healing product like Heal. Such a product will introduce automation, ease the burden on SREs to keep an e-commerce setup always available through autonomous remediation, and effectively plan for load surges and future workload projections via what-if analysis and capacity forecasting.
In this section we examine some of the features of Heal that address the operational challenges listed in the previous section:
- Visibility across the full stack: Heal ingests workload and behaviour data including inter-service flow rates and host as well as component metrics on containers and cloud environments by means of its Ingestion API. Full stack visibility ensures that even when there is a dependence on external APIs, the journey steps and exposed workload and behaviour metrics can be captured for more accurate workload-behaviour and event correlation.
- Early Warnings on workload and behaviour: Heal sends early warning signals in response to anomalies both in request and behaviour metrics – whether it is a slight deterioration in response times or a degradation in CPU due to a sudden workload surge or previously unseen transaction mix, without any perceivable effect on the end user experience. These early warnings are accompanied by adequate forensic data and contextual information to enable proactive or autonomous healing to normalize metrics before an outage affecting customer experience occurs.
- Event correlation and situational awareness for faster root cause analysis.
- User Journey Monitoring: Heal traces a transaction’s path across multiple services and applications and baselines flow rates and corresponding behaviour for each of the transaction steps. By means of a transaction ID or any other unique identifier, it can capture not only technical errors, but also business errors by scanning log files.
Fig 1: User Journey Monitoring to zero in on steps with errors – business or technical. Figure shows steps in a user journey for product checkout on an e-commerce platform. The OTP Submission Step is slow and is related to an ongoing issue.
- End User Monitoring: For e-commerce setups where device latency could also be a major cause of slowness, it is imperative to have end user monitoring (EUM) in place to differentiate between root causes of slowness originating in the infrastructure (hosts or components running in the data center or on cloud), the network or the device itself (page rendering issues, device or browser incompatibility). Heal provides JavaScript plugins and APIs to ingest EUM data for a more holistic approach to arriving at root cause.
Fig 2: Features of Heal catering to effective operational monitoring of E-Commerce environments.
Table 1: List of Heal Features supporting the peculiar issues faced in the operational monitoring of E-Commerce setups.
In the next section we examine some use cases for Heal we have seen in some E-Commerce setups we are currently deployed at.
Sample Use Cases for Heal in E-Commerce SRE
Slowness in a service in an e-commerce payments application
An e-commerce application frequently experienced slowdown in a few specific user transactions. Whenever this happened, the IT Operations team “restored” normalcy by recycling the Java servers. Although this was a quick-fix, it did not help SREs narrow down the root cause of the issue.
Enter Heal. Heal built baselines for workload and behavior and the resulting model was used to generate anomalies on transaction response times as well as other pertinent service metrics. Whenever the application experienced transaction slowness, related anomalies were grouped into a single signal, with just-in-time forensics and code snapshots collected on the anomalous metrics in the Java services.
The code snapshots revealed that a database exit to the BookingsDB was the root cause of the slowness. Just-in-time forensics showed a spike in metrics like DB locks, execution time and buffer reads indicating that the database was taking too long to execute certain queries. Hotspot analysis and database deep-dive analysis further confirmed the hypothesis that the BookingsDB was the actual root cause service, with the slow query being pinpointed through an analysis of the Oracle DB snapshots done in Heal’s integrated service dashboard.
Fig 3: Integrated root cause analysis done by Heal to troubleshoot slowness on the eCom Application Server.
The outcome of the root cause analysis performed by Heal was a proactive healing action put in place by the database system administrator upon the recommendations from the operations team – a query plan change for the slow queries. The deployment of the queries with optimized plans ensured the slowness did not recur.
Dynamic Workload Optimization – Autonomous Healing
Users of a popular food delivery application were facing slowdown and time-outs during peak hours. Heal was deployed to learn workload trends and see how autonomous healing could help remedy the situation. It was observed that the latency was due to the nature of the traffic i.e. a particular mix of workloads was seen to occur during most of the reported periods of slowdown. The app business team was consulted to understand the relative business priority of all transactions seen during these specific slowdowns. It was determined that users were willing to wait for 2-3 seconds to see search results but wanted faster response times on cart checkout and payment confirmation. The tracking facility was also found to be a significant burden on underlying resources, so it was decided that a “Please wait” type of message would be displayed to the user if he wished to see the live location of the delivery partner on a real-time map; the default setting would only show the ETD and driver partner details and no map.
Fig 4: Illustration of autonomous healing to effect dynamic workload optimization to handle surge workload and problematic workload mixes.
An autonomous healing script was written to switch low priority transactions (e.g.: search and view live map) to redirect users to this “Please Wait” page. Once the script was deployed, whenever Heal raised an Early Warning signal predicting a workload surge, this healing script would be triggered to introduce a slight wait for low- priority transactions. Heal thus provided the outcome of being able to handle periodic surge workloads without costly overprovisioning of additional capacity.
Conclusion
This blog attempts to go over some scenarios in which Heal enables operational efficiency through its monitoring and healing capabilities in e-commerce environments, thereby ensuring site reliability, improved availability, functionality and uptime, providing users a more optimal experience. We go over some similar scenarios and use cases for SAP environments in our next blog, the last one of this series. Meanwhile, do go over our older blogs on our patented ML techniques for effecting workload-behaviour correlation and providing your enterprise with the power of proactive, autonomous and projected healing. Stay safe!