by Vamsi Vedula | Apr 22, 2020
Site Reliability Engineering is a discipline that incorporates aspects of software engineering and applies them to IT operations problems. Its main goals are to create ultra-scalable and highly reliable software systems. Site Reliability Engineers (SREs) are tasked with ensuring high availability at scale, maintaining a consistent end-user experience and monitoring a truly heterogenous setup with some components on premise, some on the cloud and some running in containers. Engineering reliability is even more challenging for enterprises which handle dynamic workloads like SAP or E-Commerce setups, where the underlying infrastructure is not only subject to modern day deployment complexities but also the vagaries of inbound workload.
Our blog today focuses on some challenges faced by SREs in engineering reliable systems as they pertain to such enterprises. It is the first of a 4-part blog series which will talk about the shift from traditional monitoring and APM to AIOps for implementing site reliability, the key dimensions to be addressed for ensuring that enterprises with dynamic workloads provide a consistent user experience (latency, availability, scalability and planning) and how adopting a self-healing methodology will help such enterprises. We round up the series with some use cases for self-healing in E-Commerce and SAP systems.
Ensuring site availability in modern enterprise systems is a dicey proposition. Not only has the advent and subsequent explosion in demand of digital experiences increased user traffic manifold, but the operational focus has also shifted, with more enterprises moving their architecture to Cloud. Applications now must be available 24×7, and the customer experience needs to be optimal with minimal to no latency. To facilitate this, the underlying infrastructure, however complex, needs to be monitored continuously.
This is easier said than done. SREs are asked to manage deployment complexity while simultaneously navigating operational silos that make collaboration, data exchange and problem resolution incredibly challenging. There are hundreds of thousands of data points to consider, with containers and cloud platforms requiring diligent monitoring of applications, network and infrastructure without a clear separation between the three, and DevOps introducing a high level of automation and increasing the pace of delivery.
This gives rise to 3 problems:
In short, the Holy Grail for any SRE seems to be greater reliability with less manual intervention as a system scales.
Enterprises today handle huge volumes of data coming in from multiple sources – end-user transactions, batch jobs, background processes, external systems invoking them via API gateways etc. Such enterprises could have an on-premise, hybrid, cloud, container, microservice-based or even serverless architecture. To maintain such an enterprise like a well-oiled machine, a DevOps solution is usually thought of. However, equally important is site reliability engineering, which plays a key role in managing 24×7 uptime and application scaling.
Some of the questions that SREs for such enterprises need to ask themselves are:
Most APM tools provide one-dimensional approach, where workload and system metrics are examined individually, with anomalous patterns being flagged as possible troublemakers by applying static or dynamic thresholds. This is clearly not enough in a system where behaviour is a very close function of inbound workload. The CPU in a web server might be running at 10% above average, flagging off a dynamic threshold rule configured by the APM. However, it might only be because the workload also is commensurately higher, and this is expected behaviour for the current workload. Raising an alert and making Ops teams waste valuable time in looking into an issue that doesn’t exist or making them look through thousands of false positives in the event of an outage is totally counterproductive.
Effect of dynamic workload on system metrics
Consider a scenario in an airline’s online booking system, where a latency in the mobile banking application of a popular national bank is causing users to abandon the booking service at checkout. This is a business-critical problem not only affecting the airline’s business, but also the bank’s. There are multiple issues SREs are grappling with in such a scenario:
Challenges before SREs driving the need for automation in enterprises handling dynamic workloads
The answers to these questions need to be arrived at by following a multi-pronged approach:
Our next blog will go over some approaches to introduce automation in order to address the above needs. AIOps tools provide some of the features listed above, but self-healing can help SREs with that “something extra” that is needed to automate remediation. Heal does this through proactive, autonomous and projected healing to keep your enterprise running smoothly, no matter how complex your deployment strategy and how dynamic your traffic is. We will cover the Four Golden Signals of API Health and Performance in Cloud as defined by Google’s SRE teams, and how Heal helps you monitor all these and more. So, don’t forget to tune in next week to know more about how to address the issues we have spoken about in today’s blog, and do reach out to us to schedule a demo if you would like to see Heal in action!
HEAL Software is a renowned provider of AIOps (Artificial Intelligence for IT Operations) solutions. HEAL Software’s unwavering dedication to leveraging AI and automation empowers IT teams to address IT challenges, enhance incident management, reduce downtime, and ensure seamless IT operations. Through the analysis of extensive data, our solutions provide real-time insights, predictive analytics, and automated remediation, thereby enabling proactive monitoring and solution recommendation. Other features include anomaly detection, capacity forecasting, root cause analysis, and event correlation. With the state-of-the-art AIOps solutions, HEAL Software consistently drives digital transformation and delivers significant value to businesses across diverse industries.