Site Reliability Engineering for Dynamic Workloads

by Vamsi Vedula | Apr 22, 2020

What issues do site reliability engineers face while planning for enterprises with ultra-dynamic workloads?

Site Reliability Engineering is a discipline that incorporates aspects of software engineering and applies them to IT operations problems. Its main goals are to create ultra-scalable and highly reliable software systems. Site Reliability Engineers (SREs) are tasked with ensuring high availability at scale, maintaining a consistent end-user experience and monitoring a truly heterogenous setup with some components on premise, some on the cloud and some running in containers. Engineering reliability is even more challenging for enterprises which handle dynamic workloads like SAP or E-Commerce setups, where the underlying infrastructure is not only subject to modern day deployment complexities but also the vagaries of inbound workload.

Our blog today focuses on some challenges faced by SREs in engineering reliable systems as they pertain to such enterprises. It is the first of a 4-part blog series which will talk about the shift from traditional monitoring and APM to AIOps for implementing site reliability, the key dimensions to be addressed for ensuring that enterprises with dynamic workloads provide a consistent user experience (latency, availability, scalability and planning) and how adopting a self-healing methodology will help such enterprises. We round up the series with some use cases for self-healing in E-Commerce and SAP systems.

Common Issues faced by SREs
Ensuring site availability in modern enterprise systems is a dicey proposition. Not only has the advent and subsequent explosion in demand of digital experiences increased user traffic manifold, but the operational focus has also shifted, with more enterprises moving their architecture to Cloud. Applications now must be available 24×7, and the customer experience needs to be optimal with minimal to no latency. To facilitate this, the underlying infrastructure, however complex, needs to be monitored continuously.

This is easier said than done. SREs are asked to manage deployment complexity while simultaneously navigating operational silos that make collaboration, data exchange and problem resolution incredibly challenging. There are hundreds of thousands of data points to consider, with containers and cloud platforms requiring diligent monitoring of applications, network and infrastructure without a clear separation between the three, and DevOps introducing a high level of automation and increasing the pace of delivery.

This gives rise to 3 problems:

Prioritizing the right data points to collect across operational silos to arrive at accurate root cause, without getting overwhelmed by alert storms or having to manually sift through data.
Automating root cause analysis and proactive mitigation of issues to the highest possible degree, so that the energy of SREs may be better spent on optimizing application code and improving operational efficiency.
Planning for the future based on traffic projections, so that the system can scale effectively and process transactions with minimal disruptions and uninterrupted customer experience.

In short, the Holy Grail for any SRE seems to be greater reliability with less manual intervention as a system scales.

SRE in Enterprises handling Dynamic Workloads
Enterprises today handle huge volumes of data coming in from multiple sources – end-user transactions, batch jobs, background processes, external systems invoking them via API gateways etc. Such enterprises could have an on-premise, hybrid, cloud, container, microservice-based or even serverless architecture. To maintain such an enterprise like a well-oiled machine, a DevOps solution is usually thought of. However, equally important is site reliability engineering, which plays a key role in managing 24×7 uptime and application scaling.

Some of the questions that SREs for such enterprises need to ask themselves are:

How do I accurately capture workload and system behaviour across disparate silos?
How do I know I am getting precise and accurate data points to troubleshoot an issue with? In short, am I certain the system is throwing up no false positives and no false negatives?
How do I ensure that background jobs complete on time and do not slow down the system to an extent that end users start facing latency?
How do I prepare for an event where my transaction volume may increase ten-fold?
How do I plan my scaling, knowing that more and more external applications are going to be invoking my microservices in the days to come?

Most APM tools provide one-dimensional approach, where workload and system metrics are examined individually, with anomalous patterns being flagged as possible troublemakers by applying static or dynamic thresholds. This is clearly not enough in a system where behaviour is a very close function of inbound workload. The CPU in a web server might be running at 10% above average, flagging off a dynamic threshold rule configured by the APM. However, it might only be because the workload also is commensurately higher, and this is expected behaviour for the current workload. Raising an alert and making Ops teams waste valuable time in looking into an issue that doesn’t exist or making them look through thousands of false positives in the event of an outage is totally counterproductive.

Fig 1: Effect of dynamic workload on system metrics

Consider a scenario in an airline’s online booking system, where a latency in the mobile banking application of a popular national bank is causing users to abandon the booking service at checkout. This is a business-critical problem not only affecting the airline’s business, but also the bank’s. There are multiple issues SREs are grappling with in such a scenario:

First and foremost, how can they determine the cause of the latency in the banking application? Is it due to higher-than-usual workload arriving at its API gateway? If so, how can SREs remediate the same, especially when they have no control over the system in another enterprise?
Problems like this might go unnoticed for some time, or there could be an alert storm with no clear event correlations. How do teams make sense of the data they have been bombarded with?
Could user journey monitoring have highlighted the issue earlier, since checkout conversion rates would have suddenly and drastically reduced during the problem horizon?
Even when a problem is identified, where do teams start to look for the root-cause? Could it be a problem on the airline’s side, where a new code deployment has caused an issue with checkouts, or can it be an issue with the bank’s API gateway, or just a bad network? Could it be an auto-scaling issue with the bank’s authentication microservice, or was it due to a CPU spike on one of the boxes hosting the service?
Once the root cause is addressed, how do they ensure this does not happen again, say, during peak holiday season booking sales?

Fig 2: Challenges before SREs driving the need for automation in enterprises handling dynamic workloads

Dimensions of Ops Monitoring required in Enterprises with Dynamic Workloads
The answers to these questions need to be arrived at by following a multi-pronged approach:

Full stack visibility and monitoring, including metrics from on-premise components, cloud microservices, containers and the network.
Proactive alerting on an issue through use of AI/ML techniques.
Event correlation to reduce alert storms and lead teams towards more accurate root cause analysis.
User Journey Analysis to immediately pinpoint journeys where conversion rates have suddenly fallen.
Projective planning to ensure the infrastructure can handle surges in workload.

Our next blog will go over some approaches to introduce automation in order to address the above needs. AIOps tools provide some of the features listed above, but self-healing can help SREs with that “something extra” that is needed to automate remediation. Heal does this through proactive, autonomous and projected healing to keep your enterprise running smoothly, no matter how complex your deployment strategy and how dynamic your traffic is. We will cover the Four Golden Signals of API Health and Performance in Cloud as defined by Google’s SRE teams, and how Heal helps you monitor all these and more. So, don’t forget to tune in next week to know more about how to address the issues we have spoken about in today’s blog, and do reach out to us to schedule a demo if you would like to see Heal in action!