The IT Ops Monitoring Checkup for e-Commerce

7 Capabilities That Prepare Your Ops Team for the Unexpected

7 Critical Capabilities

Shopper journeys
Observability
Baselining
Issue detection
Issue remediation
Issue prevention
Capacity forecasting

IT Ops Pressure Points in e-Commerce

In 2020, e-Commerce companies have seen unprecedented demands placed on their technology stacks. This Checkup provides a framework for assessing the seven most critical factors that will affect IT Ops monitoring in this hard-to-predict environment.

Online retail grows exponentially

Online sales in essential categories have gone up as much as 700%, while brands with physical stores are launching a record number of promotions to shift foot traffic online

Click-and-collect adds new points of failure

Many brands are relatively new to click-and-collect or Buy Online/Pick Up In-Store (BOPIS), and components may not yet be battle-tested. Solving incidents and performance problems can be challenging because of disconnected monitoring across modern e-commerce tech and back-end systems.

Dependencies on third-party systems create cascading effects

Suppliers are also experiencing volatile demand, and any unpredictability that affects their internal systems can carry through the site to affect shoppers

Supply chains lose predictability

Retailers may need to quickly onboard new suppliers to meet demand shifts and de-risk critical supply chains. Yet today, four out of 10 retailers lack the ability to see inventory as it flows from source to consumer.
Shopper journeys
  • Do you have a holistic view of the shopping experience across application components broken down into a series of user journeys? That is, are you able to trace transactions across separate applications to find any errors that occur?
  • Can you monitor user journeys in real time, not just individual technology components, so that you can measure based on business impact?
  • Are you able to easily search for and view a shopper’s step-by-step interactions with your application if he or she calls in for support?
  • Are you able to monitor partner integrations, including outbound calls made to external services and inbound calls?

ACTION ITEMS

  • Create shopper journeys that identify linkages between technology components and dependencies at each step.
  • Build processes around shopper journeys and establish monitoring not only based on technology components but also on shopper experience.
  • Ensure that all transaction data gets indexed, not just samples.
OBSERVABILITY
  • Have your developers adequately instrumented your website/service to reveal its behavior?
  • Does your instrumentation give you warnings before an issue emerges?
  • Have your issue counts dropped over time with an instrumentation-based approach?
  • Have you faced situations where new code ran fine one day does but not run correctly the next day? How do you troubleshoot such situations in production?
  • Can you map each transaction into application components (application servers, web servers, microservices, database)?
  • Can you map each user experience into its infrastructure component (compute, storage, network)?
  • Are you able to catch signals from unusual behavior of systems, workloads on those systems, or both?
  • How many monitoring tools must you use to view a problem from end to end?

ACTION ITEMS

  • Set criteria for minimum allowable instrumentation before code goes into production, tied to yourperformance and availability goals
  • Take an instrumentation approach that provides for early warnings, not just indicators that somethinghas already broken
  • Explore artificial intelligence (AI) capabilities to provide real-time visibility into budding issues.
  • Automate deep discovery and relationship mapping of the user journey to the underlying ITcomponents (application, compute, network, and data).
  • Go beyond monitoring and logging. Capture and process behavioral data from systems and applications.
  • Consolidate monitoring tools to gain a single version of the truth.
BASELINING
  • Are you able to do baselining for ultra-dynamic environments?
  • Does your monitoring rely on baselines from predefined KPIs, or can they be dynamically generated based on workloads?
  • Do you have contextual information about how the system behaves under different workloads to correlate baselines?
  • Does your baselining approach account for user experience as well as the performance of technology components?

ACTION ITEMS

  • Move from static KPIs to performance range targets that change depending on actualworkload conditions.
  • Set both automatic and manual performance thresholds.
  • machine learning (ML) to build a dynamic baseline of each component’s normaloperating condition.
  • Explore AI and ML to gain contextual insights around system behavior and user experience.
ISSUE DETECTION
  • What percentage of issues do you find out about via monitoring tool alerts versus shoppers or intuition)?
  • How many issues can you detect before the system alerts you or shoppers experience a problem?
  • What percentage of issues do you find out about because of spikes in compute costs?
  • What percentage of issues do you find out about because of spikes in compute costs?
  • What is the false positive rate for your alerts
  • Does everyone agree on which issues require immediate action?

ACTION ITEMS

  • Evaluate all alerts and understand false positive and false negative rates. False alarms can drain 20-30% of troubleshooting resources.
  • Identify the types of problems that you have only discovered via human intuition and use ML to separate signal from noise.
  • Implement preemptive detection mechanisms to reduce the number of alerts
ISSUE REMEDIATION
  • What is your average mean time to repair (MTTR)? How has it changed since last month? Last year?
  • What percentage of issues require human involvement?
  • Do you have systems in place for remediation guidance?
  • What AI capabilities do you have to support issue remediation?
  • Are you able to quantify the impact of developing issues, whether they affect checkout or other metrics such as conversion?

ACTION ITEMS

  • Build systems that tolerate some levels of failure until issues can be fixed.
  • Document dependencies between application, infrastructure, and partner components in real time.
  • Develop work processes for operational troubleshooting, and drill your team on how do to them quickly.
  • Use AI to accelerate root cause analysis and remediation.
  • Automate remediation processes as much as possible using your orchestration tools
ISSUE PREVENTION
  • For what types of issues can you take action quick enough to prevent a negative user experience?
  • What level of automation do you use for fault prevention?

ACTION ITEMS

  • Identify situations where human decisions have created roadblocks to action.
  • Explore AI to assist IT Ops with preemptive decisioning by surfacing cause-and-effect relationships.
  • Integrate your issue prevention stack with automated remediation to provide for self-healing.
CAPACITY FORECASTING
  • How do you forecast capacity for traffic fluctuations? How accurate are your forecasts?
  • To what extent do your forecasts rely on individual expert judgment? What would happen if that expert were not available?
  • Do over-provisioning and/or cloud elasticity drive up your costs?

ACTION ITEMS

  • Keep accurate operational data that’s granular enough to provide component-by-component utilization data and workload metrics.
  • Look at new ways to automate capacity forecasting based on your actual data, not just synthetic load scenarios.
  • Conduct “what if” analyses to see the impact of growing workloads on each component of your stack.

WHAT TO DO NEXT

Leading digital media Ops teams are making strategic investments in AI capabilities to:
  • Create more holistic visibility into the relationship between user experiences and underlying applications, infrastructure, and IT Ops processes
  • Make their decisioning capabilities less reactive and more proactive
  • significantly increase the number of issues that can be automatically solved before they drive down usage metrics or revenue
By acting now, they can prevent costly outages, efficiently manage development and compute costs, and improve brand experience at a pivotal time.

Contact Appnomic to get a personalized assessment of where AI can improve your IT Ops monitoring stack.