State of IT Operations 2021 and The Journey to Self Healing

Research Report based on survey conducted by Appnomic Systems Inc

State of IT Operations

As enterprises make the shift to multi-cloud-based in fr astructure, monitoring tools have been hard-pressed to keep up with increased complexity. High stakes incidents are overwhelming IT Operations teams and downtime has become more expensive than ever.

Traditional APM and legacy tools are a liability for many organizations and are no longer capable of managing the modern enterprise cloud. Companies are stuck firefighting and troubleshooting has become more complicated and time consuming.

To resolve this dichotomy, automation in !TOM, preventive healing (i.o provonting a problom ovon botoro it occurs) and offoctivo capacity forecasting are the way forward.

This report outlines the challenges organizations face as they combat the complexities of the enterprise cloud, and why the journey to preventive healing has made it to the top of the agenda for CIOs and IT Operations leaders in 2021.

Inside this research brief

1
High stakes incidents overwhelm IT Operations

2
Existing monitoring toolset is a liability formany organizations

3
Transformation to AI-driven preventive ITOMis a business priority

4
Pr eventive healing process will bridge Dev, ITand Biz Ops

5
IT OM can evolve to Preventive Healing in 3 steps

High stakes incidents overwhelm IT Operations

The cost to fix incidents is high in terms of time and resource spend
Business risk is increasing with higher number of incidents.

Transactional volume is gr owing across mission-critical applications

The key findings emerging from the current state of IT Operations at the surveyed organizations revealed some overarching trends. Most significant among them was the number of critical incidents occurring each month and the cost required to fix the incidents given the current tools landscape.

For organizations running mission-critical applications in multi-cloud environments, domain specific monitoring capabilities and trained Operations personnel who have experience in DevOps and SRE (Site Reliability Engineering) are also required to keep the lights running with minimal or no downtime.

Most respondents have an average of 34 million transactions across an average of 71.4 mission-critical applications, with more than 519 incidents impacting business annually. These incidents can cost in average 1774 man-hours per month to fix.

High stakes incidents overwhelm IT Operations

%

respondents have at least 30-45mission critical applications

%

respondents have more than 1 milliontr ansactions every month

%

respondents have more than 1000 incidents on an aver age monthly

%

respondents have more than 100 high severity incidents in 2019

Existing monitoring toolset is a liability for many organizations

Alert storms, false positives and unnoticed events persist despite use of AIOps tools.
Multiple tools being used for monitoring results in fr actured, siloed view of the enterprise.
Teams still rely on dashboar ds to troubleshoot ver sus getting early warnings to act upon.

A single monitoring tool is insufficient to provide visibility and observability across multiple environments. Organizations are struggling to rethink their automation strategies in Hybrid Digital Infrastructure environments. Many organizations use around 10 monitoring tools and are unhappy with their existing operational setup.

Regardless of the overwhelming volume of data being collected, insights are still not being processed effectively. Coupled with that is the fact that teams need to frequently contend with alert storms and lack of coherent event correlations and root cause analysis.

Despite the plethora of AlOps tools in the market, most still automate the resolution of incidents that have already occurred through orchestra tion workflows and ITSM integration as part of the larger automation strategy. Troubleshooting teams still rely on dashboards to become aware of issues via events, logs or alerts.

An average NOC can receive up to 10,000 alerts per day; 49% percent of SREs respond to at least one major incident every week.

Transformation to AI-driven prevention is a business priority

Existing monitoring toolset is a liability for many organizations

%

respondents find present systems to be too complex

%

respondents felt that dashboards and alerts are insufficent

%

respondents feel they are getting too many alerts

%

respondents fell there are too many false positives

%

respondents feel events are going unnoticed

%

respondents feel they have a fractured view of the enterprise

The main objective of IT OM should be to pr event future incidents
Advanced analytical capability is a necessity for operations.
AI should also facilitate planning by deriving insights fr om historical data
Most surveyed organizations are looking at Al and automation as one of the top asks from monitoring initiatives and capabilities in 2020. Prevention of future incidents, quicker detection, or, at the very least, faster root cause analysis were other primary expectations.

Advanced analytics and decisioning systems will continue to be the focal point of operational capability. What would make them more helpful, however, is time-synchronized contextual data from services spread across multiple environments, including business transactions which travel across organizational silos, and diagnostic data about the current state of the system, coupled with logs, configuration changes and code/query snapshots.

Planning is also a requirement which is key to automation success; with multi-cloud environments it is imperative to identify current and potential hotspots via capacity forecasting, and scale intelligently to mitigate issues and manage costs.

95%

respondents are inter ested in AI-driven self-healing/autonomous syst ems, for which automation via prediction, preventionand planning is essential.

Most surveyed organizations are looking at Al and automation as one of the top asks from monitoring initiatives and capabilities in 2020. Prevention of future incidents, quicker detection, or, at the very least, faster root cause analysis were other primary expectations.

Advanced analytics and decisioning systems will continue to be the focal point of operational capability. What would make them more helpful, however, is time-synchronized contextual data from services spread across multiple environments, including business transactions which travel across organizational silos, and diagnostic data about the current state of the system, coupled with logs, configuration changes and code/query snapshots.

Planning is also a requirement which is key to automation success; with multi-cloud environments it is imperative to identify current and potential hotspots via capacity forecasting, and scale intelligently to mitigate issues and manage costs.

Transformation to AI-driven prevention is a business priority

%

respondents were interested in AI-driven preventive healing system

%

respondents felt that they needed advanced analytics capabilities

%

respondents favor ed preventionof future incidents

%

respondents felt capacity for casting should be done on basis of future predicted workload

Preventive healing process will bridge Dev, IT and Biz Ops

Fix problems before they occur

Focusing on MTTR and fixing issues faster does not contribute to avoiding outages and improving customer experience. A whole new approach that focuses on prevention is required across DevOps, ITOps, and BizOps silos.

70% percent of infrastructure problems come from poorly applied or failed changes. DevOps should address change management effectively by enabling frequent deployments to better support customer and employee needs. Tools and practices such as Cl/CD enable more frequent deployments, especially in complex cloud and hybrid environments.

Establishing automated workflows for activities like monitoring, alerting, root cause analysis, diagnostics, resource provisioning via planning and auto-remediation is paramount for automation readiness. These capabilities should align with and map to critical business processes, so that customers can set priorities based on business value.

Conclusion

High-severity incidents overwhelm enterprises and environments have become more complex to monitor and manage. IT Operations teams find themselves ill-equipped to troubleshoot with the existing tools, which are more reactive than preventive. The adverse impact of downtime on business and the high costs of troubleshooting mean that incident prevention is the need of the hour.

More and more IT leaders are looking at Automation and Al/ML techniques as the solution to the conundrum. Preventive Healing is the way forward to make sense of the overwhelming amount of data being collected, derive insights and prevent incidents in the enterprise.

Enterprises can evolve from AlOps to Preventive Healing by correlating events across multiple silos, bringing in situational awareness via time-synchronized contextual data, highlighting early warnings with diagnostics and root cause suggestions to effect remediation and self-healing, and planning based on workload trends to scale intelligently.

Your Journey To Self Healing
Begins with HEAL

Heal is industry’s first preventive healing software for IT that can help IT problems before they happen. By applying a patented ML technique called workload•behaviour correlation, HEAL can highlight anomalies in system behaviour as a function of the current workload, called early warning signals. Such an advanced alerting mechanism, coupled with contextual data on the state of the system at the time of the anomaly, allows you the time to put in place the required measures to prevent the issue.

What makes HEAL Unique

Complete preventive solution for enterprise
Autonomous healing to rectify issues, and enableticketless monitoring
Entire operating context of user journey, application and system in single pane of glass
Entire operating context of user journey, application and system in single pane of glass

Appendix

This independent survey was conducted by an independent market research organization to understand the state of IT Operationse.

  • More than 100 SME IT decision makers were surveyed over a period of 2months
  • 93% surveyed were private companies
  • 84% surveyed were mid-sized companies; 75% were Technologycompanies
  • Companies with revenues ranging from 50 Million USD to 300 MillionUSD
    • Categories surveyed:
    • State of IT Operations: present state of ITOM in the
      organization
    • IT Monitoring Tools: average number of tools, which
    • vendors, top use cases and challenges in using the tools
    • 84% surveyed were mid-sized companies; 75% were Technologycompanies
    • Future State of ITOM: monitoring initiatives and
      capabilities being implemented in 2021, what AI should
      do for the organization capabilities

    Appendix

    Survey Demographics

    Industry% of respondents
    Technology74%
    Banking4%
    SaaS7%
    Others11%
    Department% of respondents
    Office CTO and digital12%
    IT Operations62%
    Applications(Software)10%
    Infrastructure(Hardware or cloud)8%
    IT strategy and enterprise architecture8%

    Appendix

    Mission critical applications and transaction volum

    Number of missioncritical applications% of respondents
    <101%
    10-3021%
    30-7542%
    75-10025%
    >15012%
    Transactions per month% of respondents
    <1M5%
    1M-10M37%
    10M-50M27%
    50M-100M27%
    >100M4%

    The largest number of respondents (42%) mentioned having between 30 and 75 mission critical applications, with a majority of survey respondents (79%) possessing more than 30 such applications.

    95% survey respondents mentioned that they did more than 1 million tr ansactions ever y month, hinting at the inherent complexity in their operations

    Appendix

    Average monthly and high severity business impacting incidents

    Average monthly incidents% of respondents
    <10008%
    1000-400042%
    4000-700038%
    7000-100007%
    >10005%
    High severity business
    impacting incidents (2019)
    % of respondents
    <1009%
    100-40029%
    400-70033%
    700-100023%
    >100006%
    91% survey respondents mentioned they had more than 100 high severity incidents in the last yea

    95% survey respondents mentioned that they did more than 1 million tr ansactions ever y month, hinting at the inherent complexity in their operations

    Appendix

    Cost to fix high severity incidents and IT monitoring tool profile

    Cost to fix high severity incidents in main hours% of respondents
    <50003%
    2500-500018%
    1000-250040%
    500-100033%
    >5007%
    Number of IT monitoring tools used% of respondents
    <33-5%
    3-535%
    5-1053%
    1010%
    A large number of survey respondents (73%) mentioned that they spent between 500 to 2500 man-hours monthly to fix such Incidents

    A large number of survey respondents (88%) mentioned that they used between 3 and 10 IT monitoring tools, with the highest number of respondents (53%) indicating using between 5 and 10 monitoring tools

    Appendix

    Challenges and Biggest operational pain points

    Challenges% of respondents
    Present systems aretoo complex70%
    Dashboards and alerts insuffcient61%
    Too costly65%
    Not efficient34%
    Not forward looking i.e not
    able to prevent incidents
    35%
    Biggest operational
    pain points
    % of respondents
    Too many alerts59%
    Too many false positives45%
    700-100023%
    Alerts going unnoticed45%
    In other choices, “Events going unnoticed” as an option was chosen by 50% of the public companies, and 67% of Telecom companies. This was also chosen by 78% of the Infr astructure and 67% of the IT Strategy and Enterprise Architecture respondents

    Appendix

    Interest levels in AI self-healing

    What should AI do for
    your organization
    % of respondents
    Provide advanced analytics62%
    Predict and prevent problems49%
    Find problems fast45%
    Predict future workload36%
    Predict and fix problems14%
    Interested in AI self-healing
    systems that can prevent
    incidents from happening?
    % of respondents
    Yes95%
    No49%
    Interest in AI self-healing choices emphasize the value and potential of AI self-healing system