State-of-itops-2021.pdf | Heal Software Inc

State of IT Operations 2021 and The Journey to Self Healing

Research Report based on survey conducted by Appnomic Systems Inc

State of IT Operations

As enterprises make the shift to multi-cloud-based in fr astructure, monitoring tools have been hard-pressed to keep up with increased complexity. High stakes incidents are overwhelming IT Operations teams and downtime has become more expensive than ever.

Traditional APM and legacy tools are a liability for many organizations and are no longer capable of managing the modern enterprise cloud. Companies are stuck firefighting and troubleshooting has become more complicated and time consuming.

To resolve this dichotomy, automation in !TOM, preventive healing (i.o provonting a problom ovon botoro it occurs) and offoctivo capacity forecasting are the way forward.

This report outlines the challenges organizations face as they combat the complexities of the enterprise cloud, and why the journey to preventive healing has made it to the top of the agenda for CIOs and IT Operations leaders in 2021.

Inside this research brief

High stakes incidents overwhelm IT Operations

Existing monitoring toolset is a liability formany organizations

Transformation to AI-driven preventive ITOMis a business priority

Pr eventive healing process will bridge Dev, ITand Biz Ops

IT OM can evolve to Preventive Healing in 3 steps

High stakes incidents overwhelm IT Operations

The cost to fix incidents is high in terms of time and resource spend

Business risk is increasing with higher number of incidents.

Transactional volume is gr owing across mission-critical applications

The key findings emerging from the current state of IT Operations at the surveyed organizations revealed some overarching trends. Most significant among them was the number of critical incidents occurring each month and the cost required to fix the incidents given the current tools landscape.

For organizations running mission-critical applications in multi-cloud environments, domain specific monitoring capabilities and trained Operations personnel who have experience in DevOps and SRE (Site Reliability Engineering) are also required to keep the lights running with minimal or no downtime.

Most respondents have an average of 34 million transactions across an average of 71.4 mission-critical applications, with more than 519 incidents impacting business annually. These incidents can cost in average 1774 man-hours per month to fix.

High stakes incidents overwhelm IT Operations

respondents have at least 30-45mission critical applications

respondents have more than 1 milliontr ansactions every month

respondents have more than 1000 incidents on an aver age monthly

respondents have more than 100 high severity incidents in 2019

Existing monitoring toolset is a liability for many organizations

Alert storms, false positives and unnoticed events persist despite use of AIOps tools.

Multiple tools being used for monitoring results in fr actured, siloed view of the enterprise.

Teams still rely on dashboar ds to troubleshoot ver sus getting early warnings to act upon.

A single monitoring tool is insufficient to provide visibility and observability across multiple environments. Organizations are struggling to rethink their automation strategies in Hybrid Digital Infrastructure environments. Many organizations use around 10 monitoring tools and are unhappy with their existing operational setup.

Regardless of the overwhelming volume of data being collected, insights are still not being processed effectively. Coupled with that is the fact that teams need to frequently contend with alert storms and lack of coherent event correlations and root cause analysis.

Despite the plethora of AlOps tools in the market, most still automate the resolution of incidents that have already occurred through orchestra tion workflows and ITSM integration as part of the larger automation strategy. Troubleshooting teams still rely on dashboards to become aware of issues via events, logs or alerts.

An average NOC can receive up to 10,000 alerts per day; 49% percent of SREs respond to at least one major incident every week.

Transformation to AI-driven prevention is a business priority

Existing monitoring toolset is a liability for many organizations

respondents find present systems to be too complex

respondents felt that dashboards and alerts are insufficent

respondents feel they are getting too many alerts

respondents fell there are too many false positives

respondents feel events are going unnoticed

respondents feel they have a fractured view of the enterprise

The main objective of IT OM should be to pr event future incidents

Advanced analytical capability is a necessity for operations.

AI should also facilitate planning by deriving insights fr om historical data

Most surveyed organizations are looking at Al and automation as one of the top asks from monitoring initiatives and capabilities in 2020. Prevention of future incidents, quicker detection, or, at the very least, faster root cause analysis were other primary expectations.

Advanced analytics and decisioning systems will continue to be the focal point of operational capability. What would make them more helpful, however, is time-synchronized contextual data from services spread across multiple environments, including business transactions which travel across organizational silos, and diagnostic data about the current state of the system, coupled with logs, configuration changes and code/query snapshots.

Planning is also a requirement which is key to automation success; with multi-cloud environments it is imperative to identify current and potential hotspots via capacity forecasting, and scale intelligently to mitigate issues and manage costs.

95%

respondents are inter ested in AI-driven self-healing/autonomous syst ems, for which automation via prediction, preventionand planning is essential.

Transformation to AI-driven prevention is a business priority

respondents were interested in AI-driven preventive healing system

respondents felt that they needed advanced analytics capabilities

respondents favor ed preventionof future incidents

respondents felt capacity for casting should be done on basis of future predicted workload

Preventive healing process will bridge Dev, IT and Biz Ops

Fix problems before they occur

Focusing on MTTR and fixing issues faster does not contribute to avoiding outages and improving customer experience. A whole new approach that focuses on prevention is required across DevOps, ITOps, and BizOps silos.

70% percent of infrastructure problems come from poorly applied or failed changes. DevOps should address change management effectively by enabling frequent deployments to better support customer and employee needs. Tools and practices such as Cl/CD enable more frequent deployments, especially in complex cloud and hybrid environments.

Establishing automated workflows for activities like monitoring, alerting, root cause analysis, diagnostics, resource provisioning via planning and auto-remediation is paramount for automation readiness. These capabilities should align with and map to critical business processes, so that customers can set priorities based on business value.

Conclusion

High-severity incidents overwhelm enterprises and environments have become more complex to monitor and manage. IT Operations teams find themselves ill-equipped to troubleshoot with the existing tools, which are more reactive than preventive. The adverse impact of downtime on business and the high costs of troubleshooting mean that incident prevention is the need of the hour.

More and more IT leaders are looking at Automation and Al/ML techniques as the solution to the conundrum. Preventive Healing is the way forward to make sense of the overwhelming amount of data being collected, derive insights and prevent incidents in the enterprise.

Enterprises can evolve from AlOps to Preventive Healing by correlating events across multiple silos, bringing in situational awareness via time-synchronized contextual data, highlighting early warnings with diagnostics and root cause suggestions to effect remediation and self-healing, and planning based on workload trends to scale intelligently.

Your Journey To Self Healing
Begins with HEAL

Heal is industry’s first preventive healing software for IT that can help IT problems before they happen. By applying a patented ML technique called workload•behaviour correlation, HEAL can highlight anomalies in system behaviour as a function of the current workload, called early warning signals. Such an advanced alerting mechanism, coupled with contextual data on the state of the system at the time of the anomaly, allows you the time to put in place the required measures to prevent the issue.

What makes HEAL Unique

Complete preventive solution for enterprise

Autonomous healing to rectify issues, and enableticketless monitoring

Entire operating context of user journey, application and system in single pane of glass

Entire operating context of user journey, application and system in single pane of glass

book a demo

Appendix

This independent survey was conducted by an independent market research organization to understand the state of IT Operationse.

More than 100 SME IT decision makers were surveyed over a period of 2months

93% surveyed were private companies

84% surveyed were mid-sized companies; 75% were Technologycompanies

Companies with revenues ranging from 50 Million USD to 300 MillionUSD

Categories surveyed:
State of IT Operations: present state of ITOM in the
organization
IT Monitoring Tools: average number of tools, which
vendors, top use cases and challenges in using the tools
84% surveyed were mid-sized companies; 75% were Technologycompanies
Future State of ITOM: monitoring initiatives and
capabilities being implemented in 2021, what AI should
do for the organization capabilities

Appendix

Survey Demographics

Industry	% of respondents
Technology	74%
Banking	4%
SaaS	7%
Others	11%

Department	% of respondents
Office CTO and digital	12%
IT Operations	62%
Applications(Software)	10%
Infrastructure(Hardware or cloud)	8%
IT strategy and enterprise architecture	8%

Appendix

Mission critical applications and transaction volum

Number of missioncritical applications	% of respondents
<10	1%
10-30	21%
30-75	42%
75-100	25%
>150	12%

Transactions per month	% of respondents
<1M	5%
1M-10M	37%
10M-50M	27%
50M-100M	27%
>100M	4%

The largest number of respondents (42%) mentioned having between 30 and 75 mission critical applications, with a majority of survey respondents (79%) possessing more than 30 such applications.

95% survey respondents mentioned that they did more than 1 million tr ansactions ever y month, hinting at the inherent complexity in their operations

Appendix

Average monthly and high severity business impacting incidents

Average monthly incidents	% of respondents
<1000	8%
1000-4000	42%
4000-7000	38%
7000-10000	7%
>1000	5%

High severity business impacting incidents (2019)	% of respondents
<100	9%
100-400	29%
400-700	33%
700-1000	23%
>10000	6%

91% survey respondents mentioned they had more than 100 high severity incidents in the last yea

95% survey respondents mentioned that they did more than 1 million tr ansactions ever y month, hinting at the inherent complexity in their operations

Appendix

Cost to fix high severity incidents and IT monitoring tool profile

Cost to fix high severity incidents in main hours	% of respondents
<5000	3%
2500-5000	18%
1000-2500	40%
500-1000	33%
>500	7%

Number of IT monitoring tools used	% of respondents
<3	3-5%
3-5	35%
5-10	53%
10	10%

A large number of survey respondents (73%) mentioned that they spent between 500 to 2500 man-hours monthly to fix such Incidents

A large number of survey respondents (88%) mentioned that they used between 3 and 10 IT monitoring tools, with the highest number of respondents (53%) indicating using between 5 and 10 monitoring tools

Appendix

Challenges and Biggest operational pain points

Challenges	% of respondents
Present systems aretoo complex	70%
Dashboards and alerts insuffcient	61%
Too costly	65%
Not efficient	34%
Not forward looking i.e not able to prevent incidents	35%

Biggest operational pain points	% of respondents
Too many alerts	59%
Too many false positives	45%
700-1000	23%
Alerts going unnoticed	45%

In other choices, “Events going unnoticed” as an option was chosen by 50% of the public companies, and 67% of Telecom companies. This was also chosen by 78% of the Infr astructure and 67% of the IT Strategy and Enterprise Architecture respondents

Appendix

Interest levels in AI self-healing

What should AI do for your organization	% of respondents
Provide advanced analytics	62%
Predict and prevent problems	49%
Find problems fast	45%
Predict future workload	36%
Predict and fix problems	14%

Interested in AI self-healing systems that can prevent incidents from happening?	% of respondents
Yes	95%
No	49%

Interest in AI self-healing choices emphasize the value and potential of AI self-healing system

State of IT Operations 2021 and The Journey to Self Healing

State of IT Operations

Inside this research brief

High stakes incidents overwhelm IT Operations

High stakes incidents overwhelm IT Operations

respondents have at least 30-45mission critical applications

respondents have more than 1 milliontr ansactions every month

respondents have more than 1000 incidents on an aver age monthly

respondents have more than 100 high severity incidents in 2019

Existing monitoring toolset is a liability for many organizations

Transformation to AI-driven prevention is a business priority

Existing monitoring toolset is a liability for many organizations

respondents find present systems to be too complex

respondents felt that dashboards and alerts are insufficent

respondents feel they are getting too many alerts

respondents fell there are too many false positives

respondents feel events are going unnoticed

respondents feel they have a fractured view of the enterprise

95%

respondents are inter ested in AI-driven self-healing/autonomous syst ems, for which automation via prediction, preventionand planning is essential.

Transformation to AI-driven prevention is a business priority

respondents were interested in AI-driven preventive healing system

respondents felt that they needed advanced analytics capabilities

respondents favor ed preventionof future incidents

respondents felt capacity for casting should be done on basis of future predicted workload

Preventive healing process will bridge Dev, IT and Biz Ops

Fix problems before they occur

Conclusion

Your Journey To Self Healing Begins with HEAL

What makes HEAL Unique

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

Your Journey To Self Healing
Begins with HEAL