State of IT Operations 2021 and The Journey to Self Healing
State of IT Operations
As enterprises make the shift to multi-cloud-based in fr astructure, monitoring tools have been hard-pressed to keep up with increased complexity. High stakes incidents are overwhelming IT Operations teams and downtime has become more expensive than ever.
Traditional APM and legacy tools are a liability for many organizations and are no longer capable of managing the modern enterprise cloud. Companies are stuck firefighting and troubleshooting has become more complicated and time consuming.
To resolve this dichotomy, automation in !TOM, preventive healing (i.o provonting a problom ovon botoro it occurs) and offoctivo capacity forecasting are the way forward.
This report outlines the challenges organizations face as they combat the complexities of the enterprise cloud, and why the journey to preventive healing has made it to the top of the agenda for CIOs and IT Operations leaders in 2021.
Inside this research brief
High stakes incidents overwhelm IT Operations
Transactional volume is gr owing across mission-critical applications
For organizations running mission-critical applications in multi-cloud environments, domain specific monitoring capabilities and trained Operations personnel who have experience in DevOps and SRE (Site Reliability Engineering) are also required to keep the lights running with minimal or no downtime.
Most respondents have an average of 34 million transactions across an average of 71.4 mission-critical applications, with more than 519 incidents impacting business annually. These incidents can cost in average 1774 man-hours per month to fix.
High stakes incidents overwhelm IT Operations
%
respondents have at least 30-45mission critical applications
%
respondents have more than 1 milliontr ansactions every month
%
respondents have more than 1000 incidents on an aver age monthly
%
respondents have more than 100 high severity incidents in 2019
Existing monitoring toolset is a liability for many organizations
A single monitoring tool is insufficient to provide visibility and observability across multiple environments. Organizations are struggling to rethink their automation strategies in Hybrid Digital Infrastructure environments. Many organizations use around 10 monitoring tools and are unhappy with their existing operational setup.
Regardless of the overwhelming volume of data being collected, insights are still not being processed effectively. Coupled with that is the fact that teams need to frequently contend with alert storms and lack of coherent event correlations and root cause analysis.
Despite the plethora of AlOps tools in the market, most still automate the resolution of incidents that have already occurred through orchestra tion workflows and ITSM integration as part of the larger automation strategy. Troubleshooting teams still rely on dashboards to become aware of issues via events, logs or alerts.
An average NOC can receive up to 10,000 alerts per day; 49% percent of SREs respond to at least one major incident every week.
Transformation to AI-driven prevention is a business priority
Existing monitoring toolset is a liability for many organizations
%
respondents find present systems to be too complex
%
respondents felt that dashboards and alerts are insufficent
%
respondents feel they are getting too many alerts
%
respondents fell there are too many false positives
%
respondents feel events are going unnoticed
%
respondents feel they have a fractured view of the enterprise
Advanced analytics and decisioning systems will continue to be the focal point of operational capability. What would make them more helpful, however, is time-synchronized contextual data from services spread across multiple environments, including business transactions which travel across organizational silos, and diagnostic data about the current state of the system, coupled with logs, configuration changes and code/query snapshots.
Planning is also a requirement which is key to automation success; with multi-cloud environments it is imperative to identify current and potential hotspots via capacity forecasting, and scale intelligently to mitigate issues and manage costs.
95%
respondents are inter ested in AI-driven self-healing/autonomous syst ems, for which automation via prediction, preventionand planning is essential.
Advanced analytics and decisioning systems will continue to be the focal point of operational capability. What would make them more helpful, however, is time-synchronized contextual data from services spread across multiple environments, including business transactions which travel across organizational silos, and diagnostic data about the current state of the system, coupled with logs, configuration changes and code/query snapshots.
Planning is also a requirement which is key to automation success; with multi-cloud environments it is imperative to identify current and potential hotspots via capacity forecasting, and scale intelligently to mitigate issues and manage costs.
Transformation to AI-driven prevention is a business priority
%
respondents were interested in AI-driven preventive healing system
%
respondents felt that they needed advanced analytics capabilities
%
respondents favor ed preventionof future incidents
%
respondents felt capacity for casting should be done on basis of future predicted workload
Preventive healing process will bridge Dev, IT and Biz Ops
Fix problems before they occur
70% percent of infrastructure problems come from poorly applied or failed changes. DevOps should address change management effectively by enabling frequent deployments to better support customer and employee needs. Tools and practices such as Cl/CD enable more frequent deployments, especially in complex cloud and hybrid environments.
Establishing automated workflows for activities like monitoring, alerting, root cause analysis, diagnostics, resource provisioning via planning and auto-remediation is paramount for automation readiness. These capabilities should align with and map to critical business processes, so that customers can set priorities based on business value.
Conclusion
High-severity incidents overwhelm enterprises and environments have become more complex to monitor and manage. IT Operations teams find themselves ill-equipped to troubleshoot with the existing tools, which are more reactive than preventive. The adverse impact of downtime on business and the high costs of troubleshooting mean that incident prevention is the need of the hour.
More and more IT leaders are looking at Automation and Al/ML techniques as the solution to the conundrum. Preventive Healing is the way forward to make sense of the overwhelming amount of data being collected, derive insights and prevent incidents in the enterprise.
Enterprises can evolve from AlOps to Preventive Healing by correlating events across multiple silos, bringing in situational awareness via time-synchronized contextual data, highlighting early warnings with diagnostics and root cause suggestions to effect remediation and self-healing, and planning based on workload trends to scale intelligently.
Your Journey To Self Healing
Begins with HEAL
Heal is industry’s first preventive healing software for IT that can help IT problems before they happen. By applying a patented ML technique called workload•behaviour correlation, HEAL can highlight anomalies in system behaviour as a function of the current workload, called early warning signals. Such an advanced alerting mechanism, coupled with contextual data on the state of the system at the time of the anomaly, allows you the time to put in place the required measures to prevent the issue.
What makes HEAL Unique
Appendix
This independent survey was conducted by an independent market research organization to understand the state of IT Operationse.
- Categories surveyed:
- State of IT Operations: present state of ITOM in the
organization - IT Monitoring Tools: average number of tools, which
- vendors, top use cases and challenges in using the tools
- 84% surveyed were mid-sized companies; 75% were Technologycompanies
- Future State of ITOM: monitoring initiatives and
capabilities being implemented in 2021, what AI should
do for the organization capabilities
Appendix
Survey Demographics
Industry | % of respondents |
---|---|
Technology | 74% |
Banking | 4% |
SaaS | 7% |
Others | 11% |
Department | % of respondents |
---|---|
Office CTO and digital | 12% |
IT Operations | 62% |
Applications(Software) | 10% |
Infrastructure(Hardware or cloud) | 8% |
IT strategy and enterprise architecture | 8% |
Appendix
Mission critical applications and transaction volum
Number of missioncritical applications | % of respondents |
---|---|
<10 | 1% |
10-30 | 21% |
30-75 | 42% |
75-100 | 25% |
>150 | 12% |
Transactions per month | % of respondents |
---|---|
<1M | 5% |
1M-10M | 37% |
10M-50M | 27% |
50M-100M | 27% |
>100M | 4% |
The largest number of respondents (42%) mentioned having between 30 and 75 mission critical applications, with a majority of survey respondents (79%) possessing more than 30 such applications.
95% survey respondents mentioned that they did more than 1 million tr ansactions ever y month, hinting at the inherent complexity in their operations
Appendix
Average monthly and high severity business impacting incidents
Average monthly incidents | % of respondents |
---|---|
<1000 | 8% |
1000-4000 | 42% |
4000-7000 | 38% |
7000-10000 | 7% |
>1000 | 5% |
High severity business impacting incidents (2019) | % of respondents |
---|---|
<100 | 9% |
100-400 | 29% |
400-700 | 33% |
700-1000 | 23% |
>10000 | 6% |
95% survey respondents mentioned that they did more than 1 million tr ansactions ever y month, hinting at the inherent complexity in their operations
Appendix
Cost to fix high severity incidents and IT monitoring tool profile
Cost to fix high severity incidents in main hours | % of respondents |
---|---|
<5000 | 3% |
2500-5000 | 18% |
1000-2500 | 40% |
500-1000 | 33% |
>500 | 7% |
Number of IT monitoring tools used | % of respondents |
---|---|
<3 | 3-5% |
3-5 | 35% |
5-10 | 53% |
10 | 10% |
A large number of survey respondents (88%) mentioned that they used between 3 and 10 IT monitoring tools, with the highest number of respondents (53%) indicating using between 5 and 10 monitoring tools
Appendix
Challenges and Biggest operational pain points
Challenges | % of respondents |
---|---|
Present systems aretoo complex | 70% |
Dashboards and alerts insuffcient | 61% |
Too costly | 65% |
Not efficient | 34% |
Not forward looking i.e not able to prevent incidents | 35% |
Biggest operational pain points | % of respondents |
---|---|
Too many alerts | 59% |
Too many false positives | 45% |
700-1000 | 23% |
Alerts going unnoticed | 45% |
Appendix
Interest levels in AI self-healing
What should AI do for your organization | % of respondents |
---|---|
Provide advanced analytics | 62% |
Predict and prevent problems | 49% |
Find problems fast | 45% |
Predict future workload | 36% |
Predict and fix problems | 14% |
Interested in AI self-healing systems that can prevent incidents from happening? | % of respondents |
---|---|
Yes | 95% |
No | 49% |