Source: thefastmode.com
If you spend time digging into the Machine Learning offerings from vendors today in the Telecom Service Assurance space, you will see that many of these vendors are positioning for significant replacement of the existing Service Assurance tools that require a completely new user front-end. While this might sound enticing, it does not focus on solving the underlying issue that Network Operation Centers face today. Instead it creates a waste of resources.
Alert fatigue is real
When Tier 1 Network Operating Center (“NOC”) and/or Command Center technicians are asked to identify the biggest issues they face, some people are surprised that features and capabilities are not at the top of their list. Instead, NOC technicians place most importance on the concept of alert fatigue. They feel there is too much signal being ingested and represented within fault management and ticket queues for them to have a clear view of what’s real and what’s not.
Alert fatigue occurs when an NOC technician is exposed to a high number of frequent events (alerts) and becomes desensitized to them over a period of time. The concept of alert fatigue is not new and not specific to Network Operation Centers. It is a physiological phenomenon that plagues operators across many industries, including Service Assurance. It has been the focus of multiple studies which found that an increase in alerting decreases the likelihood of a technician responding in an effective and/or timely manner. For example, a recent study by the ESRI Institute in Hospitals found that excess alerting caused patients and hospital staff to become afflicted with increased anxiety and alarm desensitization. Furthermore, this alert fatigue resulted in a reduced quality of patient care and missed critical events. In Service Assurance organizations, it has become a standard practice to increase alert thresholds or filter major or critical events in order to minimize alert fatigue.
Telecom Service Provider organizations are rightfully looking for ways to address alert fatigue by reducing events and focusing the agents on the root cause of the problem. Many organizations are implementing Artificial Intelligence Operations (“AIOps”) projects that take a “throw the baby out with the bathwater” approach to resolve the effects of alert fatigue. These projects focus on replacing much of the existing toolset and delivering a completely new user interface. However, doing so only increases the complexity, cost, and timing of deploying machine learning in Network Operations. Most legacy Service Assurance systems have evolved over many years, with significant effort already invested to ingest and format data, and then integrate the data with other systems. Similarly, the work queues have also evolved to incorporate a deep knowledge of the event handling process and/or incident management process which has been made even more complex with the numerous mergers and acquisitions that are commonplace in the Telecom industry. Replacing these functions only complicates the adoption of AIOps.
Context counts in reducing alert fatigue
The objective in implementing AIOps is to provide the machine learning context in the most consumable form that is also the least disruptive to the current operational processes. The best approach is to introduce a machine learning intelligence layer into the Service Assurance architecture and then integrate this layer into the existing toolset where it can ingest data and push the output of machine learning into the existing systems used by the agents. The underlying value and improvement in addressing alert fatigue will come from the innovation around the machine learning algorithms. Anything else is a distraction that only serves to delay the value of AIOps. This is not to say that processes will not change as a result of machine learning, but this approach allows them to change within the existing systems without having to replace them.