Traditionally, IT operations is related to fixing what is broken and rolling out new components. Often, rolling out new components and doing upgrades also causes systems to go down. There are a few ways in which AI (artificial intelligence) can help in these scenarios. AI can help with proactively avoiding downtime and then, in the case where downtime does happen, it can help resolve the issues a lot faster. In this article, I will be discussing this using IBM Watson AIOps as my reference.
Performance degradation is an indication that IT systems are under strain. There are a lot of metrics that can be tracked to measure performance. In IBM Watson AIOps, the metrics are observed over time and a normal behaviour pattern is recorded. This behaviour pattern is quite granular and will be specific to, for example, midday on a Friday. If the metrics start acting outside of the normal behaviour pattern, the operators are alerted. If an anomaly has occurred, it is usually an indicator that something is going to fail. Investigation of the situation at this point and a fixing of the problem can avoid downtime proactively.
In the scenario of rolling out new components or configuration, the risk of downtime can be mitigated by looking at times when similar work was rolled out previously. IBM Watson AIOps takes as inputs historical data from service desk programs like Service Now and chat streams from programs such as Microsoft Teams or Slack, as well as other sources. From this input it gleans what problems were encountered previously when changes such as those proposed were made and so these can be avoided, thus reducing the risk of downtime.
Some problems occur regularly and repeatedly. With IBM Watson AIOps, these events can be identified and resolved. If events and errors are only looked at in the present moment, the information that the event occurs regularly might be missed.It also means that operators are wasting time trying to solve the same problem again and again. The regular, or seasonal, events are identified in IBM Watson AIOps. These can then be investigated and fixed once, the regularity being part of the input into helping to resolve them.
There are some events that will be occurring with regularity, like generator start-up tests. These are used to ensure generators are still functioning properly and run weekly or monthly. These events should be noted, but do not require any action. In fact, the opposite is the case, that if they do not occur an action needs to be taken. IBM Watson AIOps can be set up to only send alerts if this type of event is not received and suppress the regular events that do occur.
Of course, there will still be situations that cannot be prevented and in these situations AI can be used to resolve the issue quickly and reduce the downtime. IBM Watson AIOps uses feeds from a number of different sources of structured and unstructured data. These sources range from chat programs like Slack or Microsoft Teams, to Service Desk programs to event management systems. From these sources, IBM Watson AIOps recognises what was done the last time a similar error occurred and suggests ways to solve the current problem. It can do this a lot faster than a human can and the more sources it uses the better the suggestions are.
IBM is putting significant effort into Watson AIOps and is bringing out new features every few months. Most of the development is on the containerised environment, but it does still run on traditional environments too. In the next generations of the Watson AIOps solution, IBM is looking to help deliver fully instrumented, self-aware, automated and autonomic IT operations environments. AIOps solutions will not only be able to help resolve issues in a reactive mode, but help avoid issues from happening in the first place.
At Envisage, we specialise in IBM Watson AIOps and would gladly speak to you about all the features and even go through a demo with you. Please contact me on bsmythe@envisage.email or (084) 424 7448 if you want to know more or would like to see a demo.