In my years of working with ITIL frameworks, I’ve come to appreciate the simplicity and power of the Incident Management Life Cycle. Incidents are inevitable. Even with a robust service history, new and unique challenges emerge. Regardless of their nature, the management process remains consistent. Let me walk you through the key steps of this life cycle and share how they play out in a business scenario.
Step 1: Incident Identification
Incidents don’t just announce themselves. Thus, we need systems to identify them proactively. Identification often starts with triggers. Here are some common examples:
- Monitoring and Event Management: Automated tools, like event management systems, detect anomalies and create incident tickets. For instance, if a server goes down, the monitoring tool might poll it every minute. After three missed responses, an event is logged, which generates a ticket.
- Phone Calls: Users still pick up the phone to report disruptions. IT staff might also call the service desk when they spot an issue.
- Email or Chat: Many users prefer typing out their problems through email or chat tools, which service desk agents turn into incident tickets.
- Web Interfaces: Self-service portals allow users to log incidents directly. While efficient, this method can increase the chances of misidentified issues.
For example, a retail chain’s point-of-sale systems start failing intermittently. The monitoring system logs an incident, and a cashier’s phone call adds more context. With these triggers identified, the incident management process begins.
Step 2: Incident Logging
Every identified incident must be logged. The details include:
- Incident summary
- Incident description
- Impact
- Urgency
- Priority
- Category
- End user name
- End user team name
- Incident logger name
- Time of logging the incident
- Incident medium (phone/chat/web/email)
- Related CI
- Assigned resolver group
- Assigned engineer
- Status
- Resolution code
- Time of resolution/closure
For instance, when the point-of-sale issue is logged, the system captures that it’s affecting multiple stores. The service desk uses tools like ServiceNow to map the incident to the impacted system and pull relevant knowledge articles for faster resolution.
Step 3: Incident Categorization
Categorization ensures the incident goes to the right team. Therefore, incorrect categorization delays resolution.
For instance, if the point-of-sale issue is categorized under “applications” instead of “network,” it might go to the wrong team. Time is lost before it’s reassigned. Automation tools can help by analyzing keywords to suggest appropriate categories, though manual oversight remains crucial.
Step 4: Incident Prioritization
Impact refers to the business impact. Impact is a factor that determines the priority of the incident. Urgency is a measure of how quickly or rapidly the incident needs to be resolved. It may require the majority of staff to immediately address a particular incident.
An incident with a high impact and a high urgency would be assigned a priority of 1. An incident with a low impact and a low urgency would be assigned a priority of 4 or 5. Priorities in between are appropriate.
Consider these examples:
- High Priority: A downed network affecting all stores during peak hours.
- Low Priority: A minor display glitch in a non-critical system.
Step 5: Diagnosis and Investigation
Initial diagnosis starts with the service desk. They ask questions to narrow down the problem:
- What’s not working?
- When did it start?
- Are others impacted?
- What is the user expecting?
- What is wrong?
- What is the sequence of steps that caused the incident?
- Who is impacted?
If unresolved, the incident escalates to specialized teams. For the retail example, the network team might discover a faulty router causing the outages. They’ll dive deeper, checking if recent changes or similar past incidents provide clues.
Step 6: Resolution and Recovery
Resolution focuses on applying the right fix. For widespread issues, testing is essential before declaring success. This is a crucial step in the incident management life cycle.
In the retail scenario, replacing the faulty router resolves the issue. The team monitors the network for a week to ensure stability. Regular check-ins with stakeholders keep everyone informed.
Step 7: Incident Closure
After resolution, confirmation with the user ensures the issue is truly fixed. Some organizations prefer auto-closure after a set time if users don’t respond. Surveys follow to gather feedback on:
- Resolution timeliness
- Ease of reporting
- Communication throughout
In our example, the service desk confirms with store managers before closing the ticket. A survey reveals satisfaction with the quick resolution but highlights a need for better initial communication.
Conclusion
The Incident management life cycle ensures continuity and efficiency. By following these steps, businesses can resolve disruptions swiftly and effectively. Whether it’s automating detection or refining prioritization, there’s always room to optimize. The retail chain’s response to the point-of-sale outage exemplifies how ITIL’s framework empowers organizations to tackle challenges head-on. What’s your approach to handling incidents? Let’s discuss!
Credits: Photo by Pavel Danilyuk from Pexels




