Incident Manager
Apply on
Incident Management/Major Incident Management (Primary responsibility)
Identify an Incident has/is occurring and respond accordingly
Utilize KB articles, incident logs, etc. to quickly determine next steps
Engage the appropriate resources needed to contain the incident
Establish a communication bridge for resources to collaborate
Send out communication to appropriate audience at regular intervals during the incident
Log all activities that occur during the incident
Moderate the communication bridge to keep the team focused on containment minimizing mean time to restore (MTTR)
Escalate to management as needed
Event Management (secondary responsibility)
Use EM tools to improve detection and response times to incidents
Reduce downtime by proactively detecting performance anomalies before they become a widespread system-down incident
Recognize the need for additional alarms or modified alarm thresholds based on past incidents
Problem Management (secondary responsibility)
Detect that a problem exists i.e. repeat incidents of the same type
Log the problem and assemble a team to work the problem
Log a known error and any workaround
Facilitate the technical team as they resolve the problem
Document the problem resolution and ensure any KB articles are updated
Metrics
Maintain key performance metrics
MTTR (mean time to restore)
MTTK (mean time to know)
Incidents with defective or non-existent alarms
Mean Time to Detect/Know
Unplanned Outage count and duration
Planned Outage count and duration
Incident counts by Portfolio