Epicareer Might not Working Properly
Learn More

Incident Response Manager (contract role)

Salary undisclosed

Checking job availability...

Original
Simplified

Are you passionate about compliance and data governance? Do you thrive in a dynamic and fast-paced environment? We are seeking a highly motivated and detail-oriented Incident Response Manager to join our team as a member of the Production Engineering Team. In this role, you will oversee incident response, root cause analysis, and uptime tracking for our production data flows. This role requires a highly communicative, technical, and detail-oriented professional who can drive conversations across teams, triage critical production issues, and ensure timely resolutions. The ideal candidate will be able to balance being personable with being assertive, ensuring that incidents are thoroughly investigated, and corrective actions are taken.

Key Responsibilities

Incident Response & Management

  • Own and manage the incident lifecycle for production data flows, ensuring rapid mitigation of issues.
  • Facilitate real-time incident response, collaborating with development teams to restore functionality quickly.
  • Regularly triage and track P1 (Critical) and P2 (High) incidents via JIRA, ensuring timely resolutions.
  • Utilize incident response and escalation tools (e.g., PagerDuty, Opsgenie, etc.) to track incident statistics like MTTR (Mean Time to Resolution) and MTTA (Mean Time to Acknowledge).

Root Cause Analysis (RCA) & Post-Incident Reviews

  • Lead and facilitate recurring blameless postmortem/RCA meetings, ensuring proper documentation and follow-up actions.
  • Work with developers to identify and document the true root causes behind production failures.
  • Ensure that RCA findings lead to actionable engineering improvements (e.g., better monitoring, code changes, automation).

Uptime & SLA Tracking

  • Regularly track and report on uptime and performance metrics, ensuring compliance with published SLAs.
  • Partner with engineering teams to proactively prevent incidents by analyzing trends and identifying weak points.

Stakeholder Communication & Collaboration

  • Act as the primary incident point of contact, ensuring clear and structured communication with technical and business stakeholders.
  • Proactively drive postmortems and conversations across teams, ensuring that teams take ownership of action items post-incident.
  • Work closely with DevOps, development teams, and leadership to prevent repeat failures.

Process & Operational Improvements

  • Continuously refine the incident management process, ensuring efficiency and effectiveness.
  • Work with development teams to improve observability, monitoring, and alerting.
  • Advocate for automation and tooling improvements to minimize operational overhead.

Required Skills/Experience

  • Solid understanding of software development, APIs, data flows, and system architectures.
  • Ability to read and interpret logs, dashboards, and monitoring tools to diagnose issues quickly.
  • 3-7 years of experience in incident response management, site reliability engineering (SRE), DevOps, or a related role.
  • Strong background in production support, RCA facilitation, and cross-team collaboration.
  • Outgoing personality with a forceful-yet-personable approach to getting things done.
  • Ability to drive conversations across engineering, product, and leadership teams.
  • Clear and structured communicator—able to escalate and de-escalate incidents effectively.
  • Experience managing incidents using JIRA, PagerDuty, Opsgenie, ServiceNow, or similar tools.
  • Familiarity with observability stacks (e.g., Datadog, Splunk, Prometheus, New Relic, Grafana) is a plus.
  • Excellent ability to track, document, and analyze trends in incident data.
  • Ability to hold teams accountable for action items post-RCA.

About Our Company

Accelerant is a data-driven, technology-fuelled insurance platform that empowers underwriters to serve their insureds more effectively. We are using advanced data intelligence tools to rebuild the way that underwriters share and exchange risk. With a current focus on the small and medium-sized businesses that power our global economy and their niche insurance needs, we align incentives to improve outcomes for everyone. Our full-service risk exchange supports our carefully selected, best-in-class network of underwriting teams. We leverage granular information on each policy to deliver unprecedented insight into insurance pools, and our specialty portfolio is fully diversified with very low catastrophe, aggregation or systemic risk. We are proud to have been awarded an AM Best A- (Excellent) rating. For more information, please visit https://accelerant.ai/.

Enjoy our comprehensive benefits package designed to meet your diverse needs and support your well-being:

Work-life balance: We believe that taking time to rest and recharge makes us all better. That’s why we offer flexible time off and encourage our team to take the time they need to prioritize their health and well-being.

Health and wellness: We offer high-quality health, dental, and other benefits to ensure our team members have access to the care they need.

Remote work: Work where you’re most productive and fulfilled. This position is open to remote candidates across the U.S., Canada, UK and Europe who have the flexibility to work with our teams distributed across Europe and North America. Most cross-team collaboration happens in the mornings of the Eastern Time Zone.

Travel: We value face-to-face connections and believe that in-person interactions can enhance collaboration and build stronger relationships. Travel could be a small part of your role, with opportunities to connect with your team and our members in-person.

Are you passionate about compliance and data governance? Do you thrive in a dynamic and fast-paced environment? We are seeking a highly motivated and detail-oriented Incident Response Manager to join our team as a member of the Production Engineering Team. In this role, you will oversee incident response, root cause analysis, and uptime tracking for our production data flows. This role requires a highly communicative, technical, and detail-oriented professional who can drive conversations across teams, triage critical production issues, and ensure timely resolutions. The ideal candidate will be able to balance being personable with being assertive, ensuring that incidents are thoroughly investigated, and corrective actions are taken.

Key Responsibilities

Incident Response & Management

  • Own and manage the incident lifecycle for production data flows, ensuring rapid mitigation of issues.
  • Facilitate real-time incident response, collaborating with development teams to restore functionality quickly.
  • Regularly triage and track P1 (Critical) and P2 (High) incidents via JIRA, ensuring timely resolutions.
  • Utilize incident response and escalation tools (e.g., PagerDuty, Opsgenie, etc.) to track incident statistics like MTTR (Mean Time to Resolution) and MTTA (Mean Time to Acknowledge).

Root Cause Analysis (RCA) & Post-Incident Reviews

  • Lead and facilitate recurring blameless postmortem/RCA meetings, ensuring proper documentation and follow-up actions.
  • Work with developers to identify and document the true root causes behind production failures.
  • Ensure that RCA findings lead to actionable engineering improvements (e.g., better monitoring, code changes, automation).

Uptime & SLA Tracking

  • Regularly track and report on uptime and performance metrics, ensuring compliance with published SLAs.
  • Partner with engineering teams to proactively prevent incidents by analyzing trends and identifying weak points.

Stakeholder Communication & Collaboration

  • Act as the primary incident point of contact, ensuring clear and structured communication with technical and business stakeholders.
  • Proactively drive postmortems and conversations across teams, ensuring that teams take ownership of action items post-incident.
  • Work closely with DevOps, development teams, and leadership to prevent repeat failures.

Process & Operational Improvements

  • Continuously refine the incident management process, ensuring efficiency and effectiveness.
  • Work with development teams to improve observability, monitoring, and alerting.
  • Advocate for automation and tooling improvements to minimize operational overhead.

Required Skills/Experience

  • Solid understanding of software development, APIs, data flows, and system architectures.
  • Ability to read and interpret logs, dashboards, and monitoring tools to diagnose issues quickly.
  • 3-7 years of experience in incident response management, site reliability engineering (SRE), DevOps, or a related role.
  • Strong background in production support, RCA facilitation, and cross-team collaboration.
  • Outgoing personality with a forceful-yet-personable approach to getting things done.
  • Ability to drive conversations across engineering, product, and leadership teams.
  • Clear and structured communicator—able to escalate and de-escalate incidents effectively.
  • Experience managing incidents using JIRA, PagerDuty, Opsgenie, ServiceNow, or similar tools.
  • Familiarity with observability stacks (e.g., Datadog, Splunk, Prometheus, New Relic, Grafana) is a plus.
  • Excellent ability to track, document, and analyze trends in incident data.
  • Ability to hold teams accountable for action items post-RCA.

About Our Company

Accelerant is a data-driven, technology-fuelled insurance platform that empowers underwriters to serve their insureds more effectively. We are using advanced data intelligence tools to rebuild the way that underwriters share and exchange risk. With a current focus on the small and medium-sized businesses that power our global economy and their niche insurance needs, we align incentives to improve outcomes for everyone. Our full-service risk exchange supports our carefully selected, best-in-class network of underwriting teams. We leverage granular information on each policy to deliver unprecedented insight into insurance pools, and our specialty portfolio is fully diversified with very low catastrophe, aggregation or systemic risk. We are proud to have been awarded an AM Best A- (Excellent) rating. For more information, please visit https://accelerant.ai/.

Enjoy our comprehensive benefits package designed to meet your diverse needs and support your well-being:

Work-life balance: We believe that taking time to rest and recharge makes us all better. That’s why we offer flexible time off and encourage our team to take the time they need to prioritize their health and well-being.

Health and wellness: We offer high-quality health, dental, and other benefits to ensure our team members have access to the care they need.

Remote work: Work where you’re most productive and fulfilled. This position is open to remote candidates across the U.S., Canada, UK and Europe who have the flexibility to work with our teams distributed across Europe and North America. Most cross-team collaboration happens in the mornings of the Eastern Time Zone.

Travel: We value face-to-face connections and believe that in-person interactions can enhance collaboration and build stronger relationships. Travel could be a small part of your role, with opportunities to connect with your team and our members in-person.