Epicareer Might not Working Properly
Learn More

Site Reliability Engineer

Salary undisclosed

Apply on


Original
Simplified

Job Description

Job Description
Description:

CENTEGIX is the industry leader in wearable safety technology for healthcare, education, government, and commercial workplaces with over 600,000 badges in use. The cloud-based CENTEGIX Safety Platform initiates the fastest response time for emergencies, from the everyday to the extreme. Leaders in over 12,000 locations nationwide trust CENTEGIX's innovative safety solutions to empower and protect people (every day).

Purpose

The Site Reliability Engineer (SRE) will take ownership of the Observability component within the Platform Team, responsible for monitoring, logging, and alerting systems that ensure platform reliability and performance. This role is crucial to maintaining the health of our infrastructure, enabling efficient operations, and proactively detecting and resolving issues. The SRE will also collaborate closely with platform engineers and development teams to automate processes and improve system scalability and security.

Position Responsibilities

  • Observability Ownership: Design, implement, and maintain the observability stack, including Grafana, Prometheus, Loki, and Tempo, ensuring end-to-end visibility into system performance and health.
  • Incident Response: Lead incident response for observability-related issues, including on-call rotation responsibilities, performing root cause analysis, and driving improvements to prevent reoccurrence.
  • Monitoring and Alerting: Establish and fine-tune monitoring and alerting systems to ensure platform availability, reliability, and performance. Implement alerting thresholds and escalation protocols for timely issue detection.
  • Kubernetes & Cloud Infrastructure: Maintain and optimize Kubernetes clusters and infrastructure on AWS & GCP, ensuring scalability, security, and efficiency.
  • DevOps and CI/CD: Collaborate with the DevOps and platform teams to improve CI/CD pipelines, automate deployments, and reduce manual operational tasks.
  • Security & Compliance: Work with the security team to ensure that observability practices meet security and compliance requirements, addressing any potential vulnerabilities proactively.
  • Automation & Optimization: Develop and enhance automation around observability, reducing toil and increasing efficiency in issue detection and resolution.
  • Collaboration: Work closely with other teams to ensure observability tools are fully integrated with application and infrastructure monitoring efforts, delivering actionable insights to all stakeholders.
Requirements:

Experience:

  • 3+ years of experience in a Site Reliability Engineer, Observability Engineer, or DevOps role.
  • Extensive experience with cloud platforms, particularly AWS or GCP, and Kubernetes in production environments.
  • Proven expertise in building and managing observability systems, including Grafana, Prometheus, Loki, and Tempo.

Education & Certifications:

  • Bachelor s degree in Computer Science, Information Technology, or a related field, or equivalent professional experience.
  • Certifications in AWS or GCP (preferred but not required).

Technical Expertise:

  • Deep understanding of observability practices, monitoring, alerting, and logging tools such as Grafana, Prometheus, Loki, and Tempo.
  • Expertise in Kubernetes and cloud platforms (AWS or GCP).
  • Proficiency in automation tools, scripting (Bash, Python, etc.), and infrastructure as code (Terraform, Ansible).
  • Strong understanding of DevOps principles and best practices.

Skills and Competencies:

  • Strong problem-solving and analytical skills with an ability to lead root cause analysis efforts.
  • Effective communicator capable of working with both technical and non-technical stakeholders.
  • Security-focused mindset, integrating observability with security best practices.
  • Ability to work in a fast-paced, dynamic environment and adapt to evolving priorities.

Additional Requirements:

  • Ability to participate in an on-call rotation to ensure 24/7 availability of the observability platform.
  • Willingness to work flexible hours when necessary to support critical incidents or maintenance tasks.

What s in it for you?

  • Remote first work environment; we offer workplace flexibility
  • Participation in company wide discretionary bonus
  • 15 days paid time off(prorated)
  • 10 paid holidays
  • Monthly device(s) reimbursement
  • We offer a range of Healthcare plans to meet your needs (medical, dental, vision)
  • 401(k) Plan with 4% employer contribution to help you plan for the future
  • Employee Referral Bonus
  • Charitable Program Match

CENTEGIX is an equal opportunity employer and prohibits discrimination and harassment of any kind. We are committed to the principle of equal employment opportunity for all employees and to providing employees with a work environment free of discrimination and harassment. All employment decisions at CENTEGIX are based on business needs, job requirements, and individual qualifications, without regard to race, color, religion, sex (including pregnancy, gender identity, and sexual orientation), national origin, age, disability, genetic information, or any other status protected by the laws or regulations in the locations where we operate

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job