Epicareer Might not Working Properly
Learn More
P

Director of Site Reliability Engineering (SRE) - to 250k !!! - 100% Remote!!! (SK)

Salary undisclosed

Apply on


Original
Simplified

Salary is 220k to 250k

100% Remote position

Seeking a seasoned Director of Site Reliability Engineering (SRE) to lead and grow our SRE team. The ideal candidate will have a proven track record of building SRE environments from scratch and extensive experience with large-scale, distributed systems and high availability. You will drive the adoption of best practices in reliability, performance, and automation across our infrastructure.

<>Key Responsibilities:</>
  • Leadership & Strategy: Define and execute the vision and strategy for the SRE team, ensuring alignment with organizational goals.
  • Team Development: Build and mentor a high-performing SRE team, fostering a culture of collaboration, innovation, and continuous improvement.
  • Infrastructure Design: Architect and implement reliable, scalable, and maintainable systems using Kubernetes, Docker, and other container orchestration technologies.
  • Cloud Management: Oversee our cloud infrastructure on AWS and Google Cloud Platform, ensuring optimal performance, security, and cost management.
  • Automation & Tooling: Lead efforts in automation using tools such as Terraform and Ansible, reducing manual processes and increasing operational efficiency.
  • Incident Management: Establish and refine incident response processes, ensuring swift resolution of service disruptions and learning from incidents to improve system reliability.
  • Collaboration: Work closely with development teams to integrate SRE practices into the software development lifecycle, promoting a culture of shared responsibility for reliability.
  • Monitoring & Metrics: Implement robust monitoring, logging, and alerting systems to ensure system health and performance are continuously tracked and improved.
  • Capacity Planning: Conduct capacity planning and performance tuning for applications and infrastructure, ensuring systems can scale efficiently.
<>Qualifications:</>
  • Experience: 8+ years of experience in Site Reliability Engineering, DevOps, or related fields, with a focus on building SRE practices from the ground up.
  • Technical Skills:
    • Proficiency with Kubernetes and Docker for container orchestration.
    • Strong experience with large-scale, distributed systems and high-availability architectures.
    • Expertise in cloud platforms (AWS and Google Cloud Platform) and services.
    • Demonstrated experience with automation tools such as Terraform and Ansible.
  • Leadership Skills: Proven ability to lead and inspire teams, with strong interpersonal and communication skills.
  • Problem-Solving: Excellent analytical and troubleshooting skills, with a proactive approach to identifying and resolving issues.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job