Epicareer Might not Working Properly
Learn More
J

Site Reliability Engineer III

Salary undisclosed

Apply on


Original
Simplified
Location: Newport Beach, CA
Salary: $65.00 USD Hourly - $70.00 USD Hourly
Description: Our client is currently seeking a Site Reliability Engineer III

hybrid in Newport Beach, CA

Job Title: Lead Site Reliability Engineer (SRE)

Job Description:

As a Lead Site Reliability Engineer (SRE), you will provide technical leadership, direction, and accountability for platform engineering, system design, and end-to-end implementation. Your goal is to meet and exceed non-functional requirements, including quality, security, reliability, availability, and performance. You will optimize design and engineering for new systems and enhancements, ensuring reliable support for product rollout and operation in production. This role includes oversight of production operations and the development of solutions to optimize system reliability and automation.

Responsibilities:
  • Lead the design, build, and implementation of orchestration and tooling solutions to ensure efficient and defect-free administration tasks.
  • Establish best practices for structuring, automating, building, deploying, and monitoring complex distributed software products and environments.
  • Ensure the reliability and traceability of software releases and deployments of software and infrastructure changes.
  • Create and maintain platform architecture and design specifications to aid development, testing, and maintenance of software environments.
  • Design and implement monitoring and recovery tools to ensure high availability (HA) and disaster recovery (DR).
  • Develop highly available infrastructure and platform components to support our growing product lines.
  • Implement security engineering best practices across all deployed platforms and environments.
  • Triage alerts, diagnose and resolve critical issues, and manage the implementation of changes.
  • Coordinate, document, and track critical incidents and root cause analysis, ensuring rapid and complete issue resolution.
  • Collaborate with Delivery Engineers and DevExp Engineers to enhance and implement CI/CD orchestration systems, reducing friction for software delivery to production.
  • Lead, mentor, and grow other SRE team members.
  • Promote the DevSecOps culture and SRE mindset, mentoring others on reliability and best practices.
  • Identify and implement opportunities for automation, signal-to-noise reduction, and prevention of recurring issues to increase productivity and reduce service-impacting events.
  • Maintain a strong understanding of IaaS, PaaS, and SaaS offerings, building and maintaining a state-of-the-art, cloud-based environment for large-scale data processing.
  • Design and implement processes, technology, and automation for performance testing.
  • Ensure that implementations and solutions are fully documented and deployed with operationalized processes to support the solution lifecycle.

Qualifications:
  • 10-15 years of experience in infrastructure, system engineering, or software engineering.
  • Advanced knowledge in software engineering, testing automation frameworks, and tools for application and/or any-as-code (infrastructure, configuration, development tools).
  • Expertise in at least three of the following areas: Cloud-native and IaaS architecture, design (compliance, security), cloud engineering, and container orchestration solutions.
  • Strong understanding of business technology drivers and their impact on architecture design, performance, and monitoring.
  • Advanced knowledge of observability engineering with hands-on experience implementing and integrating monitoring and observability platforms such as AppDynamics, Dynatrace, Splunk, Grafana Cloud, or cloud-based services in AWS or Azure.
  • Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive.
  • Hands-on experience in designing, analyzing, scaling, and troubleshooting medium to large-scale distributed systems.
  • Proficiency with SRE methodologies and a passion for solving operational problems through automation and software engineering.
  • Ability to communicate technical strategy effectively across the organization.
  • Demonstrated ability to conceptualize, launch, and deliver multiple engineering projects on time and within budget.
  • Ability to troubleshoot complex problems under pressure.

Preferred Qualifications:
  • Subject matter expert in designing and supporting one of the major public cloud providers (AWS preferred, but experience with other providers considered).
  • Expertise in microservices lifecycle management (integration, testing, deployment).
  • Strong experience with logging and monitoring tools such as ELK stack, Prometheus, Stackdriver, New Relic, Datadog, Dynatrace, Splunk, AWS logging, and monitoring.
  • Expert knowledge of release software tooling (e.g., Jenkins, Spinnaker, Harness, Azure DevOps).
  • Expert-level knowledge of containerization technologies, including optimizing Docker images and managing Docker image lifecycles.
  • Advanced experience with Kubernetes or other orchestration solutions.
  • Extensive experience with Linux/Unix/Windows OS


Contact:

This job and many more are available through The Judge Group. Please apply with us today!
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job