Epicareer Might not Working Properly
Learn More

Senior Site Reliability Engineer

Salary undisclosed

Apply on


Original
Simplified

Job Title: Senior Site Reliability Engineer (SRE)

Location: Remote

Duration: Long term Contract

Contract: C2C/W2

Job Description:

We are seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) with a strong background in Azure Kubernetes Service (AKS), Jenkins CI/CD, and observability tools. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production infrastructure, while implementing best practices in CI/CD and observability.

Key Responsibilities:

Azure Kubernetes Service (AKS) Management:

Design, deploy, and manage AKS clusters, ensuring high availability and scalability.

Optimize AKS infrastructure for performance, security, and cost.

Work with development teams to containerize applications and orchestrate them on AKS.

Implement Kubernetes security best practices, including RBAC, network policies, and namespace isolation.

CI/CD Pipeline Development (Jenkins):

Design and maintain robust CI/CD pipelines using Jenkins to enable fast and reliable deployments.

Collaborate with development teams to integrate code repositories, Docker registries, and test automation tools into Jenkins workflows.

Utilize scripting languages (Groovy, Shell, Python) to automate Jenkins pipelines and improve the deployment process.

Troubleshoot and resolve issues within the CI/CD pipelines, maintaining high standards of performance and reliability.

Observability & Monitoring:

Establish comprehensive monitoring, logging, and alerting strategies for production and staging environments.

Deploy and manage observability tools, such as Prometheus, Grafana, Azure Monitor, ELK, and Datadog.

Implement APM and distributed tracing solutions to identify and resolve performance bottlenecks.

Work closely with engineering teams to define and track SLAs, SLOs, and error budgets.

Incident Management & Troubleshooting:

Proactively identify, diagnose, and resolve infrastructure and application issues in real-time.

Develop incident management and escalation procedures to minimize downtime and maintain service quality.

Conduct post-incident reviews, documenting root causes and implementing long-term fixes to improve system resilience.

Automation & Infrastructure as Code (IaC):

Leverage tools like Terraform, Ansible, and Helm to automate infrastructure provisioning, configuration management, and application deployment.

Promote the adoption of Infrastructure as Code (IaC) principles, ensuring consistent and repeatable infrastructure management.

Performance Optimization & Cost Efficiency:

Optimize AKS workloads, pipelines, and observability setups to improve performance and reduce operational costs.

Monitor cloud spending and recommend changes to reduce unnecessary expenditures.

Qualifications:

Bachelor s or Master s degree in Computer Science, Information Technology, or related field.

13+ years of experience in SRE, DevOps, or cloud infrastructure roles.

Strong expertise with Azure AKS and Kubernetes, including multi-tenant cluster management and security best practices.

Proven experience with Jenkins CI/CD, scripting in Groovy, Shell, or Python for automated pipeline development.

Extensive knowledge of observability tools (Prometheus, Grafana, ELK, Datadog, Azure Monitor) with experience in configuring APM and tracing.

Proficiency in infrastructure automation tools, such as Terraform, Ansible, and Helm.

Solid understanding of cloud infrastructure, networking, security, and best practices for Azure environments.

Excellent troubleshooting skills, with experience in incident management and root cause analysis.

Strong interpersonal skills to collaborate effectively with cross-functional teams.

Preferred Skills:

Experience with other CI/CD tools, such as GitLab CI/CD or CircleCI.

Familiarity with microservices architecture and service mesh technologies.

Experience with additional observability tools like New Relic or AppDynamics.

Strong understanding of Agile and DevOps methodologies.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job