Senior Site Reliability Engineer
Apply on
Job Title: Senior Site Reliability Engineer (SRE)
Location: Remote
Duration: Long term Contract
Contract: C2C/W2
Job Description:
We are seeking an experienced and highly skilled Senior Site Reliability Engineer (SRE) with a strong background in Azure Kubernetes Service (AKS), Jenkins CI/CD, and observability tools. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production infrastructure, while implementing best practices in CI/CD and observability.
Key Responsibilities:
Azure Kubernetes Service (AKS) Management:
Design, deploy, and manage AKS clusters, ensuring high availability and scalability.
Optimize AKS infrastructure for performance, security, and cost.
Work with development teams to containerize applications and orchestrate them on AKS.
Implement Kubernetes security best practices, including RBAC, network policies, and namespace isolation.
CI/CD Pipeline Development (Jenkins):
Design and maintain robust CI/CD pipelines using Jenkins to enable fast and reliable deployments.
Collaborate with development teams to integrate code repositories, Docker registries, and test automation tools into Jenkins workflows.
Utilize scripting languages (Groovy, Shell, Python) to automate Jenkins pipelines and improve the deployment process.
Troubleshoot and resolve issues within the CI/CD pipelines, maintaining high standards of performance and reliability.
Observability & Monitoring:
Establish comprehensive monitoring, logging, and alerting strategies for production and staging environments.
Deploy and manage observability tools, such as Prometheus, Grafana, Azure Monitor, ELK, and Datadog.
Implement APM and distributed tracing solutions to identify and resolve performance bottlenecks.
Work closely with engineering teams to define and track SLAs, SLOs, and error budgets.
Incident Management & Troubleshooting:
Proactively identify, diagnose, and resolve infrastructure and application issues in real-time.
Develop incident management and escalation procedures to minimize downtime and maintain service quality.
Conduct post-incident reviews, documenting root causes and implementing long-term fixes to improve system resilience.
Automation & Infrastructure as Code (IaC):
Leverage tools like Terraform, Ansible, and Helm to automate infrastructure provisioning, configuration management, and application deployment.
Promote the adoption of Infrastructure as Code (IaC) principles, ensuring consistent and repeatable infrastructure management.
Performance Optimization & Cost Efficiency:
Optimize AKS workloads, pipelines, and observability setups to improve performance and reduce operational costs.
Monitor cloud spending and recommend changes to reduce unnecessary expenditures.
Qualifications:
Bachelor s or Master s degree in Computer Science, Information Technology, or related field.
13+ years of experience in SRE, DevOps, or cloud infrastructure roles.
Strong expertise with Azure AKS and Kubernetes, including multi-tenant cluster management and security best practices.
Proven experience with Jenkins CI/CD, scripting in Groovy, Shell, or Python for automated pipeline development.
Extensive knowledge of observability tools (Prometheus, Grafana, ELK, Datadog, Azure Monitor) with experience in configuring APM and tracing.
Proficiency in infrastructure automation tools, such as Terraform, Ansible, and Helm.
Solid understanding of cloud infrastructure, networking, security, and best practices for Azure environments.
Excellent troubleshooting skills, with experience in incident management and root cause analysis.
Strong interpersonal skills to collaborate effectively with cross-functional teams.
Preferred Skills:
Experience with other CI/CD tools, such as GitLab CI/CD or CircleCI.
Familiarity with microservices architecture and service mesh technologies.
Experience with additional observability tools like New Relic or AppDynamics.
Strong understanding of Agile and DevOps methodologies.