Sr. Site Reliability Engineer
Sr. Site Reliability Engineer Lead
Job Summary
The Senior Support Lead in Site Reliability engineering (SRE) will be responsible for overseeing the support and reliability operations within the organization. This role will focus on ensuring the stability, performance, and efficiency of the systems while leading a team of support engineers to provide exceptional service. (1.) Key Responsibilities
1. Lead and manage a team of support engineers in resolving incidents, requests, and problems to ensure system uptime and reliability.
2. Collaborate with the engineering and development teams to implement efficient and scalable solutions that enhance system performance.
3. Develop and maintain support documentation, standard operating procedures, and best practices for the support team.
4. Identify opportunities for automation and implement tools to streamline support processes.
5. Monitor system performance and provide recommendations for improvements to optimize system reliability.
6. Participate in on call rotations to address critical incidents and ensure 24/7 system availability.
7. Conduct regular performance evaluations, provide feedback, and mentor team members to promote professional growth.
Skill Requirements
1. In-depth knowledge of site reliability engineering (sre) principles and best practices.
2. Proficiency in system monitoring, incident management, and performance tuning tools.
3. Strong understanding of cloud services, microservices architecture, and containerization technologies.
4. Excellent problem-solving skills and the ability to troubleshoot complex technical issues.
5. Experience with scripting languages (e.g., python, bash) for automation and tool development.
6. Familiarity with agile methodologies and devops practices for continuous integration and delivery.
7. Strong communication and leadership skills to effectively lead a support team and collaborate with cross functional teams.
8. Ability to work under pressure, prioritize tasks, and manage multiple projects simultaneously.
Required Qualifications:
- 10+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 10+ years of experience in Production support/Site Reliability Engineering teams with continued focus on improving Platform health
- Familiar with Agile or other rapid application development practices
- Hands-on expertise with Automated testing, Process Automation & building dashboards using APM tools.
- Experience with distributed (multi-tiered) systems, algorithms, relational databases, and NoSQL databases.
- Knowledge & Exposure caching tools (Redis, memcache) or messaging tools such as MQ, Kafka.
- Must have working knowledge of APM tools such as splunk, GCL, ELK, Grafana, Prometheus etc.
- Able to create Dashboards using GCL/Splunk/ELK and setup alerts.
- Working knowledge of CICD is a plus Source control like Git, Continuous Integration Jenkins / UCD Release etc. .
- Ability to work with Engineering teams across the ecosystem such as Security, Networking & Infrastructure challenges which can impact platform health & resiliency.
- Shell Scripting / DevOps tools like Ansible with good knowledge of yaml file to write playbooks .
- Experience with distributed storage technologies like NFS as well as dynamic resource management frameworks PCF, Kubernetes / OpenShift, AWS or Azure.
- Tech Stack: Java/J2EE (Spring, Spring Boot, Python, Shell Scripting, Kafka, Oracle, MongoDB etc.).
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
- Bachelor s Degree in computer science, computer science engineering, or related experience required.