Site Reliability Engineer
Salary undisclosed
Apply on
Original
Simplified
Responsibilities:
- Monitor and Maintain Systems: Utilize AWS CloudWatch, Dynatrace, and Quantum Metrics to ensure the reliability, performance, and availability of software systems.
- Automate Processes: Develop automated solutions for operational tasks to improve efficiency and reduce manual intervention.
- Incident Management: Proactively manage incidents, ensuring minimal downtime and quick recovery.
- Build Monitoring Systems: Create and maintain effective monitoring systems using AWS CloudWatch and Dynatrace to alert on symptoms rather than outages.
- Collaborate with Teams: Work closely with development, IT operations, and product teams to ensure smooth system operations and continuous improvement.
- Capacity Planning: Participate in system design consulting, capacity planning, and performance tuning to support future growth.
- Data Analysis: Use Quantum Metrics and other tools to analyze system performance and user experience data to drive improvements.
Requirements:
- Experience: Proven work experience as an SRE or in a similar role, with specific experience in AWS CloudWatch, MQM, Quantum Metrics, and Dynatrace.
- Technical Skills: Proficiency in programming languages such as Python, Java, or C/C++. Experience with distributed storage technologies and dynamic resource management frameworks.
- Problem-Solving: Strong analytical and problem-solving skills.
- Communication: Excellent collaboration and communication skills.
- Education: Bachelor’s degree in Computer Science or a related field.
Preferred Skills:
- Experience with tools like Chef, Ansible, Terraform, GitLab CI/CD, and Kubernetes.
- Familiarity with cloud platforms such as AWS, Azure, or Google Cloud.
- Knowledge of MQM (Message Queue Management) systems and their integration with monitoring tools.
Similar Jobs