
Site Reliability Engineer
Role: Site Reliability Engineer (SRE)
Location: Austin, TX
local to Austin
Tools & Technologies Required
o Python, Java, AWS, Kube, Jenkins, Docker, Splunk
- Design, implement, and maintain highly available and scalable distributed systems.
- Develop automation tools and scripts using Java, Python, or other relevant technologies to improve system reliability and efficiency.
- Monitor, troubleshoot, and resolve production incidents, ensuring system uptime and performance.
- Optimize infrastructure by implementing best practices in observability, logging, and monitoring (Prometheus, Grafana, ELK, etc.).
- Collaborate with development teams to enhance CI/CD pipelines, automate deployments, and improve software delivery processes.
- Ensure security, compliance, and infrastructure best practices across cloud and on-prem environments.
- Conduct root cause analysis (RCA) for incidents and drive long-term improvements.
- Improve system resilience through capacity planning, performance tuning, and failure recovery strategies.
Additional responsibilities
o Ensure all the application components are running smoothly in the Kubernetes and AWS environment.
o Support the components (patches / upgrades / issues / configurations) on the application Platform
o Manage CI/CD pipelines for the application tools / components
o Automation of Tasks to improve efficiency and effort reduction
o Create and publish comprehensive dashboards for Observability
o Configuring & Monitoring for Health Checks
o User Provisioning
o Monitoring & Remediation of Alerts
o Alert the application team in the event of any potential issues related to infrastructure or components.
o Create and Update Runbooks for standardized Operations
o Acquire knowledge about the application platform (architecture, design, usage, typical problems faced by users, and their resolution) to reduce dependency on the application team for resolving support issues
o Track and report the costing of AWS and other resources weekly.
o Respond to users on application communication channels (Slack and support email group) and provide appropriate solutions.
Role: Site Reliability Engineer (SRE)
Location: Austin, TX
local to Austin
Tools & Technologies Required
o Python, Java, AWS, Kube, Jenkins, Docker, Splunk
- Design, implement, and maintain highly available and scalable distributed systems.
- Develop automation tools and scripts using Java, Python, or other relevant technologies to improve system reliability and efficiency.
- Monitor, troubleshoot, and resolve production incidents, ensuring system uptime and performance.
- Optimize infrastructure by implementing best practices in observability, logging, and monitoring (Prometheus, Grafana, ELK, etc.).
- Collaborate with development teams to enhance CI/CD pipelines, automate deployments, and improve software delivery processes.
- Ensure security, compliance, and infrastructure best practices across cloud and on-prem environments.
- Conduct root cause analysis (RCA) for incidents and drive long-term improvements.
- Improve system resilience through capacity planning, performance tuning, and failure recovery strategies.
Additional responsibilities
o Ensure all the application components are running smoothly in the Kubernetes and AWS environment.
o Support the components (patches / upgrades / issues / configurations) on the application Platform
o Manage CI/CD pipelines for the application tools / components
o Automation of Tasks to improve efficiency and effort reduction
o Create and publish comprehensive dashboards for Observability
o Configuring & Monitoring for Health Checks
o User Provisioning
o Monitoring & Remediation of Alerts
o Alert the application team in the event of any potential issues related to infrastructure or components.
o Create and Update Runbooks for standardized Operations
o Acquire knowledge about the application platform (architecture, design, usage, typical problems faced by users, and their resolution) to reduce dependency on the application team for resolving support issues
o Track and report the costing of AWS and other resources weekly.
o Respond to users on application communication channels (Slack and support email group) and provide appropriate solutions.