Site Reliability Engineer

Full Time, onsite
Brain Bolt Consulting
Hybrid, United States of America

Salary undisclosed

Apply on

Dice

Original

Simplified

Responsibilities:

Analyse, design, program, test, and deploy new user stories and features with high quality (security, reliability, operations) to production
Achieves team commitments (and influence others to do the same) by using informal leadership & highly developed communication skills
Has an oversight on design decisions and guides team to achieve key results for products assigned to them
Remediates issues using engineering principles and creates proactive design solutions for potential failures
Work with a team of site reliability engineers that is responsible for building the continuous reliability mindset, shepherding problem management, and driving key site reliability engineering practices into the organization.
Design and drive monitoring, alerting, ticket reporting strategies to measure SLA, SLO, MTTI, MTTR. Etc. and align with management expectations to reduce/minimize prod downtime.
Guide site reliability automation to help eliminate manual toil and create a self-healing capability
Participate in selection of appropriate automation tools, defining technology, quality, experience and implementation standards and practices within own technical domain.
Fosters a culture of excellence and continuous learning within the chapter. Establishes and tracks to appropriate OKRs to ensure outcomes are met.
Creates solutions addressing high impact technology and business priorities
Competent in multiple contexts, such as programming languages, security, automation, testing, infrastructure, and performance and is the go-to person for many people (inside and outside of their team)
Proactively identifies and mitigates issues based on intuition and experience in multiple domains

Must Have Skills:

Experienced with AWS Cloud
Experienced in building and managing OCP clusters, deploy applications into OCP
Experience with SRE design to address reliability and resiliency with availability of 5-9s
Experience in managing caching solutions like Hazelcast, GemFire or Terracota
Experience in setting up and managing Kafka
High level of familiarity with the Linux command line and scripting
Extremely comfortable with production environments, firewalls, and networking
Strong experience in deploying, observing, altering, logging, and monitoring systems (Splunk, Datadog, AppDynamics, Instana) with a mindset towards predictive analysis.
Working knowledge of the automation tools such as Ansible, Terraform, or Chef
Experience in performing RCA, Disaster Recovery activities, Chaos Engineering

Good to have Skills:

Highly preferred experience working in the payments industry
Deep knowledge and understanding of emerging trends in the SRE field.
Experience developing in Java (or other similar languages)
Studied architectural patterns at scale, including thoughtfully designed APIs, repeatable delivery pipelines, and efficient computer engineering principles.
Working knowledge of messaging services like RabbitMQ, SQS, Kafka
Strong Experience with Continuous Integration and Continuous Delivery models including Blue/Green and/or Canary release models

Tools & Technologies:

Open-shift Container Platform
(Splunk, Datadog, AppDynamics, Instana)
HazelCast.
Ansible, Terraform, or Chef
RabbitMQ, SQS, Kafka
Linux VMs , Shell Scripting
AWS CLoud
Postgress Database

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Report this job

Similar Jobs

1d ago

Splunk Engineer

eTeam, Inc.