Site Reliability Engineer - Cloud & FedRAMP

Full Time, onsite
Octigo Solutions Inc
Remote, United States of America

Salary undisclosed

Checking job availability...

Original

Simplified

Responsibilities:

Design and implement resilient cloud infrastructure, automating operational tasks, and ensuring compliance with FedRAMP security standards.

Architect and implement cloud-native infrastructure across AWS, Google Cloud Platform,Azure.

Develop and maintain Infrastructure-as-Code using Terraform, CloudFormation, or Ansible.

Automate system provisioning, configuration management, and security controls.

Design and implement frameworks (monitoring, logging, tracing) for SaaS applications.

Establish 24x7 incident response protocols, efficient alerting, playbooks.

Identify reliability risks and develop automated solutions to enhance system resilience.

Collaborate with Security & Compliance teams to ensure adherence to FedRAMP security controls.

Implement and maintain strict access control policies to protect sensitive data and systems.

Participate in FedRAMP audits,security assessments,compliance across cloud environments.

Work with SaaS, Platform Engineering,DevOps teams to integrate reliability principles.

Lead initiatives to improve observability, security, and automation in a continuous delivery model.

Required Qualifications:

10+ years of experience managing large-scale, multi-tenant cloud environments.

Expertise in Kubernetes, container orchestration, and cloud networking.

Hands-on experience with AWS, Google Cloud Platform, or Azure, including security best practices.

Proficiency in Terraform, CloudFormation, or Ansible for infrastructure automation.

Strong Linux fundamentals and experience troubleshooting production outages.

Experience in incident response, on-call coordination, and system reliability engineering.

Exposure to programming languages like Go, Python, C, or C++ is highly preferred.

Understanding of FedRAMP compliance, security frameworks, and SaaS governance.

Experience in multi-cloud hybrid environments.

Strong knowledge of SRE principles, SLAs, SLOs, and error budgets.

Experience working in a 24x7 production environment with high availability requirements.

Familiarity with observability tools (Prometheus, Grafana, Splunk, ELK, Datadog).

Hands-on experience securing cloud-native applications and infrastructure.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Report this job