Epicareer Might not Working Properly
Learn More

Infrastructure and Ops (Devops and MLOps) Engineer - Remote

Salary undisclosed

Checking job availability...

Original
Simplified

Primary Responsibilities:

- Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.

- Systems Monitoring: Develop and maintain monitoring frameworks for UAIS infrastructure in relation to AI/ML training program

- Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.

- Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.

- Continuous support: Provide continuous SRE support to thousands of geographically distributed users on the UAIS platform: respond to tickets, triage support, liaise with customers.

Required Qualifications:

- Bachelor's degree in computer science, information technology, or a related field.

- 3+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and Google Cloud Platform, with hands-on experience in cloud management.

- 2+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.

- 2+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration

- 2+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts.

Preferred Qualifications:

- Security & Compliance Knowledge: Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks.

- Machine Learning and LLM Operations: Exposure to modern tools and techniques in MLOps and LLMOps fields.

- Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale.

- Exposure to a Regulated Industry: Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements.

- Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job

Primary Responsibilities:

- Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.

- Systems Monitoring: Develop and maintain monitoring frameworks for UAIS infrastructure in relation to AI/ML training program

- Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.

- Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.

- Continuous support: Provide continuous SRE support to thousands of geographically distributed users on the UAIS platform: respond to tickets, triage support, liaise with customers.

Required Qualifications:

- Bachelor's degree in computer science, information technology, or a related field.

- 3+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and Google Cloud Platform, with hands-on experience in cloud management.

- 2+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.

- 2+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration

- 2+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts.

Preferred Qualifications:

- Security & Compliance Knowledge: Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks.

- Machine Learning and LLM Operations: Exposure to modern tools and techniques in MLOps and LLMOps fields.

- Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale.

- Exposure to a Regulated Industry: Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements.

- Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job