Senior Observability Platform Engineer

Full Time, onsite
Madison-Davis, LLC
Hybrid, United States of America

Salary undisclosed

Checking job availability...

Original

Simplified

Role:

We are looking for a Senior Observability Engineer to join our team. In this role, you will focus on researching, developing, and implementing monitoring and observability solutions that enable both proactive and reactive analysis. Your work will enhance our ability to identify root causes, analyze capacity trends, escalate issues effectively, and recover swiftly from incidents that may impact our advisors and investors.

Design and Development: Create and maintain observability frameworks utilizing AWS CloudWatch, Dynatrace, ELK, and other essential monitoring tools.
Standardization: Establish and standardize observability practices to ensure thorough visibility into both application and infrastructure health.
Integration: Incorporate OpenTelemetry for improved distributed tracing and enhance overall system observability.
Monitoring as Code: Apply Monitoring as Code principles through infrastructure-as-code tools like Terraform and CloudFormation.
Collaboration: Work closely with SREs, DevOps teams, and Software Engineers to set and uphold best practices in monitoring, logging, and alerting.
Optimization: Enhance performance monitoring, anomaly detection, and automated incident response mechanisms.
Dashboards and Alerts: Develop dashboards, alerts, and reports that provide critical insights into system performance and availability.
Incident Management: Lead investigations into observability-related incidents, conduct root cause analyses, and facilitate post-mortem reviews.
Tool Evaluation: Regularly assess and introduce new observability tools and methodologies to bolster system resilience.
On-call Support: Participate in major incident responses as needed and be part of the on-call rotation for tool support.
Experience: At least 7 years in observability, monitoring, or site reliability engineering.
AIOps Knowledge: Familiarity with AIOps and predictive monitoring strategies.
DevOps and CI/CD: Understanding of CI/CD pipelines and DevOps methodologies.
Monitoring Expertise: Proficient in troubleshooting and monitoring with Dynatrace and similar APM tools; certifications like Dynatrace Associate or Master are a plus.
Tools Proficiency: Extensive experience with observability and logging tools such as SolarWinds, ELK, and Kibana; strong debugging skills and troubleshooting instincts.
Scripting Skills: Competence in scripting and automation using Python, Bash, or PowerShell.
Infrastructure as Code: Experience with Monitoring as Code (MaC) utilizing Terraform, CloudFormation, or Ansible.
Containerization Knowledge: Strong understanding of Kubernetes, Docker, and microservices architectures.
Cross-Platform Skills: Knowledge across various platforms and certifications in Windows Server, Linux/AIX, Networking, Virtualization, Databases (MSSQL/Oracle), and Cloud Computing (AWS/Azure).
Middleware and Database Experience: Familiarity with middleware services like F5, Tibco, and technologies such as MSSQL, Oracle, MySQL, and caching solutions.

Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.

Report this job

Role:

Design and Development: Create and maintain observability frameworks utilizing AWS CloudWatch, Dynatrace, ELK, and other essential monitoring tools.
Standardization: Establish and standardize observability practices to ensure thorough visibility into both application and infrastructure health.
Integration: Incorporate OpenTelemetry for improved distributed tracing and enhance overall system observability.
Monitoring as Code: Apply Monitoring as Code principles through infrastructure-as-code tools like Terraform and CloudFormation.
Collaboration: Work closely with SREs, DevOps teams, and Software Engineers to set and uphold best practices in monitoring, logging, and alerting.
Optimization: Enhance performance monitoring, anomaly detection, and automated incident response mechanisms.
Dashboards and Alerts: Develop dashboards, alerts, and reports that provide critical insights into system performance and availability.
Incident Management: Lead investigations into observability-related incidents, conduct root cause analyses, and facilitate post-mortem reviews.
Tool Evaluation: Regularly assess and introduce new observability tools and methodologies to bolster system resilience.
On-call Support: Participate in major incident responses as needed and be part of the on-call rotation for tool support.
Experience: At least 7 years in observability, monitoring, or site reliability engineering.
AIOps Knowledge: Familiarity with AIOps and predictive monitoring strategies.
DevOps and CI/CD: Understanding of CI/CD pipelines and DevOps methodologies.
Monitoring Expertise: Proficient in troubleshooting and monitoring with Dynatrace and similar APM tools; certifications like Dynatrace Associate or Master are a plus.
Tools Proficiency: Extensive experience with observability and logging tools such as SolarWinds, ELK, and Kibana; strong debugging skills and troubleshooting instincts.
Scripting Skills: Competence in scripting and automation using Python, Bash, or PowerShell.
Infrastructure as Code: Experience with Monitoring as Code (MaC) utilizing Terraform, CloudFormation, or Ansible.
Containerization Knowledge: Strong understanding of Kubernetes, Docker, and microservices architectures.
Cross-Platform Skills: Knowledge across various platforms and certifications in Windows Server, Linux/AIX, Networking, Virtualization, Databases (MSSQL/Oracle), and Cloud Computing (AWS/Azure).
Middleware and Database Experience: Familiarity with middleware services like F5, Tibco, and technologies such as MSSQL, Oracle, MySQL, and caching solutions.

Report this job