Epicareer Might not Working Properly
Learn More
A

Monitoring and Logging Specialist - Incident/Observability

Salary undisclosed

Apply on


Original
Simplified

Job Title: Monitoring and Logging Specialist - Incident/ Observability

100% Remote

Job Description:
As a Monitoring and Logging Specialist, you will be responsible for developing, implementing, and optimizing monitoring and logging solutions across the organization s infrastructure, applications, and cloud services.
You will play a key role in automating alerting, incident response, and ensuring compliance with security and audit standards, while training team members on best practices.
Key Responsibilities:Qualifications:
  • Expertise in monitoring and logging tools (e.g., Nagios, Prometheus, Splunk, ELK Stack, Graylog).
  • Experience with cloud platforms and integrating monitoring solutions across on-prem and cloud environments.
  • Strong automation and scripting skills (e.g., Python, Bash, PowerShell) for alerting and incident response.
  • Familiarity with security and compliance standards such as NIST, HIPAA.
  • Excellent troubleshooting, analytical, and incident resolution skills.
  • Strong communication and collaboration abilities for cross-functional teams.
Desired Experience:
  • Experience in monitoring, logging, and incident management.
  • Hands-on experience with automated alerting, predictive analytics, and performance tuning of monitoring solutions.
  1. Implement and Manage Monitoring Tools:
    • Deploy and maintain monitoring systems (e.g., Nagios, Zabbix, Prometheus) for infrastructure, applications, and network devices.
    • Ensure critical components, including servers, networks, and cloud services, are monitored with real-time alerts and dashboards.
  2. Manage and Optimize Logging Platforms:
    • Implement and manage centralized logging solutions (e.g., Splunk, ELK Stack, Graylog) to capture and store logs from various systems.
    • Ensure logs are properly indexed, stored, and searchable for efficient troubleshooting and analysis.
  3. Develop Automated Alerting and Incident Response:
    • Create and configure automated alerting rules, integrating them with incident management tools (e.g., PagerDuty, ServiceNow).
    • Ensure timely incident notifications and provide relevant logs and metrics for swift troubleshooting.
  4. Ensure Compliance with Logging and Monitoring Best Practices:
    • Enforce security, audit, and compliance standards (e.g., NIST, HIPAA) within monitoring and logging solutions.
    • Regularly audit practices to maintain security and ensure compliance with industry standards.
  5. Drive Proactive Monitoring and Predictive Analytics:
    • Implement proactive monitoring and predictive analytics to identify potential issues and prevent system failures.
    • Reduce downtime by predicting and resolving bottlenecks and performance issues.
  6. Facilitate Cross-Team Collaboration for Incident Resolution:
    • Provide centralized monitoring data to enable effective collaboration between infrastructure, application, and monitoring teams.
    • Shorten Mean Time to Resolution (MTTR) by ensuring quick access to relevant logs and metrics during incidents.
  7. Train and Guide Team on Monitoring Tools:
    • Train team members on the usage of monitoring tools, log analysis, and setting up alerts.
    • Ensure the team can efficiently use and maintain monitoring and logging systems, reducing dependency on specialists for daily operations.
  8. Optimize System Resource Utilization:
    • Regularly review and tune monitoring and logging systems to ensure they do not overconsume system resources like CPU and memory.
    • Ensure monitoring systems do not contribute to performance degradation.
  9. Integrate Monitoring Solutions with Cloud Platforms:
    • Set up and maintain monitoring solutions for cloud infrastructure and services (e.g., AWS, Azure), integrating them with on-premises tools.
    • Ensure unified dashboards and alerts across on-premises and cloud infrastructures.
  10. Document Monitoring/Logging Processes and Policies:
    • Create and maintain detailed documentation of monitoring configurations, logging system architectures, and incident response protocols.
    • Ensure smooth onboarding and continuity of operations in case of staff changes or system modifications.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job