Epicareer Might not Working Properly
Learn More
T

Observability Monitoring Engineer

  • Full Time, onsite
  • TEKsystems c/o Allegis Group
  • On Site, United States of America
Salary undisclosed

Apply on


Original
Simplified
**(CANNOT WORK C2C) Must work W2**
Candidates must be willing to work onsite 3 days a week. No exceptions
Job Description:
We are seeking a highly skilled Observability Monitoring Engineer with expert knowledge in Prometheus, Grafana, or Git. This role involves developing and managing telemetry for large-scale datasets and implementing strategies to enhance AI system reliability and performance, as well as assisting in capacity management.
Key Responsibilities:
  • Develop and manage telemetry systems for large-scale datasets.
  • Implement monitoring and alerting solutions to ensure system reliability.
  • Collect and analyze data to improve AI system performance.
  • Automate processes to enhance efficiency and reduce manual intervention.
  • Manage and maintain Kubernetes clusters and Docker containers.
  • Utilize Prometheus and Grafana for monitoring and visualization.
  • Work with DCGM/DCGM Exporter (Nvidia Stack) for telemetry.
  • Collaborate with data scientists to support AI/ML platforms.
  • Troubleshoot and resolve issues related to telemetry systems.

Primary Skills:
  • Telemetry/Observability, Monitoring and Alerting, Data Collection and Analysis, Automation
  • Prometheus and Grafana
  • JSON/YAML
  • Kubernetes and Docker/Container Technologies
  • DCGM/DCGM Exporter (Nvidia Stack)
  • Solid understanding of telemetry concepts, metrics, logs, and tracing

Benefits:
  • Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. B
  • Benefits are subject to change and may be subject to specific elections, plan, or program terms.
  • If eligible, the benefits available for this temporary role may include the following:
  • Medical, dental & vision
  • Critical Illness, Accident, and Hospital
  • 401(k) Retirement Plan - Pre-tax and Roth post-tax
  • contributions available
  • Life Insurance (Voluntary Life & AD&D for the
  • employee and dependents)
  • Short and long-term disability
  • Health Spending Account (HSA)
  • Transportation benefits
  • Employee Assistance Program
  • Time Off/Leave (PTO, Vacation or Sick Leave)

About TEKsystems:

We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company.

The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.
Employers have access to artificial intelligence language tools (“AI”) that help generate and enhance job descriptions and AI may have been used to create this description. The position description has been reviewed for accuracy and Dice believes it to correctly reflect the job opportunity.
Report this job