Full-Time Senior Site Reliability Engineer Observability

2100 NVIDIA USA is hiring a remote Full-Time Senior Site Reliability Engineer Observability. The career level for this job opening is Experienced and is accepting US, CA, Santa Clara based applicants remotely. Read complete job description before applying.

2100 NVIDIA USA

Job Title

Senior Site Reliability Engineer Observability

Posted

Career Level

Full-Time

Career Level

Experienced

Locations Accepted

US, CA, Santa Clara

Salary

YEAR $140000 - $258750

Job Details

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build, and maintain large-scale production systems with high efficiency and availability using software and systems engineering practices.

Responsibilities:

  • Design, implement, and support operational and reliability aspects of a large-scale Observability & Telemetry collection platform, focusing on performance at scale, real-time monitoring, logging, and alerting.
  • Engage in and improve the entire lifecycle of services (from inception and design through deployment, operation, and refinement).
  • Support services before launch through system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
  • Maintain services after launch by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through automation and evolve systems to improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Participate in on-call rotation to support production systems.

Qualifications:

  • BS degree in Computer Science or a related technical field involving coding, or equivalent experience.
  • 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large-scale private or public cloud systems in Production.
  • 5+ years of experience delivering foundational infrastructure and observability platforms.
  • Experience in Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, Networking, and Containers.

Bonus Skills:

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Strong problem-solving, communication skills, and ownership.
  • Experience debugging, optimizing code, and automating routine tasks.
  • Experience with Kubernetes, OpenStack, and Docker.
  • Experience with Grafana, OpenTelemetry, Prometheus, and similar observability tools.

FAQs

What is the last date for applying to the job?

The deadline to apply for Full-Time Senior Site Reliability Engineer Observability at 2100 NVIDIA USA is 5th of February 2025 . We consider jobs older than one month to have expired.

Which countries are accepted for this remote job?

This job accepts [ US, CA, Santa Clara ] applicants. .

Related Jobs You May Like

Databricks Platform Administrator

Budapest, Hungary
1 day ago
Azure
CI/CD
Databricks
Hiflylabs
Full-Time
Experienced

Staff Cloud Capacity Engineer

Austin, Texas
1 day ago
Automation
Cloud Computing
Infrastructure
ServiceNow
Full-Time
Experienced

Senior SRE

Latam
2 days ago
CI/CD
Docker
Kubernetes
Zarego
Full-Time
Senior Manager

Principal Site Reliability Engineer

USA
2 days ago
AWS
Kubernetes
Monitoring
SonicWall
Full-Time
Experienced

Senior DevOps Engineer - NEC Digital

United Kingdom
3 days ago
AWS
CI/CD
Docker
NECSWS
Full-Time
Experienced

Email Systems Engineer

Mexico
3 days ago
DevOps
Email Systems
Linux Server Administration
Mission Inbox Inc
Full-Time
Experienced
YEAR $20000 - $36000

Staff DevOps Engineer

USA
3 days ago
Cloud Infrastructure
Container Orchestration
Continuous Delivery
Sentinellabs
Full-Time
Expert
YEAR $158360 - $218280

DevOps Engineer

USA
3 days ago
AWS
CI/CD
DevOps
COMPLY
Full-Time
Experienced

Release Engineer

Minsk, Belarus
3 days ago
AWS Cloud
Git
Jenkins
IDT
Full-Time
Experienced

Principal DevSecOps Engineer

USA
5 days ago
Ansible
Azure
DevSecOps
Yurtsai
Full-Time
Expert
YEAR $220000 - $280000

Senior DevOps Engineer

Canada
5 days ago
Automation Tools
Cloud Platforms
Cloud Services
Loopio
Full-Time
Senior Manager

Site Reliability Engineer

Americas
5 days ago
Cloud Computing
Database Administration
Incident Management
BforeAI
Full-Time
Expert
YEAR $110000 - $110000

Looking for a specific job?