Full-Time Senior Site Reliability Engineer Observability
2100 NVIDIA USA is hiring a remote Full-Time Senior Site Reliability Engineer Observability. The career level for this job opening is Experienced and is accepting US, CA, Santa Clara based applicants remotely. Read complete job description before applying.
2100 NVIDIA USA
Job Title
Posted
Career Level
Career Level
Locations Accepted
Salary
Share
Job Details
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build, and maintain large-scale production systems with high efficiency and availability using software and systems engineering practices.
Responsibilities:
- Design, implement, and support operational and reliability aspects of a large-scale Observability & Telemetry collection platform, focusing on performance at scale, real-time monitoring, logging, and alerting.
- Engage in and improve the entire lifecycle of services (from inception and design through deployment, operation, and refinement).
- Support services before launch through system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
- Maintain services after launch by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through automation and evolve systems to improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Participate in on-call rotation to support production systems.
Qualifications:
- BS degree in Computer Science or a related technical field involving coding, or equivalent experience.
- 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large-scale private or public cloud systems in Production.
- 5+ years of experience delivering foundational infrastructure and observability platforms.
- Experience in Python, Go, Perl, or Ruby.
- In-depth knowledge of Linux, Networking, and Containers.
Bonus Skills:
- Interest in crafting, analyzing, and fixing large-scale distributed systems.
- Strong problem-solving, communication skills, and ownership.
- Experience debugging, optimizing code, and automating routine tasks.
- Experience with Kubernetes, OpenStack, and Docker.
- Experience with Grafana, OpenTelemetry, Prometheus, and similar observability tools.