Full-Time Senior Site Reliability Engineer ML Platforms

2100 NVIDIA USA is hiring a remote Full-Time Senior Site Reliability Engineer ML Platforms. The career level for this job opening is Experienced and is accepting US, CA, Santa Clara based applicants remotely. Read complete job description before applying.

This job was posted 5 months ago and is likely no longer active. We encourage you to explore more recent opportunities on our site. However, you may still try your luck using 'Apply Now' link below. We recommend focusing on newer listings available here.

2100 NVIDIA USA

Job Title

Senior Site Reliability Engineer ML Platforms

Posted

Career Level

Full-Time

Career Level

Experienced

Locations Accepted

US, CA, Santa Clara

Salary

YEAR $224000 - $425500

Job Details

Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team.
  • Designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing.
  • Implementing software and systems engineering practices to ensure high efficiency and availability of the platform.
  • Applying SRE principles to improve production systems and optimize service SLOs.
  • Collaboration with our customers to plan implement changes to the existing system, while monitoring capacity, latency, and performance.
To succeed, a strong background in SRE practices, systems, networking, coding, capacity management, cloud operations, continuous delivery and deployment, and open-source cloud enabling technologies like Kubernetes and OpenStack is required. Deep understanding of the challenges and standard methodologies of running large-scale distributed systems in production is also necessary. Excellent communication and collaboration skills are essential. What you'll be doing:
  • Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases.
  • Create tools and automation to reduce operational overhead and eliminate manual tasks.
  • Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation.
  • Define meaningful and actionable reliability metrics to track and improve system and service reliability.
  • Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally.
  • Build tools to improve our service observability for faster issue resolution.
What we need to see:
  • Minimum of 10 years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments.
  • Master's or Bachelor's degree in Computer Science or Electrical Engineering or CE or equivalent experience.
  • Strong understanding of SRE principles, including error budgets, SLOs, and SLAs.
  • Proficiency in incident, change, and problem management processes.
  • Experience with streaming data infrastructure services, such as Kafka and Spark.
  • Expertise in building and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus).
  • Proficiency in programming languages such as Python, Go, Perl, or Ruby.
  • Hands-on experience with scaling distributed systems in public, private, or hybrid cloud environments.
Ways to stand out:
  • Experience operating large-scale distributed systems with strong SLAs.
  • Excellent coding skills in Python and Go and extensive experience in operating data platforms.
  • Knowledge of CI/CD systems, such as Jenkins and GitHub Actions.
  • Familiarity with Infrastructure as Code (IaC) methodologies and tools.

FAQs

What is the last date for applying to the job?

The deadline to apply for Full-Time Senior Site Reliability Engineer ML Platforms at 2100 NVIDIA USA is 26th of July 2025 . We consider jobs older than one month to have expired.

Which countries are accepted for this remote job?

This job accepts [ US, CA, Santa Clara ] applicants. .

Related Jobs You May Like

Azure DevOps Engineer

Jersey City, NJ
2 days ago
.NET
Azure
DevOps
Derex Technologies Inc
Full-Time
Experienced

Lead Palantir Developer

Seattle, WA
2 days ago
CI/CD Pipelines
Data Engineering
Palantir Foundry
Logic20/20 Inc.
Full-Time
Experienced
YEAR $156750 - $173329

Cloud AppOps Engineer

Atlanta, GA
3 days ago
Application Support
AWS
Cloud Services (EC2, S3, IAM, ELB, VPC, VPN)
Sutherland
Full-Time
Experienced

Staff DataOps Engineer

Remote, India
3 days ago
AWS
CI/CD
DataOps
Nagarro
Full-Time
Experienced

Query Tuning Specialist - Database Performance - Postgre

Austin, Texas
3 days ago
Database Management
Performance Tuning
Problem-solving
ServiceNow
Full-Time
Experienced

DevOps Engineer, Playout

New York, New York
3 days ago
CICD
Cloud Services (AWS, GCP, Azure)
DevOps
NBCUniversal
Full-Time
Experienced
YEAR $90000 - $110000

Query Tuning Specialist - Database Performance - Postgres

Austin, Texas
3 days ago
Database Management
Performance Tuning
SaaS/PaaS/Cloud Development
ServiceNow
Full-Time
Experienced

Lead Palantir Developer

Seattle, WA
4 days ago
CI/CD Pipelines
Cloud ETL
Palantir Foundry
Logic20/20 Inc.
Full-Time
Experienced
YEAR $156750 - $173329

Cloud AppOps Engineer

Atlanta, GA
4 days ago
Application Support
AWS
Cloud Security
Sutherland
Full-Time
Experienced

Site Reliability Engineer

Stamford, Connecticut
4 days ago
Cloud Platforms (AWS, GCP, Azure)
Configuration Management
Monitoring And Alerting Tools
NBCUniversal
Full-Time
Experienced
YEAR $110000 - $145000

Senior Cloud Platform Engineer (Networking)

Berlin, Germany
5 days ago
AWS
Go
Networking
Scalable GmbH
Full-Time
Experienced

DevOps Engineer

Texas
5 days ago
AWS
GitLab
Kubernetes
InfStones
Full-Time
Experienced

Looking for a specific job?