Full-Time Senior Site Reliability Engineer ML Platforms

2100 NVIDIA USA is hiring a remote Full-Time Senior Site Reliability Engineer ML Platforms. The career level for this job opening is Experienced and is accepting US, CA, Santa Clara based applicants remotely. Read complete job description before applying.

This job was posted 5 months ago and is likely no longer active. We encourage you to explore more recent opportunities on our site. However, you may still try your luck using 'Apply Now' link below. We recommend focusing on newer listings available here.

2100 NVIDIA USA

Job Title

Senior Site Reliability Engineer ML Platforms

Posted

5 months ago on 26th June 2025

Career Level

Full-Time

Career Level

Experienced

Locations Accepted

US, CA, Santa Clara

Salary

YEAR $224000 - $425500

Job Details

Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team.

Designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing.
Implementing software and systems engineering practices to ensure high efficiency and availability of the platform.
Applying SRE principles to improve production systems and optimize service SLOs.
Collaboration with our customers to plan implement changes to the existing system, while monitoring capacity, latency, and performance.

To succeed, a strong background in SRE practices, systems, networking, coding, capacity management, cloud operations, continuous delivery and deployment, and open-source cloud enabling technologies like Kubernetes and OpenStack is required. Deep understanding of the challenges and standard methodologies of running large-scale distributed systems in production is also necessary. Excellent communication and collaboration skills are essential. What you'll be doing:

Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases.
Create tools and automation to reduce operational overhead and eliminate manual tasks.
Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation.
Define meaningful and actionable reliability metrics to track and improve system and service reliability.
Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally.
Build tools to improve our service observability for faster issue resolution.

What we need to see:

Minimum of 10 years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments.
Master's or Bachelor's degree in Computer Science or Electrical Engineering or CE or equivalent experience.
Strong understanding of SRE principles, including error budgets, SLOs, and SLAs.
Proficiency in incident, change, and problem management processes.
Experience with streaming data infrastructure services, such as Kafka and Spark.
Expertise in building and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus).
Proficiency in programming languages such as Python, Go, Perl, or Ruby.
Hands-on experience with scaling distributed systems in public, private, or hybrid cloud environments.

Ways to stand out:

Experience operating large-scale distributed systems with strong SLAs.
Excellent coding skills in Python and Go and extensive experience in operating data platforms.
Knowledge of CI/CD systems, such as Jenkins and GitHub Actions.
Familiarity with Infrastructure as Code (IaC) methodologies and tools.

Skills

Capacity Management Cloud Operations Kubernetes Python/Go SRE Principles

FAQs

What is the last date for applying to the job?

The deadline to apply for Full-Time Senior Site Reliability Engineer ML Platforms at 2100 NVIDIA USA is 26th of July 2025 . We consider jobs older than one month to have expired.

Which countries are accepted for this remote job?

This job accepts [ US, CA, Santa Clara ] applicants. .

Apply Now

Related Jobs You May Like

Azure DevOps Engineer

Jersey City, NJ

2 days ago

.NET

Azure

DevOps

Derex Technologies Inc

Full-Time

Experienced

Lead Palantir Developer

Seattle, WA

2 days ago

CI/CD Pipelines

Data Engineering

Palantir Foundry

Logic20/20 Inc.

Full-Time

Experienced

YEAR $156750 - $173329

Cloud AppOps Engineer

Atlanta, GA

3 days ago

Application Support

AWS

Cloud Services (EC2, S3, IAM, ELB, VPC, VPN)

Sutherland

Full-Time

Experienced

Staff DataOps Engineer

Remote, India

3 days ago

AWS

CI/CD

DataOps

Nagarro

Full-Time

Experienced

Query Tuning Specialist - Database Performance - Postgre

Austin, Texas

3 days ago

Database Management

Performance Tuning

Problem-solving

ServiceNow

Full-Time

Experienced

DevOps Engineer, Playout

New York, New York

3 days ago

CICD

Cloud Services (AWS, GCP, Azure)

DevOps

NBCUniversal

Full-Time

Experienced

YEAR $90000 - $110000

Query Tuning Specialist - Database Performance - Postgres

Austin, Texas

3 days ago

Database Management

Performance Tuning

SaaS/PaaS/Cloud Development

ServiceNow

Full-Time

Experienced

Lead Palantir Developer

Seattle, WA

4 days ago

CI/CD Pipelines

Cloud ETL

Palantir Foundry

Logic20/20 Inc.

Full-Time

Experienced

YEAR $156750 - $173329

Cloud AppOps Engineer

Atlanta, GA

4 days ago

Application Support

AWS

Cloud Security

Sutherland

Full-Time

Experienced

Site Reliability Engineer

Stamford, Connecticut

4 days ago

Cloud Platforms (AWS, GCP, Azure)

Configuration Management

Monitoring And Alerting Tools

NBCUniversal

Full-Time

Experienced

YEAR $110000 - $145000

Senior Cloud Platform Engineer (Networking)

Berlin, Germany

5 days ago

AWS

Networking

Scalable GmbH

Full-Time

Experienced

DevOps Engineer

Texas

5 days ago

AWS

GitLab

Kubernetes

InfStones

Full-Time

Experienced

All Remote Jobs

Full-Time Senior Site Reliability Engineer ML Platforms

2100 NVIDIA USA

Job Title

Posted

Career Level

Career Level

Locations Accepted

Salary

Share

Job Details

Skills

FAQs

What is the last date for applying to the job?

Which countries are accepted for this remote job?

Related Jobs You May Like

Azure DevOps Engineer

Lead Palantir Developer

Cloud AppOps Engineer

Staff DataOps Engineer

Query Tuning Specialist - Database Performance - Postgre

DevOps Engineer, Playout

Query Tuning Specialist - Database Performance - Postgres

Lead Palantir Developer

Cloud AppOps Engineer

Site Reliability Engineer

Senior Cloud Platform Engineer (Networking)

DevOps Engineer

Looking for a specific job?