Full-Time Staff Site Reliability Engineer

Wikimedia Foundation is hiring a remote Full-Time Staff Site Reliability Engineer. The career level for this job opening is Expert and is accepting Americas, Europe, Africa based applicants remotely. Read complete job description before applying.

Wikimedia Foundation

Job Title

Staff Site Reliability Engineer

Posted

Career Level

Full-Time

Career Level

Expert

Locations Accepted

Americas, Europe, Africa

Salary

YEAR $129347 - $200824

Job Details

The Wikimedia Foundation seeks a Staff Site Reliability Engineer (SRE) focused on ML Infrastructure.

You'll join a distributed team (UTC -5 to UTC +3) and report to the Director of Machine Learning.

Responsibilities:

  • Design, develop, maintain, and scale foundational ML infrastructure for ML Engineers & Researchers.
  • Improve reliability, availability, and scalability of ML infrastructure.
  • Collaborate with ML engineers, product teams, researchers, SREs, and the Wikimedia volunteer community.
  • Proactively monitor and optimize system performance, capacity, and security.
  • Provide guidance and documentation on using the ML infrastructure.
  • Mentor team members on infrastructure management and reliability engineering.

Skills & Experience:

  • 7+ years of SRE/DevOps/Infrastructure Engineering experience with production-grade ML systems.
  • Expertise with on-premises ML infrastructure (Kubernetes, Docker, GPU acceleration, distributed training systems).
  • Proficiency with infrastructure automation and configuration management tools (Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging for ML systems (Prometheus, Grafana, ELK stack).
  • Familiarity with Python-based ML frameworks (PyTorch, TensorFlow, scikit-learn).
  • Strong English communication skills for global team collaboration.

Qualities:

  • Collaborative, proactive, and independently motivated.
  • Experienced with diverse, remote teams.
  • Committed to open-source software and volunteer communities.
  • Systematic thinker focused on operational excellence.

Ideal Candidates Excel in:

  • Scalable ML Infrastructure: Deep understanding of scalable infrastructure design for ML training/inference.
  • Reliability and Operations: Proven track record ensuring reliability of complex, distributed ML systems.
  • Tooling and Automation: Expertise creating robust tooling/automation for ML infrastructure.

FAQs

What is the last date for applying to the job?

The deadline to apply for Full-Time Staff Site Reliability Engineer at Wikimedia Foundation is 21st of April 2025 . We consider jobs older than one month to have expired.

Which countries are accepted for this remote job?

This job accepts [ Americas, Europe, Africa ] applicants. .

Related Jobs You May Like

AI Architect

USA
2 days ago
Azure Technologies
Data Management
Data Warehousing
3Cloud
Full-Time
Expert
YEAR $133600 - $193700

AI/ML Engineer

Bengaluru, India
2 days ago
AI
Deep Learning
Generative AI
Abstrabit Technologies Pvt Ltd
Full-Time
Entry Level

Senior Machine Learning Engineer

Seattle, WA
3 days ago
AWS
Data Analysis
Hadoop
Logic20/20 Inc.
Full-Time
Senior Manager
YEAR $130000 - $150000

Senior AI Specialist

Fes, Morocco
3 days ago
AI
Data Analysis
Machine Learning
ALTEN
Full-Time
Senior Manager

Mid-Level Machine Learning Engineer

Barcelona, Spain
3 days ago
Azure
Machine Learning
MLflow
EcoVadis
Full-Time
Experienced

Senior AI Modeller

Melbourne, Australia
4 days ago
Data Analysis
Git
Large Language Models (LLMs)
Montu
Full-Time
Senior Manager

Senior AI Modeller

Melbourne, Australia
5 days ago
Cloud Computing (GCP)
Data Analysis
Large Language Models
Montu
Full-Time
Senior Manager

Machine Learning Engineer

USA
6 days ago
AWS
Kubernetes
Machine Learning
Artera
Full-Time
Experienced

Sr. Machine Learning Engineer

Prague, Czech Republic
1 week ago
Deep Learning
Generative AI
Large-scale Data Processing
DNAnexus
Full-Time
Experienced

AI/ML Engineer

Bengaluru, India
1 week ago
Deep Learning
Generative AI
Machine Learning
Abstrabit Technologies Pvt Ltd
Full-Time
Entry Level

Director, Recommendation Science

New York, New York
1 week ago
Data Science
Deep Learning
Machine Learning
NBCUniversal
Full-Time
Manager
YEAR $200000 - $250000

Junior AI/ML Engineer

Cologne, Germany
1 week ago
AI
Cloud Platforms
Git
Redcare Pharmacy
Full-Time
Entry Level