Full-Time Senior Infrastructure Engineer - AI/ML
OpenTeams is hiring a remote Full-Time Senior Infrastructure Engineer - AI/ML. The career level for this job opening is Experienced and is accepting USA based applicants remotely. Read complete job description before applying.
OpenTeams
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
We are seeking a fully remote, experienced Senior Infrastructure Engineer to join our team at OpenTeams.
At OpenTeams we prioritize cloud-native, reproducible, and observable infrastructure using tools like Terraform, Helm, ArgoCD, and Kubernetes operators.
All of our infrastructure components are designed as reusable, composable building blocks that support AI/ML workflows including model training, inference serving, experiment tracking, and data processing pipelines using tools from the PyData ecosystem.
In this position, you'll get to:
- Significantly contribute to the evolution of Nebari and design reusable, modular infrastructure components that can be composed into bespoke Kubernetes-based platforms for sovereign AI deployments
- Develop composable MLOps components and infrastructure patterns supporting model training, serving, monitoring, and CI/CD pipelines that organizations can own and operate
- Design and implement observability, monitoring, and cost optimization strategies for large-scale AI/ML workloads on client-owned Kubernetes infrastructure
- Collaborate with ML engineers to optimize infrastructure for training ML models, quantizing and packaging open weight LLMs, computer vision workloads, and other AI applications in sovereign environments
- Contribute to open-source MLOps tooling and Kubernetes ecosystem projects that enable data sovereignty
- Work with clients to deploy, configure, and optimize their sovereign AI infrastructure
- Collaborate with a fully remote distributed team using asynchronous communication methods
What We're Looking For:
- 4+ years of hands-on infrastructure/platform/DevOps experience with production systems
- Strong understanding of infrastructure engineering principles: scalability, reliability, observability, and automation
- Solid experience with Kubernetes in production environments, including troubleshooting and optimization
- Proficiency with Infrastructure-as-Code tooling (Terraform, Helm, or similar) for managing complex deployments
- Experience with at least one major cloud platform (AWS, Azure, GCP) including networking, security, and compute services
- Strong programming skills, particularly in Python and/or Go, with ability to write maintainable infrastructure code
- Experience contributing to technical initiatives or mentoring junior team members
- Understanding of CI/CD practices, GitOps workflows, and infrastructure automation principles
Bonus points for experience with:
- MLOps pipelines and ML infrastructure (model training, serving, monitoring)
- Multiple cloud platforms and their AI/ML services
- On-premises deployment and hybrid cloud environments
- ML/AI ecosystem tools (PyTorch, TensorFlow, scikit-learn, etc.)
- Monitoring and observability tools (Prometheus, Grafana, distributed tracing)
- Data sovereignty, privacy, and security requirements for enterprise AI
- GPU infrastructure and model serving frameworks (KServe, vLLM, LLM-D)
- ML workflow orchestration tools (Kubeflow, MLflow, Airflow, Prefect)
- Service mesh technologies (Istio, Linkerd) and advanced Kubernetes networking
- Open-source contributions to Kubernetes, MLOps, or AI infrastructure projects
- Cost optimization and resource management for ML workloads
- Air-gapped or highly secure deployment environments