Full-Time MLOps Engineer
Cognigy is hiring a remote Full-Time MLOps Engineer. The career level for this job opening is Experienced and is accepting Düsseldorf, Germany based applicants remotely. Read complete job description before applying.
Cognigy
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
About CognigyCognigy is transforming the customer service industry with the most advanced AI Agent platform for enterprise contact centers. Its award-winning solution, Cognigy.AI, empowers enterprises to deliver instant, hyper-personalized, multilingual service on any channel. By integrating Generative and Conversational AI to create Agentic AI, Cognigy delivers AI Agents that redefine customer experiences, drive satisfaction, and support contact center employees in real-time.
Your new role – MLOps Engineer
Location: On-site in Düsseldorf or remote in Germany
We are looking for a skilled and ambitious MLOps Engineer to join our Engineering team and take ownership of building and operating scalable, secure infrastructure for Large Language Models (LLMs). You will support our Machine Learning, Product, and SRE teams in deploying and maintaining production-grade AI workloads on Kubernetes using cutting-edge technologies like KubeRay.
You’ll help ensure optimal performance, reliability, observability, and cost-efficiency of Cognigy’s AI infrastructure, automating processes and championing modern MLOps best practices.
Your responsibilities will include
- Build & Operate LLM Infrastructure – Design and maintain scalable LLM-serving systems using Kubernetes and KubeRay.
- Automate & Optimize – Automate deployments, rollbacks, and scaling of LLMs while optimizing resource usage and performance.
- Enhance Observability – Ensure robust monitoring, logging, and alerting for LLM operations (Prometheus, Grafana, etc.).
- Support AI Teams – Empower ML and product engineers with self-service pipelines and scalable infrastructure.
- Prioritize Security – Enforce secure deployments, compliance practices, and robust incident response strategies.
- Improve Documentation – Create and maintain technical documentation to streamline knowledge sharing and onboarding.
- Drive Innovation – Evaluate, adopt, and integrate the latest MLOps and LLM-serving technologies.
- Reduce SRE Toil – Eliminate repetitive tasks and improve operational efficiency across the platform.
Requirements
- Hands-on experience running production ML or LLM workloads in Kubernetes
- Familiarity with distributed ML frameworks such as KubeRay, Ray Serve, or similar
- Deep understanding of Kubernetes internals, especially GPU scheduling, autoscaling, and multi-tenant environments
- Proficiency with CI/CD systems for ML models, and versioned deployment strategies
- Strong experience with cloud platforms (AWS, GCP, or Azure), networking, and security best practices
- Skilled in monitoring and observability for ML workloads (e.g., Prometheus, Grafana)
- Passion for automation, performance tuning, and cost optimization for LLM workloads
- Clear communicator and proactive team player who thrives in fast-paced, cross-functional environments
- MLOps or DevOps certifications (nice to have)