Full-Time Training Infra Engineer
Cohere is hiring a remote Full-Time Training Infra Engineer. The career level for this job opening is Expert and is accepting Worldwide based applicants remotely. Read complete job description before applying.
Cohere
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
Who are we? Our mission is to scale intelligence to serve humanity. We’re training and deploying frontier models for developers and enterprises building AI systems. We believe our work is instrumental to the widespread adoption of AI. We obsess over what we build. Each of us contributes to increasing the capabilities of our models and the value they drive for our customers. We like to work hard and move fast.
Why this role? Contribute to and support model training pipelines, ship state-of-the-art models to production, and bridge the research and production gap. We have one of the highest ratios of compute to engineers in the world. Everyone will contribute to writing production code and supporting our research effort, depending on interest and needs.
As a Member of Technical Staff, you will:
- Design and write high-performant and scalable software for training.
- Improve our training setup from an infrastructure and codebase performance standpoint.
- Craft and implement tools to speed up training cycles.
- Research, implement, and experiment with ideas on supercompute and data infrastructure.
- Learn from and work with the best researchers in the field.
You may be a good fit if you have:
- Extremely strong software engineering skills.
- Proficiency in Python and related ML frameworks (JAX, Pytorch, XLA/MLIR).
- Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray).
- Experience using large-scale distributed training strategies.
- Hands-on experience training large models at scale and contributing to training infrastructure tooling.
- Bonus: Publications at top-tier venues (NeurIPS, ICML, etc.).