Full-Time Senior Machine Learning Infra Engineer
Waabi is hiring a remote Full-Time Senior Machine Learning Infra Engineer. The career level for this job opening is Senior Manager and is accepting USA,Canada based applicants remotely. Read complete job description before applying.
Waabi
Job Title
Posted
Career Level
Career Level
Locations Accepted
Salary
Share
Job Details
Waabi, founded by AI pioneer and visionary Raquel Urtasun, is an AI company building the next generation of self-driving technology. With a world class team and an innovative approach that unleashes the power of AI to “drive” safely in the real world, Waabi is bringing the promise of self-driving closer to commercialization than ever before. Waabi is backed by best-in-class investors across the technology, logistics and the Canadian innovation ecosystem.
With offices in Toronto, San Francisco, and Dallas, Waabi is growing quickly and looking for diverse, innovative and collaborative candidates who want to impact the world in a positive way. To learn more visit: www.waabi.ai
You will...
- Work alongside a team of multidisciplinary Engineers and Research Scientists using an AI-first approach to enable safe self-driving at scale.
- Collaborate with cross-functional teams in the company to understand the growing need and pain points in cloud usage.
- Propose cloud strategies around compute and data usages for training and simulation workloads.
- Design and implement scalable and resilient cloud infrastructure optimized for long term reliability and adaptability.
- Devise and promote best practices for cloud usages in training and simulation environments, oversee cloud strategies and usages across the whole company.
Qualifications:
- BS, MS/PhD in Computer Science or similar technical field of study or equivalent practical experience.
- 5+ years of relevant industry experience.
- Experience in reading and developing production quality software.
- Deep understanding of Cloud compute and data storage for distributed training and inference workloads.
- Familiarity with Python, GO, Rust or C++ ecosystems.
- Experience working with public cloud platforms (AWS preferred).
- Experience with infrastructure as code systems (Terraform preferred).
- Experience in job scheduling and resource allocation.
- Experience with containers and container orchestration (i.e., Docker, ECS, Kubernetes).
- Experience and high level of comfort working with Linux systems.
- Experience with building platform services that enable other teams to do their best work.
- Open-minded and collaborative team player with the willingness to help others.
- Passionate about self-driving technologies, solving hard problems, and creating innovative solutions.
- Experience working in an Agile/Scrum environment.
Bonus/nice to have:
- Experience with on-premise servers, network equipment and scale-out storage systems.
- Experience with CI/CD pipelines and release management.
- Experience in common ML tools, workflows and frameworks (i.e. systems like Kubeflow or MLFlow).
- Understand system performance tuning at software, hardware, and network levels.
- Have good understanding of GPUs and accelerators in ML training and inference use cases.