Full-Time Senior Machine Learning Engineer
Luma Ai is hiring a remote Full-Time Senior Machine Learning Engineer. The career level for this job opening is Experienced and is accepting USA based applicants remotely. Read complete job description before applying.
Luma Ai
Job Title
Posted
Career Level
Career Level
Locations Accepted
Share
Job Details
Luma's mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change.
We are looking for engineers with significant experience maintaining & designing highly efficient systems and code that can be optimized to run on multiple hardware platforms, bringing our state-of-the-art models to as many people at the best performance per dollar.
Responsibilities
- Ensure efficient implementation of models & systems with a focus on designing, maintaining, and writing abstractions that scale beyond NVIDIA/CUDA hardware.
- Identify and remedy efficiency bottlenecks (memory, speed, utilization, communication) by profiling and implementing high-performance PyTorch code, deferring to Triton or similar kernel-level languages as necessary.
- Benchmarking our products across a variety of hardware & software to help the product team understand the optimal tradeoffs between latency, throughput and cost at various degrees of parallelism.
- Work together with our partners to help them identify bottlenecks and push forward new iterations of hardware and software.
- Work closely together with the rest of the research team to ensure systems are planned to be as efficient as possible from start to finish and raise potential issues for hardware integration.
Must have experience
- Optimizing for memory, latency and throughput in Pytorch.
Bonus: experience with
- non-NVIDIA systems
- torch.compile / torch.XLA
- benchmarking and profiling GPU & CPU code in Pytorch for optimal device utilization (examples: torch profiler, memory profilers, trace viewers, custom tooling)
- building tools & abstractions to ensure models run optimally on different hardware and software stacks
- transformer models and attention implementations
- parallel inference, particularly with tensor parallelism, pipeline parallelism
Good to have experience
- high-performance Triton/CUDA and writing custom PyTorch kernels and ops
- writing fused kernels for common hot paths, understanding when to make use of lower level features like tensor cores or warp intrinsics, and understanding where these tools can be most impactful
- writing high-performance parallel C++. Bonus if done within an ML context with PyTorch, like for data loading, data processing, inference code
- building inference / demo prototype code (incl. Gradio, Docker etc.)