L

Staff Reliability Engineer | Systems Core

Luma AI
Full-time
On-site
Palo Alto

The Opportunity Luma AI operates as a full-stack AI lab, unifying research, product, and engineering to build systems that see and understand the world. Our mission requires infrastructure that is both massive in scale and exceptionally reliable. We offer the rare combination of deep financial resources and a high-agency environment, allowing our core systems team to make architectural decisions that directly impact the future of generative media. Where You Come In You will be a foundational member of the team responsible for the operating system of our intelligence. This role moves beyond standard site reliability; you will act as a systems engineer who ensures our training and inference clusters operate at peak performance. You will live in the terminal, troubleshooting the complex interactions between container orchestration and bare-metal hardware. What You Will Build Kubernetes from Scratch: Architect and manage robust Kubernetes control planes and node components on bare metal, moving beyond the limitations of managed cloud offerings. Kernel-Level Optimization: Dive deep into the Linux OS to tune performance, managing resource isolation and debugging complex system calls to squeeze every drop of compute from our fleet. Resilient Automation: Write high-quality code in Python or Go to automate the lifecycle of thousands of servers, turning manual operations into self-healing software systems. The Profile We Are Looking For Linux Mastery: You view the operating system as a tool to be mastered, possessing deep, hands-on fluency with the command line and kernel internals. Tenacious Troubleshooter: You thrive on solving the hardest problems in the stack, tracking down obscure bugs that manifest only at massive scale. Builder DNA: You prefer creating your own tooling and automation over integrating off-the-shelf solutions, driven by a desire for precision and efficiency.

Apply now
Share this job