We’re building the first prototype node of a breakthrough compute project and need a hands-on leader to own the software stack and distributed systems architecture. You’ll take responsibility for the full software environment, from OS bring-up and containerized workload integration to orchestration, telemetry and optimization. You’ll work from first principles to design, implement, and optimize the system for performance, reliability, and scalability, then guide the transition from a single-node prototype to a deployed and managed, production-scale fleet.
Own software architecture, stack selection and system integration from initial prototype to scaled deployment
Bring up, configure, and optimize high-performance compute nodes: OS, drivers, container runtime and workload orchestration
Integrate with third-party orchestration or workload distribution platforms
Benchmark and optimize workload performance, memory usage, network throughput, and cost efficiency
Design and implement prototype node and fleet management tooling: monitoring, telemetry, automation, performance optimization and fault recovery
Document system architecture, integration specifications, operational procedures, and test results
Partner with cross-functional teams to define and deliver software that meets performance, cost, reliability, and schedule requirements at production scale
10+ years in systems, platform, or distributed software engineering, including shipping production systems
Proven ability to apply first-principles systems thinking to new domains and architectures
Hands-on experience with Linux systems engineering, OS bring-up and optimization, driver integration and system optimization
Strong containerization background (Docker, OCI images, GPU-aware runtimes)
Experience integrating with or building distributed workload orchestration systems
Strong understanding of cloud compute stack architectures, especially related to AI / LLMs and cloud rendering and gaming
Skilled in performance analysis, instrumentation, and optimization across compute, memory, and network
Experience working in a start-up environment. Comfortable wearing multiple hats and a proven ability to quickly learn new technologies and disciplines
Development expertise across a variety of platforms and languages (Bash, Python, JavaScript, PyTorch, TensorFlow)
Technical knowledge of high-performance or GPU-accelerated compute systems
Experience working with multi-GPU systems
Experience designing telemetry and over-the-air update systems for remote nodes