Role overview

You will own the GPU infrastructure powering foundation model training for structured data, managing tens of millions in compute spend. Transitioning from Slurm on GCP to multi-provider environments, you'll optimize cluster architecture, scheduling, and distributed training performance. This is a high-leverage role working directly with world-class researchers to build the next generation of AI systems.

Prior Labs GmbH

View profile

Prior Labs builds multimodal tabular foundation models (TFMs), starting with TabPFN, designed to understand tables natively and perform statistical reasoning directly from data. The company says its broader vision is to create agentic AI systems that can understand high-level goals, combine tables, language, and images, reason across modalities, integrate domain knowledge, infer causality, and adapt dynamically.

What you will do

Design and evolve multi-cluster GPU infrastructure, moving beyond current Slurm/GCP setups to incorporate multi-provider orchestration and the latest hardware generations.
Drive maximum training efficiency by profiling distributed training runs, debugging systems-level bottlenecks, and optimizing GPU utilization to minimize cost-per-FLOP.
Build the internal developer platform, including experiment tracking, CI pipelines, and model registries, to ensure the research team maintains high iteration speeds.

Who this is a fit for

Possesses 5+ years of experience operating production-scale GPU infrastructure or distributed training systems at a major AI lab, well-funded startup, or HPC environment.
Demonstrates deep expertise in Slurm and systems-level thinking, with a proven ability to profile PyTorch internals and identify hardware-level bottlenecks like memory bandwidth or communication latencies.
Exhibits a track record of managing significant compute budgets and making high-stakes infrastructure calls that measurably improve training performance or cost efficiency.

Why this role is remarkable

Extreme Compute Ownership: You will manage a GPU budget in the tens of millions, making critical architectural decisions on hardware, scheduling, and provider strategy where a single optimization can save six figures.
State-of-the-Art Research: Work at the frontier of AI by building the infrastructure for foundation models specifically designed for structured data, operating in a lean, high-talent environment without corporate overhead.
Architectural Freedom: Lead the evolution from a single Slurm/GCP cluster to a multi-provider, multi-cluster infrastructure, evaluating new hardware generations as they come online to maximize training throughput.

How Jack & Jill work together

I get to know what you’re great at, then find roles you’d never find yourself.Ok, I'll go first. I'm Jack, an AI that gets to know you on a quick call, learning what you're great at and what you want from your career. Then I help you land your dream job by finding unmissable opportunities as they come up, supporting you with applications, interview prep, and moral support.

I recruit from Jack’s network and make the intro when I spot a great match.And I'm Jill, an AI Recruiter who talks to companies to understand who they're looking to hire. Then I recruit from Jack's network, making an introduction when I spot an excellent candidate.

Senior ML Infrastructure Engineer at Prior Labs GmbH