You will build the foundation model layer that makes manual labeling optional for surgical video. By training self-supervised video encoders and spatiotemporal transformers on raw 2-6 hour surgical footage, you will enable machines to understand tool-tissue interactions and procedure progression, directly shaping the future of autonomous surgical robotics and AI-driven operative reporting.
Foundation Modelling Engineer at Uncovr
Join an EF-backed Paris startup building the foundation model layer for the next generation of surgical robotics. As a Foundation Modelling Engineer, you will tackle the immense challenge of teaching machines to understand complex 2-6 hour surgical procedures from raw video alone, without relying on brittle, manually-labeled steps. Working alongside a world-class team from ETH Zurich and Oxford Dynamics, you'll build self-supervised video encoders and spatiotemporal transformers that transform raw footage into structured intelligence. This is a rare opportunity to join a high-growth AI team of fewer than ten people where your technical decisions will directly shape the future of robotic autonomy.
About this role
Role overview
About the company
Uncovr
EF-backed Paris startup building the intelligence layer for surgical robotics
What you'll do
What you will do
- Train self-supervised video encoders using masked video modeling and temporal contrastive learning on laparoscopic and robotic video datasets.
- Build spatiotemporal transformers to track tool-tissue interactions and construct scene graphs that describe surgical actions independent of tool brands.
- Develop production-grade training infrastructure and data pipelines to turn massive hospital video streams into scalable, high-performance training workflows.
Who you are
Who this is a fit for
- 5-8 years of hands-on experience training large-scale vision or video models, with deep expertise in representation learning and self-supervised training.
- Proficiency in optimizing distributed PyTorch training for long-context temporal modeling and handling multi-hour video sequences.
- A proven builder mindset with the ability to bridge high-level research and production-grade engineering, ideally supported by publications at CVPR, NeurIPS, or ICLR.
Why this role
Why this role is remarkable
- Work with a world-class team from ETH Zurich, ESA, and Oxford Dynamics, advised by Prof. Dan Stoyanov, a global leader in surgical computer vision.
- Tackle one of AI's hardest frontiers: training large-scale foundation models on complex, long-context video data where labels are traditionally scarce and brittle.
- Join an early-stage startup as one of the first ten employees, gaining meaningful equity and the autonomy to design technical architectures from scratch.
Jack & Jill
How Jack & Jill work together
Meet Jack
Jack gets to know what you're great at and what you want next, then searches 15 million jobs daily and helps you discover roles at companies like this.
How does this work?
Tell Jack what you want
He learns what great work looks like for you, then scans 15M jobs every night to find your best matches.
Reply to refine your search
React to roles in-app or by email. Jack adjusts, tracks applications, and keeps improving the more you share.
Land the role
Mock interviews, salary benchmarking, and coaching calls. Free for candidates.