We're seeking an experienced Platform Site Reliability Engineer to manage and evolve our AI infrastructure platform. You'll ensure 24/7 stability and security across bare-metal, virtualization, and orchestration layers, deploying and optimizing Kubernetes for AI workloads. This role involves significant automation, incident management, and mentoring, contributing to a scalable and efficient AI ecosystem.
This job is no longer actively hiring. Open Roles to see active jobs.
Platform Site Reliability Engineer at AI infrastructure platform startup
Are you a seasoned Platform Site Reliability Engineer passionate about AI infrastructure? Join a pioneering platform startup revolutionizing how software connects with hardware for the AI era. You'll be instrumental in running and evolving a globally scaled platform, deploying Kubernetes for AI workloads, and ensuring 24/7 stability and security. This is a chance to make a significant impact, drive automation, and mentor others in a fast-paced, innovative environment.
Overview
Role overview
Company
About the company
AI infrastructure platform startup
Responsibilities
What you will do
- Deploy and manage Kubernetes clusters at scale, supporting AI-centric workloads across diverse infrastructure.
- Optimize Linux system configurations and build automation scripts for platform lifecycle and incident resolution.
- Apply ITSM frameworks, maintain observability with Prometheus/Grafana, and operate services in 24x7 production environments.
Candidate profile
Who this is a fit for
- 5+ years proven experience in globally scaled, performance-intensive SRE environments with 24/7 support.
- 3+ years experience running, deploying, and optimizing orchestration platforms, with strong Kubernetes expertise.
- Expert-level Linux administration (especially Ubuntu), system tuning, and strong networking fundamentals.
What makes it remarkable
Why this role is remarkable
- Drive the evolution of cutting-edge AI infrastructure, connecting software and hardware for the AI era.
- Work across bare-metal, virtualization, and large-scale Kubernetes deployments supporting critical AI workloads.
- Make a significant impact on 24/7 operations, automation, and mentorship within a growing, well-funded tech company.
Meet Jack
Jack gets to know what you're great at and what you want next, then searches 14 million jobs daily and introduces you directly to hiring managers.
How does this work?
Jack's an AI agent for job searching and career coaching. He works for you.
Jill is the AI recruiter working for the company. She recruits from Jack's network.
If it's a match and the company wants to meet you, they'll make the intro. In the meantime, if you'd like, Jack will send you excellent alternatives.