Role details

What you'll do

Own inference infrastructure end-to-end: optimize latency, throughput, and cost across our model fleet.
Build and scale model serving with TensorZero, vLLM/SGlang/TRT, and Kubernetes.
Design and maintain vector search pipelines with Vector storages.
Familiarity with support metrics (SLAs, FCR, deflection) and ability to define service health KPIs.
Turn research into product: grab experimental models from the research team, figure out what's production-ready, and ship it - formatting, sampling parameters, deployment, the whole thing

3+ years shipping high performance ML systems in production, not just training notebooks
Deep hands-on experience with inference optimization - you've debugged latency spikes and know the difference between theoretical and real-world throughput
Comfortable across the stack: from CUDA kernels to Kubernetes manifests to Grafana dashboards
A big plus: experience with Rust, custom Triton kernels, benchmarks

Department: Research team

Ready to apply for this role?