BentoML Model Serving

SLA: p99 < 200ms per inference · autoscales 0–50 replicas · cost-performance tuning

Category:

Description

Production inference for custom and open-weight models. Containerised via BentoML, autoscaled behind Envoy, observably tuned for cost and latency. Typical engagement: design + pilot deployment for a single model, including autoscaler tuning and SLO baselining.