BentoML Model Serving

SLA: p99 < 200ms per inference · autoscales 0–50 replicas · cost-performance tuning

Category: Assignments

Description

Description

Production inference for custom and open-weight models. Containerised via BentoML, autoscaled behind Envoy, observably tuned for cost and latency. Typical engagement: design + pilot deployment for a single model, including autoscaler tuning and SLO baselining.

BentoML Model Serving

Description

Related products

Kubernetes Control Planes

SeaweedFS Storage

NATS Messaging Fabric