Home › BentoML Model Serving
Platform Engineering · Capability
BentoML Model Serving.
Production inference that scales. We deploy, autoscale, A/B test, and cost-tune model-serving infrastructure with BentoML as the runtime and Kubernetes as the substrate.
Scope
What we do
- Containerise and deploy models with BentoML.
- Configure horizontal autoscaling based on queue depth and latency SLOs.
- Run shadow + canary + full-cutover releases behind Envoy.
- Cost-performance tuning (batch size, GPU fractioning, request routing).
Practical
Exercises we run
Small, repeatable drills we use on engagements and teach in workshops. Each has a lab setup, step-by-step outline, and measurable output.
References
Three open-source runtimes we reach for
BentoML is our default, but not our only tool. Here's how we decide between it, vLLM, and NVIDIA Triton on real engagements.
| Project | Best for | Trade-offs | When we reach for it |
|---|---|---|---|
| BentoML | Python-native service definition; packages arbitrary models (HF Transformers, PyTorch, sklearn, XGBoost) as containers with a first-class deploy story. | One worker loop per model — you size the box to the model. Less specialised than vLLM for LLM batching. | Default choice for multi-framework platforms. CI/CD friendly, easy autoscale on K8s, clean A/B-test harness. |
| vLLM | LLM inference only. PagedAttention + continuous batching give 2–10× throughput vs naive HF. | LLM-shaped. Not helpful for classification, embeddings, ASR, or vision unless you wrap it yourself. | Any engagement where the primary workload is an open-weight LLM at > 1 req/s. Usually wrapped behind a BentoML shell. |
| NVIDIA Triton | Multi-framework (TensorRT, ONNX, PyTorch, TF, Python) with model repositories, dynamic batching, and GPU-sharing primitives baked in. | Heavier to operate; C++ server + config-driven deployments. Python UX is less friendly than Bento for day-to-day iteration. | GPU-dense deployments where TensorRT kernels or ensemble scheduling pay for themselves; latency-sensitive serving on NVIDIA hardware. |
Further reading
More on BentoML.
Workshops we teach + field notes we're writing, all linked back to what you just read. See all workshops → See all field notes →
Hands-on: BentoML on RKE2 — 1-day workshop
Containerise a real model, autoscale it behind Envoy, measure p99 latency and cost-per-request. Ship with a runbook.
Scheduling soon →
Serving a Whisper speech-to-text model
Autoscaling, cost-curve, and handover runbook for a production Whisper deployment on a single-node K3s + L4 GPU.
Draft →
A/B testing two inference servers
Shadow → canary → cutover with Envoy `weighted_clusters`, NATS shadow bus, and auto-rollback abort conditions.
Draft →
Engagement
Hands-on: BentoML on RKE2 — 1-day workshop
Packaged engagement — we scope, build, and hand over with runbooks, against a specific SLA. Add to cart to request delivery; no price is billed up-front.
Neux Ltd
AI Infrastructure · Platform Engineering · London.
Since 2014.
Contact
Legal
© 2014–2026 Neux Ltd
Registered in England & Wales.