BentoML Model Serving

Home BentoML Model Serving

Platform Engineering · Capability

BentoML Model Serving.

Production inference that scales. We deploy, autoscale, A/B test, and cost-tune model-serving infrastructure with BentoML as the runtime and Kubernetes as the substrate.

Scope

What we do

  • Containerise and deploy models with BentoML.
  • Configure horizontal autoscaling based on queue depth and latency SLOs.
  • Run shadow + canary + full-cutover releases behind Envoy.
  • Cost-performance tuning (batch size, GPU fractioning, request routing).

Practical

Exercises we run

Small, repeatable drills we use on engagements and teach in workshops. Each has a lab setup, step-by-step outline, and measurable output.

Serving a Whisper speech-to-text modelContainerise, deploy, autoscale; measure p99 latency and cost-per-request under synthetic load.
A/B testing two inference servers behind EnvoyShadow traffic to a candidate model, promote through canary, full cutover with rollback.
Queue-depth autoscalingTune BentoML autoscaler parameters for a bursty request pattern.

References

Three open-source runtimes we reach for

BentoML is our default, but not our only tool. Here's how we decide between it, vLLM, and NVIDIA Triton on real engagements.

Project Best for Trade-offs When we reach for it
BentoML Python-native service definition; packages arbitrary models (HF Transformers, PyTorch, sklearn, XGBoost) as containers with a first-class deploy story. One worker loop per model — you size the box to the model. Less specialised than vLLM for LLM batching. Default choice for multi-framework platforms. CI/CD friendly, easy autoscale on K8s, clean A/B-test harness.
vLLM LLM inference only. PagedAttention + continuous batching give 2–10× throughput vs naive HF. LLM-shaped. Not helpful for classification, embeddings, ASR, or vision unless you wrap it yourself. Any engagement where the primary workload is an open-weight LLM at > 1 req/s. Usually wrapped behind a BentoML shell.
NVIDIA Triton Multi-framework (TensorRT, ONNX, PyTorch, TF, Python) with model repositories, dynamic batching, and GPU-sharing primitives baked in. Heavier to operate; C++ server + config-driven deployments. Python UX is less friendly than Bento for day-to-day iteration. GPU-dense deployments where TensorRT kernels or ensemble scheduling pay for themselves; latency-sensitive serving on NVIDIA hardware.

Further reading

More on BentoML.

Workshops we teach + field notes we're writing, all linked back to what you just read. See all workshops → See all field notes →

Workshop

Hands-on: BentoML on RKE2 — 1-day workshop

Containerise a real model, autoscale it behind Envoy, measure p99 latency and cost-per-request. Ship with a runbook.

Scheduling soon →

Field note

Serving a Whisper speech-to-text model

Autoscaling, cost-curve, and handover runbook for a production Whisper deployment on a single-node K3s + L4 GPU.

Draft →

Field note

A/B testing two inference servers

Shadow → canary → cutover with Envoy `weighted_clusters`, NATS shadow bus, and auto-rollback abort conditions.

Draft →

Engagement

Hands-on: BentoML on RKE2 — 1-day workshop

Packaged engagement — we scope, build, and hand over with runbooks, against a specific SLA. Add to cart to request delivery; no price is billed up-front.

Neux Ltd

AI Infrastructure · Platform Engineering · London.
Since 2014.

Contact

LinkedIn

Legal

© 2014–2026 Neux Ltd
Registered in England & Wales.