BentoML Model Serving.

Production inference that scales. We deploy, autoscale, A/B test, and cost-tune model-serving infrastructure with BentoML as the runtime and Kubernetes as the substrate.

Scope

What we do

Containerise and deploy models with BentoML.
Configure horizontal autoscaling based on queue depth and latency SLOs.
Run shadow + canary + full-cutover releases behind Envoy.
Cost-performance tuning (batch size, GPU fractioning, request routing).

Practical

Exercises we run

Small, repeatable drills we use on engagements and teach in workshops. Each has a lab setup, step-by-step outline, and measurable output.

Serving a Whisper speech-to-text modelContainerise, deploy, autoscale; measure p99 latency and cost-per-request under synthetic load.

A/B testing two inference servers behind EnvoyShadow traffic to a candidate model, promote through canary, full cutover with rollback.

Queue-depth autoscalingTune BentoML autoscaler parameters for a bursty request pattern.

References

Three open-source runtimes we reach for

BentoML is our default, but not our only tool. Here's how we decide between it, vLLM, and NVIDIA Triton on real engagements.

Project	Best for	Trade-offs	When we reach for it
BentoML	Python-native service definition; packages arbitrary models (HF Transformers, PyTorch, sklearn, XGBoost) as containers with a first-class deploy story.	One worker loop per model — you size the box to the model. Less specialised than vLLM for LLM batching.	Default choice for multi-framework platforms. CI/CD friendly, easy autoscale on K8s, clean A/B-test harness.
vLLM	LLM inference only. PagedAttention + continuous batching give 2–10× throughput vs naive HF.	LLM-shaped. Not helpful for classification, embeddings, ASR, or vision unless you wrap it yourself.	Any engagement where the primary workload is an open-weight LLM at > 1 req/s. Usually wrapped behind a BentoML shell.
NVIDIA Triton	Multi-framework (TensorRT, ONNX, PyTorch, TF, Python) with model repositories, dynamic batching, and GPU-sharing primitives baked in.	Heavier to operate; C++ server + config-driven deployments. Python UX is less friendly than Bento for day-to-day iteration.	GPU-dense deployments where TensorRT kernels or ensemble scheduling pay for themselves; latency-sensitive serving on NVIDIA hardware.

Hands-on: BentoML on RKE2 — 1-day workshop

Packaged engagement — we scope, build, and hand over with runbooks, against a specific SLA. Add to cart to request delivery; no price is billed up-front.

Add to engagement →

Neux Ltd

AI Infrastructure · Platform Engineering · London.
Since 2014.

Contact

Also from Neux

neux.ai — AI consultancy

styk.tv — podcast

Legal

BentoML Model Serving

BentoML Model Serving.

What we do

Exercises we run

Three open-source runtimes we reach for

More on BentoML.

Hands-on: BentoML on RKE2 — 1-day workshop

Serving a Whisper speech-to-text model

A/B testing two inference servers

Hands-on: BentoML on RKE2 — 1-day workshop