TTS & Voice Cloning

Home TTS & Voice Cloning

Platform Engineering · Capability

TTS & Voice Cloning.

Cost-governed speech synthesis on Azure A100 serverless. We run five frontier TTS engines plus a voice-cloning workflow through a single queue-driven pipeline — $0 idle between batches, manifest-driven provenance on every render.

Scope

What we do

  • Stand up the on-demand A100 batch pipeline (Azure Container Apps job + storage queue + blob sink + managed identity).
  • Dispatch the five engines we keep current: Parler-TTS (prompt-based), CosyVoice 2 (multilingual cloning), F5-TTS (zero-shot cloning), Dia (multi-speaker dialog), StyleTTS 2 (style-reference).
  • Design the voice-cloning workflow: reference capture, consent tracking, style pool, output receipts.
  • Wire the output store so every render has a deterministic filename, a sidecar receipt with timings, and a per-run manifest + log.

Practical

Exercises we run

Small, repeatable drills we use on engagements and teach in workshops. Each has a lab setup, step-by-step outline, and measurable output.

On-demand batch in 90 minutesStand up the Azure Container Apps job, queue, blob sink, and managed identity; drain a small batch end-to-end with the stub renderer; then swap in one real engine.
Voice clone + consent flowCapture a short reference from a consenting speaker; produce a clone via CosyVoice 2 or F5; record the consent + reference provenance in the receipt.
Cost-model a 10k-render batchProject runtime and cost across the 5 engines using `071.materialize.plan.sh`; tune batch shape so the A100 replica drains continuously with no empty polls.

Engine dispatch

Five TTS engines we keep current, and when we reach for each

The pipeline dispatches on a `model` field in the queue message — pick per-render, not per-cluster. Each engine is a different answer to the same question: how natural, how controllable, how portable.

Project Best for Inputs it wants When we reach for it
Parler-TTS Prompt-based voice control — you describe the voice in natural language (e.g. "british-rp-woman, whispered") and the model produces an appropriate render. Text + a voice-description prompt. No reference audio required. Broad editorial copy where the character matters more than voice identity; batch promotional output with many stylistic variants.
CosyVoice 2 High-fidelity multilingual voice cloning from a short reference clip; strong across English / Chinese / Japanese / French / Spanish. Text + a reference audio of the target speaker (5–30 seconds). Known speaker, multilingual output, where voice identity must stay consistent across long narratives.
F5-TTS Fast zero-shot voice cloning with strong English prosody. Lower parameter count than CosyVoice 2, roughly 2–4× faster on A100. Text + a short reference audio (5–15 seconds). No fine-tuning needed. Latency-sensitive or high-volume English batches; A/B cloning tests where iteration speed matters.
Dia Multi-speaker dialog TTS — renders conversations with distinct speakers, natural turn-taking, disfluencies if asked for. A tagged dialog script with speaker annotations; optional reference clips per speaker. Podcast-style content, radio-drama-style narration, synthetic dialog for training or demo purposes.
StyleTTS 2 Style-reference transfer — you supply a target voice AND an emotional / prosodic reference; the model transfers the style to your new text. Text + a style-reference clip (can be from a different speaker than the voice reference). Dramatic or theatrical reads where the emotional trajectory has to be controlled independently of the voice timbre.

Engine choice isn't a one-way door — the pipeline supports mixed-engine batches (filename schema encodes the engine used) so a single production run can render the same script through multiple engines and pick the best take per section.

Further reading

More on TTS & Voice.

Workshops we teach + field notes we're writing, all linked back to what you just read. See all workshops → See all field notes →

Workshop

Hands-on: TTS pipeline on A100 — 1-day workshop

5-engine dispatch on Azure serverless A100. Voice cloning with consent. Manifest-driven provenance for every render.

Scheduling soon →

Engagement

Hands-on: TTS pipeline on A100 — 1-day workshop

Packaged engagement — we scope, build, and hand over with runbooks, against a specific SLA. Add to cart to request delivery; no price is billed up-front.

Neux Ltd

AI Infrastructure · Platform Engineering · London.
Since 2014.

Contact

LinkedIn

Legal

© 2014–2026 Neux Ltd
Registered in England & Wales.