Home › TTS & Voice Cloning
Platform Engineering · Capability
TTS & Voice Cloning.
Cost-governed speech synthesis on Azure A100 serverless. We run five frontier TTS engines plus a voice-cloning workflow through a single queue-driven pipeline — $0 idle between batches, manifest-driven provenance on every render.
Scope
What we do
- Stand up the on-demand A100 batch pipeline (Azure Container Apps job + storage queue + blob sink + managed identity).
- Dispatch the five engines we keep current: Parler-TTS (prompt-based), CosyVoice 2 (multilingual cloning), F5-TTS (zero-shot cloning), Dia (multi-speaker dialog), StyleTTS 2 (style-reference).
- Design the voice-cloning workflow: reference capture, consent tracking, style pool, output receipts.
- Wire the output store so every render has a deterministic filename, a sidecar receipt with timings, and a per-run manifest + log.
Practical
Exercises we run
Small, repeatable drills we use on engagements and teach in workshops. Each has a lab setup, step-by-step outline, and measurable output.
Engine dispatch
Five TTS engines we keep current, and when we reach for each
The pipeline dispatches on a `model` field in the queue message — pick per-render, not per-cluster. Each engine is a different answer to the same question: how natural, how controllable, how portable.
| Project | Best for | Inputs it wants | When we reach for it |
|---|---|---|---|
| Parler-TTS | Prompt-based voice control — you describe the voice in natural language (e.g. "british-rp-woman, whispered") and the model produces an appropriate render. | Text + a voice-description prompt. No reference audio required. | Broad editorial copy where the character matters more than voice identity; batch promotional output with many stylistic variants. |
| CosyVoice 2 | High-fidelity multilingual voice cloning from a short reference clip; strong across English / Chinese / Japanese / French / Spanish. | Text + a reference audio of the target speaker (5–30 seconds). | Known speaker, multilingual output, where voice identity must stay consistent across long narratives. |
| F5-TTS | Fast zero-shot voice cloning with strong English prosody. Lower parameter count than CosyVoice 2, roughly 2–4× faster on A100. | Text + a short reference audio (5–15 seconds). No fine-tuning needed. | Latency-sensitive or high-volume English batches; A/B cloning tests where iteration speed matters. |
| Dia | Multi-speaker dialog TTS — renders conversations with distinct speakers, natural turn-taking, disfluencies if asked for. | A tagged dialog script with speaker annotations; optional reference clips per speaker. | Podcast-style content, radio-drama-style narration, synthetic dialog for training or demo purposes. |
| StyleTTS 2 | Style-reference transfer — you supply a target voice AND an emotional / prosodic reference; the model transfers the style to your new text. | Text + a style-reference clip (can be from a different speaker than the voice reference). | Dramatic or theatrical reads where the emotional trajectory has to be controlled independently of the voice timbre. |
Engine choice isn't a one-way door — the pipeline supports mixed-engine batches (filename schema encodes the engine used) so a single production run can render the same script through multiple engines and pick the best take per section.
Further reading
More on TTS & Voice.
Workshops we teach + field notes we're writing, all linked back to what you just read. See all workshops → See all field notes →
Hands-on: TTS pipeline on A100 — 1-day workshop
5-engine dispatch on Azure serverless A100. Voice cloning with consent. Manifest-driven provenance for every render.
Scheduling soon →
Engagement
Hands-on: TTS pipeline on A100 — 1-day workshop
Packaged engagement — we scope, build, and hand over with runbooks, against a specific SLA. Add to cart to request delivery; no price is billed up-front.
Neux Ltd
AI Infrastructure · Platform Engineering · London.
Since 2014.
Contact
Legal
© 2014–2026 Neux Ltd
Registered in England & Wales.