The principle
In a traditional voice pipeline, every TTS request is a fresh synthesis call. The same greeting, the same compliance disclosure, the same hold message, generated from scratch every time, billed every time. Output assembly changes this. When audio has been produced before for the same request, it is served instantly, with no upstream model call and no provider billing. When it has not, it is generated, and the result is available for future requests.How it improves over time
Output assembly is not a static cache. It improves as call volume grows:- Coverage expands with usage. Every new phrase your agents speak adds to the cache.
- Repetition compounds. Voice agents naturally repeat greetings, confirmations, disclosures, hold messages, and error handling.
- Industry patterns accelerate improvement. Regulated industries see the fastest improvement because their workflows have high repetition.
- Cost decreases structurally. As coverage grows, fewer turns hit the upstream model.
What this means
| Without output assembly | With output assembly |
|---|---|
| Every TTS request is a fresh synthesis call | Only novel text triggers synthesis |
| Cost is constant per turn | Cost per turn decreases as coverage grows |
| Latency depends on the model every time | Cached responses return from the edge |
| Switching TTS models means starting over | Pronunciation rules carry over across model swaps |
Cache isolation
Every cache entry is scoped and isolated by:- Region: cache is local to the region where the request executes.
- Customer: your cache is yours. No data is shared across SLNG customers.
- Use case: different agents, voices, and configurations maintain separate cache namespaces.
Pronunciation dictionaries
Before text reaches any TTS model, pronunciation rewrite rules normalize brand names, acronyms, and domain terms, regardless of which model synthesizes the audio. Your rules carry over when you swap TTS models. See Pronunciation Dictionaries for the full API.BYOK (Bring Your Own Key)
With BYOK, you pass your own provider key on TTS requests. Output assembly still applies: a cache hit never reaches the provider, so you are not billed; a miss uses your key for synthesis and caches the result. Your existing provider contracts and volume discounts stay intact.Configuration
Output assembly is automatic. Every TTS request goes through it. You shape the behavior through:- Model and voice selection: determines the cache namespace.
- Pronunciation dictionaries: control text rewriting before synthesis.
- BYOK: use your own provider keys with the same assembly logic.
- Region: determines which edge cache is consulted first.
| Setting | Default | Description |
|---|---|---|
tts_first_audio_timeout_s | 4.0 | Maximum wait for first audio from the TTS model before failover |
failure_audio_enabled | true | Play a failure audio clip if all TTS paths fail |