Skip to main content
TTS on the execution layer is assembly, not generation. When a turn produces text for speech, it is not blindly forwarded to a TTS model. The layer determines the way to produce the audio: serving from cache when possible, generating only what is genuinely new. The goal is to not generate what already exists.

The principle

In a traditional voice pipeline, every TTS request is a fresh synthesis call. The same greeting, the same compliance disclosure, the same hold message, generated from scratch every time, billed every time. Output assembly changes this. When audio has been produced before for the same request, it is served instantly, with no upstream model call and no provider billing. When it has not, it is generated, and the result is available for future requests.

How it improves over time

Output assembly is not a static cache. It improves as call volume grows:
  • Coverage expands with usage. Every new phrase your agents speak adds to the cache.
  • Repetition compounds. Voice agents naturally repeat greetings, confirmations, disclosures, hold messages, and error handling.
  • Industry patterns accelerate improvement. Regulated industries see the fastest improvement because their workflows have high repetition.
  • Cost decreases structurally. As coverage grows, fewer turns hit the upstream model.

What this means

Without output assemblyWith output assembly
Every TTS request is a fresh synthesis callOnly novel text triggers synthesis
Cost is constant per turnCost per turn decreases as coverage grows
Latency depends on the model every timeCached responses return from the edge
Switching TTS models means starting overPronunciation rules carry over across model swaps

Cache isolation

Every cache entry is scoped and isolated by:
  • Region: cache is local to the region where the request executes.
  • Customer: your cache is yours. No data is shared across SLNG customers.
  • Use case: different agents, voices, and configurations maintain separate cache namespaces.
Your audio, your data, your cache, fully isolated.

Pronunciation dictionaries

Before text reaches any TTS model, pronunciation rewrite rules normalize brand names, acronyms, and domain terms, regardless of which model synthesizes the audio. Your rules carry over when you swap TTS models. See Pronunciation Dictionaries for the full API.

BYOK (Bring Your Own Key)

With BYOK, you pass your own provider key on TTS requests. Output assembly still applies: a cache hit never reaches the provider, so you are not billed; a miss uses your key for synthesis and caches the result. Your existing provider contracts and volume discounts stay intact.

Configuration

Output assembly is automatic. Every TTS request goes through it. You shape the behavior through:
  • Model and voice selection: determines the cache namespace.
  • Pronunciation dictionaries: control text rewriting before synthesis.
  • BYOK: use your own provider keys with the same assembly logic.
  • Region: determines which edge cache is consulted first.
Timeout settings for the TTS stage:
SettingDefaultDescription
tts_first_audio_timeout_s4.0Maximum wait for first audio from the TTS model before failover
failure_audio_enabledtruePlay a failure audio clip if all TTS paths fail