Which approach should you use?
| I need to… | Use |
|---|---|
| Generate audio files for download or storage | HTTP |
| Stream audio in real time for a voice agent or app | WebSocket Streaming |
| Control how brand names and acronyms are pronounced | Add Pronunciation Dictionaries |
| Use my own provider keys with SLNG’s optimization | Enable BYOK |
Which model should you use?
| I need… | Recommended model | Why |
|---|---|---|
| Lowest latency English voice agent | Deepgram Aura 2 (SLNG-hosted) | Low-latency, deployed on SLNG infrastructure |
| Lowest latency with broad language coverage | Cartesia Sonic 3 | Many languages, low latency |
| Hindi / Indian languages | Sarvam Bulbul | Indian languages, many voices |
| Multilingual with expressiveness control | KugelAudio Kugel | Broad language coverage |
| Spanish | Deepgram Aura 2 Spanish (SLNG-hosted) | SLNG-hosted Spanish voice |
Execution layer behavior
Every TTS request goes through the output assembly stage, which checks whether this text, voice, and config combination has been assembled before:- Cache hit: audio returns from the edge. No upstream model call. No provider billing.
- Cache miss: the request routes to the model endpoint, audio is synthesized, and the result is cached.