VUI
Generate audio from text using VUI voice model (default region: USA). High-quality text-to-speech synthesis.
Headers
Authorization
string · requiredThe
Authorization
header is used to authenticate with the API using your API key. Value is of the formatBearer YOUR_KEY_HERE
.
Request Body
text
string · requiredThe text to convert to speech
voice
stringVoice to use for synthesis (optional)
Default: defaultstream
booleanWhether to stream the audio response
Default: falseasync
booleanWhether to use async prediction (returns prediction_id)
Default: false
Responses
Audio response or async prediction ID
Orpheus
Generate audio from text using Orpheus voice model. Optimized with TRT-LLM on H100 MIG 40GB hardware. Generates ~83 tokens/second for real-time streaming. Audio format: 24kHz, 16-bit, mono WAV.
Headers
Authorization
string · requiredThe
Authorization
header is used to authenticate with the API using your API key. Value is of the formatBearer YOUR_KEY_HERE
.
Request Body
prompt
string · requiredThe text to convert to speech
voice
stringVoice to use - English: 'tara', 'leah', 'jess', 'leo', 'dan', 'mia', 'zac', 'zoe'; French: 'pierre', 'amelie', 'marie'; German: 'jana', 'thomas', 'max'; etc.
Example: taraDefault: taramax_tokens
numberMaximum tokens to generate
Default: 2000stream
booleanWhether to stream the response
Default: falseasync
booleanWhether to run asynchronously
Default: falseoutput_language
stringLanguage code - English: 'en' (high quality); French: 'fr' (high quality); German: 'de' (high quality); Korean: 'ko' (high quality); Mandarin: 'zh' (high quality); Spanish: 'es' (medium); Italian: 'it' (medium); Hindi: 'hi' (medium)
Example: enoutput_style
stringStyle of speech (e.g., 'cheerful', 'serious', 'excited')
Responses
Audio response or async prediction ID
Kokoro
Generate audio from text using Kokoro, a frontier TTS model with just 82 million parameters. Offers efficient and high-quality speech synthesis. Audio format: 16-bit WAV.
Headers
Authorization
string · requiredThe
Authorization
header is used to authenticate with the API using your API key. Value is of the formatBearer YOUR_KEY_HERE
.
Request Body
text
string · requiredThe text to convert to speech
voice
stringVoice to use (if supported by the model)
stream
booleanWhether to stream the response
Default: falseasync
booleanWhether to run asynchronously
Default: false
Responses
Audio response or async prediction ID
XTTS-V2
Generate audio from text using XTTS-V2 voice model with voice cloning capabilities in multiple languages. XTTS-V2 is a state-of-the-art text-to-speech model by Coqui. Audio format: WAV (16-bit, 24kHz).
Headers
Authorization
string · requiredThe
Authorization
header is used to authenticate with the API using your API key. Value is of the formatBearer YOUR_KEY_HERE
.
Request Body
text
string · requiredThe text to convert to speech
speaker_voice
string · requiredBase64 encoded audio file for voice cloning (6+ seconds recommended)
language
stringTarget language code - English: 'en', Spanish: 'es', French: 'fr', German: 'de', Italian: 'it', Portuguese: 'pt', Polish: 'pl', Turkish: 'tr', Russian: 'ru', Dutch: 'nl', Czech: 'cs', Arabic: 'ar', Chinese: 'zh', Japanese: 'ja', Korean: 'ko', Hungarian: 'hu', Hindi: 'hi'
Example: enDefault: enstream
booleanWhether to stream the response
Default: falseasync
booleanWhether to run asynchronously
Default: false
Responses
Audio response or async prediction ID
MARS6
Generate audio from text using MARS6 voice model with voice/prosody cloning capabilities in 10 languages. MARS6 is a frontier text-to-speech model by CAMB.AI. Audio format: AAC (adts stream) or FLAC, depending on stream_format parameter.
Headers
Authorization
string · requiredThe
Authorization
header is used to authenticate with the API using your API key. Value is of the formatBearer YOUR_KEY_HERE
.
Request Body
text
string · requiredThe text to convert to speech
audio_ref
string · requiredBase64 encoded audio file for voice cloning (6-90 seconds recommended)
language
string · requiredTarget language code - English: 'en-us', French: 'fr-fr', German: 'de-de', Spanish: 'es-es', Italian: 'it-it', Portuguese: 'pt-pt', Chinese: 'zh-cn', Japanese: 'ja-jp', Korean: 'ko-kr', Dutch: 'nl-nl'
Example: en-us
ref_text
stringText transcript of the reference audio (optional but recommended)
stream
booleanWhether to stream the response
Default: truestream_format
string · enumFormat for streaming: 'adts' for AAC or 'flac' for FLAC
Enum values:adtsflacDefault: adtstemperature
numberTemperature for generation
Default: 0.7top_p
numberTop-p for generation
Default: 0.7chunk_length
numberText chunk length for splitting long input
Default: 200max_new_tokens
numberLimit on max tokens (0 = unlimited)
Default: 0repetition_penalty
numberRepetition penalty for generation
Default: 1.5async
booleanWhether to run asynchronously
Default: false
Responses
Audio response or async prediction ID
TWI SpeechT5 TTS
Synthesize speech from text using TWI SpeechT5 model with customizable 512-dimensional speaker embeddings. Hosted on SLNG infrastructure for low-latency synthesis.
Headers
Authorization
string · requiredThe
Authorization
header is used to authenticate with the API using your API key. Value is of the formatBearer YOUR_KEY_HERE
.
Request Body
text
string · requiredThe text to synthesize into speech.
speaker_embedding
number[] · minItems: 512 · maxItems: 512 · requiredA 512-dimensional speaker embedding vector representing the target voice.
Responses
Synthesized audio waveform (array of floats)
audio
number[]Raw waveform samples as float array (16kHz sample rate, mono channel). Values typically range from -1.0 to 1.0.
fallback
booleanTrue if fallback silent audio was returned due to an error
error
stringError message if fallback audio was returned