Skip to main content
Not every turn deserves the same execution path. The layer adapts on four dimensions.

Geography

Route to the nearest region. Data stays in-jurisdiction. A caller in Frankfurt hits EU infrastructure. A caller in Mumbai hits AP South. No configuration needed: this is the default behavior. Override with X-Region-Override when you need a specific datacenter, or X-World-Part-Override when you need to stay within a geographic zone. See Regional Execution.

Compliance

Execution constraints determine where data and models can operate. Healthcare, banking, and insurance workloads have requirements about where audio can be processed and where transcripts can exist. Regional execution respects these constraints per request.

Cost and latency

Cached path, local inference, or full reasoning, based on the input. The layer makes this decision per turn:
  • A greeting that has been said a thousand times: cached path.
  • A simple acknowledgment: local inference, no LLM call.
  • A complex question requiring reasoning: full inference path.
Same API call from your side. The path is selected for you.

Workload

Adapt based on model availability, caller context, and interaction characteristics. If a primary model shows elevated latency, traffic can route to a fallback. If a caller’s language is better served by a specific STT model, the routing adapts.

The result

A system where cost and latency decrease with usage, while reliability increases. Every call improves routing decisions and cache coverage for the next one. Regulated industries (healthcare, banking, insurance) see the strongest compounding effect. Their workflows have high repetition within each customer’s use cases, and the layer learns those patterns.