MoAI Performance Gateway
Routes inference requests across heterogeneous chips in your data center to extract optimal performance from every accelerator. Exposes OpenAI- and Anthropic-compatible APIs, built for production.
The Performance Gateway, Defined
AI gateways typically mean routing across model providers or regions. Moreh defines a new category: routing within a data center, across the chips you already have, for performance.
| Gateway | Scale | Across | Role |
|---|---|---|---|
| Semantic gateway | Within or across data centers | Multiple models | Selects the most appropriate (smaller) model based on request semantics |
| Multi-provider gateway | Across data centers | Multiple API providers | Selects the most cost-efficient or available region |
| Performance gateway | Within a data center | Multiple chips | Distributes requests across multiple (heterogeneous) chips within a data center to achieve optimal performance |
Engineered for Performance
Every routing decision is informed by per-request KV-cache state, workload characteristics, and live engine telemetry.
Prefix Cache-Aware Routing
Routes each request to the chip with the longest cached prefix, minimizing KV-cache recomputation in multi-turn and long-context conversations.
Request Length-Based Routing
Selects the chip and serving configuration that best fits the request's sequence length, matching workload characteristics to hardware.
Flexible Routing Composition
Compose filters, scorers, and pickers into a custom routing pipeline via declarative configuration. Plug in prefix cache-aware, load-aware, request length-based, or custom scorers.
Heterogeneous Prefill-Decode Disaggregation
Coordinates prefill and decode phases across chips of different vendors and architectures, with automatic fallback to single-phase serving on transfer failure.
Minimal Overhead Outside GPU Compute
Routing, scheduling, and event-driven telemetry that typically span multiple services run inside a single binary, minimizing inter-process hops on the request hot path. Under load, the only meaningful latency in your inference pipeline is GPU compute itself.
Modern API Complexity, Off Your Serving Engines
Tool calling, reasoning budgets, chat templates, structured outputs, streaming protocols — AI APIs grow more complex by the month, and most of that complexity is GPU-independent. MoAI Performance Gateway absorbs it at the edge so your serving engines stay simple: tokens in, tokens out. Update to the next API surface or reasoning model without touching GPU-dependent software.
token_ids → engine → token_idsSpeaks the APIs Your Apps Already Use
OpenAI and Anthropic compatibility, with the features that matter for agentic and reasoning workloads.
OpenAI Chat Completions API
POST /v1/chat/completionsOpenAI Responses API
POST /v1/responsesAnthropic Messages API
POST /v1/messages