Building Block

MoAI Performance Gateway

Routes inference requests across heterogeneous chips in your data center to extract optimal performance from every accelerator. Exposes OpenAI- and Anthropic-compatible APIs, built for production.

A New Category

The Performance Gateway, Defined

AI gateways typically mean routing across model providers or regions. Moreh defines a new category: routing within a data center, across the chips you already have, for performance.

GatewayScaleAcrossRole
Semantic gatewayWithin or across data centersMultiple modelsSelects the most appropriate (smaller) model based on request semantics
Multi-provider gatewayAcross data centersMultiple API providersSelects the most cost-efficient or available region
Performance gatewayWithin a data centerMultiple chipsDistributes requests across multiple (heterogeneous) chips within a data center to achieve optimal performance
Capabilities

Engineered for Performance

Every routing decision is informed by per-request KV-cache state, workload characteristics, and live engine telemetry.

Prefix Cache-Aware Routing

Routes each request to the chip with the longest cached prefix, minimizing KV-cache recomputation in multi-turn and long-context conversations.

Request Length-Based Routing

Selects the chip and serving configuration that best fits the request's sequence length, matching workload characteristics to hardware.

Flexible Routing Composition

Compose filters, scorers, and pickers into a custom routing pipeline via declarative configuration. Plug in prefix cache-aware, load-aware, request length-based, or custom scorers.

Heterogeneous Prefill-Decode Disaggregation

Coordinates prefill and decode phases across chips of different vendors and architectures, with automatic fallback to single-phase serving on transfer failure.

Minimal Overhead Outside GPU Compute

Routing, scheduling, and event-driven telemetry that typically span multiple services run inside a single binary, minimizing inter-process hops on the request hot path. Under load, the only meaningful latency in your inference pipeline is GPU compute itself.

16×Lower P99 Latencyvs. Istio + EPP
<1 µsScheduling Hot Path
Architecture

Modern API Complexity, Off Your Serving Engines

Tool calling, reasoning budgets, chat templates, structured outputs, streaming protocols — AI APIs grow more complex by the month, and most of that complexity is GPU-independent. MoAI Performance Gateway absorbs it at the edge so your serving engines stay simple: tokens in, tokens out. Update to the next API surface or reasoning model without touching GPU-dependent software.

Handled by the gateway
Chat templatesTool-call parsingReasoning extractionToken accountingStreaming SSERequest validationObservability events
What the engine sees
token_ids → engine → token_ids
API Surfaces

Speaks the APIs Your Apps Already Use

OpenAI and Anthropic compatibility, with the features that matter for agentic and reasoning workloads.

OpenAI Chat Completions API

POST /v1/chat/completions
Tool callsStreaming SSEReasoning contentSystem/developer roles

OpenAI Responses API

POST /v1/responses
Tool callsStreaming SSEReasoning contentSystem/developer rolesStateful conversation

Anthropic Messages API

POST /v1/messages
Tool useStreaming deltasExtended thinkingSystem prompts