Building Block

MoAI Performance Gateway

Routes inference requests across heterogeneous chips in your data center to extract optimal performance from every accelerator. Exposes OpenAI- and Anthropic-compatible APIs, built for production.

A New Category

The Performance Gateway, Defined

AI gateways typically mean routing across model providers or regions. Moreh defines a new category: routing within a data center, across the chips you already have, for performance.

Gateway	Scale	Across	Role
Semantic gateway	Within or across data centers	Multiple models	Selects the most appropriate (smaller) model based on request semantics
Multi-provider gateway	Across data centers	Multiple API providers	Selects the most cost-efficient or available region
Performance gateway	Within a data center	Multiple chips	Distributes requests across multiple (heterogeneous) chips within a data center to achieve optimal performance

Capabilities

Engineered for Performance

Every routing decision is informed by per-request KV-cache state, workload characteristics, and live engine telemetry.

Prefix Cache-Aware Routing

Routes each request to the chip with the longest cached prefix, minimizing KV-cache recomputation in multi-turn and long-context conversations.

Request Length-Based Routing

Selects the chip and serving configuration that best fits the request's sequence length, matching workload characteristics to hardware.

Flexible Routing Composition

Compose filters, scorers, and pickers into a custom routing pipeline via declarative configuration. Plug in prefix cache-aware, load-aware, request length-based, or custom scorers.

Heterogeneous Prefill-Decode Disaggregation

Coordinates prefill and decode phases across chips of different vendors and architectures, with automatic fallback to single-phase serving on transfer failure.

Minimal Overhead Outside GPU Compute

Routing, scheduling, and event-driven telemetry that typically span multiple services run inside a single binary, minimizing inter-process hops on the request hot path. Under load, the only meaningful latency in your inference pipeline is GPU compute itself.

16×Lower P99 Latencyvs. Istio + EPP

<1 µsScheduling Hot Path

Architecture

Modern API Complexity, Off Your Serving Engines

Tool calling, reasoning budgets, chat templates, structured outputs, streaming protocols — AI APIs grow more complex by the month, and most of that complexity is GPU-independent. MoAI Performance Gateway absorbs it at the edge so your serving engines stay simple: tokens in, tokens out. Update to the next API surface or reasoning model without touching GPU-dependent software.

Handled by the gateway

What the engine sees

token_ids → engine → token_ids

API Surfaces

Speaks the APIs Your Apps Already Use

OpenAI and Anthropic compatibility, with the features that matter for agentic and reasoning workloads.

OpenAI Chat Completions API

POST /v1/chat/completions

OpenAI Responses API

POST /v1/responses

Anthropic Messages API

POST /v1/messages