MoAI Inference Framework

Automating distributed inference at data center scale

Serve large models across every GPU you have — regardless of vendor, generation, or architecture — through a single API endpoint. MoAI Inference Framework automatically allocates resources, routes requests, and scales capacity so your cluster delivers maximum throughput at the lowest latency.

Key Differentiator

One Cluster, Every GPU

Most inference stacks lock you into a single vendor. MoAI Inference Framework breaks that constraint — split prefill and decode across chips from different vendors, squeeze remaining value out of legacy GPUs, or add non-GPU accelerators into the same cluster. Each device runs what it's best at.

1.7×

throughput with cross-vendor PD disaggregation

0

overhead in mixed-vendor unified routing

Explore scenarios ›

Unified API Endpoint

Router / Scheduler

NVIDIA

AMD

Tenstorrent

Core Capabilities

Automatic Disaggregation

Efficient distributed inference requires combining multiple techniques, allocating GPU resources optimally, and scheduling requests intelligently. MoAI Inference Framework automates all of this based on your defined SLOs and real-time traffic patterns.

01

SLO-Driven Optimization

Specify latency constraints and let the framework automatically determine the optimal parallelization strategy and resource allocation to maximize throughput per dollar.

02

Prefill-Decode Disaggregation

Separates prefill and decode phases across different GPU pools — including across heterogeneous GPU types — to optimize resource utilization for each workload characteristic.

03

Prefix Cache-Aware Routing

Routes requests to instances with pre-cached prefix computations, reducing TTFT by up to 20x and achieving 2.2x throughput with just 40% of the servers.

04

Request Length-Based Routing

Classifies incoming requests by expected length and routes them to GPU pools optimized for each workload profile — short prompts to latency-tuned instances, long contexts to throughput-tuned ones.

05

Auto Scaling

Automatically scales inference capacity up and down based on traffic patterns, ensuring optimal resource utilization and cost efficiency.

Architecture

Kubernetes Native

MoAI Inference Framework runs as a set of Kubernetes-native controllers — no sidecar daemons, no proprietary control plane. Deploy with Helm, expose through any Gateway API Inference Extension-compatible controller including Istio, and let NFD auto-discover heterogeneous accelerators across your fleet.

Kubernetes NativeGateway API Inference ExtensionIstio CompatibleHelm ChartsNFD IntegrationRoCE Networking

Supported Models

MoAI Inference Framework works with any model supported by its underlying serving engines (Moreh vLLM, vLLM, SGLang, and others). This includes most open-source LLMs:

DeepSeekDeepSeekGPT-OSSGPT-OSSLlamaLlamaQwenQwenMistralMistralGLMGLMStepStepGemmaGemmaKimiKimiand more

Supported Hardware

Accelerators

NVIDIA
B300B200H200H100H20A100
AMD
MI355XMI325XMI308XMI300XMI250XMI250
Tenstorrent
BlackholeWormhole

Networking

RoCEInfiniBand