MoAI Inference Framework
Automating distributed inference at data center scale
Serve large models across every GPU you have — regardless of vendor, generation, or architecture — through a single API endpoint. MoAI Inference Framework automatically allocates resources, routes requests, and scales capacity so your cluster delivers maximum throughput at the lowest latency.
Key Differentiator
One Cluster, Every GPU
Most inference stacks lock you into a single vendor. MoAI Inference Framework breaks that constraint — split prefill and decode across chips from different vendors, squeeze remaining value out of legacy GPUs, or add non-GPU accelerators into the same cluster. Each device runs what it's best at.
1.7×
throughput with cross-vendor PD disaggregation
0
overhead in mixed-vendor unified routing
Unified API Endpoint
Router / Scheduler
NVIDIA
AMD
Tenstorrent
Core Capabilities
Automatic Disaggregation
Efficient distributed inference requires combining multiple techniques, allocating GPU resources optimally, and scheduling requests intelligently. MoAI Inference Framework automates all of this based on your defined SLOs and real-time traffic patterns.
01
SLO-Driven Optimization
Specify latency constraints and let the framework automatically determine the optimal parallelization strategy and resource allocation to maximize throughput per dollar.
02
Prefill-Decode Disaggregation
Separates prefill and decode phases across different GPU pools — including across heterogeneous GPU types — to optimize resource utilization for each workload characteristic.
03
Prefix Cache-Aware Routing
Routes requests to instances with pre-cached prefix computations, reducing TTFT by up to 20x and achieving 2.2x throughput with just 40% of the servers.
04
Request Length-Based Routing
Classifies incoming requests by expected length and routes them to GPU pools optimized for each workload profile — short prompts to latency-tuned instances, long contexts to throughput-tuned ones.
05
Auto Scaling
Automatically scales inference capacity up and down based on traffic patterns, ensuring optimal resource utilization and cost efficiency.
Architecture
Kubernetes Native
MoAI Inference Framework runs as a set of Kubernetes-native controllers — no sidecar daemons, no proprietary control plane. Deploy with Helm, expose through any Gateway API Inference Extension-compatible controller including Istio, and let NFD auto-discover heterogeneous accelerators across your fleet.
Supported Models
MoAI Inference Framework works with any model supported by its underlying serving engines (Moreh vLLM, vLLM, SGLang, and others). This includes most open-source LLMs:
Supported Hardware
Accelerators
Networking