MoAI Inference Framework
Data-center-scale distributed inference software
Serve large models across every GPU you have — regardless of vendor, generation, or architecture — through a single API endpoint. MoAI Inference Framework automatically allocates resources, routes requests, and scales capacity so your cluster delivers maximum throughput at the lowest latency.
One Cluster, Every GPU
Most inference stacks lock you into a single vendor. MoAI Inference Framework breaks that constraint — split prefill and decode across chips from different vendors, squeeze remaining value out of legacy GPUs, or add non-GPU accelerators into the same cluster. Each device runs what it's best at.
Unified API Endpoint
Performance Gateway
NVIDIA
AMD
Tenstorrent
Cross-Vendor Software Fabric
Automatic Disaggregation
Efficient distributed inference requires combining multiple techniques, allocating GPU resources optimally, and scheduling requests intelligently. MoAI Inference Framework automates all of this based on your defined SLOs and real-time traffic patterns.
SLO-Driven Optimization
Specify latency constraints and let the framework automatically determine the optimal parallelization strategy and resource allocation to maximize throughput per dollar.
Prefill-Decode Disaggregation
Separates prefill and decode phases across different GPU pools — including across heterogeneous GPU types — to optimize resource utilization for each workload characteristic.
Prefix Cache-Aware Routing
Routes requests to instances with pre-cached prefix computations, reducing TTFT by up to 20x and achieving 2.2x throughput with just 40% of the servers.
Request Length-Based Routing
Classifies incoming requests by expected length and routes them to GPU pools optimized for each workload profile — short prompts to latency-tuned instances, long contexts to throughput-tuned ones.
Auto Scaling
Automatically scales inference capacity up and down based on traffic patterns, ensuring optimal resource utilization and cost efficiency.
Building Blocks
MoAI Inference Framework is composed of purpose-built components that work together to deliver optimal inference across heterogeneous accelerators.
MoAI Performance Gateway
Intelligent workload distribution across heterogeneous accelerators.
MoAI Fabric
Software-defined, cross-vendor GPU memory fabric for KV cache transfer.
MoAI Autopilot
SLO-driven serving stack configuration and continuous optimization.
Moreh vLLM for AMD
Drop-in vLLM replacement with up to 2× higher throughput on AMD GPUs.
Moreh vLLM for Tenstorrent
High-performance vLLM serving on Tenstorrent accelerators.
Supported Models
MoAI Inference Framework works with any model supported by its underlying serving engines (Moreh vLLM, vLLM, SGLang, and others). This includes most open-source LLMs:
Supported Hardware
Accelerators
Networking