Full-Stack Software

MoAI Inference Framework

Data-center-scale distributed inference software

Serve large models across every GPU you have — regardless of vendor, generation, or architecture — through a single API endpoint. MoAI Inference Framework automatically allocates resources, routes requests, and scales capacity so your cluster delivers maximum throughput at the lowest latency.

Request Demo→Read Documentation

Key Differentiator

One Cluster, Every GPU

Most inference stacks lock you into a single vendor. MoAI Inference Framework breaks that constraint — split prefill and decode across chips from different vendors, squeeze remaining value out of legacy GPUs, or add non-GPU accelerators into the same cluster. Each device runs what it's best at.

1.7×throughput with cross-vendor PD disaggregation

0overhead in mixed-vendor unified routing

Explore scenarios→

Unified API Endpoint

Performance Gateway

NVIDIA

AMD

Tenstorrent

…

Cross-Vendor Software Fabric

Core Capabilities

Automatic Disaggregation

Efficient distributed inference requires combining multiple techniques, allocating GPU resources optimally, and scheduling requests intelligently. MoAI Inference Framework automates all of this based on your defined SLOs and real-time traffic patterns.

SLO-Driven Optimization

Specify latency constraints and let the framework automatically determine the optimal parallelization strategy and resource allocation to maximize throughput per dollar.

Prefill-Decode Disaggregation

Separates prefill and decode phases across different GPU pools — including across heterogeneous GPU types — to optimize resource utilization for each workload characteristic.

Prefix Cache-Aware Routing

Routes requests to instances with pre-cached prefix computations, reducing TTFT by up to 20x and achieving 2.2x throughput with just 40% of the servers.

Request Length-Based Routing

Classifies incoming requests by expected length and routes them to GPU pools optimized for each workload profile — short prompts to latency-tuned instances, long contexts to throughput-tuned ones.

Auto Scaling

Automatically scales inference capacity up and down based on traffic patterns, ensuring optimal resource utilization and cost efficiency.

Architecture

Building Blocks

MoAI Inference Framework is composed of purpose-built components that work together to deliver optimal inference across heterogeneous accelerators.

MoAI Performance Gateway

Intelligent workload distribution across heterogeneous accelerators.

Learn more→

MoAI Fabric

Software-defined, cross-vendor GPU memory fabric for KV cache transfer.

Learn more→

MoAI Autopilot

SLO-driven serving stack configuration and continuous optimization.

Coming soon

Moreh vLLM for AMD

Drop-in vLLM replacement with up to 2× higher throughput on AMD GPUs.

Learn more→

Moreh vLLM for Tenstorrent

High-performance vLLM serving on Tenstorrent accelerators.

Learn more→

Models

Supported Models

MoAI Inference Framework works with any model supported by its underlying serving engines (Moreh vLLM, vLLM, SGLang, and others). This includes most open-source LLMs:

Hardware

Supported Hardware

Accelerators

NVIDIA

AMD

Tenstorrent

Networking

RDMA interconnect