MoAI Inference Framework

Powering the fastest serving on GPU clusters

Efficient cluster-level distributed inference has become the dominant factor in AI service costs. MoAI Inference Framework optimizes AI models at data center scale to achieve superlinear efficiency.

Key Focuses

Automatic Disaggregation

Boost system throughput by efficiently partitioning and distributing large, complex LLMs.

Implementing efficient distributed inference goes beyond simply applying individual techniques such as Prefill-Decode, EP, and Attention-FFN. The greater challenge lies in combining multiple techniques, allocating GPU resources appropriately, and scheduling incoming requests. MoAI Inference Framework automates this based on its own cost model.

Heterogeneous Accelerators

Increase system efficiency by offloading models and requests across different types of GPUs.

A single accelerator cannot always be optimal for every inference workload in data centers. Using different types of chips together will become essential for reducing total costs of operation. MoAI Inference Framework efficiently mixes heterogeneous accelerators to provide a single, unified inference endpoint.

Agentic AI

Allocate GPU resources across tens to hundreds of LLMs and multimodal models.

Auto Scaling

Adjust GPU allocation to ensure service levels as user requests increase.

Adaptive Scheduling

Optimize disaggregation and scheduling policies in real time based on user request patterns.

Inference for RL

Support online model updates and high-batch inference for reinforcement learning.

Moreh vLLM

One of the key building block of MoAI Inference Framework, delivering best-in-class inference performance on AMD GPUs and Tenstorrent accelerators.

Learn more

Our goal is to optimize the entire inference software stack end-to-end, from GPU kernels to distributed inference. By integrating Moreh vLLM with MoAI Inference Framework, we achieve top-tier datacenter-scale inference performance on non-NVIDIA GPUs.

3 Ways to Get Started

Install on Existing AMD GPU Cluster

Moreh provides turn-key services to install and configure MoAI Inference Framework on existing AMD GPU systems.

Build a New Inference System

Moreh provides on-premises or cloud-based distributed inference environments, optimized for customers’ models and applications.