Powering the fastest serving on GPU clusters

Efficient cluster-level distributed inference has become the dominant factor in AI service costs. MoAI Inference Framework optimizes AI models at data center scale to achieve superlinear efficiency.

Automatic DisaggregationSchedulingRoutingScaling

Implementing efficient distributed inference goes beyond simply applying individual techniques such as Prefill-Decode disaggregation and prefix cache-aware routing. The greater challenges lies in combining multiple techniques, allocating GPU resources appropriately, and scheduling incoming requests. MoAI Inference Framework automates this based on the defined service level objectives and real-time traffic patterns.

Heterogeneous Accelerators

A single accelerator cannot always be optimal for every inference workload in data centers. Using different types of chips together will become essential for reducing total costs of operation. MoAI Inference Framework efficiently integrates heterogeneous accelerators and provides a single, unified inference endpoint.

Scenario 1

Legacy + Latest GPU

Enhance service quality (e.g., latency) by utilizing the latest GPUs, while achieving high throughput at low cost by leveraging legacy GPUs.

Scenario 2

NVIDIA + AMD GPU

Combine alternative accelerators such as AMD GPUs to reduce dependency on a single hardware vendor.

Scenario 3

GPU + Others

Maximize efficiency by mixing chips with different performance characteristics, such as Rubin GPU + Rubin CPX or GPU + Tenstorrent.

Fully Integrated for AMD GPUsTenstorrent Chips

Our goal is to optimize the entire inference software stack end-to-end, from GPU kernels to distributed inference. By integrating Moreh vLLM with MoAI Inference Framework, we achieve top-tier data center-scale inference performance on non-NVIDIA GPUs.

3 Ways to Get Started

Install on Existing AMD GPU Cluster

Moreh provides turn-key services to install and configure MoAI Inference Framework on existing AMD GPU systems.

Build a New Inference System

Moreh provides on-premises or cloud-based distributed inference environments, optimized for customers’ models and applications.

Try on Moreh Cloud