Solution

One Inference Cluster, Every GPU

AI datacenters accumulate GPUs across procurement cycles — different vendors, architectures, and generations. Traditional software can't serve them together, leaving older GPUs idle and locking you to a single vendor. Moreh's software unifies every chip into a single inference system.

Plan Your Heterogeneous Cluster

Three Scenarios, One Platform

Scenario 1

Old + New Generation

e.g., H100 + B200

Put older GPUs back to work — offload tasks from newer GPUs and boost speculative decoding efficiency, so every generation contributes to cluster throughput.

Scenario 2

NVIDIA + AMD

e.g., H200 + MI355X

Route inference across NVIDIA and AMD GPUs from a single API endpoint, and split prefill and decode across vendors for even higher throughput.

Scenario 3

GPU + AI Accelerator

e.g., GPU + Tenstorrent

Mix GPUs with specialized AI accelerators like Tenstorrent chips, using each for the workloads where they excel most.

Enabling Technologies

All of these capabilities are built into MoAI Inference Framework — a single platform that orchestrates heterogeneous GPUs at cluster scale.

Model-Aware GPU Placement

Large models on newer GPUs, smaller models on older GPUs

Automatically assign models to the most suitable GPU pool based on model size and hardware capability — run flagship models on latest-gen GPUs while older GPUs handle lighter models.

Cross-Vendor Prefill-Decode Disaggregation

Vendor A for prefill, Vendor B for decode

Use NVIDIA GPUs for prefill and AMD GPUs for decode, achieving 1.7× higher throughput than same-vendor configurations. Enabled by our cross-vendor RDMA communication library for direct GPU-to-GPU data transfer over RoCE.

Workload-Aware Prefill-Decode Disaggregation

Compute-rich GPU for prefill, high-bandwidth GPU for decode

Match each inference phase to the GPU that fits its profile — compute-intensive prefill on one chip, bandwidth-hungry decode on another. Works across chip variants within the same vendor, such as H100 + H20 or MI300X + MI308X.

Request Length-Based Routing

Short sequences to older GPUs, long sequences to newer GPUs

Route incoming requests by sequence length to the GPU pool best equipped to handle them — keeping older GPUs productive on shorter workloads while newer GPUs tackle long-context requests.

Multi-Node Prefill Engine (SLOPE)

Older GPUs for prefill, newer GPUs for decode

Distribute long-context prefill across multiple older-generation GPU nodes, freeing newer GPUs to focus on decode.

Online Draft Model Training

Older GPUs train draft models, newer GPUs decode faster

Continuously improve draft models on older GPUs to boost speculative decoding efficiency on newer GPUs — making every generation useful.

Kubernetes NFD Auto-Discovery

Detect and classify every accelerator automatically

Automatic GPU detection and classification via Kubernetes Node Feature Discovery, with unified routing across all discovered accelerators.

Ready to Unify Your GPU Fleet?

Talk to our team about deploying MoAI Inference Framework across your heterogeneous infrastructure.

Plan Your Heterogeneous Cluster Learn about MoAI Inference Framework