Solution
One Inference Cluster, Every GPU
AI datacenters accumulate GPUs across procurement cycles — different vendors, architectures, and generations. Traditional software can't serve them together, leaving older GPUs idle and locking you to a single vendor. Moreh's software unifies every chip into a single inference system.
Three Scenarios, One Platform
Old + New Generation
e.g., H100 + B200
Put older GPUs back to work — offload tasks from newer GPUs and boost speculative decoding efficiency, so every generation contributes to cluster throughput.
NVIDIA + AMD
e.g., H200 + MI355X
Route inference across NVIDIA and AMD GPUs from a single API endpoint, and split prefill and decode across vendors for even higher throughput.
GPU + AI Accelerator
e.g., GPU + Tenstorrent
Mix GPUs with specialized AI accelerators like Tenstorrent chips, using each for the workloads where they excel most.
Enabling Technologies
All of these capabilities are built into MoAI Inference Framework — a single platform that orchestrates heterogeneous GPUs at cluster scale.
Model-Aware GPU Placement
Large models on newer GPUs, smaller models on older GPUs
Automatically assign models to the most suitable GPU pool based on model size and hardware capability — run flagship models on latest-gen GPUs while older GPUs handle lighter models.
Cross-Vendor Prefill-Decode Disaggregation
Vendor A for prefill, Vendor B for decode
Use NVIDIA GPUs for prefill and AMD GPUs for decode, achieving 1.7× higher throughput than same-vendor configurations. Enabled by our cross-vendor RDMA communication library for direct GPU-to-GPU data transfer over RoCE.
Workload-Aware Prefill-Decode Disaggregation
Compute-rich GPU for prefill, high-bandwidth GPU for decode
Match each inference phase to the GPU that fits its profile — compute-intensive prefill on one chip, bandwidth-hungry decode on another. Works across chip variants within the same vendor, such as H100 + H20 or MI300X + MI308X.
Read more ›Request Length-Based Routing
Short sequences to older GPUs, long sequences to newer GPUs
Route incoming requests by sequence length to the GPU pool best equipped to handle them — keeping older GPUs productive on shorter workloads while newer GPUs tackle long-context requests.
Multi-Node Prefill Engine (SLOPE)
Older GPUs for prefill, newer GPUs for decode
Distribute long-context prefill across multiple older-generation GPU nodes, freeing newer GPUs to focus on decode.
Read more ›Online Draft Model Training
Older GPUs train draft models, newer GPUs decode faster
Continuously improve draft models on older GPUs to boost speculative decoding efficiency on newer GPUs — making every generation useful.
Read more ›Kubernetes NFD Auto-Discovery
Detect and classify every accelerator automatically
Automatic GPU detection and classification via Kubernetes Node Feature Discovery, with unified routing across all discovered accelerators.
Ready to Unify Your GPU Fleet?
Talk to our team about deploying MoAI Inference Framework across your heterogeneous infrastructure.