Optimal LLM Inference on Every Accelerator

From custom kernels to distributed serving, we build the full-stack software that unlocks peak inference performance on AMD GPUs, Tenstorrent chips, and heterogeneous clusters.

1.68×

vs ROCm vLLM

DeepSeek R1 on a single server

20,000+

tok/s per node

DeepSeek R1 on MI300X cluster

1.7×

with cross-vendor GPUs

NVIDIA + AMD PD disaggregation

2.2×

throughput on 40% fewer servers

Prefix cache-aware routing

Full-Stack Inference Software

From Kernels to Clusters

Moreh covers the entire inference stack across heterogeneous accelerators — from chip-level kernels to distributed serving.

MoAI Inference Framework

Routing & Scheduling · Auto Scaling · SLO-Driven Optimization · KV Cache

Moreh vLLM

SOTA Model Optimization · Quantization · Graph Execution

Native vLLM

Moreh Libraries

Custom Kernels · GEMM/Attention/MoE · Communication

AMD Instinct GPUs

Tenstorrent Chips

NVIDIA GPUs

Why Moreh

Three ways our inference software creates value for your AI infrastructure.

Inference on Non-NVIDIA Accelerators

Full-stack software from kernels to cluster-level framework, optimized for AMD GPUs and enabling inference on Tenstorrent chips.

Heterogeneous GPU Inference

Unify GPUs across vendors, architectures, and generations into a single inference cluster — maximizing the efficiency of every chip in your data center.

Inference Cost Optimization

Maximize tokens per dollar through chip-level optimization, communication optimization, and multi-vendor infrastructure utilization.

Ecosystem & Open Source

We contribute to the open-source ecosystem and partner with leading chip vendors.

AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot
AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot
AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot
AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot