Optimal LLM Inference on Every Accelerator
From custom kernels to distributed serving, we build the full-stack software that unlocks peak inference performance on AMD GPUs, Tenstorrent chips, and heterogeneous clusters.
1.68×
vs ROCm vLLM
DeepSeek R1 on a single server
20,000+
tok/s per node
DeepSeek R1 on MI300X cluster
1.7×
with cross-vendor GPUs
NVIDIA + AMD PD disaggregation
2.2×
throughput on 40% fewer servers
Prefix cache-aware routing
Full-Stack Inference Software
From Kernels to Clusters
Moreh covers the entire inference stack across heterogeneous accelerators — from chip-level kernels to distributed serving.
MoAI Inference Framework
Routing & Scheduling · Auto Scaling · SLO-Driven Optimization · KV Cache
Moreh vLLM
SOTA Model Optimization · Quantization · Graph Execution
Native vLLM
Moreh Libraries
Custom Kernels · GEMM/Attention/MoE · Communication
AMD Instinct GPUs
Tenstorrent Chips
NVIDIA GPUs
Why Moreh
Three ways our inference software creates value for your AI infrastructure.
Inference on Non-NVIDIA Accelerators
Full-stack software from kernels to cluster-level framework, optimized for AMD GPUs and enabling inference on Tenstorrent chips.
Heterogeneous GPU Inference
Unify GPUs across vendors, architectures, and generations into a single inference cluster — maximizing the efficiency of every chip in your data center.
Inference Cost Optimization
Maximize tokens per dollar through chip-level optimization, communication optimization, and multi-vendor infrastructure utilization.
From Our Blog
View all ›
Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)
March 16, 2026
Moreh’s optimized inference engine achieves 1.47x improvement in end-to-end latency and throughput per GPU for DeepSeek R1 on AMD MI300X, compared to InferenceMAX baseline.

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
February 5, 2026
TIDE continuously improves inference speed by training a lightweight draft model in the background, using idle GPUs in the cluster — no extra data preparation or downtime required.

Step3 Inference Optimization on AMD Instinct MI308X: 1.30× Higher Decode Throughput vs. NVIDIA H20
December 29, 2025
Moreh optimized StepFun’s Step3 321B MoE model for AMD Instinct MI308X GPUs, achieving 1.30× higher decode throughput and 23% lower decode latency compared to NVIDIA H20.
Ecosystem & Open Source
We contribute to the open-source ecosystem and partner with leading chip vendors.



















