Optimal LLM Inference on Every Accelerator

From custom kernels to distributed serving, we build the full-stack software that unlocks peak inference performance on AMD GPUs, Tenstorrent chips, and heterogeneous clusters.

Request Demo View Benchmarks

1.68×

vs ROCm vLLM

DeepSeek R1 on a single server

20,000+

tok/s per node

DeepSeek R1 on MI300X cluster

1.7×

with cross-vendor GPUs

NVIDIA + AMD PD disaggregation

2.2×

throughput on 40% fewer servers

Prefix cache-aware routing

Full-Stack Inference Software

From Kernels to Clusters

Moreh covers the entire inference stack across heterogeneous accelerators — from chip-level kernels to distributed serving.

MoAI Inference Framework

Routing & Scheduling · Auto Scaling · SLO-Driven Optimization · KV Cache

Moreh vLLM

SOTA Model Optimization · Quantization · Graph Execution

Native vLLM

Moreh Libraries

Custom Kernels · GEMM/Attention/MoE · Communication

AMD Instinct GPUs

Tenstorrent Chips

NVIDIA GPUs

Why Moreh

Three ways our inference software creates value for your AI infrastructure.

Inference on Non-NVIDIA Accelerators

Full-stack software from kernels to cluster-level framework, optimized for AMD GPUs and enabling inference on Tenstorrent chips.

AMD GPU ›Tenstorrent ›

Heterogeneous GPU Inference

Unify GPUs across vendors, architectures, and generations into a single inference cluster — maximizing the efficiency of every chip in your data center.

Learn more ›

Inference Cost Optimization

Maximize tokens per dollar through chip-level optimization, communication optimization, and multi-vendor infrastructure utilization.

Learn more ›

From Our Blog

View all ›

Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)

March 16, 2026

Moreh’s optimized inference engine achieves 1.47x improvement in end-to-end latency and throughput per GPU for DeepSeek R1 on AMD MI300X, compared to InferenceMAX baseline.

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

February 5, 2026

TIDE continuously improves inference speed by training a lightweight draft model in the background, using idle GPUs in the cluster — no extra data preparation or downtime required.

Step3 Inference Optimization on AMD Instinct MI308X: 1.30× Higher Decode Throughput vs. NVIDIA H20

December 29, 2025

Moreh optimized StepFun’s Step3 321B MoE model for AMD Instinct MI308X GPUs, achieving 1.30× higher decode throughput and 23% lower decode latency compared to NVIDIA H20.

Ecosystem & Open Source

We contribute to the open-source ecosystem and partner with leading chip vendors.