Moreh vLLM

The fastest way to serve LLMs on AMD GPUs

Drop-in replacement for vLLM with up to 2× higher throughput on AMD Instinct GPUs. Same API, same model formats — just faster. Deploy in minutes with a single Docker image.

Request Demo View Benchmarks

Benchmarks

Proven Performance Across Models

DeepSeek R1 671B · 8× AMD Instinct MI300X

Output tokens/s normalized to ROCm vLLM, across input lengths, output lengths, and concurrency levels.

Moreh vLLM 0.9.0

ROCm vLLM 0.9.2

SGLang 0.4.8

Measured using vLLM’s benchmark_serving tool.

More evaluation reports

Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)Step3 Inference Optimization on AMD Instinct MI308X: 1.30× Higher Decode Throughput vs. NVIDIA H20 Telco LLM Inference Optimization on AMD MI300X: 1.38× Higher Serving Capacity Moreh vLLM Performance Evaluation: Llama 3.3 70B on AMD Instinct MI300X GPUs Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs

View all benchmarks ›

Getting Started

Preset-Based Deployment

Moreh vLLM ships with optimized presets for popular models and hardware configurations. Pick a preset, point to your model, and serve — parallelism, memory, and kernel settings are handled automatically.

Example Deployments

$ docker run --device /dev/kfd --device /dev/dri \
  --network host -v /models:/models \
  moreh/moreh-vllm:latest \
  serve.sh /models/DeepSeek-R1 \
    presets/deepseek-ai-deepseek-r1-amd-mi300x-dp8-moe-ep8.yaml

Under the Hood

Why It’s Faster

Moreh vLLM replaces the compute backend with engines purpose-built for AMD GPU architecture.

Custom Libraries for AMD GPUs

Compute libraries — including GEMM, attention, MoE, and fused operations — built specifically for AMD GPU architecture.

Model Optimization

Techniques such as operation fusion, graph-level execution, and quantization to run each model as efficiently as possible.

Multi-GPU Scaling

Communication/compute overlap, EP load balancing, and other optimizations to scale across GPUs within a server.

Supported Models

Optimized for popular open-source LLMs, including:

DeepSeek

GPT-OSS

Llama

Qwen

Mistral

GLM

Stepand more

Supported Hardware

AMD Instinct MI355XAMD Instinct MI325XAMD Instinct MI308XAMD Instinct MI300XAMD Instinct MI250

Running a proprietary model?

Moreh provides on-demand vLLM optimization for your private and fine-tuned models on AMD GPUs. We build a custom Moreh vLLM tailored to your model architecture, so you get the same performance gains without any extra work on your side.

We’ve done this for customers including StepFun (Step3 321B on MI308X, 1.30× higher decode throughput vs. NVIDIA H20) and a major Korean telco (7.8B affiliate model on MI300X, 1.38× higher serving capacity vs. NVIDIA H100).