Moreh vLLM

The fastest way to serve LLMs on AMD GPUs

Drop-in replacement for vLLM with up to 2× higher throughput on AMD Instinct GPUs. Same API, same model formats — just faster. Deploy in minutes with a single Docker image.

Benchmarks

Proven Performance Across Models

DeepSeek R1 671B · 8× AMD Instinct MI300X

Output tokens/s normalized to ROCm vLLM, across input lengths, output lengths, and concurrency levels.

Moreh vLLM 0.9.0
ROCm vLLM 0.9.2
SGLang 0.4.8
Normalized output TPS (ROCm vLLM = 1)00.51.01.52.02.5(1K, 1K, 1)(1K, 1K, 8)(1K, 1K, 32)(4K, 1K, 1)(4K, 1K, 8)(4K, 1K, 32)(32K, 1K, 1)(32K, 1K, 8)(32K, 1K, 32)(input length, output length, concurrency)

Measured using vLLM’s benchmark_serving tool.

Getting Started

Preset-Based Deployment

Moreh vLLM ships with optimized presets for popular models and hardware configurations. Pick a preset, point to your model, and serve — parallelism, memory, and kernel settings are handled automatically.

Example Deployments

$ docker run --device /dev/kfd --device /dev/dri \
  --network host -v /models:/models \
  moreh/moreh-vllm:latest \
  serve.sh /models/DeepSeek-R1 \
    presets/deepseek-ai-deepseek-r1-amd-mi300x-dp8-moe-ep8.yaml

Under the Hood

Why It’s Faster

Moreh vLLM replaces the compute backend with engines purpose-built for AMD GPU architecture.

Custom Libraries for AMD GPUs

Compute libraries — including GEMM, attention, MoE, and fused operations — built specifically for AMD GPU architecture.

Model Optimization

Techniques such as operation fusion, graph-level execution, and quantization to run each model as efficiently as possible.

Multi-GPU Scaling

Communication/compute overlap, EP load balancing, and other optimizations to scale across GPUs within a server.

Supported Models

Optimized for popular open-source LLMs, including:

DeepSeekDeepSeekGPT-OSSGPT-OSSLlamaLlamaQwenQwenMistralMistralGLMGLMStepStepand more

Supported Hardware

AMD Instinct MI355XAMD Instinct MI325XAMD Instinct MI308XAMD Instinct MI300XAMD Instinct MI250

Running a proprietary model?

Moreh provides on-demand vLLM optimization for your private and fine-tuned models on AMD GPUs. We build a custom Moreh vLLM tailored to your model architecture, so you get the same performance gains without any extra work on your side.

We’ve done this for customers including StepFun (Step3 321B on MI308X, 1.30× higher decode throughput vs. NVIDIA H20) and a major Korean telco (7.8B affiliate model on MI300X, 1.38× higher serving capacity vs. NVIDIA H100).

Contact us ›