‹ Back to Blog

Customer Case

Step3 Inference Optimization on AMD Instinct MI308X: 1.30× Higher Decode Throughput vs. NVIDIA H20

December 29, 2025

Background

StepFun's Step3 is a 321B-parameter Mixture-of-Experts (MoE) multimodal model with 38B activated parameters per token. It features 61 layers with 56 MoE layers using a 3-in-48 expert selection, and introduces Multi-Matrix Factorization Attention (MFA) that reduces KV-cache demands to approximately 22% of DeepSeek V3's per-token attention cost.

StepFun was serving Step3 on NVIDIA H20 GPUs and wanted to evaluate AMD Instinct MI308X as an alternative. Moreh was asked to optimize inference of a private model with the same architecture as Step3 on MI308X, before Step3 was publicly released as open source. This is an example of Moreh's custom model optimization service, where we adapt Moreh vLLM for proprietary model architectures.

Why MI308X for Decode

AMD Instinct MI308X is a variant of MI300X available in the Chinese market. It has 1/4 the compute cores of MI300X but retains the same HBM3e memory capacity and bandwidth. This makes MI308X particularly well-suited for the decode phase of LLM inference, which is memory-bandwidth-bound rather than compute-bound: tokens are generated one at a time in an autoregressive manner, and the bottleneck is loading model weights and KV-cache from memory — not performing matrix multiplications.

Optimization Techniques

  • Custom HIP attention kernel: The default vLLM Triton attention kernel was the most significant bottleneck, occupying around 50% of GPU time. We developed custom HIP attention kernels optimized for Step3's MFA configuration (64 query heads, 1 KV head, head dimension 256) with data parallelism. Our kernels reduce attention latency by 72% for decode batches and 37% for mixed prefill/decode batches.
  • CUDA graph: Once GPU kernel latency was significantly reduced, CPU-side overhead became the next bottleneck for decode steps. We enabled full CUDA graph capture for the Step3 model with DP8-EP8 parallelism, improving decode throughput from approximately 2,900 to 4,100 tok/s.
  • Mixed BF16–FP8 blockscale quantization: Exhaustive GEMM tuning for both BF16 and FP8 blockscale computation to achieve optimal precision–efficiency trade-off.
  • Optimized MoE one-stage kernel: Custom kernel optimization for Step3's MoE layer, focusing on the inter_dim parameter.
  • Shared-expert MLP fusion: Integrated the shared-expert MLP within the MoE layer to reduce redundant computation and improve inference latency.
  • MoRI EP integration: Integrated MoRI library for efficient expert-parallel all-to-all communication on AMD GPUs.

Performance Results

We benchmarked Moreh vLLM on 8× MI308X against StepFun's reported numbers on 8× NVIDIA H20, using the same test configuration: ISL=4096, OSL=256, Concurrency=256, with DP8-EP8 parallelism (8-way data parallelism for attention, 8-way expert parallelism for MoE).

Decode throughput and latency comparison between Moreh vLLM on MI308X and StepFun on H20
Decode performance comparison: Moreh vLLM (MI308X) vs. StepFun (H20).
DecodePrefill
Throughput (tok/s)Latency (ms)Throughput (tok/s)Latency (ms)
Moreh vLLM (MI308X)4,082639,601109,217
StepFun (H20)3,1478213,78076,420
Speedup1.30×1.30×0.70×0.70×

ISL=4096, OSL=256, Concurrency=256, DP8-EP8. Speedup is Moreh/StepFun for throughput and StepFun/Moreh for latency (higher is better for Moreh in both cases).

The results show a clear split between the two phases:

  • Decode: Moreh vLLM on MI308X achieves 4,082 tok/s — 1.30× higher throughput and 1.30× lower latency (63 ms vs. 82 ms) compared to StepFun's H20 baseline.
  • Prefill: H20 retains an advantage in the compute-bound prefill phase (13,780 vs. 9,601 tok/s), which is expected given its stronger on-chip cache subsystem.

In production serving with prefill–decode disaggregation, the decode phase is where most GPUs are allocated. MI308X's strong decode performance translates directly into cost-effective serving at scale.

Summary

This engagement demonstrates that AMD Instinct MI308X, paired with Moreh vLLM's model-specific optimizations, can deliver higher decode throughput than NVIDIA H20 for large MoE models. MI308X's high memory bandwidth relative to its compute capacity makes it a cost-effective choice for the decode phase, which dominates GPU allocation in production LLM serving deployments.

Moreh provides custom vLLM optimization for proprietary and fine-tuned models. If you are evaluating AMD GPUs for your model, contact us to discuss how we can help.