Blog
Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)
March 16, 2026
Authors: Bongwon Jang
Introduction
With the emergence of large-scale models like DeepSeek R1 and the surge in AI demand, even small differences in inference performance can translate into millions of dollars in cost-per-token gaps at scale. This has made objective measurement of GPU inference performance increasingly important, and SemiAnalysis's open-source benchmarking framework, InferenceMAX, is a prime example. Running nightly tests across hundreds of GPUs to track inference performance in real time, InferenceMAX has established itself as one of the most trusted measurement systems in the industry.
The problem is that many people treat the public InferenceMAX numbers as the actual performance ceiling of the hardware. But these numbers represent what the default open-source software (SGLang) achieves—not the limits of the hardware itself. Depending on how deeply you optimize the software, you can extract significantly higher performance from the same hardware. This is especially true for AMD, where inference software is still less mature compared to NVIDIA's CUDA ecosystem—meaning there's more room for optimization and software plays an even more critical role.
We ran the same InferenceMAX benchmark using our own optimized inference engine and observed a 1.47x improvement in end-to-end latency and a 1.47x increase in throughput per GPU (geometric mean) compared to the published InferenceMAX numbers. This confirms that software optimization remains key to unlocking the full potential of AMD GPUs—and that Moreh has the technical capability to deliver it. For organizations considering AMD infrastructure, this suggests that partnering with Moreh can help achieve higher inference performance on the same hardware, ultimately reducing inference costs at scale. In this post, we'll walk through our test results to show just how much of a performance difference software optimization can make on identical hardware.
The Challenge: Software Is the Real Bottleneck for AMD GPUs
On paper, the AMD Instinct MI300X is an impressive inference accelerator. It features 192 GB of HBM3 memory and 5.3 TB/s of memory bandwidth—roughly 2.4x and 1.7x higher than its competitor, the NVIDIA H100, respectively.
But in practice, AMD's inference software ecosystem is less mature than NVIDIA's CUDA-based stack. SemiAnalysis also identified composability as AMD's biggest challenge in its report. Individual optimization techniques—FP8 quantization, MoE kernels, Expert Parallelism—each work well on their own, but integrating them into a single production-grade pipeline remains difficult.
The nature of DeepSeek R1 as a model adds further complexity. It's a 671B-parameter MoE model with 256 experts per decoder block, combined with Multi-Head Latent Attention (MLA) and long chain-of-thought outputs—resulting in a broad optimization surface. This also means there's significant performance headroom that default open-source software configurations simply can't tap into.
To address these bottlenecks, Moreh developed its own inference engine with optimizations down to the GPU kernel level. We tackled the areas that default open-source software overlooks—MoE kernel efficiency, FP8 KV cache utilization, kernel launch overhead, and more—to push performance further. Below, we examine how Moreh's optimized inference engine outperformed the InferenceMAX benchmark results measured with existing open-source software.
Test Environment
| Category | Specification |
|---|---|
| GPU | AMD Instinct MI300X (8 GPUs per node) |
| Model | DeepSeek R1 0528 |
| Precision | FP8 |
| Benchmark | InferenceMAX benchmark suite |
| Baseline | Public SGLang results (January 26, 2026) |
| Inference Framework | Moreh Optimized Inference Engine (Moreh-vLLM) |
Benchmark Configuration
We replicated the exact InferenceMAX benchmark configuration, covering three representative ISL/OSL (Input Sequence Length / Output Sequence Length) scenarios:
- 1K/1K — Balanced workload (short-context Q&A, chat)
- 1K/8K — Long output workload (reasoning, coding, chain-of-thought)
- 8K/1K — Long input workload (document processing, summarization, RAG)
Each scenario was tested at concurrency levels of 4, 8, 16, 32, and 64 (total requests ranging from 40 to 640), with an infinite request rate applied to measure maximum throughput.
Performance Evaluation
Results Summary
Across all 15 benchmark configurations, Moreh-vLLM—our inference engine built with Moreh's optimization techniques—consistently outperformed the published InferenceMAX numbers on the same AMD MI300X hardware.
| Metric | Geometric Mean Improvement |
|---|---|
| Median End-to-End Latency (E2EL) | 1.47x |
| Total Throughput per GPU (tok/s/gpu) | 1.47x |


Detailed Analysis by Scenario
1K/1K (ISL=1,024, OSL=1,024)
| CON | Median E2E Latency (s) | Total Throughput per GPU (tok/s/gpu) | ||||
|---|---|---|---|---|---|---|
| SGLang | Moreh-vLLM | Improvement | SGLang | Moreh-vLLM | Improvement | |
| 4 | 24.68 | 15.43 | 1.60x | 35.91 | 58.29 | 1.62x |
| 8 | 27.06 | 17.64 | 1.53x | 66.15 | 103.44 | 1.56x |
| 16 | 29.6 | 22.18 | 1.33x | 120.13 | 163.57 | 1.36x |
| 32 | 37.57 | 29.25 | 1.28x | 190.84 | 247.98 | 1.30x |
| 64 | 48.55 | 39.15 | 1.24x | 294.07 | 371.63 | 1.26x |
Performance improvements are most pronounced at low concurrency (CON=4), with latency improving by 1.60x and throughput increasing by 1.62x. This is the result of Moreh's optimizations effectively eliminating kernel launch overhead, which dominates at small batch sizes.
While the gains taper off as concurrency increases, meaningful improvements of over 1.24x are sustained even at CON=64.

1K/8K (ISL=1,024, OSL=8,192)
| CON | Median E2E Latency (s) | Total Throughput per GPU (tok/s/gpu) | ||||
|---|---|---|---|---|---|---|
| SGLang | Moreh-vLLM | Improvement | SGLang | Moreh-vLLM | Improvement | |
| 4 | 203.9 | 117.62 | 1.73x | 19.4 | 33.69 | 1.74x |
| 8 | 210.22 | 134.7 | 1.56x | 38.48 | 60.11 | 1.56x |
| 16 | 239.432 | 173.8 | 1.38x | 67.84 | 93.49 | 1.38x |
| 32 | 347.05 | 221.34 | 1.57x | 93.95 | 147.16 | 1.57x |
| 64 | 395.78 | 291.09 | 1.36x | 162.89 | 221.7 | 1.36x |
The 1K/8K scenario involves generating long outputs and is designed to stress-test decode performance. This is where Moreh's optimizations for maximizing memory bandwidth utilization stood out the most. In particular, the 1.73x latency improvement and 1.74x throughput gain at CON=4 clearly demonstrate the impact of our optimizations on long-generation workloads.
As concurrency increases, the workload gradually shifts toward being compute-bound, which narrows the software optimization gap. However, even at CON=64, we recorded meaningful performance gains of 1.36x in both end-to-end latency and throughput.

8K/1K (ISL=8,192, OSL=1,024)
| CON | Median E2E Latency (s) | Total Throughput per GPU (tok/s/gpu) | ||||
|---|---|---|---|---|---|---|
| SGLang | Moreh-vLLM | Improvement | SGLang | Moreh-vLLM | Improvement | |
| 4 | 30.84 | 16.82 | 1.83x | 129.74 | 236.7 | 1.82x |
| 8 | 32.72 | 20.49 | 1.60x | 243.75 | 396.34 | 1.63x |
| 16 | 38.77 | 28.24 | 1.37x | 402.33 | 567.92 | 1.41x |
| 32 | 60.31 | 41.33 | 1.46x | 522.94 | 781.02 | 1.49x |
| 64 | 88.06 | 64.75 | 1.36x | 722.49 | 840.53 | 1.16x |
The 8K/1K scenario is a prefill-dominant workload. The peak latency improvement of 1.83x at CON=4 is attributed to Moreh's kernel optimizations for the prefill phase. Notably, even at maximum concurrency (CON=64), we achieved a 1.36x latency improvement and 1.16x throughput gain—demonstrating meaningful performance advantages even under heavy load.

Key Observations
- Consistent performance improvements across all concurrency levels. The same pattern appears across all three scenarios. At small batch sizes, kernel launch overhead and per-operation inefficiencies dominate overall performance—and this is where Moreh's optimizations deliver the greatest impact. Even as concurrency increases, stable performance gains of at least 1.16x are maintained across all configurations, demonstrating that the optimization benefits are not limited to specific conditions but apply consistently across the board.
- Moreh's optimizations also prove valuable for long output workloads. With the rise of reasoning models, long output workloads like chain-of-thought are growing rapidly. In the 1K/8K scenario, we observed performance improvements ranging from 1.36x to 1.74x—a result of sustained bandwidth utilization optimizations during long decode sequences.
- Throughput and latency improvements scale at nearly the same rate. The geometric means are nearly symmetric at 1.47x vs. 1.47x. This indicates that our optimizations didn't simply shift the latency-throughput tradeoff—they improved actual computational efficiency.
- The hardware is identical. Only the software changed. All results were achieved on the same AMD MI300X GPUs. The performance difference comes from our own optimizations that go deeper than default open-source software—reducing kernel launch overhead at small batch sizes, maximizing GPU memory bandwidth utilization, optimizing prefill operations, and so on.
Conclusion
Software optimization on AMD GPUs is not a closed chapter with open-source software. And the numbers published on InferenceMAX do not represent the performance limits of the hardware. In this evaluation, we demonstrated that with deeper software optimization, the AMD MI300X can achieve a 1.47x improvement in end-to-end latency and a 1.47x improvement in throughput per GPU for DeepSeek R1 FP8 inference—compared to the currently published InferenceMAX baseline.
Every percentage point of inference efficiency translates directly into cost-per-token savings for CSPs and enterprises serving open-weight models at scale. Moreh can be a proven software partner for organizations looking to adopt AMD infrastructure, helping them extract maximum performance from the same hardware. We will continue pushing the boundaries of inference performance on AMD GPUs, enabling more organizations to fully realize the value of AMD infrastructure.
For more details on Moreh's inference optimization, visit moreh.io and docs.moreh.io.