Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)

Introduction

With the emergence of large-scale models like DeepSeek R1 and the surge in AI demand, even small differences in inference performance can translate into millions of dollars in cost-per-token gaps at scale. This has made objective measurement of GPU inference performance increasingly important, and SemiAnalysis's open-source benchmarking framework, InferenceMAX, is a prime example. Running nightly tests across hundreds of GPUs to track inference performance in real time, InferenceMAX has established itself as one of the most trusted measurement systems in the industry.

The problem is that many people treat the public InferenceMAX numbers as the actual performance ceiling of the hardware. But these numbers represent what the default open-source software (SGLang) achieves—not the limits of the hardware itself. Depending on how deeply you optimize the software, you can extract significantly higher performance from the same hardware. This is especially true for AMD, where inference software is still less mature compared to NVIDIA's CUDA ecosystem—meaning there's more room for optimization and software plays an even more critical role.

We ran the same InferenceMAX benchmark using our own optimized inference engine and observed a 1.47x improvement in end-to-end latency and a 1.47x increase in throughput per GPU (geometric mean) compared to the published InferenceMAX numbers. This confirms that software optimization remains key to unlocking the full potential of AMD GPUs—and that Moreh has the technical capability to deliver it. For organizations considering AMD infrastructure, this suggests that partnering with Moreh can help achieve higher inference performance on the same hardware, ultimately reducing inference costs at scale. In this post, we'll walk through our test results to show just how much of a performance difference software optimization can make on identical hardware.

The Challenge: Software Is the Real Bottleneck for AMD GPUs

On paper, the AMD Instinct MI300X is an impressive inference accelerator. It features 192 GB of HBM3 memory and 5.3 TB/s of memory bandwidth—roughly 2.4x and 1.7x higher than its competitor, the NVIDIA H100, respectively.

But in practice, AMD's inference software ecosystem is less mature than NVIDIA's CUDA-based stack. SemiAnalysis also identified composability as AMD's biggest challenge in its report. Individual optimization techniques—FP8 quantization, MoE kernels, Expert Parallelism—each work well on their own, but integrating them into a single production-grade pipeline remains difficult.

The nature of DeepSeek R1 as a model adds further complexity. It's a 671B-parameter MoE model with 256 experts per decoder block, combined with Multi-Head Latent Attention (MLA) and long chain-of-thought outputs—resulting in a broad optimization surface. This also means there's significant performance headroom that default open-source software configurations simply can't tap into.

To address these bottlenecks, Moreh developed its own inference engine with optimizations down to the GPU kernel level. We tackled the areas that default open-source software overlooks—MoE kernel efficiency, FP8 KV cache utilization, kernel launch overhead, and more—to push performance further. Below, we examine how Moreh's optimized inference engine outperformed the InferenceMAX benchmark results measured with existing open-source software.

Test Environment

Category	Specification
GPU	AMD Instinct MI300X (8 GPUs per node)
Model	DeepSeek R1 0528
Precision	FP8
Benchmark	InferenceMAX benchmark suite
Baseline	Public SGLang results (January 26, 2026)
Inference Framework	Moreh Optimized Inference Engine (Moreh-vLLM)

Benchmark Configuration

We replicated the exact InferenceMAX benchmark configuration, covering three representative ISL/OSL (Input Sequence Length / Output Sequence Length) scenarios:

1K/1K — Balanced workload (short-context Q&A, chat)
1K/8K — Long output workload (reasoning, coding, chain-of-thought)
8K/1K — Long input workload (document processing, summarization, RAG)

Each scenario was tested at concurrency levels of 4, 8, 16, 32, and 64 (total requests ranging from 40 to 640), with an infinite request rate applied to measure maximum throughput.

Performance Evaluation

Results Summary

Across all 15 benchmark configurations, Moreh-vLLM—our inference engine built with Moreh's optimization techniques—consistently outperformed the published InferenceMAX numbers on the same AMD MI300X hardware.

Metric	Geometric Mean Improvement
Median End-to-End Latency (E2EL)	1.47x
Total Throughput per GPU (tok/s/gpu)	1.47x

Figure 1. Performance speedup for various request patterns (end-to-end latency). — Figure 1. Performance speedup for various request patterns. Higher is better. Moreh-vLLM shows an average of 1.47x lower end-to-end latency.

Figure 2. Performance speedup for various request patterns (throughput). — Figure 2. Performance speedup for various request patterns. Higher is better. Moreh-vLLM shows an average of 1.47x higher throughput.

Detailed Analysis by Scenario

1K/1K (ISL=1,024, OSL=1,024)

CON	Median E2E Latency (s)			Total Throughput per GPU (tok/s/gpu)
CON	SGLang	Moreh-vLLM	Improvement	SGLang	Moreh-vLLM	Improvement
4	24.68	15.43	1.60x	35.91	58.29	1.62x
8	27.06	17.64	1.53x	66.15	103.44	1.56x
16	29.6	22.18	1.33x	120.13	163.57	1.36x
32	37.57	29.25	1.28x	190.84	247.98	1.30x
64	48.55	39.15	1.24x	294.07	371.63	1.26x

Performance improvements are most pronounced at low concurrency (CON=4), with latency improving by 1.60x and throughput increasing by 1.62x. This is the result of Moreh's optimizations effectively eliminating kernel launch overhead, which dominates at small batch sizes.

While the gains taper off as concurrency increases, meaningful improvements of over 1.24x are sustained even at CON=64.

Figure 3. Throughput-Latency trade-off comparison (ISL=1,024, OSL=1,024). Moreh demonstrates superior efficiency over SGLang by maintaining higher throughput at significantly lower end-to-end latency.

1K/8K (ISL=1,024, OSL=8,192)

CON	Median E2E Latency (s)			Total Throughput per GPU (tok/s/gpu)
CON	SGLang	Moreh-vLLM	Improvement	SGLang	Moreh-vLLM	Improvement
4	203.9	117.62	1.73x	19.4	33.69	1.74x
8	210.22	134.7	1.56x	38.48	60.11	1.56x
16	239.432	173.8	1.38x	67.84	93.49	1.38x
32	347.05	221.34	1.57x	93.95	147.16	1.57x
64	395.78	291.09	1.36x	162.89	221.7	1.36x

The 1K/8K scenario involves generating long outputs and is designed to stress-test decode performance. This is where Moreh's optimizations for maximizing memory bandwidth utilization stood out the most. In particular, the 1.73x latency improvement and 1.74x throughput gain at CON=4 clearly demonstrate the impact of our optimizations on long-generation workloads.

As concurrency increases, the workload gradually shifts toward being compute-bound, which narrows the software optimization gap. However, even at CON=64, we recorded meaningful performance gains of 1.36x in both end-to-end latency and throughput.

Figure 4. Throughput-Latency trade-off comparison (ISL=1,024, OSL=8,192). Moreh demonstrates superior efficiency over SGLang by maintaining higher throughput at significantly lower end-to-end latency.

8K/1K (ISL=8,192, OSL=1,024)

CON	Median E2E Latency (s)			Total Throughput per GPU (tok/s/gpu)
CON	SGLang	Moreh-vLLM	Improvement	SGLang	Moreh-vLLM	Improvement
4	30.84	16.82	1.83x	129.74	236.7	1.82x
8	32.72	20.49	1.60x	243.75	396.34	1.63x
16	38.77	28.24	1.37x	402.33	567.92	1.41x
32	60.31	41.33	1.46x	522.94	781.02	1.49x
64	88.06	64.75	1.36x	722.49	840.53	1.16x

The 8K/1K scenario is a prefill-dominant workload. The peak latency improvement of 1.83x at CON=4 is attributed to Moreh's kernel optimizations for the prefill phase. Notably, even at maximum concurrency (CON=64), we achieved a 1.36x latency improvement and 1.16x throughput gain—demonstrating meaningful performance advantages even under heavy load.

Figure 5. Throughput-Latency trade-off comparison (ISL=8,192, OSL=1,024). Moreh demonstrates superior efficiency over SGLang by maintaining higher throughput at significantly lower end-to-end latency.

Key Observations

Consistent performance improvements across all concurrency levels. The same pattern appears across all three scenarios. At small batch sizes, kernel launch overhead and per-operation inefficiencies dominate overall performance—and this is where Moreh's optimizations deliver the greatest impact. Even as concurrency increases, stable performance gains of at least 1.16x are maintained across all configurations, demonstrating that the optimization benefits are not limited to specific conditions but apply consistently across the board.
Moreh's optimizations also prove valuable for long output workloads. With the rise of reasoning models, long output workloads like chain-of-thought are growing rapidly. In the 1K/8K scenario, we observed performance improvements ranging from 1.36x to 1.74x—a result of sustained bandwidth utilization optimizations during long decode sequences.
Throughput and latency improvements scale at nearly the same rate. The geometric means are nearly symmetric at 1.47x vs. 1.47x. This indicates that our optimizations didn't simply shift the latency-throughput tradeoff—they improved actual computational efficiency.
The hardware is identical. Only the software changed. All results were achieved on the same AMD MI300X GPUs. The performance difference comes from our own optimizations that go deeper than default open-source software—reducing kernel launch overhead at small batch sizes, maximizing GPU memory bandwidth utilization, optimizing prefill operations, and so on.

Conclusion

Software optimization on AMD GPUs is not a closed chapter with open-source software. And the numbers published on InferenceMAX do not represent the performance limits of the hardware. In this evaluation, we demonstrated that with deeper software optimization, the AMD MI300X can achieve a 1.47x improvement in end-to-end latency and a 1.47x improvement in throughput per GPU for DeepSeek R1 FP8 inference—compared to the currently published InferenceMAX baseline.

Every percentage point of inference efficiency translates directly into cost-per-token savings for CSPs and enterprises serving open-weight models at scale. Moreh can be a proven software partner for organizations looking to adopt AMD infrastructure, helping them extract maximum performance from the same hardware. We will continue pushing the boundaries of inference performance on AMD GPUs, enabling more organizations to fully realize the value of AMD infrastructure.

For more details on Moreh's inference optimization, visit moreh.io and docs.moreh.io.