Telco LLM Inference Optimization on AMD MI300X: 1.38× Higher Serving Capacity

Background

One of the major Korean telcos was planning to deploy an LLM-powered service using a 7.8B-parameter dense LLM developed by an affiliate within their corporate group. As part of their infrastructure evaluation, they wanted to compare AMD Instinct MI300X against their existing NVIDIA H100 GPUs for serving this model in production.

The customer asked Moreh to optimize inference for this affiliate model on MI300X and run a head-to-head benchmark against H100. The goal was not just to measure raw speed, but to answer a concrete business question: how many concurrent users can a single GPU serve while maintaining acceptable response quality?

This is a common question for telcos deploying customer-facing AI services, where the number of concurrent sessions directly determines how many GPUs are needed — and therefore the total infrastructure cost.

Why These Metrics Matter

Before diving into the results, it is worth explaining why each metric was chosen. The customer was designing a subscriber-facing LLM service, so every metric maps to a specific aspect of user experience and operational cost:

TTFT (Time To First Token): How long a user waits before the service starts responding. In a conversational interface, high TTFT feels sluggish and drives users away. This is the "perceived responsiveness" metric.
TPOT (Time Per Output Token): The interval between successive tokens during generation, which determines the streaming speed of the response. Lower TPOT produces text that feels like natural, real-time typing; higher values cause noticeable stuttering or lag.
End-to-End Latency (E2EL): Total time from request submission to the last token. This captures the complete user wait time for a full response.
Output TPS (Tokens Per Second): Aggregate throughput — how many tokens the system produces per second. Higher TPS means more work done per GPU per unit of time.
Max Concurrency: The maximum number of simultaneous requests a single GPU can handle while keeping TTFT and TPOT within customer-specified thresholds (the Service Level Objectives, or SLOs). This is the most operationally important metric: it directly determines how many GPUs the customer needs to purchase for a given user base.

Test Setup

All tests compared a single GPU against a single GPU:

MI300X side: 1× AMD Instinct MI300X (192 GB HBM3e), running Moreh vLLM
H100 side: 1× NVIDIA H100 SXM (80 GB HBM3), running vLLM

The workload used ShareGPT traces — real conversation logs from a ChatGPT-like service — to simulate realistic conversational interactions. Unlike synthetic benchmarks with fixed input/output lengths, ShareGPT traces reflect the highly variable request patterns of actual users: short follow-up questions, long initial prompts, varying response lengths, and so on. This makes the results more representative of what the customer would see in production.

Optimization Techniques

Running an affiliate-developed model on a new GPU platform is not simply a matter of swapping hardware. The model had been developed and tested on NVIDIA GPUs, and the default open-source vLLM on AMD's ROCm stack left significant performance on the table. Moreh applied two key optimizations to close this gap and unlock MI300X's full potential:

Custom attention backend: Multiple attention kernel implementations exist for AMD ROCm, but none consistently outperformed the others across all scenarios for this model architecture. Moreh profiled each candidate separately in prefill and decode phases, then combined the best-performing kernel for each phase into a unified custom attention backend. This alone improved output throughput and inter-token latency by 17% compared to baseline ROCm vLLM.
GEMM tuning with shape-aware dispatch: The model's BF16 matrix multiplications were served through a generic GEMM path. Moreh built a custom dispatch layer on top of multiple GEMM backends (including aiter.tgemm and specialized skinny-GEMM kernels optimized for small batch sizes typical in decode), then tuned a shape-specific dispatch table for every GEMM shape that occurs in the model. This added another 10% improvement in output throughput and 3% in TTFT.

Combined, these optimizations made Moreh vLLM on MI300X up to 27% faster than baseline ROCm vLLM on the same MI300X hardware — before even comparing against H100. The results below reflect this fully optimized configuration.

Single-Request Latency

The first test measured baseline performance with a single request (no concurrent load). This isolates the raw inference speed of each platform without batching effects:

Metric	Moreh vLLM (MI300X)	vLLM (H100)	Comparison
Output TPS (tok/s)	186.75	143.39	1.30× higher
TPOT (ms)	5.33	6.96	1.31× faster
End-to-End Latency (ms)	2,913	3,808	1.31× faster

Single request, ShareGPT workload, single GPU. TPOT = Time Per Output Token, E2EL = End-to-End Latency.

With a single request, Moreh vLLM on MI300X delivered 1.30× higher output throughput and 1.31× lower latency across all metrics. In practical terms, a user would see the full response arrive about 900 ms faster (2.9 s vs. 3.8 s) — a noticeable improvement in a conversational interface.

This advantage comes from MI300X's higher HBM3e memory bandwidth (5.3 TB/s vs. H100's 3.35 TB/s) combined with Moreh vLLM's kernel-level optimizations described above.

SLO-Compliant Maximum Serving Capacity

Raw single-request speed is useful, but production deployment decisions are driven by a different question: how many users can one GPU serve simultaneously while maintaining acceptable quality of service?

To answer this, the test gradually increased the number of concurrent requests until the system could no longer meet the Service Level Objectives (SLOs) specified by the customer:

TTFT < 1,000 ms
TPOT < 100 ms

These thresholds were defined by the customer based on their own service requirements. The maximum concurrency that stays within both SLOs represents the effective serving capacity of a single GPU.

Metric	Moreh vLLM (MI300X)	vLLM (H100)	Comparison
Max Concurrency (SLO-compliant)	880	636	1.38×

Customer-specified SLO thresholds: TTFT < 1,000 ms, TPOT < 100 ms. ShareGPT workload on a single GPU.

Moreh vLLM on MI300X achieved 1.38× higher SLO-compliant serving capacity: 880 concurrent requests per GPU vs. 636 on H100. A single MI300X can serve 38% more simultaneous sessions while keeping both TTFT and TPOT within the customer's specified bounds.

For a telco planning to serve millions of subscribers, this difference compounds at scale. If the service needs to handle 10,000 concurrent sessions, it requires roughly 12 MI300X GPUs vs. 16 H100 GPUs — a 25% reduction in GPU count from the serving capacity advantage alone, before considering hardware cost differences.

Model Accuracy Verification

Switching GPU platforms and inference engines introduces a risk of subtle numerical differences that could affect model output quality. To verify that the migration to MI300X with Moreh vLLM does not compromise the model's capabilities, MMLU (Massive Multitask Language Understanding, 5-shot) accuracy was measured on both platforms:

Benchmark	Moreh vLLM (MI300X)	vLLM (H100)
MMLU (5-shot)	65.25	65.80

MMLU = Massive Multitask Language Understanding. The 0.55-point difference is within normal variance and does not indicate quality regression.

The 0.55-point difference is well within normal variance for MMLU evaluations and confirms that Moreh vLLM's optimizations for MI300X introduce no meaningful quality degradation. The customer can deploy on MI300X with confidence that response quality will be identical to their H100 baseline.

TCO Analysis

Combining the performance results with hardware economics paints a clear picture for total cost of ownership (TCO):

Serving capacity advantage: Each MI300X serves 1.38× more concurrent users than an H100, reducing the number of GPUs needed for a given workload.
Hardware cost advantage: AMD Instinct MI300X has a lower acquisition cost than NVIDIA H100 SXM.

When both factors are combined, our internal analysis projected up to 70% better cost-efficiency for this inference workload on the MI300X + Moreh vLLM platform. For a telco deploying AI services at national scale, this translates to significant capital expenditure savings.

Summary

This engagement with one of the major Korean telcos demonstrates that AMD Instinct MI300X, paired with Moreh vLLM, is a compelling alternative to NVIDIA H100 for production LLM serving. For their affiliate-developed 7.8B-parameter model:

1.30× higher single-request throughput with 1.31× lower end-to-end latency
1.38× higher SLO-compliant serving capacity (880 vs. 636 concurrent sessions per GPU)
Equivalent model accuracy (MMLU 65.25 vs. 65.80)
Up to 70% better cost-efficiency when accounting for both performance and hardware cost advantages

The affiliate-developed LLM required custom optimization work by Moreh — including a model-specific attention backend and shape-aware GEMM tuning — to run efficiently on AMD hardware. This demonstrates Moreh's ability to optimize models for AMD GPUs, enabling customers to diversify their GPU supply chain and reduce dependency on a single vendor.

Moreh provides custom vLLM optimization for models on AMD GPUs. If you are evaluating AMD Instinct GPUs for your inference workloads, contact us to discuss how we can help.