‹ Back to Blog

Technical Report

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

February 5, 2026

Authors: Jiyoung Park, Hankyu Jang, Changseok Song, and Wookeun Jung

Read full paper on arXiv

Abstract

Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15× throughput improvement over static speculative decoding while reducing draft training time by 1.67× compared to approaches that recompute training signals.

Overview of TIDE architecture and workflow
Figure 1: Overview of TIDE architecture and workflow.

1. Introduction

Large language models (LLMs) increasingly achieve state-of-the-art performance by scaling test-time computation, particularly for reasoning-intensive tasks such as mathematics and code generation (Snell et al., 2024; Muennighoff et al., 2025). As a result, inference efficiency has become a central bottleneck for deploying modern reasoning-oriented LLMs in real-world systems.

Speculative decoding is one of the most effective techniques for accelerating LLM inference. By allowing a lightweight draft model to propose multiple tokens that are then verified in batch by a target model, speculative decoding can significantly improve throughput and latency when the draft and target models are well aligned (Leviathan et al., 2023; Chen et al., 2023). However, its effectiveness is highly sensitive to draft–target alignment: when alignment degrades, acceptance rates drop sharply and speculative decoding yields little or no performance gain.

A fundamental challenge is that draft–target alignment is inherently workload-dependent. In production LLM services, inference workloads evolve continuously as user behavior changes, application logic is updated, and prompt templates are modified. While workloads are globally non-stationary, prior studies show that they exhibit strong short-term temporal locality, with recent inference history remaining predictive of near-future requests (Wang et al., 2024; Gim et al., 2024; Zheng et al., 2024a; Xiang et al., 2025). This suggests that alignment can, in principle, be preserved by adapting to recent inference behavior, even as long-term distributions shift.

Recent work has explored this opportunity by adapting draft models online using inference-time signals, for example via online distillation from target model corrections or logits (Zhou et al., 2024; Yan et al., 2025). While these approaches demonstrate that alignment can be recovered under distribution shift, they primarily focus on the learning algorithm itself. Whether online draft training can be integrated into high-performance inference engines in a way that yields sustained end-to-end throughput improvements remains an open systems-level question.

In practice, addressing this question requires careful coordination between learning and serving. Online draft training must introduce minimal interference to latency-critical inference, operate under realistic resource constraints, and adapt only when beneficial. Because the performance impact of speculative decoding varies across workload phases, continuous speculation or training is often unnecessary and can even be counterproductive. Effective deployment therefore requires dynamic runtime control over when to speculate and when to train, based solely on signals observable during inference serving.

To address these challenges, we introduce Temporal Incremental Draft Engine (TIDE), a serving-engine-native framework for adaptive speculative decoding under evolving workloads. Rather than treating draft adaptation as an isolated learning problem, TIDE jointly manages training signal collection, draft model updates, and speculative decoding decisions entirely within the inference serving engine.

TIDE exploits short-term temporal locality by incrementally adapting the draft model based on recent inference behavior, while dynamically controlling when speculative decoding and training are beneficial. Crucially, TIDE generates training data with zero additional inference overhead by reusing intermediate hidden representations already computed by the target model during verification, eliminating the need to reload or recompute target model activations during training.

Finally, TIDE decouples inference serving and draft training to enable efficient deployment under realistic hardware constraints. In our evaluation, we demonstrate that inference serving on NVIDIA H100 GPUs can be paired with draft model training on AMD Instinct MI250 GPUs, improving overall system throughput while maintaining high speculative decoding performance.

In summary, our main contributions are:

  • We propose TIDE, a serving-engine-native framework for adaptive speculative decoding that incrementally maintains draft–target alignment under non-stationary inference workloads.
  • We enable zero-overhead training data generation by reusing intermediate hidden states computed during inference, allowing efficient draft training without loading the large target model.
  • We introduce adaptive runtime control mechanisms that determine when to speculate and when to train, avoiding unnecessary overhead under unfavorable workload conditions.
  • We demonstrate effective heterogeneous GPU utilization by decoupling inference and training, running inference on NVIDIA H100 GPUs and draft training on AMD MI250 GPUs.
  • We implement a complete TIDE prototype and show consistent system-level throughput improvements across diverse real-world workload patterns.

5. Evaluation

5.5. Heterogeneous GPU Allocation

We evaluate TIDE's performance benefits when deploying on heterogeneous GPU clusters with varying compute capabilities. Figure 11 presents throughput comparison for inference and draft model training across different GPU types, normalized to MI250 baseline. The results reveal a disproportionate throughput gap between inference and training workloads. For inference, H100 achieves 6.76× higher throughput compared to MI250, with MI300X at 4.42×. However, for training, the gap is much smaller: H100 shows only 2.44× improvement over MI250, with MI300X at 1.77×. This disparity motivates TIDE's heterogeneous resource allocation strategy, where lower-end GPUs like MI250 contribute more effectively to training while higher-end GPUs handle inference workloads.

Per-GPU throughput comparison for inference and draft model training, normalized to MI250 baseline
Figure 11: Per-GPU throughput comparison for inference and draft model training, normalized to MI250 baseline. Inference throughput measured on gpt-oss-120b with ShareGPT dataset using SGLang. Training throughput measured on single nodes with 8 GPU devices using PyTorch with FSDP parallelization.

To quantify the benefits of this approach, we evaluate TIDE across four diverse datasets, comparing two resource allocation strategies: (1) all GPUs performing inference with speculative decoding disabled, and (2) TIDE allocating a single MI250 node with 4 GPUs for draft model training while a single H100 node with 8 GPUs handles inference. Figure 10 shows that TIDE achieves 1.08–1.22× throughput improvement over the all-inference baseline. The improvement correlates with the speculative decoding speedup achieved through draft model training, ranging from s=1.15 (ShareGPT, 1.08× throughput) to s=1.30 (Science, 1.22× throughput). These variations reflect differences in output distribution characteristics and draft model learning difficulty across datasets. For instance, Science dataset's more structured output enables better draft model learning, resulting in higher acceptance rates and greater speedup. This result demonstrates that TIDE's benefits vary with dataset characteristics and highlights the importance of considering workload properties when deploying heterogeneous training strategies.

Relative throughput comparison between all-inference baseline and TIDE across four datasets
Figure 10: Relative throughput comparison between all-inference baseline and TIDE across four datasets using a single MI250 node with 4 GPUs for draft model training and a single H100 node with 8 GPUs for inference. Values in parentheses indicate the speculative decoding speedup (s) achieved through draft model training on each dataset.

Please read the full paper on arXiv.