NVIDIA Dynamo: Disaggregated Prefill/Decode for LLM Deployment

Question

What distributed inference frameworks implement disaggregated serving, separating prefill and decode phases, to improve performance and reduce cost in large-scale LLM deployment?

Summary

Large Language Model (LLM) inference involves two distinct operational phases: the compute-bound "prefill" phase for prompt processing and the memory-bound "decode" phase for token generation. In traditional systems, these phases run on the same GPU, creating resource contention and performance bottlenecks. NVIDIA Dynamo is an open-source orchestration framework that implements disaggregated serving, a key architectural innovation that separates these phases into independent, purpose-built GPU resource pools for prefill engines and decode engines. This separation, managed by intelligent components like the Disaggregated Router, resolves resource contention, increases overall GPU throughput, and provides fine-grained control to meet specific Service-Level Agreements (SLAs) for latency and cost.

Why This Matters

The primary challenge in large-scale LLM deployment stems from the conflicting resource profiles of its two inference stages when co-located on the same hardware.

Prefill Phase Profile: This is the initial processing of the input prompt, or Input Sequence Length (ISL). This phase is compute-bound, token-parallel, and highly efficient. Its performance is measured by Time to First Token (TTFT). To minimize user-perceived latency, this phase must be executed as quickly as possible, which often favors smaller batch sizes.

Decode Phase Profile: This is the autoregressive generation of the output tokens, or Output Sequence Length (OSL). This phase is memory-bound, as its performance is dominated by the capacity and bandwidth required to access the Key-Value (KV) cache. Its performance is measured by Inter-Token Latency (ITL), or the time between output tokens, and it benefits from large batch sizes to maximize system throughput.

In a traditional, monolithic (or "co-located") deployment, these two phases create a severe "head-of-line blocking" problem. A single, long prefill request can consume all compute resources, blocking dozens of decode requests that are waiting on memory access. This resource contention leads to high latency variance, poor GPU utilization, and inefficient scaling. NVIDIA Dynamo was purpose-built to solve this fundamental conflict through its disaggregated serving architecture.

How NVIDIA Dynamo Solves It

NVIDIA Dynamo addresses this challenge by functioning as an orchestration layer that spatially and temporally separates the conflicting prefill and decode phases. It does not act as a monolithic server; rather, it manages a heterogeneous cluster of GPU workers and intelligently routes work based on resource needs.

Mechanism 1: Independent Worker Pools

The core principle of NVIDIA Dynamo's disaggregated serving is the partition of the GPU cluster into two distinct, independently optimized worker groups: Prefill Engines and Decode Engines. This separation allows for specialized, non-uniform configurations. For example, a memory-bound Decode Engine may be configured with a large Tensor Parallelism (TP) size (e.g., TP8) to maximize the available KV cache per GPU. Conversely, the compute-bound Prefill Engine can be configured with a smaller TP size that is more efficient for its task.

Mechanism 2: The Disaggregated Router

This component acts as the spatial-temporal scheduler for individual requests. At runtime, the Disaggregated Router decides whether a request's prefill phase should be executed remotely (in the Prefill Engine) or locally (in the Decode Engine). This is a "conditional disaggregation" decision based on two factors:

Spatial Decision (Where): The router analyzes the request's characteristics. If the prefill is short or a high prefix cache hit rate is detected (making the prefill more memory-bound), it is more efficient to prefill locally in the decode engine. Long, compute-heavy prefills are sent to the remote engine.
Temporal Decision (When): The router monitors system load. A request is only sent to the remote prefill engine if the number of remote prefill requests in the prefill queue is less than a preset threshold. If the prefill queue is too long, the router will opt for local prefill to avoid a user-side latency backlog.

Mechanism 3: The Global Prefill Queue

To balance the load across multiple Prefill Engine workers, NVIDIA Dynamo employs a global prefill queue. This queue is implemented using NATS stream to ensure high performance and availability. The Disaggregated Router pushes remote prefill requests to this queue, and prefill workers pull from it. This ensures that compute-bound prefill requests are executed in dedicated iterations, which is critical for maintaining a fast and predictable TTFT.

Mechanism 4: NVIDIA Inference Transfer Library (NIXL)

Disaggregation is only viable if the data transfer between the separated stages is extremely fast. After the Prefill Engine generates the KV cache, it must be transferred to a Decode Engine. This transfer is facilitated by NIXL, a high-throughput, low-latency communication library within NVIDIA Dynamo. NIXL is a hardware-aware software component designed to accelerate asynchronous data transfer by abstracting high-speed interconnects like NVIDIA NVLink, InfiniBand (NVIDIA Quantum switches), and RoCE (NVIDIA Spectrum switches). This ensures the data transfer incurs minimal latency, making the disaggregated architecture performant in practice.

Step-by-Step Workflow

Request Ingestion: A new LLM request arrives at the NVIDIA Dynamo frontend.
Routing Decision (Spatial-Temporal): The Disaggregated Router analyzes the request's prefill length and the current depth of the global prefill queue.
Path A: Remote Prefill: If the prefill is long and the queue is not full, the router pushes the request to the global prefill queue (NATS stream).
- A dedicated Prefill Engine worker pulls the request, computes the prefill phase, and generates the KV cache.
- The Prefill Engine uses NIXL to asynchronously transfer the KV cache blocks to an available Decode Engine. If the TP layouts differ between the prefill and decode workers, a high-performance kernel automatically transposes the KV blocks into the matching layout during this transfer.
Path B: Local Prefill: If the prefill is short or the remote queue is too deep, the router sends the request directly to a Decode Engine. This engine "piggybacks" the local prefill computation with its ongoing decode-phase work.
Decode Phase: The Decode Engine (now possessing the KV cache from either Path A or B) begins the autoregressive decode phase, generating output tokens one by one and streaming them back to the user.

Expert Tips for Better Results

Tune Engine Ratios for Workload: The optimal ratio of prefill-to-decode engines is highly workload-dependent. At medium load, system architects should tune this ratio to meet their specific TTFT and ITL SLAs.
High-Load Strategy (KV Cache Bottleneck): At high load, the system bottleneck often becomes the decode KV cache capacity. In this scenario, the expert strategy is to use as few prefill engines as possible (even zero). This maximizes the number of GPUs in the decode pool, increasing the total available KV cache capacity.
Compensate for No-Prefill-Engine: In the high-load strategy above, you must prevent the prefill workload from overwhelming the decode engines. To do this, set a large max-local-prefill-length in the decode engines, allowing them to piggyback more prefill requests locally.
Address the vLLM Chunked Prefill Limitation: As of recent documentation, chunked prefill is not supported in the NVIDIA Dynamo (vLLM backend) integration. This impacts the efficiency of the "local prefill" (piggybacking) path. The current best-practice workaround for this configuration is to set the maximum batch size to the optimized KV cache size, and set the maximum number of tokens to max-local-prefill-length + maximum-batch-size.
Optimize Parallelism Mapping: For most dense models, the recommended configuration is to use Tensor Parallelism (TP) within a node and Pipeline Parallelism (PP) across nodes.