Dynamo Architecture Flow — NVIDIA Dynamo Documentation
Title: Dynamo Architecture Flow — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/latest/architecture/dynamo_flow.html?userAgent=PromptingBot%2F1.0.0
Published Time: Tue, 14 Oct 2025 15:19:31 GMT
Markdown Content: Dynamo Architecture Flow#
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in components/backends/vllm. Color-coded flows indicate different types of operations:
🔵 Main Request Flow (Blue)#
The primary user journey through the system:
-
Discovery (S1): Client discovers the service endpoint
-
Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
-
Validate (S3): Frontend forwards request to Processor for validation and routing
-
Route (S3): Processor routes the validated request to appropriate Decode Worker
🟠 Decision and Allocation Flow (Orange)#
The system’s intelligent routing and resource allocation:
-
Query (S4): Decode Worker queries for prefix cache hits to optimize processing
-
Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
-
Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
🟢 Prefill Worker Flow (Green)#
The dedicated prefill processing pipeline:
-
NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
-
Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
-
Prefill (S9): Worker executes the prefill computation on the input tokens
-
NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks
🟣 Completion Flow (Purple)#
The response generation and delivery:
-
Notify (S11): PrefillWorker sends completion notification to Decode Worker
-
Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
-
Response (S13): The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
🔗 Infrastructure Connections (Dotted lines)#
Coordination and messaging support:
ETCD Connections (Gray, dotted)#
-
Frontend, Processor, Planner: Service discovery and registration
-
Decode Worker, PrefillWorker: NIXL metadata storage for GPU communication setup
NATS Connections (Teal, dotted)#
-
PrefillQueue: JetStream consumer group for reliable work distribution
-
Processor: Load balancing across workers
Planning Connections (Gold, dotted)#
-
Frontend → Planner: Metrics collection for auto-scaling decisions
-
Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker
Technical Implementation Details#
NIXL (NVIDIA Interchange Library):#
-
Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
-
Decode Worker publishes GPU metadata to ETCD for coordination
-
PrefillWorker loads metadata to establish direct communication channels
-
Block-based transfers (64–128 tokens per block) for efficient batching
Disaggregated KV Cache:#
-
Each Decode Worker maintains local KV cache in its GPU memory
-
No shared storage bottlenecks—all transfers are direct worker-to-worker
-
Pre-allocated blocks ensure deterministic memory layout and performance
Links/Buttons:
- Skip to main content
- document.write(
<img src="../_static/nvidia-logo-horiz-rgb-wht-for-screen.svg" class="logo__image only-dark" alt="NVIDIA Dynamo Documentation - Home"/>); NVIDIA Dynamo Documentation - GitHub
- Installation
- Support Matrix
- Architecture
- Disaggregated Serving
- Examples
- Quickstart (K8s)
- Detailed Installation Guide
- Dynamo Operator
- Metrics
- Logging
- Multinode
- Minikube Setup
- Backends
- vLLM
- SGLang
- TensorRT-LLM
- Router
- Planner
- Pre-Deployment Profiling
- Load-based Planner
- SLA-based Planner
- KVBM
- Motivation
- KVBM Architecture
- Understanding KVBM components
- KVBM Further Reading
- LMCache Integration
- Dynamo Benchmarking Guide
- Planner Benchmark Example
- Health Checks
- Tuning Disaggregated Serving Performance
- Writing Python Workers in Dynamo
- Glossary
- #
- components/backends/vllm
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact