Dynamo Architecture Flow — NVIDIA Dynamo Documentation
Title: Dynamo Architecture Flow — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/latest/design_docs/dynamo_flow.html
Published Time: Thu, 30 Oct 2025 05:14:55 GMT
Markdown Content: Skip to main content
Back to top Ctrl+K
latest
latest0.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Search Ctrl+K
Search Ctrl+K
latest
latest0.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Table of Contents
Getting Started
Kubernetes Deployment
User Guides
Components
Design Docs
-
Dynamo Architecture Flow
Dynamo Architecture Flow#
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in components/backends/vllm. Color-coded flows indicate different types of operations:
🔵 Main Request Flow (Blue)#
The primary user journey through the system:
-
Discovery (S1): Client discovers the service endpoint
-
Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
-
Validate (S3): Frontend forwards request to Processor for validation and routing
-
Route (S3): Processor routes the validated request to appropriate Decode Worker
🟠 Decision and Allocation Flow (Orange)#
The system’s intelligent routing and resource allocation:
-
Query (S4): Decode Worker queries for prefix cache hits to optimize processing
-
Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
-
Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
🟢 Prefill Worker Flow (Green)#
The dedicated prefill processing pipeline:
-
NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
-
Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
-
Prefill (S9): Worker executes the prefill computation on the input tokens
-
NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks
🟣 Completion Flow (Purple)#
The response generation and delivery:
-
Notify (S11): PrefillWorker sends completion notification to Decode Worker
-
Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
-
Response (S13): The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
🔗 Infrastructure Connections (Dotted lines)#
Coordination and messaging support:
ETCD Connections (Gray, dotted)#
-
Frontend, Processor, Planner: Service discovery and registration
-
Decode Worker, PrefillWorker: NIXL metadata storage for GPU communication setup
NATS Connections (Teal, dotted)#
-
PrefillQueue: JetStream consumer group for reliable work distribution
-
Processor: Load balancing across workers
Planning Connections (Gold, dotted)#
-
Frontend → Planner: Metrics collection for auto-scaling decisions
-
Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker
Technical Implementation Details#
NIXL (NVIDIA Interchange Library):#
-
Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
-
Decode Worker publishes GPU metadata to ETCD for coordination
-
PrefillWorker loads metadata to establish direct communication channels
-
Block-based transfers (64–128 tokens per block) for efficient batching
Disaggregated KV Cache:#
-
Each Decode Worker maintains local KV cache in its GPU memory
-
No shared storage bottlenecks—all transfers are direct worker-to-worker
-
Pre-allocated blocks ensure deterministic memory layout and performance
previous High Level Architecturenext Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
On this page
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2024-2025, NVIDIA CORPORATION & AFFILIATES.
NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By continuing to use this site or by clicking one of the buttons below, you agree to the use of cookies and other tools as described in our Privacy Policy and Cookie Policy (subject to your settings) and accept our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.
We have detected the Global Privacy Control (GPC) signal and have opted you out of all optional cookies on this site for this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. To opt out of non-cookie personal information "sales" / "sharing" for targeted advertising purposes, please visit the NVIDIA Preference Center. Please see our Privacy Policy for more information on our privacy practices.
We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies. You can manage these settings in the NVIDIA NVIDIA Preference Center. Please see our Privacy Policy for more information.
We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies which overrides at least one of your previous settings. You can manage them in the NVIDIA Preference Center. Please see our Privacy Policy for more information.
Manage Settings
Turn Off Optional Cookies Agree

Cookie Settings
We and our third-party partners (including social media, advertising, and analytics partners) use cookies and other tracking technologies to collect, store, monitor, and process certain information about you when you visit our website. The information collected might relate to you, your preferences, or your device. We use that information to make the site work, analyze performance and traffic on our website, provide a more personalized web experience, and assist in our marketing efforts.
Under certain privacy laws, you have the right to direct us not to "sell" or "share" your personal information for targeted advertising. To opt-out of the "sale" and "sharing" of personal information through cookies, you must opt-out of optional cookies using the toggles below. To opt out of the "sale" and "sharing" of data collected by other means (e.g., online forms) you must also update your data sharing preferences through the NVIDIA Preference Center.
Click on the different category headings below to find out more and change the settings according to your preference. You cannot opt out of Required Cookies as they are deployed to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, etc.). By clicking "Save and Accept" or "Decline All" at the bottom, you consent to the use of cookies and other tools as described in our Cookie Policy in accordance with your settings and accept our Terms of Service (which contains important waivers). For more information about our privacy practices, please see our Privacy Policy.
Required Cookies
Always Active
These cookies enable core functionality such as security, network management, and accessibility. These cookies are required for the site to function and cannot be turned off.
Cookies Details
Performance Cookies
- Performance Cookies
These cookies are used to provide quantitative measures of our website visitors, such as the number of times you visit, time on page, your mouse movements, scrolling, clicks and keystroke activity on the websites; other browsing, search, or product research behavior; and what brought you to our site. These cookies may store a unique ID so that our system will remember you when you return. Information collected with these cookies is used to measure and find ways to improve website performance.
Cookies Details
Personalization Cookies
- Personalization Cookies
These cookies collect data about how you have interacted with our website to help us improve your web experience, such as which pages you have visited. These cookies may store a unique ID so that our system will remember you when you return. They may be set by us or by third party providers whose services we have added to our pages. These cookies enable us to provide enhanced website functionality and personalization as well as make the marketing messages we send to you more relevant to your interests. If you do not allow these cookies, then some or all of these services may not function properly.
Cookies Details
Advertising Cookies
- Advertising Cookies
These cookies record your visit to our websites, the pages you have visited and the links you have followed to influence the advertisements that you see on other websites. These cookies and the information they collect may be managed by other companies, including our advertising partners, and may be used to build a profile of your interests and show you relevant advertising on other sites. We and our advertising partners will use this information to make our websites and the advertising displayed on it, more relevant to your interests.
Cookies Details
Cookie List
Clear
-
- checkbox label label
Apply Cancel
Consent Leg.Interest
-
checkbox label label
-
checkbox label label
-
checkbox label label
Decline All Save and Accept
Links/Buttons:
- Skip to main content
- NVIDIA Dynamo Documentation
- latest
- 0.6.0
- 0.5.1
- 0.5.0
- 0.4.1
- 0.4.0
- 0.3.2
- 0.3.1
- 0.3.0
- 0.2.1
- 0.2.0
- GitHub
- Installation
- Support Matrix
- Examples
- Deployment Guide
- Kubernetes Quickstart
- Detailed Installation Guide
- Dynamo Operator
- Minikube Setup
- Observability (K8s)
- Metrics
- Logging
- Multinode
- Multinode Deployments
- Grove
- Tool Calling
- Multimodality Support
- Finding Best Initial Configs
- Dynamo Benchmarking Guide
- Tuning Disaggregated Performance
- Writing Python Workers in Dynamo
- Observability (Local)
- Health Checks
- Glossary
- Backends
- vLLM
- SGLang
- TensorRT-LLM
- Router
- Planner
- SLA Planner Quick Start
- Pre-Deployment Profiling
- SLA-based Planner
- KVBM
- Motivation
- Architecture
- Components
- Design Deep Dive
- Integrations
- KVBM in vLLM
- KVBM in TRTLLM
- LMCache Integration
- Further Reading
- Overall Architecture
- Architecture Flow
- Disaggregated Serving
- Distributed Runtime
- #
- components/backends/vllm
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact
- Cookie Policy
- NVIDIA Preference Center