KV Block Manager — Dynamo
Title: KV Block Manager — Dynamo
URL Source: https://docs.nvidia.com/dynamo/archive/0.3.0/architecture/kvbm_intro.html?userAgent=PromptingBot%2F1.0.0
Published Time: Fri, 13 Jun 2025 20:22:45 GMT
Markdown Content: Skip to main content
Back to top- [x] - [x]
Ctrl+K
Search Ctrl+K
Search Ctrl+K
Table of Contents
Architecture & Features
Dynamo Command Line Interface
- CLI Overview
- Running Dynamo (dynamo run)
- Serving Inference Graphs (dynamo serve)
- Building Dynamo (dynamo build)
- Deploying Inference Graphs (dynamo deploy)
Usage Guides
- Writing Python Workers in Dynamo
- Disaggregation and Performance Tuning
- KV Cache Router Performance Tuning
- Planner Benchmark Example
- Working with Dynamo Kubernetes Operator
Deployment Guides
- Dynamo Cloud Kubernetes Platform
- Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform
- Manual Helm Deployment
- Minikube Setup Guide
- Model Caching with Fluid
API
Examples
-
Hello World Example: Aggregated and Disaggregated Deployment
-
KV Block Manager
KV Block Manager#
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM.
It offers:
-
A unified memory API that spans GPU memory, pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
-
Support for evolving block lifecycles (allocate → register → match) with event-based state transitions that storage can subscribe to.
-
Integration with NIXL, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs.
previous Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performancenext Motivation behind KVBM
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2025-2025, NVIDIA Corporation.
Links/Buttons:
- Skip to main content
- document.write(
<img src="../_static/nvidia-logo-horiz-rgb-wht-for-screen.svg" class="logo__image only-dark" alt="Dynamo - Home"/>); Dynamo - GitHub
- Support Matrix
- Getting Started
- High Level Architecture
- Distributed Runtime
- Disaggregated Serving
- KV Block Manager
- Motivation
- KVBM Architecture
- Understanding KVBM components
- KVBM Further Reading
- KV Cache Routing
- Planner
- CLI Overview
- Running Dynamo (dynamo run)
- Serving Inference Graphs (dynamo serve)
- Building Dynamo (dynamo build)
- Deploying Inference Graphs (dynamo deploy)
- Writing Python Workers in Dynamo
- Disaggregation and Performance Tuning
- KV Cache Router Performance Tuning
- Planner Benchmark Example
- Working with Dynamo Kubernetes Operator
- Dynamo Cloud Kubernetes Platform
- Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform
- Manual Helm Deployment
- Minikube Setup Guide
- Model Caching with Fluid
- Dynamo SDK
- Python API
- Hello World Example: Basic
- Hello World Example: Aggregated and Disaggregated Deployment
- LLM Deployment Examples
- Multinode Examples
- LLM Deployment Examples using TensorRT-LLM
- #
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact