Motivation behind KVBM — Dynamo
Title: Motivation behind KVBM — Dynamo
Published Time: Fri, 18 Jul 2025 15:25:45 GMT
Markdown Content: Skip to main content
Back to top- [x] - [x]
Ctrl+K
Search Ctrl+K
Search Ctrl+K
Table of Contents
Architecture & Features
Dynamo Command Line Interface
- CLI Overview
- Running Dynamo (dynamo run)
- Serving Inference Graphs (dynamo serve)
- Building Dynamo (dynamo build)
- Deploying Inference Graphs (dynamo deploy)
Usage Guides
- Writing Python Workers in Dynamo
- Disaggregation and Performance Tuning
- KV Cache Router Performance Tuning
- Working with Dynamo Kubernetes Operator
Deployment Guides
- Dynamo Cloud Kubernetes Platform
- Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform
- Manual Helm Deployment
- GKE Setup Guide
- Minikube Setup Guide
- Model Caching with Fluid
Benchmarking
API
Examples
- Hello World Example: Basic
- Hello World Example: Aggregated and Disaggregated Deployment
- LLM Deployment Examples
- Multinode Examples
- LLM Deployment Examples using TensorRT-LLM
Reference
-
Motivation...
Motivation behind KVBM#
Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges:
-
Tailored for GenAI use-cases
-
Lack of visibility into real-time block usage patterns.
-
Need for lightweight, ownership-driven memory management over complex object stores with unneeded overheads.
-
Modular and need simplified UX and to be memory safe.
-
Inability to differentiate between hot (frequently accessed) and cold (infrequently accessed) blocks across the stack without intrusive application-level changes.
-
Difficulty in optimizing storage placement across heterogeneous storage tiers (for example, SSDs, object storage, and cloud storage).
Conventional systems either lack dynamic feedback mechanisms or require deep integration into core storage paths, which both increases complexity and reduces portability.
previous KV Block Managernext KVBM Architecture
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2025-2025, NVIDIA Corporation.
Links/Buttons:
- Skip to main content
- document.write(
<img src="../_static/nvidia-logo-horiz-rgb-wht-for-screen.svg" class="logo__image only-dark" alt="Dynamo - Home"/>); Dynamo - GitHub
- Support Matrix
- Getting Started
- High Level Architecture
- Distributed Runtime
- Disaggregated Serving
- KV Block Manager
- Motivation
- KVBM Architecture
- Understanding KVBM components
- KVBM Further Reading
- KV Cache Routing
- Planner
- Load-based Planner
- SLA-based Planner
- Dynamo Architecture Flow
- CLI Overview
- Running Dynamo (dynamo run)
- Serving Inference Graphs (dynamo serve)
- Building Dynamo (dynamo build)
- Deploying Inference Graphs (dynamo deploy)
- Writing Python Workers in Dynamo
- Disaggregation and Performance Tuning
- KV Cache Router Performance Tuning
- Working with Dynamo Kubernetes Operator
- Dynamo Cloud Kubernetes Platform
- Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform
- Manual Helm Deployment
- GKE Setup Guide
- Minikube Setup Guide
- Model Caching with Fluid
- Planner Benchmark Example
- SDK Reference
- Python API
- Hello World Example: Basic
- Hello World Example: Aggregated and Disaggregated Deployment
- LLM Deployment Examples
- Multinode Examples
- LLM Deployment Examples using TensorRT-LLM
- Glossary
- #
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact