Running KVBM in vLLM — NVIDIA Dynamo Documentation
Title: Running KVBM in vLLM — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/archive/0.6.0/kvbm/vllm-setup.html
Published Time: Thu, 30 Oct 2025 05:14:53 GMT
Markdown Content: Skip to main content
Back to top Ctrl+K
latest
latest0.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Search Ctrl+K
Search Ctrl+K
latest
latest0.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Table of Contents
Getting Started
Kubernetes Deployment
User Guides
Components
Design Docs
-
Running KVBM in vLLM
Running KVBM in vLLM#
This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in vLLM.
To learn what KVBM is, please check here
Quick Start#
To use KVBM in vLLM, you can follow the steps below:
Docker Setup#
start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
build a container containing vllm and kvbm
./container/build.sh --framework vllm --enable-kvbm
launch the container
./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
Aggregated Serving with KVBM#
cd $DYNAMO_HOME/components/backends/vllm ./launch/agg_kvbm.sh
Disaggregated Serving with KVBM#
1P1D - one prefill worker and one decode worker
NOTE: need at least 2 GPUs
cd $DYNAMO_HOME/components/backends/vllm ./launch/disagg_kvbm.sh
2P2D - two prefill workers and two decode workers
NOTE: need at least 4 GPUs
cd $DYNAMO_HOME/components/backends/vllm ./launch/disagg_kvbm_2p2d.sh
Note
To tune the size of CPU or disk cache, set DYN_KVBM_CPU_CACHE_GB and DYN_KVBM_DISK_CACHE_GB accordingly. We only set DYN_KVBM_CPU_CACHE_GB=20 in both scripts above.
Note
DYN_KVBM_CPU_CACHE_GB must be set and DYN_KVBM_DISK_CACHE_GB is optional.
Note
When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than 2. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
To disable disk offload filtering, set DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER to true or 1.
Sample Request#
make a request to verify vLLM with KVBM is started up correctly
NOTE: change the model name if served with a different one
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream":false, "max_tokens": 10 }'
Alternatively, can use vllm serve directly to use KVBM for aggregated serving:
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "dynamo.llm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
Enable and View KVBM Metrics#
Follow below steps to enable metrics collection and view via Grafana dashboard:
Start the basic services (etcd & natsd), along with Prometheus and Grafana
docker compose -f deploy/docker-compose.yml --profile metrics up -d
set env var DYN_KVBM_METRICS to true, when launch via dynamo
Optionally set DYN_KVBM_METRICS_PORT to choose the /metrics port (default: 6880).
NOTE: update launch/disagg_kvbm.sh or launch/disagg_kvbm_2p2d.sh as needed
DYN_KVBM_METRICS=true
python -m dynamo.vllm
--model Qwen/Qwen3-0.6B
--enforce-eager
--connector kvbm
optional if firewall blocks KVBM metrics ports to send prometheus metrics
sudo ufw allow 6880/tcp
View grafana metrics via http://localhost:3001 (default login: dynamo/dynamo) and look for KVBM Dashboard
Benchmark KVBM#
Once the model is loaded ready, follow below steps to use LMBenchmark to benchmark KVBM performance:
git clone https://github.com/LMCache/LMBenchmark.git
show case of running the synthetic multi-turn chat dataset.
we are passing model, endpoint, output file prefix and qps to the sh script.
cd LMBenchmark/synthetic-multi-round-qa
./long_input_short_output_run.sh
"Qwen/Qwen3-0.6B"
"http://localhost:8000"
"benchmark_kvbm"
1
Average TTFT and other perf numbers would be in the output from above cmd
More details about how to use LMBenchmark could be found here.
NOTE: if metrics are enabled as mentioned in the above section, you can observe KV offloading, and KV onboarding in the grafana dashboard.
To compare, you can run vllm serve Qwen/Qwen3-0.6B to turn KVBM off as the baseline.
previous KVBM Integrationsnext Running KVBM in TensorRT-LLM
On this page
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2024-2025, NVIDIA CORPORATION & AFFILIATES.
NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By continuing to use this site or by clicking one of the buttons below, you agree to the use of cookies and other tools as described in our Privacy Policy and Cookie Policy (subject to your settings) and accept our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.
We have detected the Global Privacy Control (GPC) signal and have opted you out of all optional cookies on this site for this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. To opt out of non-cookie personal information "sales" / "sharing" for targeted advertising purposes, please visit the NVIDIA Preference Center. Please see our Privacy Policy for more information on our privacy practices.
We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies. You can manage these settings in the NVIDIA NVIDIA Preference Center. Please see our Privacy Policy for more information.
We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies which overrides at least one of your previous settings. You can manage them in the NVIDIA Preference Center. Please see our Privacy Policy for more information.
Manage Settings
Turn Off Optional Cookies Agree

Cookie Settings
We and our third-party partners (including social media, advertising, and analytics partners) use cookies and other tracking technologies to collect, store, monitor, and process certain information about you when you visit our website. The information collected might relate to you, your preferences, or your device. We use that information to make the site work, analyze performance and traffic on our website, provide a more personalized web experience, and assist in our marketing efforts.
Under certain privacy laws, you have the right to direct us not to "sell" or "share" your personal information for targeted advertising. To opt-out of the "sale" and "sharing" of personal information through cookies, you must opt-out of optional cookies using the toggles below. To opt out of the "sale" and "sharing" of data collected by other means (e.g., online forms) you must also update your data sharing preferences through the NVIDIA Preference Center.
Click on the different category headings below to find out more and change the settings according to your preference. You cannot opt out of Required Cookies as they are deployed to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, etc.). By clicking "Save and Accept" or "Decline All" at the bottom, you consent to the use of cookies and other tools as described in our Cookie Policy in accordance with your settings and accept our Terms of Service (which contains important waivers). For more information about our privacy practices, please see our Privacy Policy.
Required Cookies
Always Active
These cookies enable core functionality such as security, network management, and accessibility. These cookies are required for the site to function and cannot be turned off.
Cookies Details
Performance Cookies
- Performance Cookies
These cookies are used to provide quantitative measures of our website visitors, such as the number of times you visit, time on page, your mouse movements, scrolling, clicks and keystroke activity on the websites; other browsing, search, or product research behavior; and what brought you to our site. These cookies may store a unique ID so that our system will remember you when you return. Information collected with these cookies is used to measure and find ways to improve website performance.
Cookies Details
Personalization Cookies
- Personalization Cookies
These cookies collect data about how you have interacted with our website to help us improve your web experience, such as which pages you have visited. These cookies may store a unique ID so that our system will remember you when you return. They may be set by us or by third party providers whose services we have added to our pages. These cookies enable us to provide enhanced website functionality and personalization as well as make the marketing messages we send to you more relevant to your interests. If you do not allow these cookies, then some or all of these services may not function properly.
Cookies Details
Advertising Cookies
- Advertising Cookies
These cookies record your visit to our websites, the pages you have visited and the links you have followed to influence the advertisements that you see on other websites. These cookies and the information they collect may be managed by other companies, including our advertising partners, and may be used to build a profile of your interests and show you relevant advertising on other sites. We and our advertising partners will use this information to make our websites and the advertising displayed on it, more relevant to your interests.
Cookies Details
Cookie List
Clear
-
- checkbox label label
Apply Cancel
Consent Leg.Interest
-
checkbox label label
-
checkbox label label
-
checkbox label label
Decline All Save and Accept
Links/Buttons:
- Skip to main content
- NVIDIA Dynamo Documentation
- latest
- 0.6.0
- 0.5.1
- 0.5.0
- 0.4.1
- 0.4.0
- 0.3.2
- 0.3.1
- 0.3.0
- 0.2.1
- 0.2.0
- GitHub
- Installation
- Support Matrix
- Examples
- Deployment Guide
- Kubernetes Quickstart
- Detailed Installation Guide
- Dynamo Operator
- Minikube Setup
- Observability (K8s)
- Metrics
- Logging
- Multinode
- Multinode Deployments
- Grove
- Tool Calling
- Multimodality Support
- Finding Best Initial Configs
- Dynamo Benchmarking Guide
- Tuning Disaggregated Performance
- Writing Python Workers in Dynamo
- Observability (Local)
- Health Checks
- Glossary
- Backends
- vLLM
- SGLang
- TensorRT-LLM
- Router
- Planner
- SLA Planner Quick Start
- Pre-Deployment Profiling
- SLA-based Planner
- KVBM
- Motivation
- Architecture
- Components
- Design Deep Dive
- Integrations
- KVBM in vLLM
- KVBM in TRTLLM
- LMCache Integration
- Further Reading
- Overall Architecture
- Architecture Flow
- Disaggregated Serving
- Distributed Runtime
- #
- here
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact
- Cookie Policy
- NVIDIA Preference Center