Title: Running KVBM in vLLM — NVIDIA Dynamo Documentation

URL Source: https://docs.nvidia.com/dynamo/archive/0.6.0/kvbm/vllm-setup.html

Published Time: Thu, 30 Oct 2025 05:14:53 GMT

Markdown Content: Skip to main content

NVIDIA Dynamo Documentation

latest

latest 0.6.0 0.5.1 0.5.0 0.4.1 0.4.0 0.3.2 0.3.1 0.3.0 0.2.1 0.2.0

Search Ctrl+K

GitHub

Search Ctrl+K

NVIDIA Dynamo Documentation

latest

latest 0.6.0 0.5.1 0.5.0 0.4.1 0.4.0 0.3.2 0.3.1 0.3.0 0.2.1 0.2.0

GitHub

Table of Contents

Getting Started

Kubernetes Deployment

User Guides

Components

Design Docs

Running KVBM in vLLM#

This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in vLLM.

To learn what KVBM is, please check here

Quick Start#

To use KVBM in vLLM, you can follow the steps below:

Docker Setup#

start up etcd for KVBM leader/worker registration and discovery

docker compose -f deploy/docker-compose.yml up -d

build a container containing vllm and kvbm

./container/build.sh --framework vllm --enable-kvbm

launch the container

./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds

Aggregated Serving with KVBM#

cd $DYNAMO_HOME/components/backends/vllm ./launch/agg_kvbm.sh

Disaggregated Serving with KVBM#

1P1D - one prefill worker and one decode worker

NOTE: need at least 2 GPUs

cd $DYNAMO_HOME/components/backends/vllm ./launch/disagg_kvbm.sh

2P2D - two prefill workers and two decode workers

NOTE: need at least 4 GPUs

cd $DYNAMO_HOME/components/backends/vllm ./launch/disagg_kvbm_2p2d.sh

Note

To tune the size of CPU or disk cache, set DYN_KVBM_CPU_CACHE_GB and DYN_KVBM_DISK_CACHE_GB accordingly. We only set DYN_KVBM_CPU_CACHE_GB=20 in both scripts above.

Note

DYN_KVBM_CPU_CACHE_GB must be set and DYN_KVBM_DISK_CACHE_GB is optional.

Note

When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than 2. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.

To disable disk offload filtering, set DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER to true or 1.

Sample Request#

make a request to verify vLLM with KVBM is started up correctly

NOTE: change the model name if served with a different one

curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden." } ], "stream":false, "max_tokens": 10 }'

Alternatively, can use vllm serve directly to use KVBM for aggregated serving:

vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "dynamo.llm.vllm_integration.connector"}' Qwen/Qwen3-0.6B

Enable and View KVBM Metrics#

Follow below steps to enable metrics collection and view via Grafana dashboard:

Start the basic services (etcd & natsd), along with Prometheus and Grafana

docker compose -f deploy/docker-compose.yml --profile metrics up -d

set env var DYN_KVBM_METRICS to true, when launch via dynamo

Optionally set DYN_KVBM_METRICS_PORT to choose the /metrics port (default: 6880).

NOTE: update launch/disagg_kvbm.sh or launch/disagg_kvbm_2p2d.sh as needed

DYN_KVBM_METRICS=true
python -m dynamo.vllm
--model Qwen/Qwen3-0.6B
--enforce-eager
--connector kvbm

optional if firewall blocks KVBM metrics ports to send prometheus metrics

sudo ufw allow 6880/tcp

View grafana metrics via http://localhost:3001 (default login: dynamo/dynamo) and look for KVBM Dashboard

Benchmark KVBM#

Once the model is loaded ready, follow below steps to use LMBenchmark to benchmark KVBM performance:

git clone https://github.com/LMCache/LMBenchmark.git

show case of running the synthetic multi-turn chat dataset.

we are passing model, endpoint, output file prefix and qps to the sh script.

cd LMBenchmark/synthetic-multi-round-qa ./long_input_short_output_run.sh
"Qwen/Qwen3-0.6B"
"http://localhost:8000"
"benchmark_kvbm"
1

Average TTFT and other perf numbers would be in the output from above cmd

More details about how to use LMBenchmark could be found here.

NOTE: if metrics are enabled as mentioned in the above section, you can observe KV offloading, and KV onboarding in the grafana dashboard.

To compare, you can run vllm serve Qwen/Qwen3-0.6B to turn KVBM off as the baseline.

previous KVBM Integrations next Running KVBM in TensorRT-LLM

On this page

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By continuing to use this site or by clicking one of the buttons below, you agree to the use of cookies and other tools as described in our Privacy Policy and Cookie Policy (subject to your settings) and accept our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.

We have detected the Global Privacy Control (GPC) signal and have opted you out of all optional cookies on this site for this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. To opt out of non-cookie personal information "sales" / "sharing" for targeted advertising purposes, please visit the NVIDIA Preference Center. Please see our Privacy Policy for more information on our privacy practices.

We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies. You can manage these settings in the NVIDIA NVIDIA Preference Center. Please see our Privacy Policy for more information.

We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies which overrides at least one of your previous settings. You can manage them in the NVIDIA Preference Center. Please see our Privacy Policy for more information.

Manage Settings

Turn Off Optional Cookies Agree

Image 7: NVIDIA Logo

Cookie Settings

We and our third-party partners (including social media, advertising, and analytics partners) use cookies and other tracking technologies to collect, store, monitor, and process certain information about you when you visit our website. The information collected might relate to you, your preferences, or your device. We use that information to make the site work, analyze performance and traffic on our website, provide a more personalized web experience, and assist in our marketing efforts.

Under certain privacy laws, you have the right to direct us not to "sell" or "share" your personal information for targeted advertising. To opt-out of the "sale" and "sharing" of personal information through cookies, you must opt-out of optional cookies using the toggles below. To opt out of the "sale" and "sharing" of data collected by other means (e.g., online forms) you must also update your data sharing preferences through the NVIDIA Preference Center.

Click on the different category headings below to find out more and change the settings according to your preference. You cannot opt out of Required Cookies as they are deployed to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, etc.). By clicking "Save and Accept" or "Decline All" at the bottom, you consent to the use of cookies and other tools as described in our Cookie Policy in accordance with your settings and accept our Terms of Service (which contains important waivers). For more information about our privacy practices, please see our Privacy Policy.

Required Cookies

Always Active

These cookies enable core functionality such as security, network management, and accessibility. These cookies are required for the site to function and cannot be turned off.

Cookies Details‎

Performance Cookies

Performance Cookies

These cookies are used to provide quantitative measures of our website visitors, such as the number of times you visit, time on page, your mouse movements, scrolling, clicks and keystroke activity on the websites; other browsing, search, or product research behavior; and what brought you to our site. These cookies may store a unique ID so that our system will remember you when you return. Information collected with these cookies is used to measure and find ways to improve website performance.

Cookies Details‎

Personalization Cookies

Personalization Cookies

These cookies collect data about how you have interacted with our website to help us improve your web experience, such as which pages you have visited. These cookies may store a unique ID so that our system will remember you when you return. They may be set by us or by third party providers whose services we have added to our pages. These cookies enable us to provide enhanced website functionality and personalization as well as make the marketing messages we send to you more relevant to your interests. If you do not allow these cookies, then some or all of these services may not function properly.

Cookies Details‎

Advertising Cookies

Advertising Cookies

These cookies record your visit to our websites, the pages you have visited and the links you have followed to influence the advertisements that you see on other websites. These cookies and the information they collect may be managed by other companies, including our advertising partners, and may be used to build a profile of your interests and show you relevant advertising on other sites. We and our advertising partners will use this information to make our websites and the advertising displayed on it, more relevant to your interests.

Cookies Details‎

Cookie List

Clear

- checkbox label label

Apply Cancel

Consent Leg.Interest

checkbox label label
checkbox label label
checkbox label label

Decline All Save and Accept

Links/Buttons:

Running KVBM in vLLM#

Quick Start#

Docker Setup#

start up etcd for KVBM leader/worker registration and discovery

build a container containing vllm and kvbm

launch the container

Aggregated Serving with KVBM#

Disaggregated Serving with KVBM#

1P1D - one prefill worker and one decode worker

NOTE: need at least 2 GPUs

2P2D - two prefill workers and two decode workers

NOTE: need at least 4 GPUs

Sample Request#

make a request to verify vLLM with KVBM is started up correctly

NOTE: change the model name if served with a different one

Enable and View KVBM Metrics#

Start the basic services (etcd & natsd), along with Prometheus and Grafana

set env var DYN_KVBM_METRICS to true, when launch via dynamo

Optionally set DYN_KVBM_METRICS_PORT to choose the /metrics port (default: 6880).

NOTE: update launch/disagg_kvbm.sh or launch/disagg_kvbm_2p2d.sh as needed

optional if firewall blocks KVBM metrics ports to send prometheus metrics

Benchmark KVBM#

show case of running the synthetic multi-turn chat dataset.

we are passing model, endpoint, output file prefix and qps to the sh script.

Average TTFT and other perf numbers would be in the output from above cmd

Related Articles