Deploying Inference Graphs to Kubernetes — NVIDIA Dynamo Documentation
Title: Deploying Inference Graphs to Kubernetes#
URL Source: https://docs.nvidia.com/dynamo/archive/0.5.1/kubernetes/README.html?userAgent=PromptingBot%2F1.0.0
Published Time: Tue, 14 Oct 2025 16:26:34 GMT
Markdown Content: High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
- Install Platform First#
1. Set environment
export NAMESPACE=dynamo-system export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.
- Choose Your Backend#
Each backend has deployment examples and configuration options:
| Backend | Available Configurations |
|---|---|
| vLLM | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated + Planner, Disaggregated Multi-node |
| SGLang | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Planner, Disaggregated Multi-node |
| TensorRT-LLM | Aggregated, Aggregated + Router, Disaggregated, Disaggregated + Router, Disaggregated Multi-node |
- Deploy Your First Model#
export NAMESPACE=dynamo-cloud kubectl create namespace ${NAMESPACE}
Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
Test it
kubectl port-forward svc/agg-vllm-frontend 8000:8000 -n ${NAMESPACE} curl http://localhost:8000/v1/models
What’s a DynamoGraphDeployment?#
It’s a Kubernetes Custom Resource that defines your inference pipeline:
-
Model configuration
-
Resource allocation (GPUs, memory)
-
Scaling policies
-
Frontend/backend connections
Refer to the API Reference and Documentation for more details.
📖 API Reference & Documentation#
For detailed technical specifications of Dynamo’s Kubernetes resources:
-
API Reference - Complete CRD field specifications for
DynamoGraphDeploymentandDynamoComponentDeployment -
Operator Guide - Dynamo operator configuration and management
-
Create Deployment - Step-by-step deployment creation examples
Choosing Your Architecture Pattern#
When creating a deployment, select the architecture pattern that best fits your use case:
-
Development / Testing - Use
agg.yamlas the base configuration -
Production with Load Balancing - Use
agg_router.yamlto enable scalable, load-balanced inference -
High Performance / Disaggregated - Use
disagg_router.yamlfor maximum throughput and modular scalability
Frontend and Worker Components#
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
-
Provides OpenAI-compatible
/v1/chat/completionsendpoint -
Auto-discovers backend workers via etcd
-
Routes requests and handles load balancing
-
Validates and preprocesses requests
Customizing Your Deployment#
Example structure:
apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm spec: services: Frontend: dynamoNamespace: my-llm componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: your-image VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker dynamoNamespace: dynamo-dev componentType: worker replicas: 1 envFromSecret: hf-token-secret # for HuggingFace models resources: limits: gpu: "1" extraPodSpec: mainContainer: image: your-image command: ["/bin/sh", "-c"] args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
Worker command examples per backend:
vLLM worker
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
SGLang worker
args:
-
- python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --tp 1 --trust-remote-code
TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B --extra-engine-args engine_configs/agg.yaml
Key customization points include:
-
Model Configuration: Specify model in the args command
-
Resource Allocation: Configure GPU requirements under
resources.limits -
Scaling: Set
replicasfor number of worker instances -
Routing Mode: Enable KV-cache routing by setting
DYN_ROUTER_MODE=kvin Frontend envs -
Worker Specialization: Add
--is-prefill-workerflag for disaggregated prefill workers
Additional Resources#
-
Examples - Complete working examples
-
Create Custom Deployments - Build your own CRDs
-
Operator Documentation - How the platform works
-
Helm Charts - For advanced users
-
GitOps Deployment with FluxCD - For advanced users
-
Logging - For logging setup
-
Multinode Deployment - For multinode deployment
-
Grove - For grove details and custom installation
-
Monitoring - For monitoring setup
-
Model Caching with Fluid - For model caching with Fluid
Links/Buttons:
- Skip to main content
- document.write(
<img src="../_static/nvidia-logo-horiz-rgb-wht-for-screen.svg" class="logo__image only-dark" alt="NVIDIA Dynamo Documentation - Home"/>); NVIDIA Dynamo Documentation - GitHub
- Installation
- Support Matrix
- Architecture
- Disaggregated Serving
- Examples
- Quickstart (K8s)
- Detailed Installation Guide
- Dynamo Operator
- Metrics
- Logging
- Multinode
- Minikube Setup
- Backends
- vLLM
- SGLang
- TensorRT-LLM
- Router
- Planner
- Pre-Deployment Profiling
- Load-based Planner
- SLA-based Planner
- KVBM
- Motivation
- KVBM Architecture
- Understanding KVBM components
- KVBM Further Reading
- LMCache Integration
- Dynamo Benchmarking Guide
- Planner Benchmark Example
- Health Checks
- Tuning Disaggregated Serving Performance
- Writing Python Workers in Dynamo
- Glossary
- #
- Installation Guide for Dynamo Kubernetes Platform
- API Reference and Documentation
- Operator Guide
- Create Deployment
- Helm Charts
- GitOps Deployment with FluxCD
- Multinode Deployment
- Grove
- Monitoring
- Model Caching with Fluid
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact