Expert Parallelism Load Balancer (EPLB) in SGLang — NVIDIA Dynamo Documentation
Title: Expert Parallelism Load Balancer (EPLB) in SGLang#
URL Source: https://docs.nvidia.com/dynamo/archive/0.6.0/backends/sglang/expert-distribution-eplb.html
Published Time: Thu, 30 Oct 2025 05:14:51 GMT
Markdown Content: Skip to main content
Back to top Ctrl+K
latest
latest0.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Search Ctrl+K
Search Ctrl+K
latest
latest0.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Table of Contents
Getting Started
Kubernetes Deployment
User Guides
Components
Design Docs
-
Expert Parallelism Load Balancer (EPLB) in SGLang
Expert Parallelism Load Balancer (EPLB) in SGLang#
Mixture-of-Experts (MoE) models utilize a technique called Expert Parallelism (EP), where experts are distributed across multiple GPUs. While this allows for much larger and more powerful models, it can lead to an uneven workload distribution. Because the load on different experts may vary depending on the workload, some GPUs can become bottlenecks, forcing the entire system to wait. This imbalance leads to wasted compute cycles and increased memory usage.
To address this, SGLang implements an Expert Parallelism Load Balancer (EPLB) inspired by the work in the DeepSeek-V3 paper. EPLB analyzes expert usage patterns and dynamically re-arranges the experts across the available GPUs to ensure a more balanced workload.
The EPLB Algorithm: Core Concepts#
The load balancing algorithm revolves around a few key ideas to achieve an optimal distribution of work.
Redundant Experts for Flexibility#
The core strategy is to create redundant experts. Instead of being limited to the model’s original number of experts, EPLB can create duplicates of heavily-loaded experts. For example, if a model has 256 experts, you can configure EPLB to create an additional 32 “redundant” experts, bringing the total to 288. This pool of replicated experts is then strategically packed onto the available GPUs. A popular expert might be duplicated multiple times, while a moderately used expert might be grouped with several rarely used ones on a single GPU.
Group-Limited Routing for Efficiency#
Modern MoE models like DeepSeek-V3 use group-limited expert routing. In this design, experts are organized into groups, and routing decisions are constrained within these groups. EPLB can take advantage of this structure to reduce inter-node data traffic by attempting to place all experts from the same group onto the same node whenever possible.
Load Balancing Policies#
The algorithm comes with two policies for different scenarios:
-
Hierarchical Load Balancing: This policy is used when the number of server nodes evenly divides the number of expert groups. It first harnesses the group-limited routing by packing expert groups onto nodes to balance the load between nodes. Then, within each node, it replicates and packs the experts onto individual GPUs to balance the load locally. This is often used during prefill where the expert-parallel size might be smaller.
-
Global Load Balancing: In all other cases, a global policy is used. It replicates experts globally without regard to their group affiliation and packs them onto individual GPUs. This policy is more general and can be adopted during the decoding stage with a larger expert-parallel size.
How SGLang Implements EPLB#
SGLang provides a robust implementation of EPLB, allowing for dynamic, online rebalancing of expert locations based on real-world traffic.
Dynamic Rebalancing#
You can enable dynamic rebalancing by setting the --enable-eplb flag. When enabled, the EPLBManager runs in the background. It periodically triggers a rebalance after a certain number of requests, configured with --eplb-rebalance-num-iterations. At each rebalance, it computes a new expert placement plan based on the latest usage statistics and updates the model’s expert locations on the fly.
Expert Usage Recording#
To make intelligent balancing decisions, SGLang needs to collect data on expert usage. The ExpertDistributionRecorder is responsible for this, and its behavior is controlled by the --expert-distribution-recorder-mode flag. This flag determines the granularity of the collected data. When enable_eplb is on, this mode defaults to stat to gather statistics for rebalancing. The available modes are:
-
per_token: This is the most detailed mode. It records the specific expert choices for every single token processed by the model. While it provides the richest data, it also has the highest performance overhead. The raw, unaggregated data for each forward pass is stored. -
per_pass: In this mode, SGLang records the aggregated expert usage counts for each individual forward pass. The data is not aggregated across different passes, giving you a snapshot of expert popularity for each batch of requests. -
stat: This mode also records the exact expert usage counts for each forward pass, but it then aggregates these counts across multiple passes (the number of passes is determined by--expert-distribution-recorder-buffer-size). This provides a moving average of expert usage statistics and is the default when EPLB is enabled. -
stat_approx: This mode is similar tostatbut gathers approximate statistics, usually from the DeepEP dispatcher. This method has lower overhead thanstatbut is less precise, especially for small batch sizes. It is a good choice when performance is critical.
The collected statistics are then fed into the rebalancing algorithm to generate a new expert placement plan.
Initializing with a Pre-computed Distribution#
While SGLang can start with a simple default layout and learn a better one over time, you can also provide it with a pre-computed expert distribution to start with. The --init-expert-location flag allows you to specify a file path (.pt or .json) or a JSON string containing an expert layout. This is useful if you have already analyzed a representative workload offline and want the server to start immediately with a balanced configuration. If this flag is not set, it defaults to a trivial sequential layout.
References and further reading#
On this page
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2024-2025, NVIDIA CORPORATION & AFFILIATES.
NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By continuing to use this site or by clicking one of the buttons below, you agree to the use of cookies and other tools as described in our Privacy Policy and Cookie Policy (subject to your settings) and accept our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.
We have detected the Global Privacy Control (GPC) signal and have opted you out of all optional cookies on this site for this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. To opt out of non-cookie personal information "sales" / "sharing" for targeted advertising purposes, please visit the NVIDIA Preference Center. Please see our Privacy Policy for more information on our privacy practices.
We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies. You can manage these settings in the NVIDIA NVIDIA Preference Center. Please see our Privacy Policy for more information.
We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies which overrides at least one of your previous settings. You can manage them in the NVIDIA Preference Center. Please see our Privacy Policy for more information.
Manage Settings
Turn Off Optional Cookies Agree

Cookie Settings
We and our third-party partners (including social media, advertising, and analytics partners) use cookies and other tracking technologies to collect, store, monitor, and process certain information about you when you visit our website. The information collected might relate to you, your preferences, or your device. We use that information to make the site work, analyze performance and traffic on our website, provide a more personalized web experience, and assist in our marketing efforts.
Under certain privacy laws, you have the right to direct us not to "sell" or "share" your personal information for targeted advertising. To opt-out of the "sale" and "sharing" of personal information through cookies, you must opt-out of optional cookies using the toggles below. To opt out of the "sale" and "sharing" of data collected by other means (e.g., online forms) you must also update your data sharing preferences through the NVIDIA Preference Center.
Click on the different category headings below to find out more and change the settings according to your preference. You cannot opt out of Required Cookies as they are deployed to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, etc.). By clicking "Save and Accept" or "Decline All" at the bottom, you consent to the use of cookies and other tools as described in our Cookie Policy in accordance with your settings and accept our Terms of Service (which contains important waivers). For more information about our privacy practices, please see our Privacy Policy.
Required Cookies
Always Active
These cookies enable core functionality such as security, network management, and accessibility. These cookies are required for the site to function and cannot be turned off.
Cookies Details
Performance Cookies
- Performance Cookies
These cookies are used to provide quantitative measures of our website visitors, such as the number of times you visit, time on page, your mouse movements, scrolling, clicks and keystroke activity on the websites; other browsing, search, or product research behavior; and what brought you to our site. These cookies may store a unique ID so that our system will remember you when you return. Information collected with these cookies is used to measure and find ways to improve website performance.
Cookies Details
Personalization Cookies
- Personalization Cookies
These cookies collect data about how you have interacted with our website to help us improve your web experience, such as which pages you have visited. These cookies may store a unique ID so that our system will remember you when you return. They may be set by us or by third party providers whose services we have added to our pages. These cookies enable us to provide enhanced website functionality and personalization as well as make the marketing messages we send to you more relevant to your interests. If you do not allow these cookies, then some or all of these services may not function properly.
Cookies Details
Advertising Cookies
- Advertising Cookies
These cookies record your visit to our websites, the pages you have visited and the links you have followed to influence the advertisements that you see on other websites. These cookies and the information they collect may be managed by other companies, including our advertising partners, and may be used to build a profile of your interests and show you relevant advertising on other sites. We and our advertising partners will use this information to make our websites and the advertising displayed on it, more relevant to your interests.
Cookies Details
Cookie List
Clear
-
- checkbox label label
Apply Cancel
Consent Leg.Interest
-
checkbox label label
-
checkbox label label
-
checkbox label label
Decline All Save and Accept
Links/Buttons:
- Skip to main content
- NVIDIA Dynamo Documentation
- latest
- 0.6.0
- 0.5.1
- 0.5.0
- 0.4.1
- 0.4.0
- 0.3.2
- 0.3.1
- 0.3.0
- 0.2.1
- 0.2.0
- GitHub
- Installation
- Support Matrix
- Examples
- Deployment Guide
- Kubernetes Quickstart
- Detailed Installation Guide
- Dynamo Operator
- Minikube Setup
- Observability (K8s)
- Metrics
- Logging
- Multinode
- Multinode Deployments
- Grove
- Tool Calling
- Multimodality Support
- Finding Best Initial Configs
- Dynamo Benchmarking Guide
- Tuning Disaggregated Performance
- Writing Python Workers in Dynamo
- Observability (Local)
- Health Checks
- Glossary
- Backends
- vLLM
- SGLang
- TensorRT-LLM
- Router
- Planner
- SLA Planner Quick Start
- Pre-Deployment Profiling
- SLA-based Planner
- KVBM
- Motivation
- Architecture
- Components
- Design Deep Dive
- Integrations
- KVBM in vLLM
- KVBM in TRTLLM
- LMCache Integration
- Further Reading
- Overall Architecture
- Architecture Flow
- Disaggregated Serving
- Distributed Runtime
- #
- SGLang Large Scale P/D + WideEP Deployment
- Deepseek’s EPLB repository
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact
- Cookie Policy
- NVIDIA Preference Center