Mixture-of-Experts (MoE) models utilize a technique called Expert Parallelism (EP), where experts are distributed across multiple GPUs. While this allows for much larger and more powerful models, it can lead to an uneven workload distribution. Because the load on different experts may vary depending on the workload, some GPUs can become bottlenecks, forcing the entire system to wait. This imbalance leads to wasted compute cycles and increased memory usage.

To address this, SGLang implements an Expert Parallelism Load Balancer (EPLB) inspired by the work in the DeepSeek-V3 paper. EPLB analyzes expert usage patterns and dynamically re-arranges the experts across the available GPUs to ensure a more balanced workload.

The EPLB Algorithm: Core Concepts#

The load balancing algorithm revolves around a few key ideas to achieve an optimal distribution of work.

Redundant Experts for Flexibility#

The core strategy is to create redundant experts. Instead of being limited to the model’s original number of experts, EPLB can create duplicates of heavily-loaded experts. For example, if a model has 256 experts, you can configure EPLB to create an additional 32 “redundant” experts, bringing the total to 288. This pool of replicated experts is then strategically packed onto the available GPUs. A popular expert might be duplicated multiple times, while a moderately used expert might be grouped with several rarely used ones on a single GPU.

Group-Limited Routing for Efficiency#

Modern MoE models like DeepSeek-V3 use group-limited expert routing. In this design, experts are organized into groups, and routing decisions are constrained within these groups. EPLB can take advantage of this structure to reduce inter-node data traffic by attempting to place all experts from the same group onto the same node whenever possible.

Load Balancing Policies#

The algorithm comes with two policies for different scenarios:

Hierarchical Load Balancing: This policy is used when the number of server nodes evenly divides the number of expert groups. It first harnesses the group-limited routing by packing expert groups onto nodes to balance the load between nodes. Then, within each node, it replicates and packs the experts onto individual GPUs to balance the load locally. This is often used during prefill where the expert-parallel size might be smaller.
Global Load Balancing: In all other cases, a global policy is used. It replicates experts globally without regard to their group affiliation and packs them onto individual GPUs. This policy is more general and can be adopted during the decoding stage with a larger expert-parallel size.

How SGLang Implements EPLB#

SGLang provides a robust implementation of EPLB, allowing for dynamic, online rebalancing of expert locations based on real-world traffic.

Dynamic Rebalancing#

You can enable dynamic rebalancing by setting the --enable-eplb flag. When enabled, the EPLBManager runs in the background. It periodically triggers a rebalance after a certain number of requests, configured with --eplb-rebalance-num-iterations. At each rebalance, it computes a new expert placement plan based on the latest usage statistics and updates the model’s expert locations on the fly.

Expert Usage Recording#

To make intelligent balancing decisions, SGLang needs to collect data on expert usage. The ExpertDistributionRecorder is responsible for this, and its behavior is controlled by the --expert-distribution-recorder-mode flag. This flag determines the granularity of the collected data. When enable_eplb is on, this mode defaults to stat to gather statistics for rebalancing. The available modes are:

per_token: This is the most detailed mode. It records the specific expert choices for every single token processed by the model. While it provides the richest data, it also has the highest performance overhead. The raw, unaggregated data for each forward pass is stored.
per_pass: In this mode, SGLang records the aggregated expert usage counts for each individual forward pass. The data is not aggregated across different passes, giving you a snapshot of expert popularity for each batch of requests.
stat: This mode also records the exact expert usage counts for each forward pass, but it then aggregates these counts across multiple passes (the number of passes is determined by --expert-distribution-recorder-buffer-size). This provides a moving average of expert usage statistics and is the default when EPLB is enabled.
stat_approx: This mode is similar to stat but gathers approximate statistics, usually from the DeepEP dispatcher. This method has lower overhead than stat but is less precise, especially for small batch sizes. It is a good choice when performance is critical.

The collected statistics are then fed into the rebalancing algorithm to generate a new expert placement plan.

Initializing with a Pre-computed Distribution#

While SGLang can start with a simple default layout and learn a better one over time, you can also provide it with a pre-computed expert distribution to start with. The --init-expert-location flag allows you to specify a file path (.pt or .json) or a JSON string containing an expert layout. This is useful if you have already analyzed a representative workload offline and want the server to start immediately with a balanced configuration. If this flag is not set, it defaults to a trivial sequential layout.

References and further reading#

On this page

NVIDIA uses cookies to improve your experience on our web site. We and our third-party partners also use cookies and other tools to collect and record information you provide as well as information about your interactions with our websites for performance improvement, analytics, and to assist in marketing efforts. By continuing to use this site or by clicking one of the buttons below, you agree to the use of cookies and other tools as described in our Privacy Policy and Cookie Policy (subject to your settings) and accept our Terms of Service (which contains important waivers). Please see our Privacy Policy for more information on our privacy practices.

We have detected the Global Privacy Control (GPC) signal and have opted you out of all optional cookies on this site for this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. To opt out of non-cookie personal information "sales" / "sharing" for targeted advertising purposes, please visit the NVIDIA Preference Center. Please see our Privacy Policy for more information on our privacy practices.

We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies. You can manage these settings in the NVIDIA NVIDIA Preference Center. Please see our Privacy Policy for more information.

We have detected the Global Privacy Control Signal (GPC) and have opted you out of all optional cookies on this browser. You can manage your cookie settings by clicking on "Manage Settings". Please see our Cookie Policy for more information. We have also opted you out of "sharing"/"sales" of personal information outside of cookies which overrides at least one of your previous settings. You can manage them in the NVIDIA Preference Center. Please see our Privacy Policy for more information.

Manage Settings

Turn Off Optional Cookies Agree

Image 9: NVIDIA Logo

Cookie Settings

We and our third-party partners (including social media, advertising, and analytics partners) use cookies and other tracking technologies to collect, store, monitor, and process certain information about you when you visit our website. The information collected might relate to you, your preferences, or your device. We use that information to make the site work, analyze performance and traffic on our website, provide a more personalized web experience, and assist in our marketing efforts.

Under certain privacy laws, you have the right to direct us not to "sell" or "share" your personal information for targeted advertising. To opt-out of the "sale" and "sharing" of personal information through cookies, you must opt-out of optional cookies using the toggles below. To opt out of the "sale" and "sharing" of data collected by other means (e.g., online forms) you must also update your data sharing preferences through the NVIDIA Preference Center.

Click on the different category headings below to find out more and change the settings according to your preference. You cannot opt out of Required Cookies as they are deployed to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, etc.). By clicking "Save and Accept" or "Decline All" at the bottom, you consent to the use of cookies and other tools as described in our Cookie Policy in accordance with your settings and accept our Terms of Service (which contains important waivers). For more information about our privacy practices, please see our Privacy Policy.

Required Cookies

Always Active

These cookies enable core functionality such as security, network management, and accessibility. These cookies are required for the site to function and cannot be turned off.

Cookies Details‎

Performance Cookies

Performance Cookies

These cookies are used to provide quantitative measures of our website visitors, such as the number of times you visit, time on page, your mouse movements, scrolling, clicks and keystroke activity on the websites; other browsing, search, or product research behavior; and what brought you to our site. These cookies may store a unique ID so that our system will remember you when you return. Information collected with these cookies is used to measure and find ways to improve website performance.

Cookies Details‎

Personalization Cookies

Personalization Cookies

These cookies collect data about how you have interacted with our website to help us improve your web experience, such as which pages you have visited. These cookies may store a unique ID so that our system will remember you when you return. They may be set by us or by third party providers whose services we have added to our pages. These cookies enable us to provide enhanced website functionality and personalization as well as make the marketing messages we send to you more relevant to your interests. If you do not allow these cookies, then some or all of these services may not function properly.

Cookies Details‎

Advertising Cookies

Advertising Cookies

These cookies record your visit to our websites, the pages you have visited and the links you have followed to influence the advertisements that you see on other websites. These cookies and the information they collect may be managed by other companies, including our advertising partners, and may be used to build a profile of your interests and show you relevant advertising on other sites. We and our advertising partners will use this information to make our websites and the advertising displayed on it, more relevant to your interests.

Cookies Details‎

Cookie List

Clear

- checkbox label label

Apply Cancel

Consent Leg.Interest

checkbox label label
checkbox label label
checkbox label label

Decline All Save and Accept

Links/Buttons:

Expert Parallelism Load Balancer (EPLB) in SGLang#

The EPLB Algorithm: Core Concepts#

Redundant Experts for Flexibility#

Group-Limited Routing for Efficiency#

Load Balancing Policies#

How SGLang Implements EPLB#

Dynamic Rebalancing#

Expert Usage Recording#

Initializing with a Pre-computed Distribution#

References and further reading#

Related Articles