What is Cluster Launch Control (CLC)?
Cluster Launch Control (CLC) represents a significant advancement in GPU architecture, introduced by Blackwell. Primarily aimed at enhancing dynamic scheduling, CLC allows the kernel to launch a grid with an adaptive number of thread blocks. This dynamic approach mimics the efficiency seen in non-persistent kernels while harnessing two key benefits: fewer thread block launches, a perk of persistent kernels, and improved load balancing provided by hardware capabilities.
To illustrate its functionality, consider a straightforward General Matrix Multiplication (GEMM) kernel utilizing 32×32 output tiles across 144 streaming multiprocessors (SMs) available for processing. The innovative CLC modifies traditional scheduling techniques, making it possible to optimize workload distribution among the available SMs.
Fig-1. Non-persistent scheduling
With CLC enabled, a 32×32 grid launched from the host assigns Compute Thread Arrays (CTAs) 0 through 143 to SMs in a sequential mapping.
Fig-2. CLC assigns the initial CTAs to SMs
One of CLC’s standout features is its capability for dynamic scheduling, allowing the system to adapt to changing workloads and resource availability during execution. For instance, if additional SMs become available at runtime, they can immediately engage in stealing work. This capability significantly improves throughput and efficiency.
Fig-3. CLC steals work
What is TLX?
The Triton Low-level Extension (TLX) is an innovative tool designed for expert users who require granular control over GPU operations. TLX opens up a diverse array of functionalities, including:
- Hardware-specific intrinsics, such as
wgmma,async_copy, andbarrier. - Management of shared and local memory.
- Instruction-level scheduling and control.
- Cross-warpgroup synchronization.
By exposing low-level GPU primitives, TLX empowers advanced kernel development and enables architecture-specific optimizations. This offers developers a comprehensive toolkit for efficient programming, albeit with the responsibility of managing performance across varying hardware platforms.
CLC in TLX
Within TLX, CLC is facilitated through three primary APIs, enhancing the user experience while retaining performance.
-
Initialization:
tlx.clc_create_context(num_stages,num_consumers)allocates shared memory for the CLC, supporting pipelined workload stealing and multi-consumer scenarios. -
Producer:
tlx.clc_producer(context, k, p_producer)allows for the attempt to steal a workload stage, utilizing the returned context from the initialization API. -
Consumer:
tlx.clc_consumer(context, k, p_consumer)is used for CTA ID decoding.
During setup, the initialization API enables a streamlined process for both multi-consumer and multi-stage pipelining. Developers can leverage this structure to create efficient workflows that enhance GPU performance. The producer API interacts with barrier synchronization mechanisms to optimize throughput, while the consumer API efficiently decodes tile IDs for responsive processing.
Case Study: WS GEMM vs. CLC + WS GEMM
A comparative analysis between traditional Worker-Server (WS) GEMM and the enhanced CLC + WS GEMM reveals critical insights into performance optimization. In this case study, both methods executed with three WS regions demonstrate the efficiency gains from CLC integration.
Fig-4. Initialize context outside tlx.async_tasks and call producer API in WS region
Fig-5. Call consumer API in the epilogue WS region
The analysis identifies distinct variations in invocations between default and non-default worker groups (WGs). While the default WG utilizes both tlx.clc_producer and tlx.clc_consumer, alternatives such as the non-default WG rely on the consumer API alone, maximizing efficiency.
Fig-6. Call consumer API in the MMA WS region
Fig-7. Call consumer API in the TMA load WS region
This structured differentiation aids in achieving optimal load distribution, particularly critical for kernels with uneven workloads across thread blocks. Visualizing such efficiencies becomes apparent through occupancy heatmaps, where CLC’s impact is vivid.
Fig-8. Mirror the grid size used in non-persistent kernels
The heatmaps, indexing time (clock cycles) against SM IDs, demonstrate a marked improvement in load occupancy and reduced idle states, showcasing CLC’s ability to enhance operational throughput significantly.
Fig-9. SM-occupancy heatmaps between pipelined GEMM and CLC GEMM
Future ideal scenarios will leverage CLC to maximize load balancing even further, particularly in dynamic workloads where processing demands fluctuate.
Fig-10. SM-occupancy heatmap of an internal kernel with CLC enabled
Acknowledgments
This exploration of CLC and its significance in GPU architecture would not have been possible without the insightful discussions with industry experts like Bingyi Zhang from Nvidia. I also extend my gratitude to Srivatsan Ramesh and Yuanwei (Kevin) Fang from Meta for their invaluable contributions toward generating SM-occupancy heatmaps, aiding in the understanding of CLC’s practical applications.
Inspired by: Source










