Cublaslt Grouped Gemm

The core API call is cublasLtMatmul , but it requires specific descriptor setup using cublasLtMatrixLayout and cublasLtMatmulDesc .

solves this by allowing you to define a "group" of these operations. The cuBLASLt kernel batches them into a single grid, ensuring the GPU remains saturated and reducing launch overhead. cublaslt grouped gemm

cublasLtGroupedMatmulPlan_t groupPlans[3]; for (int i = 0; i < groupCount; i++) cublasLtGroupedMatmulPlanInit(handle, matmulDesc, &groupPlans[i], CUDA_R_16F, CUDA_R_16F, CUDA_R_16F, CUDA_R_32F, m_arr[i], n, k); The core API call is cublasLtMatmul , but

A single kernel launch for 1,024 GEMMs vs. 1,024 separate launches. On GPUs, kernel launch latency is in microseconds, but over thousands of operations, it adds up. Grouped GEMM reduces this to near zero. cublasLtGroupedMatmulPlan_t groupPlans[3]; for (int i = 0; i

Since each operation has its own descriptor, you store and compute exactly what you need. This saves memory bandwidth and avoids spurious computations.

cuBLASLt Grouped GEMM: Accelerating Irregular Matrix Workloads

), removes this restriction, enabling developers to process "irregular" workloads—such as those found in Mixture-of-Experts (MoE) models or LoRA (Low-Rank Adaptation) fine-tuning—with significantly higher GPU efficiency. Why Grouped GEMM?