News - Cuda 12.6 Release

| Library | Key Changes in CUDA 12.6 | |---------|--------------------------| | | New FP8 GEMM kernels for Hopper (up to 2x faster than 12.5). cublasGemmEx supports CUBLAS_COMPUTE_32I for integer GEMM. | | cuDNN | Version 9.2.0 integrated. Adds FlashAttention-3 (FP8) support on H200. Grouped convolutions optimized for 4D tensors. | | cuFFT | Support for half-precision R2C and C2R transforms up to 3D. Reduced memory footprint for multi-GPU transforms. | | cuSPARSE | New sparse matrix–vector (SpMV) for block compressed sparse row (BSR) format with FP16/BF16. | | NCCL | Included NCCL 2.21.5. Adds NVLS (NVIDIA Link Switch) support for multi-node all-reduce. Improved ring/tree autotuning. | | CUDA Math API | New __h2bf16 and __bf162h intrinsics for Hopper. |

NVIDIA maintains a frequent release cycle for the CUDA Toolkit Archive , with 12.6 seeing the following milestones: Release Date August 2024 Initial General Availability (GA) CUDA 12.6.1 August 2024 Performance updates and initial 12.6 patches CUDA 12.6.2 October 2024 Expanded target APIs for CUPTI CUDA 12.6.3 November 2024 Final major patch for the 12.6 branch NVIDIA Developerhttps://developer.nvidia.com CUDA Toolkit Archive - NVIDIA Developer

CUDA 12.6 is likely the (expected late Q4 2024) or a direct jump to CUDA 13 (speculated for 2025). NVIDIA is focusing on:

Report compiled from NVIDIA CUDA 12.6 Release Notes, NVIDIA Developer Blog (August 2024), and technical documentation.

The release of CUDA 12.6 marks a significant, albeit nuanced, milestone in the evolution of NVIDIA’s parallel computing platform. While previous major releases often introduced radical new features or architectural shifts, CUDA 12.6 is best characterized as a foundational release. It serves as the stable bedrock required to support NVIDIA’s next-generation Blackwell architecture while simultaneously refining the developer experience (DX) through enhanced debugging tools and streamlined deployment workflows.

CUDA Toolkit Documentation 12.6 Update 1 ... The NVIDIA® CUDA® Toolkit provides a development environment for creating high perfo... NVIDIA Docs Nsight Compute Release History - NVIDIA Developer Archive * 2026/01/12 - 2025.4.1 getting started, new features, and docs (for the CUDA Toolkit 13.1 Update 1 release and docs) * 20... NVIDIA Developer

Updates to cuBLAS include fixes for large leading dimensions on Compute Capability 9.0 (Hopper) and 10.x (Blackwell) architectures.

Another major focus of this release is the maturation of the debugging and analysis toolchain. As GPU code becomes more complex—handling millions of threads across massive datasets—finding bottlenecks becomes exponentially harder. CUDA 12.6 brings updates to tools like cuda-gdb and the Nsight suite, offering improved visibility into how kernels execute on the hardware. These improvements allow for more precise profiling, helping developers squeeze every ounce of performance out of their code. In an era where optimizing a large language model (LLM) by even a few percentage points can result in millions of dollars in energy and compute savings, these developer tools are as valuable as raw hardware speed.

| Issue | Workaround | |-------|-------------| | cuBLAS FP8 kernels may cause incorrect results on H200 with TMA (Tensor Memory Accelerator) enabled | Disable TMA via environment variable CUBLAS_TMA=0 | | NVCC with -arch=native fails on some Windows 11 24H2 builds | Explicitly specify -arch=sm_90 for Hopper | | Multi-GPU cudaMemcpyPeerAsync can deadlock if graphs are used | Use blocking peer copies or separate streams | | Nsight Systems fails to profile WSL2 kernels >1 second | Upgrade to WSL2 kernel 5.15+ |