SageAttention
SageAttention is a high-performance, plug-and-play quantized attention library designed to accelerate inference for large language, image, and video models. Achieving recognition as a spotlight or featured track at ICLR 2025, ICML 2025, and NeurIPS 2025, it delivers 2x to 5x speedups over FlashAttention variants without compromising end-to-end accuracy metrics. The project includes multiple versions: SageAttention utilizes accurate 8-bit quantization; SageAttention2 introduces thorough outlier smoothing and per-thread INT4 quantization; and SageAttention3 explores microscaling FP4 attention and 8-bit training. It provides optimized kernels for Ampere, Ada, and Hopper GPUs, including support for RTX 5090, and features INT8 quantization for QK calculations and FP8 quantization for PV operations. Key technical capabilities include a two-level accumulation strategy to maintain precision in low-precision matrix multiplications and compatibility with torch.compile in non-cudagraphs mode for distributed inference. T