Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile | NVIDIA Technical Blog
…The autotuner discovers that 64×64 tiles are best for sequences ≤2,048, then transitions to larger tiles for longer sequences. This delivers 45% additional performance at short sequences compared to fixed…