FlashAttention-2: Faster Attention with Better Parallelism and Work . . . We propose FlashAttention-2, with better work partitioning to address these issues In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work