Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog
…During attention computation, each device: Processes its local portion of the sequence Exchanges Key Value (KV) tensors with neighboring devices in a ring topology Incrementally computes attention scores as KV blocks circulate…