Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron | NVIDIA Technical Blog
… Reduce-scatter gradient : A reduce-scatter is performed over all gradients and each GPU gets a portion of gradients corresponding to the parameters it “owns.” Local updates : Each GPU updates only the specific portion of the model parameters it “owns.” AllGather parameters : After the update, GPUs … …