Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core | NVIDIA Technical Blog
…Even though these sequences fit on a single GPU, they’re partitioned due to a longer sequence in the same batch, resulting in unnecessary CP communication overhead. Usually, computation hides CP communication…