Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core | NVIDIA Technical Blog
… At the same time, activation memory grows linearly \ \mathcal{O} S \ , meaning even small variances can lead to major imbalances in compute and memory across DP ranks and micro-batches. To balance a large sample’s workload, we may pack small samples together, but this causes severe memory pressure. …