Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel | NVIDIA Technical Blog
…Refers to specially registered GPU memory that can be accessed by kernels on other ranks. It is the only global static buffer. Registration depends on the scenario: cross-node communication registers GPU…