Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel | NVIDIA Technical Blog
…Dynamic routing mechanisms cause some “hot experts” to receive more tokens than average, while “cold experts” are underutilized, resulting in uneven computing load across devices and wasted computing power. This problem becomes…
