Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus | NVIDIA Technical Blog
…Normal workflow Figure 5 shows the AllGather bus bandwidth for the mixed network + NVLink collectives in one of the experiments. The compute performance for this AI pretraining workload was ~310 TFLOPs/GPU…