Search

Showing top 15 results for "AI job impact"

People also ask

What is the impact of segment size on node availability?

To study the importance of setting --segment appropriately, one can use a simplified mathematical model that demonstrates the impact of the segment size on the effective available cluster capacity for a given job. Administrators need to be aware of how segment size can affect node availability.  You can also observe the impact of --segment=9: the expected usable capacity degrades quickly as the node unavailability rate λ increases, since having only a single unavailable node means the domain can only contribute nine nodes for jobs using --segment=9. Whereas for --segment=16, a domain will cont

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling | NVIDIA Technical Blog
What is topology-aware job scheduling?

Topology-aware job scheduling allows a job scheduler such as Slurm to make resource allocation decisions based on the cluster’s physical network layout, such as the hierarchy of switches and racks. The scheduler should preserve locality, keeping workloads within the same NVLink domain whenever possible. In addition, because multiple training or inference jobs can fit in a group of NVL72 racks, the scheduler must provide efficient bin-packing to avoid resource fragmentation. The longstanding Slurm topology/tree plugin provides topology-aware scheduling for large clusters, but its best-effort ap

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling | NVIDIA Technical Blog
How do cluster segmentation and job scheduling work on GB200 NVL72?

As clusters grow in scale and complexity, managing GPU resources becomes critical for achieving both high utilization and predictable performance. The GB200 NVL72 system introduces larger AI job segment sizes and fine-grained scheduling control, enabling operators to align segment configurations with workload needs. Together with GB200 NVL72-aware scheduling extensions in the Slurm workload manager, this approach balances large and small jobs to maximize efficiency even in the presence of hardware faults.

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling | NVIDIA Technical Blog
What is the best job scheduling approach for GB200 NVL72?

Based on our simulation results and performance testing, we recommend a scheduling approach for NVIDIA GB200 NVL72 clusters that prioritizes large job performance while maintaining high utilization. Large jobs of 64 GPUs or more should be given access to the maximum number of NVLink domains, using segment sizing to ensure proportional GPU allocation across domains. Segment-based scheduling is essential for aligning resources with workload patterns. For jobs of 32 nodes or more, a segment size of 16 is recommended if the application can benefit from it, while smaller jobs are better suited to s

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling | NVIDIA Technical Blog

To show you the most relevant results, we’ve omitted some entries very similar to those already shown. Repeat the search with the omitted results included.