Search: first-party performance

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile | NVIDIA Technical Blog

… First, for best performance, the input and output arrays should only be accessed through their respective pointers while the kernel is running. …

May 26, 2026 · Jonathan Bentz

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling | NVIDIA Technical Blog

… By doing this, ComputeDomains make the high-performance fabric first-class in scheduling . …

Apr 7, 2026 · Ryan Prout

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads | NVIDIA Technical Blog

… One workload can’t impact the performance or memory stability of another. …

Mar 25, 2026 · Sagar Desai

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates | NVIDIA Technical Blog

… PCG64 is the default PRNG in Numpy and provides a good balance between quality and performance. include include global void sample kernel { cuda::pcg64 rng threadIdx.x ; cuda::std::normal distribution dist 0.0f, 1.0f ; float sample = dist rng ; } Search: cub::DeviceFind::FindIf CCCL 3.3 adds cub::D… …

May 26, 2026 · Jonathan Bentz

Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron | NVIDIA Technical Blog

… Check out the Megatron Bridge performance recipes . …

Apr 22, 2026 · Hao Wu

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling | NVIDIA Technical Blog

… For example, to define two GB200 NVL72 domains use the following script: --- - topology: gb200-nvl72 cluster default: true block: block sizes: - 18 blocks: - block: block01 nodes: node 0001-0018 - block: block02 nodes: node 0019-0036 The Slurm topology/block plugin supports multiple levels of hiera… …

May 7, 2026 · Felix Abecassis

Scaling the AI-Ready Data Center with NVIDIA RTX PRO 4500 Blackwell Server Edition and NVIDIA vGPU 20 | NVIDIA Technical Blog

… To do this, first ensure the VM is powered off. …

Apr 22, 2026 · Phoebe Lee

Controlling Floating-Point Determinism in NVIDIA CCCL | NVIDIA Technical Blog

… Determinism performance comparison The level of determinism selected affects the performance of cub::DeviceReduce . Not-guaranteed determinism, with its relaxed requirements, provides the highest performance. …

Mar 5, 2026 · Nader Al Awar

Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog

… Communication backend comparison Each configuration was evaluated with two communication backends: NCCL baseline NVSHMEM-enabled implementation Measurements: TFLOP/s per device : GPU computational throughput Step time seconds : Time per training iteration Speedup : Relative performance improvement … …

Feb 3, 2026 · Sevin Fide Varoglu

Removing the Guesswork from Disaggregated Serving | NVIDIA Technical Blog

… HiSim also aids HiCache architecture exploration and cost/performance optimization through three-level KV cache design e.g., L2 size, prefetch/eviction policy, L3 bandwidth needs, write-through vs write-back to find the best cost–performance point. …

Mar 9, 2026 · Tianhao Xu

Followed topics