Search

Showing top 57 results for "Hardware/support requests"

People also ask

What metrics should you measure for LLM inference performance?

The prerequisite for sizing and TCO estimation is benchmarking the performance of each deployment unit, e.g., an inference server. The goal of this step is to measure the throughput a system can produce under load, and at what latency. These throughput and latency metrics, together with quality of service requirements (e.g., max latency) and expected peak demand (e.g., max concurrent users or requests per second), will help estimate the required hardware, such as sizing the deployment. In turn, sizing information is a prerequisite for estimating the total cost of ownership (TCO) of the given s

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

… By using the latency constraint and peak requests per second, developers can calculate the required number of model instances and servers, and then build a TCO calculator to estimate hardware and software costs. …

Jun 18, 2025 · Vinh Nguyen

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer Library | NVIDIA Technical Blog

… Finally, while there is a need for heterogeneous hardware support in terms of memory and storage, there can be heterogeneity in compute hardware as well. Handling each of these unique hardware components can become cumbersome. …

Mar 9, 2026 · Seonghee Lee

How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale | NVIDIA Technical Blog

Agentic AI / Generative AI How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale Mar 16, 2026 By Amr Elmeleegy Discuss 1 Discuss 1 L T F R E AI-Generated Summary Like Dislike NVIDIA Dynamo 1.0 delivers a mature, production-grade distributed inference framework for large-scale, multi… …

Mar 16, 2026 · Amr Elmeleegy

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads | NVIDIA Technical Blog

… Consolidating support models like ASR and TTS provides a strategic path to maximize hardware utilization while maintaining end-to-end responsiveness. …

Mar 25, 2026 · Sagar Desai

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical Blog

… For example: Nemotron-3-Nano-30B sustained 1,025 token/s at 256 concurrent requests with dynamic fractions compared to a static-fraction ceiling of 721 token/s at just four concurrent requests before instability 1.4x . …

Feb 27, 2026 · Shwetha Krishnamurthy

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models | NVIDIA Technical Blog

… He takes care of model optimization across target hardware and also maintains accelerator infrastructure at Sarvam. He likes to dive deep into model architecture, kernels, and hardware. …

Feb 18, 2026 · Utkarsh Uppal

Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo | NVIDIA Technical Blog

… Codex sends a Responses reasoning object when the selected model metadata says reasoning summaries are supported. In that path, Codex also requests reasoning.encrypted content so the reasoning state can be replayed across turns. …

May 8, 2026 · Matej Kosec

DynoSim: Simulating the Pareto Frontier | NVIDIA Technical Blog

… The goal is not a purely analytical estimate and not a bit-exact hardware emulator. …

May 29, 2026 · Yongming Ding

Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark | NVIDIA Technical Blog

… Two, four, or even eight subagents concurrently working through requests can make use of the strong concurrency capabilities in DGX Spark. With support from frameworks that handle concurrency well such as NVIDIA TensorRT LLM , vLLM, and SGLang , multiagent workloads run smoothly on NVIDIA DGX Spark. …

Mar 16, 2026 · Allen Bourgoyne

Advancing AI Infrastructure for Agentic AI with NVIDIA DOCA In-Silicon Security | NVIDIA Technical Blog

… This establishes a consistent, hardware-enforced security foundation across the platform. …

Jun 1, 2026 · Ofir Arkin

Followed topics