Search

Showing top 111 results for "AI cost and performance"

People also ask

What metrics should you measure for LLM inference performance?

The prerequisite for sizing and TCO estimation is benchmarking the performance of each deployment unit, e.g., an inference server. The goal of this step is to measure the throughput a system can produce under load, and at what latency. These throughput and latency metrics, together with quality of service requirements (e.g., max latency) and expected peak demand (e.g., max concurrent users or requests per second), will help estimate the required hardware, such as sizing the deployment. In turn, sizing information is a prerequisite for estimating the total cost of ownership (TCO) of the given s

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

What formulas determine cost per token and yearly depreciation for LLM inference?

To estimate the amount of hardware and software licenses required and the associated cost, follow these steps and a hypothetical example First, collect and identify the cost information corresponding to both hardware and software. Next, calculate the total cost following the steps: Number of servers is calculated as the number of instances times the GPUs per instance, divided by the number of GPUs per server. Yearly server cost is calculated as the initial server cost divided by the depreciation period (in years), adding the yearly software licensing and hosting costs per server. Total cost is

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

How do latency-throughput trade-offs affect deployment optimization?

Once raw benchmark data are collected, they are analyzed to gain insight into the various performance characteristics of the system. Read our LLM inference benchmarking guide, where we gather NIM performance data with GenAI-perf and use a simple Python script to analyze the data. For example, ‌performance data provided by GenAI-perf can be used to establish the latency-throughput trade-off curve, shown in Figure 1. Each dot on this graph corresponds to a “concurrency” level, that is, the number of concurrent requests being put into the system at any given time throughout the benchmark process

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

Add a Specialized Deep Research Skill to Agent Harnesses | NVIDIA Technical Blog

…The pipeline runs where the data is. AI-Q can read enterprise data, perform retrieval and synthesis, and create reports without raw documents leaving the controlled environment. This is critical for enterprises…

May 20, 2026 · William Markito Oliveira

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car | NVIDIA Technical Blog

…In addition, DriveOS 7 on Thor supports multiple QNX and Linux virtual machines enabling secure software environments for both AV and in-vehicle AI domains. DRIVE AGX Thor’s powerful AI performance…

May 5, 2026 · Felix Friedmann

Accelerated X-Ray Analysis for Nanoscale Imaging (XANI) of Novel Materials | NVIDIA Technical Blog

…Through extensive experimentation on the latest GPUs and high-performance Lustre storage systems, three critical optimizations were performed to achieve peak I/O performance: GDS, multithreaded HDF5, and data layout (details to…

May 13, 2026 · Irina Demeshko

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation | NVIDIA Technical Blog

…design, deploy, and scale AI systems, with deep expertise in both model training and high-performance inference. His work centers on practical, production-ready AI—particularly in vision and large-scale model…

Feb 5, 2026 · Alex Steiner

Integrate Physical AI Capabilities into Existing Apps with NVIDIA Omniverse Libraries | NVIDIA Technical Blog

…The future of modular physical AI NVIDIA Omniverse is becoming a set of modular building blocks—libraries and frameworks you can compose into your own physical AI stack. By providing high-performance…

Apr 8, 2026 · Ashley Goldstein

Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization | NVIDIA Technical Blog

…and aggregates the information to perform useful tasks such as summarization, Q&A and alerts. For more information about each task, see Build a Video Search and Summarization Agent with NVIDIA AI…

May 19, 2025 · Adam Ryason

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

…Advanced scheduling techniquesgang scheduling, hierarchical gang scheduling, and topology-aware placementare crucial for performant deployment on Kubernetes, with AI schedulers like KAI Scheduler and abstractions such as LeaderWorkerSet and NVIDIA Grove translating…

Mar 23, 2026 · Anish Maddipoti

How to Build a Document Processing Pipeline for RAG with Nemotron | NVIDIA Technical Blog

…By pairing frontier models with NVIDIA Nemotron via an LLM router, you can sustain this high performance while optimizing for cost and efficiency. You can also find more information on how Justt…

Feb 4, 2026 · Chia-Chih Chen

How to Accelerate Protein Structure Prediction at Proteome-Scale | NVIDIA Technical Blog

…With a PhD in molecular microbiology and immunology, Kyle bridges science and strategy, translating breakthroughs in AI, chemistry, and biology into platforms that accelerate discovery for researchers, startups, and pharmaceutical companies worldwide…

Apr 9, 2026 · Christian Dallago

Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo | NVIDIA Technical Blog

…AI agent architecture research with a focus on accelerating agent performance through a co-designed software and hardware stack. Benjamin also works on accelerating software development velocity through the design and deployment…

May 8, 2026 · Matej Kosec

Followed topics

People also ask

Add a Specialized Deep Research Skill to Agent Harnesses | NVIDIA Technical Blog

How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car | NVIDIA Technical Blog

Accelerated X-Ray Analysis for Nanoscale Imaging (XANI) of Novel Materials | NVIDIA Technical Blog

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation | NVIDIA Technical Blog

Integrate Physical AI Capabilities into Existing Apps with NVIDIA Omniverse Libraries | NVIDIA Technical Blog

Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization | NVIDIA Technical Blog

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

How to Build a Document Processing Pipeline for RAG with Nemotron | NVIDIA Technical Blog

How to Accelerate Protein Structure Prediction at Proteome-Scale | NVIDIA Technical Blog

Streaming Tokens and Tools: Multi-Turn Agentic Harness Support in NVIDIA Dynamo | NVIDIA Technical Blog