Search: fact-check accuracy

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer | NVIDIA Technical Blog

… Export and deploy : Once the accuracy is acceptable, the fake quantized weights are compressed into their true low-precision form and exported as a checkpoint for downstream engines. In our case, we export the PyTorch checkpoint to ONNX and run inference with TensorRT. …

May 7, 2026 · Ruixiang Wang

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog

… Skip Softmax Attention can be enabled through the sparse attention configuration of the LLM API: from tensorrt llm import LLM from tensorrt llm.llmapi import SkipSoftmaxAttentionConfig sparse attention config = SkipSoftmaxAttentionConfig threshold scale factor=1000.0 Additionally, the threshold sca… …

Dec 16, 2025 · Laikh Tewari

Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

… Calibrating the model to obtain scaling factors for lower-precision GEMMs and exporting the quantized model to the TensorRT-LLM checkpoint . …

Sep 10, 2024 · Jan Lasek

Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs | NVIDIA Technical Blog

… This involves calculating a static scaling factor for each output weight channel before execution and a dynamic scaling factor for each token during execution to preserve maximum accuracy. …

Aug 28, 2024 · Anjali Shah

Followed topics

Search

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer | NVIDIA Technical Blog

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog

Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs | NVIDIA Technical Blog

Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities | NVIDIA Technical Blog

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning | NVIDIA Technical Blog

How Justt Scaled Chargeback Extraction with Nemotron Parse

Mastering Agentic Techniques: AI Agent Evaluation | NVIDIA Technical Blog

Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy | NVIDIA Technical Blog

NVIDIA Technical Blog