Search

Showing top 106 results for "GPU needs for LLMs"

Videos

Healthcare and Life Sciences Archives

…GPU and 1,000 tokens per second per user on gpt-oss with the latest NVIDIA TensorRT-LLM stack. As AI shifts from one-shot answers to complex reasoning, the demand for…

May 7, 2026

Fast, Low-Cost Inference Offers Key to Profitable AI

…To run state-of-the-art LLMs in real time, enterprises need multiple GPUs working in concert. Tools like the NVIDIA Collective Communication Library , or NCCL, enable multi-GPU systems to quickly…

Jan 23, 2025 · Dave Salvator

I replaced my local LLM with a model half its size and got better results — And it wasn't about the parameters

…It ran smoothly on my setup through GPU offloading, even though it’s officially designed for 16GB of VRAM. For most things, it was fine. I would primarily prompt it for quick…

Apr 8, 2026 · Nolen Jonker

Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

…samples. The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup. torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py…

Oct 7, 2025 · Max Xu

Optimized Software for Professionals With AMD and ISV Solutions

…Epic Games AMD partners with Epic Games to optimize Unreal Engine for peak performance on AMD CPUs and GPUs, delivering enhanced graphics and speed for developers and users. Maxon AMD and Maxon…

AR / VR – NVIDIA Technical Blog

…Your Essential Tool for Measuring GPU Interconnect and Memory Performance When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is…

May 22, 2026

Developer Tools & Techniques – NVIDIA Technical Blog

…Your Essential Tool for Measuring GPU Interconnect and Memory Performance When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is…

May 22, 2026

Discussions and forums

r/homelab · u/AntifaAustralia · 2w ago

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

I'm pretty new to homelabbing and this is my first mini rack! Started with the Beelink ME Mini and then just kinda grew from there (it's always the way hey haha). It idles at 70 watts (not too shabby for how much is goin…

r/LocalLLaMA · u/APFrisco · 2w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 2w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/selfhosted · u/lazycodewiz · 1w ago

Followed topics

Search

Videos

Healthcare and Life Sciences Archives

Fast, Low-Cost Inference Offers Key to Profitable AI

I replaced my local LLM with a model half its size and got better results — And it wasn't about the parameters

Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

Top stories

I built my own Googlebook with a Raspberry Pi, local LLMs, and old hardware

I added a second GPU just for local AI workloads, and it cost less than upgrading my main one

13 years later, the GTX Titan is still the most important GPU Nvidia ever made

My RTX 5090 can't keep up with Apple Silicon on the biggest local LLMs, and I hate to admit it

Optimized Software for Professionals With AMD and ISV Solutions

AR / VR – NVIDIA Technical Blog

Developer Tools & Techniques – NVIDIA Technical Blog

Discussions and forums

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

services with actually generous free tiers for open-source projects. my list, what would you add?

Nvidia slaps Groq into new LPX racks for faster AI response

Launching AMD AI Playbooks: Step-by-Step Guides for Building with AI Locally with AMD

My local LLM is the best productivity tool I've installed in years, and it costs nothing to run