Search

Showing top 27 results for "GPU needs for LLMs"

Filtered by topic: LLMs Clear ✕

All sources xda-developers.com 18 developer.nvidia.com 8 nextplatform.com 1

I replaced my local LLM with a model half its size and got better results — And it wasn't about the parameters

…It ran smoothly on my setup through GPU offloading, even though it’s officially designed for 16GB of VRAM. For most things, it was fine. I would primarily prompt it for quick…

Apr 8, 2026 · Nolen Jonker

Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

…samples. The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup. torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py…

Oct 7, 2025 · Max Xu

Local LLMs are actually good now, and I wasted months not realizing it

…Related 7 things I wish I knew when I started self-hosting LLMs I've been self-hosting LLMs for quite a while now, and these are all of the things I…

Apr 18, 2026 · Nolen Jonker

Your local LLM feels weak because you're treating it like a search engine

…Sign in to your XDA account I’ve been running my local LLM for a while now, and it’s been hit and miss for me. For starters, I do kind of…

Apr 6, 2026 · Nolen Jonker

I tested 3 local LLMs on my actual work — and each model won at something different

…how deep the conversation can actually go for anyone with a smaller GPU. Related I ran the same prompts through Claude and my local LLM, and the results weren't what I…

Apr 21, 2026 · Nolen Jonker

TurboQuant tackles the hidden memory problem that's been limiting your local LLMs

…That's more than many GPUs can hold, and that's before you account for the model weights. Modern models have largely moved to GQA, which shares KV heads across multiple query…

Mar 30, 2026 · Adam Conway

Build a Retrieval-Augmented Generation (RAG) Agent with NVIDIA Nemotron | NVIDIA Technical Blog

…v2 LLM from the NVIDIA API Catalog . These APIs are useful for evaluating many models, quick experimentation, and getting started is free. However, for the unlimited performance and control needed in production…

Sep 23, 2025 · Edward Li

Discussions and forums

r/LocalLLaMA · u/APFrisco · 1w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 2w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/LocalLLaMA · u/bobaburger · 2w ago

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB V…

Followed topics