Search

Showing top 26 results for "GPU needs for LLMs"

Filtered by topic: LLMs Clear ✕

All sources xda-developers.com 17 developer.nvidia.com 8 nextplatform.com 1

Your old GPU can still run big LLMs – you just need the right tweaks

… Offloading layers lets me run massive LLMs on weak GPUs That’s how I managed to deploy Qwen3.6-35B-A3B on 12GB of VRAM Although your GPU is the ideal component for providing extra processing oomph to your LLMs, it’s not the only device capable of running them. …

May 6, 2026 · Ayush Pande

LM Studio's frontend was slowing me down, so I switched to this instead

… The most notable feature is PagedAttention , which dynamically pages the KV cache in and out instead of allocating fixed GPU resources per request. vLLM also uses continuous batching to keep the GPU fully saturated rather than idle, which gets your work done faster. …

Apr 22, 2026 · Joe Rice-Jones

Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron | NVIDIA Technical Blog

… Because whole layers vary in size, each GPU needs to collect differently sized parameter updates from different GPUs through all gatherv . …

Apr 22, 2026 · Hao Wu

Contemplating Meta’s Homegrown MTIA Compute Engine Roadmap

… I went through this in detail with a drilldown on the “Zion” and “ZionEX” and “Grand Teton” hybrid CPU-GPU systems designed by Meta Platforms way back in October 2022 , showing how the DLRMs were just as parameter-intense and flops-hungry as the LLMs of the time. …

Apr 8, 2026 · Timothy Prickett Morgan

Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

… All the quantized variants of the Llama 3 70B model can be served using only one NVIDIA H100 GPU while the baseline FP16 precision requires at least two GPUs. …

Sep 10, 2024 · Jan Lasek

You don't need an expensive GPU to run a local LLM that actually works

… The issue there is that RAM is actually really slow, at least compared to VRAM on a GPU or CPU cache. The former is the best choice for running LLMs with Nvidia GPUs and tool sets leading the way. But how much do you need to spend on a GPU to comfortably run an LLM with decent results? …

Apr 29, 2026 · Rich Edmonds

After a year of self-hosting LLMs, I realized the real bottleneck isn’t the GPU

… Obsession with GPUs is real GPUs are important, but not everything When you first get into self-hosting LLMs , everything revolves around the GPU; and honestly, that makes sense. …

May 6, 2026 · Yash Patel

Discussions and forums

r/LocalLLaMA · u/APFrisco · 1w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 2w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/LocalLLaMA · u/bobaburger · 2w ago

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB V…

Stop obsessing over your GPU's core clock — memory clock matters more for local LLM inference

… This is why an older high-end GPU like the RTX 3090 is better than newer cards for local AI inference . It has 24GB of GDDR6X VRAM while being powerful enough for most modern LLMs. Even the value for money of a used RTX 3090 is hard to resist, considering the prices of GPUs in today's market . …

Mar 28, 2026 · Tanveer Singh

I thought I needed a GPU for local LLMs until I tried this lean model

… Related I started self-hosting LLMs and absolutely loved it Who needs OpenAI when your home lab can do the thinking for you? …

Apr 5, 2026 · Parth Shah

Home Assistant's local LLM support outperforms Gemini for Home, and Google knows it

… Related 7 things I wish I knew when I started self-hosting LLMs I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start. …

Apr 28, 2026 · Samir Makwana

Followed topics