Search

Showing top 27 results for "GPU needs for LLMs"

Filtered by topic: LLMs Clear ✕

All sources xda-developers.com 18 developer.nvidia.com 8 nextplatform.com 1

Home Assistant's local LLM support outperforms Gemini for Home, and Google knows it

…You’re at the mercy of the model Google chooses and wait for an update if it falls short. Running LLMs locally frees you from those constraints — of course, you still need…

Apr 28, 2026 · Samir Makwana

I use this local AI tool to turn boring documents into cool narrations

…Sign in to your XDA account I recently started integrating local LLMs with my arsenal of free and open-source tools, and they’ve been a game-changer for my productivity needs…

May 17, 2026 · Ayush Pande

Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them

…Available VRAM on the GPU D Internet bandwidth Spot on! VRAM is the key bottleneck for local LLM inference. If a model fits entirely in your GPU's VRAM, it runs dramatically…

Apr 8, 2026 · Adam Conway

I built a local LLM server I can access from anywhere, and it uses a Raspberry Pi

…Related I ran local LLMs on a "dead" GPU, and the results surprised me My Pascal card may not be ideal for intensive workloads, but it's more than enough for light…

Apr 23, 2026 · Ayush Pande

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

…How do you calculate required server capacity for peak LLM request volumes? To calculate the required infrastructure for a given LLM application, we need to identify the following constraints: Latency type and…

Jun 18, 2025 · Vinh Nguyen

MLOps – NVIDIA Technical Blog

…You can optimize for specific GPU configurations and achieve... 9 MIN READ Jan 08, 2026 Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Large language models…

May 12, 2026

How Small Language Models Are Key to Scalable Agentic AI | NVIDIA Technical Blog

…adding a new skill or fixing a behavior can be done in a few GPU hours on an SLM, compared to days or weeks of fine-tuning for LLMs. With edge deployments…

Aug 29, 2025 · Peter Belcak

Discussions and forums

r/LocalLLaMA · u/APFrisco · 1w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 2w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/LocalLLaMA · u/bobaburger · 2w ago

Followed topics