Search

Showing top 108 results for "GPU needs for LLMs"

Videos

Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them

…Available VRAM on the GPU D Internet bandwidth Spot on! VRAM is the key bottleneck for local LLM inference. If a model fits entirely in your GPU's VRAM, it runs dramatically…

Apr 8, 2026 · Adam Conway

Tech Mahindra Bridges India's Language Gap with AI

…We needed to ensure GPUs were not a requirement, and that it could run on a typical PC as well as a server.” Solution: Open-Source AI for India One of the…

· PDF

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

…How do you calculate required server capacity for peak LLM request volumes? To calculate the required infrastructure for a given LLM application, we need to identify the following constraints: Latency type and…

Jun 18, 2025 · Vinh Nguyen

LLM From Scratch is a hands-on workshop where you write every piece of an AI from nothing

…Related After a year of self-hosting LLMs, I realized the real bottleneck isn’t the GPU Hardware is just the entry fee for local intelligence.

May 8, 2026 · Simon Batt

MLOps – NVIDIA Technical Blog

…You can optimize for specific GPU configurations and achieve... 9 MIN READ Jan 08, 2026 Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Large language models…

May 12, 2026

I ditched my dedicated GPU for integrated graphics and cut my power bill in half

…for my media center Using a GPU to handle specific workloads is a must-have, depending on what you wish to achieve. For running a large language model (LLM), you absolutely need…

Mar 25, 2026 · Rich Edmonds

Nemotron-Nano-9B-v2-Japanese の推論チュートリアル

…The user is asking which prefecture is famous for \"Kusatsu Senbei,\" which is a type of cracker. Wait, the user wrote \"草加せんべい\" which is \"Kusatsu Senbei.\" But I need to check if…

Mar 17, 2026 · Atsunori Fujita

Discussions and forums

r/homelab · u/AntifaAustralia · 2w ago

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

I'm pretty new to homelabbing and this is my first mini rack! Started with the Beelink ME Mini and then just kinda grew from there (it's always the way hey haha). It idles at 70 watts (not too shabby for how much is goin…

r/LocalLLaMA · u/APFrisco · 2w ago

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

As the title states, my build is indeed able to run a 1 trillion parameter model (in this case Kimi K2.5) locally at ~4 tokens/second. I thought r/LocalLLaMA would be interested in the build due to that stat line, and al…

r/LocalLLaMA · u/janvitos · 2w ago

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec w…

r/LocalLLaMA · u/ex-arman68 · 2w ago

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

2026-05-07 edit: I have updated the hardware based recommendations with more focus on quality. I do not recommend q4_0 KV cache anymore beyond 64k context. After multiple rounds of testing with the different size quants,…

r/selfhosted · u/lazycodewiz · 1w ago

Followed topics

Search

Videos

Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them

Tech Mahindra Bridges India's Language Gap with AI

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

LLM From Scratch is a hands-on workshop where you write every piece of an AI from nothing

Top stories

AMD just dropped a compact AI workstation that makes discrete GPUs look outdated for running LLMs

I added a second GPU just for local AI workloads, and it cost less than upgrading my main one

13 years later, the GTX Titan is still the most important GPU Nvidia ever made

My RTX 5090 can't keep up with Apple Silicon on the biggest local LLMs, and I hate to admit it

MLOps – NVIDIA Technical Blog

I ditched my dedicated GPU for integrated graphics and cut my power bill in half

Nemotron-Nano-9B-v2-Japanese の推論チュートリアル

Discussions and forums

My first 10 inch rack with local LLM! No more Spotify, Google Home, Netflix, ChatGPT...

Computer build using Intel Optane Persistent Memory - Can run 1 trillion parameter model at over 4 tokens/sec

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

services with actually generous free tiers for open-source projects. my list, what would you add?

Peak Training: Blackwell Delivers Next-Level MLPerf Training Performance

I ditched Copilot on VS Code for this free extension, and it's miles ahead

How Small Language Models Are Key to Scalable Agentic AI | NVIDIA Technical Blog