Search: AI token costs

Paper page - Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

… We therefore call for inference papers and benchmarks to report Joules/token, active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output alongside accuracy and latency. …

May 14, 2026

Paper page - DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

… Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. …

May 12, 2026

Paper page - Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Papers arxiv:2605.06105 Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility Published on May 7 Submitted by jeongseokoh on May 11 Authors: Jungsuk Oh , , , , Abstract SPEED is a phase-asymmetric KV-visibility policy that reduces long-context inference… …

May 11, 2026

Paper page - MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

… AI-generated summary DeepSeek Sparse Attention DSA sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. …

May 11, 2026

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Papers arxiv:2605.09649 Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Published on May 10 Submitted by Ngoc Bui on May 12 Yale University Authors: , , , Abstract Learned global retention-based key-value cache eviction improves long-context reasoning by sel… …

May 12, 2026

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

… The prevailing practice typically adopts global encoding followed by post-ViT compression . Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. …

May 12, 2026

Paper page - Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Papers arxiv:2605.12825 Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion Published on May 12 Submitted by Nguyen Van Chien on May 14 Authors: Chien Van Nguyen , , , , Franck Dernoncourt , Abstract Orthrus is a dual-architecture framework that combines autoregressive LLMs … …

May 14, 2026

Paper page - Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

… AI-generated summary On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. …

May 12, 2026

Paper page - PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

… This preserves tool-call fidelity, gives cross-round consistency, and locks in first-turn protection even if the on-device LLM is later compromised. 📊 Results Qwen3-4B + Gemini 3 Flash 📈 +15-36% accuracy and 2-6× lower leakage vs SOTA device-cloud baselines on $\tau^2$-Bench Airline/Retail and GAIA… …

May 13, 2026

We Got Claude to Fine-Tune an Open Source LLM

… I also recently read a related guide: https://mobisoftinfotech.com/resources/blog/ai‑development/llm‑api‑pricing‑guide — which gives practical advice on LLM API usage, token‑based pricing, and how to plan costs when working with LLMs. …

Oct 14, 2025 · ben burtenshaw

Followed topics

Paper page - Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

Paper page - DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

Paper page - Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Paper page - MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Paper page - Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Paper page - LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Paper page - Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Paper page - Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Paper page - PAAC: Privacy-Aware Agentic Device-Cloud Collaboration

We Got Claude to Fine-Tune an Open Source LLM