Search

Showing top 117 results for "model-by-model evaluation"

All sources huggingface.co 94 developer.nvidia.com 7 anthropic.com 6 amd.com 5 arstechnica.com 2 theverge.com 2 techcrunch.com 1 tomshardware.com 1 blogs.nvidia.com 1 intel.com 1 9to5mac.com 1

Paper page - PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

…The following papers were recommended by the Semantic Scholar API HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks (2026) Medical Reasoning with Large Language Models: A Survey and MR-Bench (2026…

May 5, 2026

Paper page - Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual…

Jun 3, 2026

Paper page - Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

…AI-generated summary Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech (TTS) introduces high variance due to linguistic diversity and…

Apr 29, 2026

Paper page - IntentGrasp: A Comprehensive Benchmark for Intent Understanding

…May 7 Submitted by Yuwei Yin on May 11 University of British Columbia Authors: Yuwei Yin , Chuyuan Li , Abstract IntentGrasp is a benchmark for evaluating large language models' intent understanding capability, demonstrating…

May 11, 2026

Discussions and forums

Hacker News · u/linzhiqiu · 1w ago

Show HN: VQAScore – open eval metric/reward model, now for text-to-video

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field …

Hacker News · u/deepakakkil · May 15, 2026

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

Hacker News · u/dhavalt · 11h ago

Show HN: AptSelect – A local LLM client for parallel testing and evaluation

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.What it does:Parallel Execution: Send a single prompt to OpenAI, Anthr…

Hacker News · u/jrhizor · May 18, 2026

Show HN: Elmo (Open Source AEO)

I'm excited to announce Elmo, an MIT-licensed, open source AEO/GEO tool.We help you scrape ChatGPT/Google AI Mode/etc using web scrapers like BrightData/Olostep/etc, evaluate prompts against the OpenAI/Anthropic/Mistral …

Hacker News · u/JohannaAlmeida · Apr 7, 2026

Hybrid Attention

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests .Full attention O(n²): 17.9…

40 9

Paper page - AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

…Across the eight model backends, Lucky rates range from 0.5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of…

May 14, 2026

Paper page - Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues

…We evaluate 10 frontier LLMs on Ψ-Bench and find that while most models can produce coherent and reasonable arguments, even state-of-the-art models still leave considerable room for improvement…

Jun 3, 2026

Paper page - τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

…to evaluate conversational agent reliability, revealing significant performance gaps among leading models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As recommender systems transition toward agentic, multi-turn conversational interfaces , evaluation paradigms…

Jun 11, 2026

Followed topics

Search

People also ask

Paper page - PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Top stories

Paper page - When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Paper page - Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Paper page - Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Paper page - Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases