Search

Showing top 117 results for "model-by-model evaluation"

All sources huggingface.co 93 developer.nvidia.com 7 anthropic.com 7 amd.com 5 arstechnica.com 2 theverge.com 2 techcrunch.com 1 tomshardware.com 1 blogs.nvidia.com 1 intel.com 1 9to5mac.com 1

Videos

Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

…A Benchmark for Real-World, Long-Horizon Agent Evaluation Published on May 11 Submitted by Shuangrui Ding on May 15 Intern Large Models Authors: , Xuanlang Dai , , , , Yang JingYi , , , , , , , , , , , Yuhang Zang Abstract WildClawBench…

May 15, 2026

Paper page - M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

…Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks Published on Jun 3 Submitted by Huang Jie on Jun 4 PKU-VaLuE-Lab Authors: , , , , , , Abstract Multi-modal models exhibit significant limitations in…

Jun 4, 2026

Paper page - MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

…Evaluating Open-World Exploration of MLLM Agents in Minecraft Published on May 29 Submitted by Tianjie Ju on Jun 2 Authors: , , , , , , , , , Abstract MineExplorer benchmark evaluates multimodal large language models' open-world exploration…

Jun 2, 2026

Paper page - XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

…by the Semantic Scholar API IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia (2026) ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models (2026…

May 8, 2026

Discussions and forums

Hacker News · u/linzhiqiu · 1w ago

Show HN: VQAScore – open eval metric/reward model, now for text-to-video

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field …

Hacker News · u/deepakakkil · May 15, 2026

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

Hacker News · u/dhavalt · 1d ago

Show HN: AptSelect – A local LLM client for parallel testing and evaluation

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.What it does:Parallel Execution: Send a single prompt to OpenAI, Anthr…

Hacker News · u/jrhizor · May 18, 2026

Show HN: Elmo (Open Source AEO)

I'm excited to announce Elmo, an MIT-licensed, open source AEO/GEO tool.We help you scrape ChatGPT/Google AI Mode/etc using web scrapers like BrightData/Olostep/etc, evaluate prompts against the OpenAI/Anthropic/Mistral …

Hacker News · u/JohannaAlmeida · Apr 7, 2026

Hybrid Attention

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests .Full attention O(n²): 17.9…

40 9

Spooked by Mythos, Trump suddenly realized AI safety testing might be good

…To date, CAISI said it has completed about 40 evaluations, including those of frontier models that have yet to be released. When conducting tests, CAISI frequently gains access to models with “reduced…

May 6, 2026 · Ashley Belanger

New Microsoft tool lets devs spin up AI behavior tests using text descriptions | TechCrunch

AI researchers and labs have advanced by leaps and bounds in evaluating AI models for everything from safety and compliance to sycophancy and alignment . But it appears companies and developers are faced…

Jun 2, 2026 · Ram Iyer

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

…Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions Published on Apr 30 Submitted by Weiyu Sun on May 8 Authors: Weiyu Sun , , , , , Abstract EDU-CIRCUIT-HW…

May 8, 2026

Followed topics

Search

People also ask

Videos

Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Top stories

Paper page - When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Paper page - Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

Paper page - Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Paper page - LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs