Search

Showing top 117 results for "model-by-model evaluation"

All sources huggingface.co 94 developer.nvidia.com 7 anthropic.com 6 amd.com 5 arstechnica.com 2 theverge.com 2 techcrunch.com 1 tomshardware.com 1 blogs.nvidia.com 1 intel.com 1 9to5mac.com 1

Paper page - SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

…The following papers were recommended by the Semantic Scholar API Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews (2026) Teaching Language Models to Check Grounded Claim Factuality with Human Test…

Jun 1, 2026

Paper page - Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

…Can Video Generation Models Dream Executable Robot Manipulation? Published on Jun 4 Submitted by Rui Zhao on Jun 5 Authors: , , , , , , , , Abstract Video generation models were evaluated through robotic manipulation tasks to assess…

Jun 5, 2026

Paper page - ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

…narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character arc information is conditioned into models. Generated by Qwen/Qwen2.5-Coder…

Jun 5, 2026

Paper page - YoCausal: How Far is Video Generation from World Model? A Causality Perspective

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct As video diffusion models (VDMs) advance toward world models , a key question arises: do they truly understand causality , or merely overfit to statistical temporal…

May 29, 2026

Discussions and forums

Hacker News · u/linzhiqiu · 1w ago

Show HN: VQAScore – open eval metric/reward model, now for text-to-video

Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field …

Hacker News · u/deepakakkil · May 15, 2026

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

Hacker News · u/dhavalt · 12h ago

Show HN: AptSelect – A local LLM client for parallel testing and evaluation

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.What it does:Parallel Execution: Send a single prompt to OpenAI, Anthr…

Hacker News · u/jrhizor · May 18, 2026

Show HN: Elmo (Open Source AEO)

I'm excited to announce Elmo, an MIT-licensed, open source AEO/GEO tool.We help you scrape ChatGPT/Google AI Mode/etc using web scrapers like BrightData/Olostep/etc, evaluate prompts against the OpenAI/Anthropic/Mistral …

Hacker News · u/JohannaAlmeida · Apr 7, 2026

Hybrid Attention

TLDR: Forked pytorch and triton internals . Changed attention so its linear first layer , middle quadratic layer, last linear layer Inference got much faster with a low perplexity hit in tests .Full attention O(n²): 17.9…

40 9

Test AI models and workflows with AMD Instinct GPUs on Hot Aisle

…on large generative models. Choose fully dedicated nodes or flexible VMs. You control the OS, networking, and stack, while Hot Aisle and AMD provide the performance foundation. Evaluate AMD Instinct GPUs on…

Model Quantization: Post-Training Quantization Using NVIDIA Model Optimizer | NVIDIA Technical Blog

…saving should be achieved by exporting the model to deployment frameworks such as NVIDIA TensorRT . This simulation is crucial because it enables you to evaluate the model’s accuracy before committing to…

May 7, 2026 · Ruixiang Wang

Paper page - Bridging the Agent-World Gap: Text World Models for LLM-based Agents

…principled evaluation. We systematically review text world models for LLM-based agents , organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state…

Jun 10, 2026

Paper page - RewardHarness: Self-Evolving Agentic Post-Training

…framework that improves image edit evaluation by iteratively developing tools and skills from limited human demonstrations, achieving superior performance compared to existing models. AI-generated summary Evaluating instruction-guided image edits requires…

May 14, 2026

Paper page - IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

…This remains the most persistent weakness across all 17 evaluated models (including frontier models from Google, OpenAI, Anthropic, and the Qwen family). ⚖️ New Evaluation Paradigm: We decouple raw correctness from strict safety…

May 13, 2026

Paper page - WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

…Evaluating Multimodal Agent Memory Through Action-World Interaction Published on May 28 Submitted by taesiri on May 29 Authors: , , , , , , , , , , , , , , , , Abstract Multimodal large language models require sophisticated memory systems that can track evolving…