Search

Showing top 115 results for "real-world evaluation"

Top stories

Discussions and forums

r/netsec · u/Fickle-Box1433 · 4d ago

I evaluated 5 LLM agents on patching real-world CVEs. Here is what I found.

I built an independent benchmark with 20 real CVEs across 15 CWE categories, 5 models (3 OpenAI, 2 Poolside Laguna), three prompt conditions: full advisory, behavioral description only, and location only (file and functi…

Hacker News · u/deepakakkil · 2w ago

Show HN: Emergence World: World building as a way to evaluate LLMs

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working…

3
Hacker News · u/adnan9999 · 1w ago

Show HN: Unsiloed AI – #1 on olmOCR-Bench

Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to h…

7 4
r/Games · u/Turbostrider27 · 2w ago

LEGO Batman: Legacy of the Dark Knight Review Thread

Game Information Game Title: LEGO Batman: Legacy of the Dark Knight Platforms: Nintendo Switch 2 (May 22, 2026) PlayStation 5 (May 22, 2026) Xbox Series X/S (May 22, 2026) PC (May 22, 2026) Trailer: Developer: Review Agg…

r/Android · u/MishaalRahman · 2w ago

New features, emojis, & security improvements: Here’s everything new coming to Android!

Hi Reddit, We just wrapped up The Android Show | I/O Edition, and a core theme of the show was how we’re making your phone more helpful so that you can spend less time looking at it and more time living your life. To mak…