Paper page - A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
…Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and…