Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding
…Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents , do not readily resolve these failures, still falling short on VSTAT. View arXiv page View…