Validating agentic behavior when “correct” isn’t deterministic
…But to a developer, the loading screen is incidental ; it doesn’t change whether the task was successful. We can classify agent behavior into three categories: Essential states: Milestones that must occur…