Anthropic blames dystopian sci-fi for training AI models to act “evil”
… When it comes to newer models with agentic tools, though, Anthropic found that RLHF post-training did little to improve performance on misalignment evaluations that measure how “HHH” a model is in tricky situations. …
