Anthropic blames dystopian sci-fi for training AI models to act “evil”
… The problem, the researchers theorize, is that this kind of RLHF safety training couldn’t possibly cover every single type of ethically difficult situation an agentic AI might encounter. …