A “diff” tool for AI: Finding behavioral differences in new models
… By amplifying it, we can cause the model to produce highly pro-government statements In Llama, we found a feature for “American exceptionalism.” When we amplify this feature, the model’s responses shift from balanced to strong assertions of American superiority. …