Before explaining our results, we should take a moment to consider what it means for an AI model to introspect. What could they even be introspecting on? Language models like Claude process text (and image) inputs and produce text outputs. Along the way, they perform complex internal computations in order to decide what to say. These internal processes remain largely mysterious, but we know that models use their internal neural activity to represent abstract concepts. For instance, prior research has shown that language models use specific neural patterns to distinguish known vs. unknown peopl