AI concept detection - Mochiai.blog

Anthropics New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers

How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking?…