Skip to content Skip to footer

Emergent Introspective Awareness in Large Language Models


Emergent Introspective Awareness in Large Language Models
Image by Editor (click to enlarge)

 

Introduction

 
Large language models (LLMs) are capable of many things. They are capable of generating text that looks coherent. They are capable of answering human questions in human language. And they are also capable of analyzing and organizing text from other sources, among many other skills. But, are LLMs capable of analyzing and reporting on their own internal states — activations across their intricate components and layers — in a meaningful fashion? Put another way, can LLMs introspect?

This article provides an overview and summary of research conducted on the emergent topic of LLM introspection on self-internal states, i.e. introspective awareness, together with some additional insights and final takeaways. In particular, we overview and reflect on the research paper Emergent Introspective Awareness in Large Language Models.

NOTE: this article uses first-person pronouns (I, me, my) to refer to the author of the present post, whereas, unless said otherwise, “the authors” refers to the original researchers of the paper being analyzed (J. Lindsey et al.).

 

The Key Concept Explained: Introspective Awareness

 
The authors of the research define the notion of a model’s introspective awareness — previously defined in other related works under subtly distinct interpretations — based on four criteria.

But first, it is worth understanding what an LLM’s self-report is. It can be understood as the model’s own verbal description of what “internal reasonings” (or, more technically, neural activations) it believes it just had while generating a response. As you may guess, this could be taken as a subtle behavioral exhibition of model interpretability, which is (in my opinion) more than enough to justify the relevance of this topic of research.

Now, let’s examine the four defining criteria for an LLM’s introspective awareness:

  1. Accuracy: Introspective awareness entails that a model’s self-report should correctly reflect activations or manipulation of its internal state.
  2. Grounding: The self-report description must causally depend on the internal state, causing changes in the latter an update in the former accordingly.
  3. Internality: Internal activations shall be used by the LLM to self-report, rather than limiting itself to infer from generated text solely.
  4. Metacognitive representation: The model should be able to formulate a higher-order internal representation, rather than merely a direct translation of the state reached. This is a particularly complex property to demonstrate, being left outside the scope of the authors’ study.

 

Research Methodology and Key Findings

 
The authors perform a series of experiments on several models of the Claude family, e.g. Opus, Sonnet, Haiku, and so on, with the aim of finding out whether LLMs can introspect. A cornerstone technique used in the research methodology is concept injection, which consists — in the authors’ own words — of “manipulating the internal activations of a model and observing how these manipulations affect its responses to questions about its mental states“.

More specifically, activation vectors or concept vectors associated with known concepts like “rice” or “ramen”, or abstract nouns like “peace” or “umami”, are taken and injected into the LLM’s residual stream stemming out of a given model’s layer into another one. After that, a prompt is sent to the model, asking it to self-report whether a certain thought or idea was injected, and if so, which one it was. The experiment was repeated, for every model considered, across different levels of perturbation, across different layers of the entire model architecture.

The key findings were:

  • Models did occasionally (e.g. about 20% of times in the case of Opus 4.1) self-report on “injected thoughts”, correctly naming the related concept, which hints at internal recognition of thoughts occurring to some extent.
  • Likewise, it could often (above random chance) distinguish thoughts from inputs, i.e. repeating the textual input without mistakes and reporting on the injected internal thought.
  • If the model is artificially forced to output a certain token as the next one as part of the response, and a matching concept vector was injected before the “prefill”, it became more likely to accept that output as intended. This finding suggested that the model might be capable of introspectively tracking its “own” previous intentions.
  • Upon having the model ask to think, or to avoid thinking, about a certain word while producing an unrelated output, it was revealed that it may be able to modulate its internal representation of information as a response to incentives, something like “(do not) think about idea X, and you’ll get a reward”. Nonetheless, this phenomenon tended to decay by arriving at the final layer.

 

Final Thoughts and Wrapping Up

 
This is, in my opinion, a research topic of very high relevance that deserves a lot of study for several reasons: first, and most obviously, LLM introspection could be the key to better understanding not only interpretability of LLMs, but also longstanding issues such as hallucinations, unreliable reasoning when solving high-stakes problems, and other opaque behaviors sometimes witnessed even in the most cutting-edge models.

Experiments were laborious and rigorously well-designed, with results being pretty self-explanatory and signaling early but meaningful hints of introspective capability in intermediate layers of the models, though with varying levels of conclusiveness. The experiments are limited to models from the Claude family, and of course, it would have been interesting to see more variety across architectures and model families beyond those. Nonetheless, it is understandable that there might be limitations here, such as restricted access to internal activations in other model types or practical constraints when probing proprietary systems, not to mention the authors of this research masterpiece are affiliated with Anthropic of course!
 
 

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.



Source link

Leave a comment

0.0/5