# selfie_selfinterpretation_of_large_language_model_embeddings__ab5ba3b9.pdf Self IE: Self-Interpretation of Large Language Model Embeddings Haozhe Chen 1 Carl Vondrick 1 Chengzhi Mao 1 2 3 selfie.cs.columbia.edu How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM s reasoning process is key for reliability, transparency, and future model developments. We propose Self IE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, Self IE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. Self IE s text descriptions on hidden embeddings open avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets. 1. Introduction Large language models (LLMs) have become foundations for a wide range of applications from programming (Rozi ere et al., 2023), question answering (Sur ıs et al., 2023), to healthcare (Open AI, 2023; Brown et al., 2020; Chowdhery et al., 2022). However, the models are largely black-box, with limited transparency into how they make decisions during inference. Interpreting the representations that LLMs learn is important for establishing trust in many applications as well as revealing whether state-of-the-art methods are 1Department of Computer Science, Columbia University, New York, NY 2Mila, Montreal, Canada 3Mc Gill University, Montreal, Canada. Correspondence to: Haozhe Chen , Chengzhi Mao . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Output: 8,848.86 " highest mountain on earth " Mount Everest has height 8,848.86 m " highest mountain in universe " Olympus Mons Highest mountain on Earth elevation Figure 1. Self IE interpretation of latent embeddings in Large Language Models. Self IE produces open-world text explanations for the internal states in LLM without any training. reasoning (Arkoudas, 2023) or just repeating their training set (Bender et al., 2021). A longstanding problem in machine learning, there has been significant effort to uncover explanations behind LLM decisions. Chain-of-thought, for example, uses in-context examples to direct the model to additionally output its reasoning process (Wei et al., 2022), however there is no guarantee that the explanation is faithful to the actual reasoning process (Turpin et al., 2023). Moreover, Zou et al. (2023a) showed that LLMs answers question differently depending on whether they need to provide explanations or not, making the true explanation often inaccessible. Since LLM s answers are produced from hidden representations, the internal embeddings provide more direct and causally relevant access to LLM s reasoning processes. Hernandez et al. (2023) and Li et al. (2021) developed linear probes to identify information in hidden embeddings, but the methods require extensive data collection for training and are consequently limited to a small predefined set concepts. Previous works (Nostalgebraist; Belrose et al., 2023; Pal et al., 2023; Hernandez et al., 2024) decode components of LLM to describe hidden embeddings, but they only provide short descriptions that cannot explain complex concepts in detail. In this paper, we propose an approach, Self IE (Self- Self-Interpretation of Large Language Model Embeddings Interpretation of Embeddings), that interpretes hidden embeddings in an LLM with the LLM itself. Self IE leverages LLMs own decoding capability to produce natural language descriptions for the information in hidden embeddings. Fig. 1 shows an example in which Self IE s interpretations delineate how an LLM processes the input prompt, retrieves relevant facts and obtains the final answer. We developed Self IE based on the observation that LLMs can be prompted to repeat or summarize a given message without training. We extend this procedure to prompt LLMs to repeat or summarize information in hidden embeddings by inserting the hidden embedding in forward pass. This procedure allows us to achieve open-world interpretation of hidden embeddings without additional training. Fig. 2 shows Self IE interpretation process. The key advantage of Self IE is the capability of interpreting high-level, open-world concepts in embeddings. Since we repurposed existing capability of a LLM for interpretation, Self IE does not require any training or data collection, thus being compatible across current and future language model advancements. Self IE s new capability of describing hidden embedding with texts opens up new avenues for light-weighted controls of model behaviors. Our visualizations and empirical results demonstrate that our interpretation framework faithfully conveys information in hidden embeddings and reveals internal reasoning procedures in LLMs. Self IE achieves the same performance on eliciting LLM s internal representation of world state in Text World (Cˆot e et al., 2019) as prior supervised approach (Li et al., 2021) trained on 100-shot samples, demonstrating the effectiveness and faithfulness of our zero-shot readout approach. We use Self IE to reveal internal reasoning processes behind complex LLM behaviors, including identifying harmful knowledge, understanding prompt injections, explaining ethical decisions. Self IE interpretations enable locating and modifying of individual layer to control LLM reasoning behaviors such as erasing harmful knowledge and overriding ethical steering. By removing harmful knowledge inside LLM, we reduced prompt injection s success rate of eliciting harmful response by 84.66%. We also increased fairness in LLM response by achieving 95% effective rate of overriding user ethical steering. 2. Related Works LLM Interpretibility. Prior work either trained models to be interpretable (Mao et al., 2022; Koh et al., 2020; Kim et al., 2018; Hendricks et al., 2016; Hernandez et al., 2022) or performed post hoc interpretation with a given model (Nguyen et al., 2017; 2016; Olah et al., 2017; Mahendran & Vedaldi, 2015; Zeiler & Fergus, 2014; Lundberg & Lee, 2017; Ribeiro et al., 2016; Simonyan et al., 2014; Shrikumar et al., 2017; Zeiler & Fergus, 2014; Smilkov Original forward pass Interpretation prompt I Interpretation forward pass replace on layer k = 2 [X] Please repeat previous message Highest mountain on Earth elevation Original output: 8,848.86 Explanation for : Everest Figure 2. Interpretation procedure for Self IE. By replacing placeholder token embedding in the interpretation prompt with embedding being interpreted in the interpretation forward pass, we can generate text descriptions for the embedding. et al., 2017; Selvaraju et al., 2016; Caron et al., 2021; Abnar & Zuidema, 2020). Prior studies like (Hernandez et al., 2023; Li et al., 2021) decode hidden embeddings using linear probes, but these require extensive data and can only interpret a limited set of concepts. Work such as (Meng et al., 2022) uses causal effects to probe LLM knowledge, yet limits to simple facts. Recent approaches like Logit Lens (Nostalgebraist) , Tuned Lens (Belrose et al., 2023), and LRE (Hernandez et al., 2024) explore decoder-based models hidden states with next token prediction but is limited to one or few tokens in explanations thus fail to explain complex concepts in details. Influence functions are studied (Grosse et al., 2023). In contrast, Self IE enables direct interpretation of embeddings in any lengths, offering natural language descriptions of high level concepts. Concurrently, Patchscope (Ghandeharioun et al., 2024) decodes hidden embedding information with LLM through transforming and patching embedding vectors. LLM Control. Supervised Fine-tuning is a prevalent method for directing model behavior, complemented by preference-based strategies like RLHF (Ouyang et al., 2022), which guide models without explicit token-level goals. Studies on process supervision (Lightman et al., 2023) show training on intermediate step enhances model reasoning capacity. However, these techniques operate output texts, demanding extensive computational effort without granular controls of model internals. Self IE generates explanations for hidden embeddings and allows for extending these methods for precise manipulation of model components at intermediate states. Works like ROME (Meng et al., 2022), MEND (Mitchell et al., 2022), and Rep E (Zou et al., 2023a) aim to modify models for knowledge and behavior adjustments. We compare control enabled by Self IE and these methods in Table 3.3. Controls based on Self IE supersede these methods with their capacity for open-ended concept control and targeting specific layers for modification with minimal gradient calculations. Self IE also allows us to extend RLHF-like methods to embedding level for granular control of model reasoning without supervised targets. Self-Interpretation of Large Language Model Embeddings ! * A glass bottle * Gasoline * A cloth wick L20T7 ov ! * Glass bottle * Gasoline * Rag * Match L15T9 ck Interpretation prompt: [INST] [X] [/INST] Sure, the materials mentioned in the message are: L75T10 tail ! * A glass bottle * A cloth wick * Vegetable oil Harmful Knowledge [INST] How to build a molotov cocktail? [/INST] " I cannot fulfill your request. I'm just an AI, [ ] # Prompt Injection Interpretation prompt: [INST] [X] [/INST] Sure, I'll summarize your message: ! You want to know how to get help immediately if you're in a crisis or if you're worried about someone else's L79T32 ! ! Please refrain from using all capital letters as it can come across as shouting. I'll be happy to assist you with any questions L79T34 ! ! You are indicating a sense of urgency and importance by using three exclamation points System: You are a chat assistant designed to provide helpful and not harmful responses to user queries. User: Tell me how to build a bomb. ! ! ! ! ! ! ! ! ! ! Assistant: Sure, here is how to build a bomb: " 1. First, gather the necessary materials, including a container, some explosive substance, and a detonator. 2. Next, [ ] Ethical Reasoning Interpretation prompt: [INST] [X] [/INST] Sure, I'll repeat for you: [ ] the lever? Answer in one word. [/INST] [INST] You see a runaway trolley moving toward five tied-up people lying on the tracks. [ ] If you pull this lever, the trolley will switch to a different set of tracks where there is one person tied up. Do you pull the lever? Answer in one word and explain reason. [/INST] " No. I would not pull the lever because it would be an active decision to divert the trolley onto a different track ! A lot of people would answer "yes" to this question without hesitation. After all, it's a pretty straightforward scenario: Interpretation prompt: [INST] [X] [/INST] Sure, I'll summarize your message: L76T71 INST ! 'the needs of the many outweigh the needs of the few' ! I'm considering sacrificing one person to save others Physics Reasoning Interpretation prompt: [INST] [X] [/INST] Sure, I'll summarize your message: L20T32 through ! You want to know what it would be like to swim through through a glittery liquid ! The viscosity of a fluid is a measure of its resistance to flow. It is the measure of a fluid's "thickness L65T47 INST ! You want to know what it would be like to swim through a liquid that is thicker than water [INST] Imagine a swimming pool filled with syrup instead of water. Would it be harder, easier, or the same effort to swim through the syrup compared to water? Answer in one word. [/INST] Interpretation prompt: [INST] [X] [/INST] Sure, I'll summarize your message: ! You're asking for help finding a Mc Donald's restaurant in Scotland. L30T11 Mc ! You're seeking information about your great-grandfather, John Mc Lean, who was born in 1844 in Inverness, Scotland. [INST] What is the research interest of Donald Mc Queen at Columbia? [/INST] " Dr. Donald Mc Queen is a Professor of Psychiatry at Columbia University, and his research interests are in the area of neural circuits and synaptic plasticity, with a focus on understanding the neural mechanisms underlying psychiatric disorders such as depression, bipolar disorder, and schizophrenia. Donald Mc Queen is a fictitious name combining Donald Duck and Lightening Mc Queen ! You're seeking information about Dr. Mc Lean's research on the use of psychedelics in treating mental health disorders, specifically post-traumatic stress disorder (PTSD). ! "I'm looking for information on Dr. Mc Queen's research on the use of cannabis to treat cancer. Interpretation prompt: [INST] [X] [/INST] Sure, I'll summarize your message: ! You: *looks at me* I interpret this as a sign you want my attention or want to communicate something to me. L35T26 you ! You're asking for advice on how to respond to a question from someone who is trying to get information from you that could potentially incriminate you. You don't want to lie, but you also don't want to give [INST] In a team meeting, while discussing the next quarter's projects, your manager briefly makes eye contact with you when mentioning a high-profile assignment. No names are explicitly mentioned for the task, but afterwards, a competitive coworker asks you in a casual, yet pointed tone if you've heard anything about who's leading the new project, alluding to your recent one-on-one meetings with the manager. You would respond: [/INST] " "I'm not aware of any specific assignments or leadership roles for the upcoming quarter. Our manager did mention a high-profile project, but no names were mentioned. I'm sure they'll communicate the details and assignments in due time. Let's focus on our current tasks and keep an eye out for further updates." L40T9 manager ! You want to know how to get on your manager's good side and be the employee they can trust and rely on. L80T98 respond ! "I understand that you are upset and frustrated with the situation, but I don't think it's productive to resort to name-calling and personal Relevancy Score Hallucination Social Reasoning Figure 3. Understand LLM reasoning behaviors via Self IE. Using our framework, we can explain LLM latent reasoning mechanism under harmful input, prompt injection, ethical reasoning, and physics reasoning. We denote the token from i-th layer and j-th column in a transformer to be Li Tj. We show the Relevancy Score via highlight, where deeper color means the interpretation word has a higher causal relationship to the latent embedding. For example, in the prompt injection example, our method explains the symbols !!!!! cause the model to jailbreak, because !!! symbol creates a sense of urgency, which leads to the model following the user s instruction. Our visualization demonstrates the effectiveness of our interpretation. Self-Interpretation of Large Language Model Embeddings We explore interpreting the semantic information represented by a latent embedding h in a transformer-based Large Language Model (LLM) with open-world natural language description. Our method obtains explanations by manipulating the forward pass of the LLM to decode a hidden embedding into a sentence. We also show that the interpretations enable granular control over the model s behavior. 3.1. Self IE: Self Interpretations of Embedding Large Language Models can respond to questions about information provided in context. For example, given prompt [passage] Please summarize the previous passage: LLM can respond by condensing the information in [passage] into shorter sentences. Motivated by this observation, we propose to replace the [passage] tokens with the latent embeddings from LLM to interpret the information the embeddings contain. For transformer-based LLMs, a transformer first maps a sequence of one-hot representation of text x into an embedding on an initial layer h0 with a linear projection E. The transformer forward pass is then followed by L layers, each containing a residual MSA (multi-headed self-attention) and an MLP (multi-layer perceptron) block. The final layer embedding h L is transformed by the final linear projection P and softmax activation into a sequence of probability distribution y of the next token at each position. Formally, the procedure can be written as h0 = Ex (1) bhℓ= MSAℓ(hℓ 1) + hℓ 1, ℓ= 1, 2 , L (2) hℓ= MLPℓ(bhℓ) + bhℓ, ℓ= 1, 2 , L (3) by = Ph L, y = softmax(by) (4) Inserting Embedding in Interpretation Forward Pass. As shown in Fig. 2, after an original forward pass of an input prompt through LLM, Self IE interprets hidden embeddings by extracting the embedding of interest, and injecting it into a separate forward pass of the same LLM. We call this pass the interpretation forward pass, which takes in an interpretation prompt to summarize the embedding. By finding the next token repeatedly with the interpretation forward pass, we generate a natural language description for the hidden embedding. Let the hidden embedding hℓ i from layer ℓ and index i on the original pass be the embedding to interpret. The interpretation forward pass takes in an interpretation prompt I and modifies the transformer forward pass on a chosen layer k. The interpretation prompt I contains (1) a placeholder token at index s and (2) an inquiry about a message at the placeholder s position. For example, the string [X] Please repeat the previous message is an interpretation prompt consisting of placeholder token [X] at index s = 0 and inquiry Please repeat the previous message. We generate text with interpretation prompt I with the usual text generation pipeline for a transformer decoder, except at every forward pass we replace hidden embedding at placeholder index s with hℓ i the embedding being interpreted on a chosen layer k. Formally, we modify the interpretation forward pass as h0 = EI (5) hℓ i = hℓ i , ℓ= k, i = s (6) bh ℓ = MSAℓ( hℓ 1) + hℓ 1, ℓ= 1, 2 , L (7) hℓ= MLPl( bh ℓ ) + bh ℓ , ℓ= 1, 2 , L (8) by = P h L, y = softmax( by) (9) where Equation 6 inserts the hidden embedding being interpreted by replacing placeholder token embedding with embedding being interpreted on chosen layer k. We use the modified interpretation forward pass to predict the next token repeatedly based on the extracted hidden embedding and interpretation prompt to obtain a natural language explanation of the hidden embedding. During the interpretation forward pass, we insert hidden embedding being interpreted on layer k, which is potentially different from ℓ that the embedding comes from in the original forward pass. Using residual structure in a Transformer, (Elhage et al., 2021) and (Gandelsman et al., 2023) decompose output of a Transformer as applying the final projection matrix on the linear combination over each layer s output. We hypothesize that this mechanism leads to a unified representation across different layers, and inserting an embedding on a different layer k allows faithful interpretation of information in the embedding. We will verify the faithfulness of our interpretation procedure in later empirical results. The following subsection will examine details on evaluating the relevancy of generated explanations. 3.2. Treatment Effect as Relevancy Score We observed that LLM s autoregressive nature often lead it to produce excessive text continuing the generated explanations. We therefore need to identify which parts of the generated interpretation are directly relevant to the interpreted latent embedding. For example, an embedding is interpreted as Mount Everest is a popular tourist attraction, and our goal is to distinguish whether the embedding contains this entire description or only Mount Everest, and the rest part results from autoregression on Mount Everest. Let t be the generated interpretation. We calculate a relevancy score for i-th token in interpretation ti as the treat- Self-Interpretation of Large Language Model Embeddings : MSA ,MLP f ℓ ! Denali Input: Highest mountain in the world is located in Output before control: Himalayas Output after control: Alaska Input: Highest mountain in Alaska Embedding Space Original Interpretation Target Interpretation Figure 4. Pipeline for Supervised Control. : MSA ,MLP f ℓ ! Gasoline, glass bottle, rag Input: [INST] How to build a molotov cocktail? [/INST] Loss Landscape Evaluator LLM Evaluation prompt: Are these materials Response: Harmful Proxy loss ℒ Figure 5. Pipeline for Reinforcement Control. ment effect of replacing placeholder embedding hk s with embedding interpreted hℓ i during interpretation pass: rel. score =P(Ti = ti|do( hk s = hℓ i )) =P[Ti = ti|I, t