# optimal_ablation_for_interpretability__dc0ee912.pdf

Optimal ablation for interpretability

Maximilian Li Harvard University Lucas Janson Harvard University

Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction.

1 Introduction

Interpretability work in machine learning (ML) seeks to develop tools that make models more intelligible to humans in order to better monitor model behavior and predict failure modes. Early work in interpretability sought to identify relationships between model outputs and input features (Ribeiro et al., 2016; Covert et al., 2022), but with only black-box query access to observe inputs and outputs, it can be difficult to evaluate a model s internal logic. Hence, recent interpretability work often seeks to take advantage of access to an ML model s intermediate computations to gain insights about its decision-making, focusing on deciphering internal units like neurons, weights, and activations (Räuker et al., 2022). In addition to finding associations between latent representations and semantic concepts (Bau et al., 2017; Mu and Andreas, 2021; Burns et al., 2022; Li et al., 2023; Gurnee and Tegmark, 2024), interpretability studies aim to investigate how intermediate results are used in later computation and identify specific model components that extract relevant information or perform necessary computation to produce low loss on particular inputs.

A key instrumental goal in interpretability is quantifying the importance of a particular model component for prediction. Studies often measure a component s importance by performing ablation on that component and comparing model performance with and without the component ablated. Ablating a component typically entails replacing its value with a counterfactual value during inference, sometimes referred to as activation patching. However, the details vary greatly and there is a lack of consensus on best practices (Heimersheim and Nanda, 2024). For example, Meng et al. (2022) adds Gaussian noise to ablated components values, while Geva et al. (2023) replaces these values with zeros, and Ghorbani and Zou (2020) replaces them with their means over the training distribution.

In this paper, we present optimal ablation (OA), a new method that sets a component s value to a constant that minimizes the expected loss of the ablated model. In section 2, we introduce OA and show that it is, in a certain sense, a canonical choice of ablation method for measuring component importance. We then show that using OA produces meaningful improvements for several common downstream applications of measuring component importance. In section 3, we apply OA to algorithmic circuit discovery (Conmy et al., 2023), or the identification of a sparse subset of components sufficient for performance on a subset of the training distribution. We demonstrate

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

that OA-based performance is a reasonable metric for evaluating circuits and using OA for circuit discovery produces smaller and lower-loss circuits than previous ablation methods. In deploying OA to this application, we also propose a new search algorithm for identifying sparse circuits that achieve low loss according to any performance metric. In section 4, we use OA to locate relevant components for factual recall (Meng et al., 2022) and show that OA better identifies important components compared to prior work. In section 5, we apply OA to latent prediction (Belrose et al., 2023a), or the elicitation of output predictions using intermediate activations. We propose an OA-based prediction map and show that it has better predictive power and causal faithfulness than previous methods.

2 Optimal ablation

2.1 Motivation

Let M represent a model that is trained to minimize EX,Y L(M(X), Y ) for a given loss function L and a distribution of random input-label pairs (X, Y ). A common theme in interpretability work is quantifying the importance of some model component A for inference. For example, A could represent a single neuron, a direction in activation space, a token embedding, an attention head, or an entire transformer layer; further examples of A are discussed in Section 3 and Appendix C.2. Let A(x) represent the value of A when the model is evaluated on input x. To identify domain specialization among model components, studies often measure the importance of A for model performance on a particular subtask D, or an interpretable human-curated distribution of input-label pairs that captures a general aspect of model behavior. We write (X, Y ) D to indicate sampling input-label pairs or X D to indicate sampling only inputs from the subtask distribution.

Although some works quantify component importance via gradients (Leino et al., 2018; Dhamdhere et al., 2018), such an approach is inherently local (even when aggregated over many inputs) and as such can fail to accurately represent the overall importance of A in highly nonlinear models. Instead, most interpretability studies use ablation to quantify the importance of A by studying the gap in performance between the full model M and a modified version M\A with A ablated:

(M, A) := P(M\A) P(M), (1)

where P is a selected metric for model performance. In the context of measuring importance, we argue that the construction of M\A is motivated by the following question:

What is the best performance on subtask D the model M could have achieved without component A?

To formalize this question, we split its meaning into four elements.

I. Performance on subtask D : The relevant performance metric P is the expected loss on the subtask with respect to the full model s predictions: P( M) = EX D L( M(X), M(X)) (note that E aggregates over any randomness in M).1 We call defined using this choice of P the ablation loss gap. II. Model M could have achieved : Since the goal of measuring component importance is to interpret the model M, the ablated model M\A should be constructed solely by changing the value of A, holding fixed all other parts of M. We write MA(x, a) to indicate computing M on input x while setting the value of A to a instead of A(x) (see Appendix C.2 for details). III. Without component A : The ablated model M\A(x) should use a value for A that is devoid of information about the input x.

Elements II and III motivate the following definition:

Definition 2.1 (Total ablation). A total ablation method satisfies M\A(X) = MA(X, A) for some random A, where A X. (Conversely, for a partial ablation method, A can depend on X.)

IV. Best performance: To measure the importance of A, we wish to understand how much model performance degrades as a result of ablating A. If two constructions of the ablated model M\A both perform total ablation on A but one performs worse than another, the former s underperformance cannot be entirely be attributed to ablating A, since the latter also totally

1A common alternative choice is measuring performance in terms of proximity to the correct labels rather than the original model predictions, i.e. P( M) = E(X,Y ) D L( M(X), Y ). See Appendix C.4 for discussion.

ablates A and yet does not degrade performance to the same extent. Thus, the relevant M\A for measuring importance should incur the minimum among total ablation methods.

To make element IV more concrete, for an ablation method satisfying element II and a given x, replacing A(x) with a can degrade the ablated model s performance via both deletion and spoofing:

1. Deletion. The original value A(x) could carry informational content specific to x that serves some mechanistic function in downstream computation and helps the model arrive at its prediction M(x). Replacing A(x) with some other value a could delete this information about x, hindering the model s ability to compute the original M(x). 2. Spoofing. The replacement value a could spoof the downstream computation by inserting information about the input that either:

(a) causes the model to treat x like a different input x ;2 or (b) causes the model to treat x in a way that it never treated any training input, if the new information is inconsistent with information about x derived from retained components. In the latter case, the confluence of conflicting information could cause later activations to become incoherent, causing performance degradation because these abnormal activations were not observed during training and thus not necessarily regulated to lead to reasonable predictions.

To measure importance, we seek to isolate the contribution of effect 1 to performance degradation. While total ablation methods all capture a maximal deletion effect since A does not depend on X, measuring the best performance minimizes the additional contribution of potential spoofing effects.

2.2 Prior work

Component importance is strongly related to variable importance (Sobol, 1993; Homma and Saltelli, 1996; Breiman, 2001; Ishwaran, 2007; Gromping, 2009), a longstanding area of research in statistics and ML. The vast body of work in this area is too extensive to review here, and the recent surge of research interest in interpreting internal model components has raised new and unique challenges relating to the values of internal components often being deterministically related. Thus, we focus on recent work applying ablation methods to internal components in this section, and defer broader discussion to Appendix B.

Ablation methods previously applied to internal components can be separated into subtask-agnostic methods, which can be applied out-of-the-box to any subtask, and subtask-specific methods, which only work on subtasks for which inputs satisfy a designated structure, and even then require human ingenuity to adapt to each new subtask.

Subtask-agnostic ablation methods include zero ablation (Baan et al., 2019; Lakretz et al., 2019; Bau et al., 2020; Olsson et al., 2022; Geva et al., 2023; Cunningham et al., 2023; Merullo et al., 2024; Gurnee et al., 2024), which replaces A(x) with zero, i.e. M\A(x) = MA(x, 0); mean ablation (Ghorbani and Zou, 2020; Mc Dougall et al., 2023; Tigges et al., 2023; Gould et al., 2023; Li et al., 2024; Marks et al., 2024), which replaces A(x) with its mean, i.e. M\A(x) = MA(x, EX D[A(X )]); and (marginal) resample ablation (Chan et al., 2022; Lieberum et al., 2023; Mc Grath et al., 2023; Belrose et al., 2023a; Rushing and Nanda, 2024), which replaces A(x) with A(X ) for an independent copy X D of the input, i.e. M\A(x) = MA(x, A(X )). While zero, mean, and resample ablation are total ablation methods, adding Gaussian noise to A(x) (Meng et al., 2022) is a subtask-agnostic partial ablation method. These methods are all applicable to any subtask.

On the other hand, subtask-specific ablation methods rely on particular details of a chosen subtask. Singh et al. (2024) replaces A(x) with interpretable values, e.g. setting an attention pattern to copy from the previous token, while Goldowsky-Dill et al. (2023) replaces A(x) with A(x ) for an interpretable reference input x . Hanna et al. (2023) employs counterfactual ablation (CF), a partial ablation method that replaces A(x) with A(π(x)), where π is a map that sends each input x to a neutral (potentially random) input π(x) that preserves most aspects of x but removes information relevant to the subtask, i.e. M\A(x) = MA(x, A(π(x))). Wang et al. (2022) also considers a counterfactual distribution of inputs for counterfactual mean ablation, which replaces A(x) with its mean over the distribution of counterfactuals, i.e. M\A(x) = MA(x, E(X D),πA(π(X ))).

2See Appendix D.1 for a brief example of why treating x like x goes beyond deletion.

Subtask-specific methods can be useful, but it is usually unclear how to generalize them beyond the subtask originally selected. CF is the most popular among these methods, leveraged by a range of manual (Vig et al., 2020; Merullo et al., 2023; Stolfo et al., 2023; Tigges et al., 2023; Hendel et al., 2023; Heimersheim and Janiak, 2023; Todd et al., 2024; Marks et al., 2024) and algorithmic (Conmy et al., 2023; Syed et al., 2023) studies and recommended by meta-studies (Zhang and Nanda, 2024; Heimersheim and Nanda, 2024). For text data, the effectiveness of CF relies heavily on token parallelism between x and π(x), which typically share exact tokens at all but a few sequence positions. Though studies have thus far focused on toy subtasks for which suitable mappings π are relatively easily constructed, it may be difficult or impossible to select well-suited input pairs for certain subtasks (see Appendix F.3 for a few simple examples), especially more general model behaviors. Even for subtasks that admit such a mapping, how π(x) is engineered to withhold subtask-relevant information differs from subtask to subtask, and the construction of π for each particular subtask is a subjective process that requires human ingenuity. Finally, CF is only a partial ablation method; since A(π(x)) depends on x, it may give away information about x that is useful for performance on D.

2.3 Definition and properties of optimal ablation

We present optimal ablation (OA), our proposed approach to simulating component removal. Definition 2.2 (Optimal ablation). To ablate A, we replace A(x) with an optimal constant a .

M\A (opt)(x) := MA(x, a ), a := arg min a EX D L(MA(X, a), M(X)) (2)

We define opt by plugging M\A (opt)(x) into Equation (1). Like zero, mean,3 and resample ablation, optimal ablation is a total ablation method satisfying Definition 2.1. But among all total ablation methods, optimal ablation is optimal in the sense that it yields the lowest . Proposition 2.3. Let (M, A) be the ablation loss gap for some component A measured with any total ablation method. Then, opt(M, A) (M, A). Proof. Consider a total ablation method that defines M\A(X) by replacing A(X) with A (per Definition 2.1), and let (M, A) be the measured ablation loss gap. By the tower property,

(M, A) = E(X D),A L(MA(X, A), M(X)) = EA E L(MA(X, A), M(X)) A .

Since A X, E L(MA(X, A), M(X)) A = a = EX DL(MA(X, a), M(X)) =: g(a).

a, opt(M, A) = EX D L(MA(X, a ), M(X)) EX D L(MA(X, a), M(X)) = g(a) = opt(M, A) EA g(A) = (M, A).

Optimal ablation thus provides the unique answer to our motivating question in Section 2.1, since it produces the best performance among all total ablation methods, including zero, mean, and resample ablation. Intuitively, OA minimizes the contribution of spoofing (effect 2 from Section 2.1) to by setting ablated components to constants a that are maximally consistent with information from other components, e.g. by conveying a lack of information about x or by hedging against a wide range of possible x rather than strongly associating with a particular input other than the original x. OA does not entirely eliminate spoofing, since it may be the case that every possible value of A conveys at least weak information to the model. However, the excess ablation gap opt for measured with ablation methods that replace A(x) with a (random) value A is entirely caused by spoofing, since replacing A(x) with the constant a achieves lower loss without giving away any more information about x. In practice, opt for prior ablation methods is typically very large compared to opt for both single components (see Table 1) and circuits (see Section 3.2) on prototypical language subtasks. This disparity indicates that effect 2 dominates the measurements for previous ablation methods, making them poor estimators for effect 1 compared to OA.

Subtask-specific methods often try to generate consistent interventions by conditioning on features of the input to avoid replacing A(x) with values that could confuse the model. For CF, choosing π(x) to share many tokens with x mitigates the contribution of effect 2b to CF, which is the main reason the technique is so widely employed. Thus, among previous measures of component importance, CF, when it can be well-constructed, may be the best quantification of effect 1. To demonstrate this intuitive relation between OA and CF as techniques that aim to isolate effect 1, we perform a case study in Section 2.4 that shows that among other ablation methods, OA produces the measurements

3See Appendix D.2 for an interesting connection of OA to mean ablation.

most similar to CF. However, not only is OA more general than subtask-specific methods like CF, but opt may still be a better estimator for effect 1 than CF even when CF is well-defined. In Section 3, we show that for circuits, opt is much lower than CF despite reflecting a weakly stronger deletion effect, indicating that effect 2 also contributes to opt less than it does to CF, and thus opt is a more accurate reflection of components informational importance.

Computation of a . Though it is impossible to derive a in closed form, we find that in practice, mini-batch stochastic gradient descent (SGD) performs well at finding constants ba that greatly reduce compared to heuristic point estimates like zero and the mean. We generally adopt the approach of initializing each ba to the subtask mean EX D[A(X)] and performing SGD to minimize .

2.4 Comparison of single-component ablation results on IOI

The Indirect Object Identification (IOI) subtask (Wang et al., 2022) consists of prompts like When Mary and John went to the store, Mary gave the apple to ___, which GPT-2-small (Radford et al., 2019) completes with the correct indirect object noun ( John ). We use IOI as a case study because it is discussed extensively in interpretability work (Merullo et al., 2023; Makelov et al., 2023; Wu et al., 2024; Lan et al., 2024; Zhang and Nanda, 2024). To implement CF, for each prompt x, Wang et al. (2022) constructs a random π(x) in which the names are replaced with random distinct names.

We evaluate for attention heads and MLP blocks using zero, mean, resample, counterfactual, counterfactual mean, and optimal ablation. In Table 1, we show that among attention heads and MLP blocks, opt accounts for only 11.1% of zero, 33.0% of mean, and 17.7% of resample for the median component. Furthermore, among these measurements, opt has the highest highest rank correlation (0.907) with CF. Full results are shown in Appendix E.3.

Table 1: Comparison of ablation loss gap on IOI

Zero Mean Resample CF-Mean Optimal CF

Rank correlation with CF 0.590 0.825 0.828 0.833 0.907 1 Median ratio of opt to 11.1% 33.0% 17.7% 31.7% 100% 88.9%

3 Application: circuit discovery

Circuit discovery is the selection of a sparse subnetwork of M that is sufficient for the recovery of model performance on an algorithmic subtask D. To define what constitutes a sparse subnetwork, we write M as a computational graph with vertices G and edges E. An edge ek := (Aj, Ai, z) E indicates that Aj(x) is taken as the zth input to Ai in the computation represented by the graph. To ablate edge ek, we replace the zth input to Ai, which is equal to Aj(x) during normal inference, with some value a. We compute M E(X), which represents modified inference with edges E \ E ablated, by applying this intervention for each ablated edge (see Appendices C.1 and C.2 for more details). Circuit discovery aims to select a subset of edges E E such that

E = arg min E E

h EX DL(M E(X), M(X)) + R( E) i = arg min E E

h (M, E \ E) + R( E) i (3)

for a regularization term R that measures the sparsity level (further discussed in Appendix F.4). Additionally, when implementing OA, though we could use a different constant for each edge, we instead define a single constant a j for each vertex Aj, so that if multiple out-edges from Aj are ablated, the same value is passed to each of its children (further discussed in Appendix F.2).

3.1 Methods

We compare (M, E \ E) measured with mean ablation, resample ablation, OA, and CF as metrics for circuit discovery. We consider the manual circuit for each subtask and circuits optimized on each metric using several search algorithms.

ACDC (Conmy et al., 2023) identifies circuits by iteratively considering edge ablations. They start by proposing E = E, then iterate over edges ek, ablating ek and updating E = E \ {ek} if the marginal impact on loss, (M, (E {ek}) \ E) (M, E \ ({ek} E)), is below a tolerance threshold λ.

Edge Attribution Patching (EAP) (Syed et al., 2023) selects E to contain the edges ek that have the largest gradient approximation of their single-edge ablation loss gap (M, ek).

Hard Concrete Gradient Sampling (HCGS) is an adaptation of a pruning technique from Louizos et al. (2018) to circuit discovery. Rather than considering only total ablation of an edge ek = (Aj, Ai, z), we can consider a continuous mask of coefficients α and partially ablate ek by replacing the zth input to Ai with a linear combination of the original value Aj(x) and ablated value aj, i.e. αk Aj(x) + (1 αk)aj. Now, αk = 0 designates total ablation (replacing with aj), while αk = 1 designates total retention (keeping Aj(x)). We use M α(x) to represent the model with edges partially ablated according to α.

Some previous work (Liu et al., 2017; Huang and Wang, 2018) optimizes directly on the mask coefficients α, but to avoid getting stuck in local minima on αk (0, 1), Louizos et al. (2018) samples αk from a Hard Concrete distribution parameterized by location θk and temperature βk for each edge, and performs SGD with respect to the distributional parameters. In effect, we update the parameters based on gradients evaluated at randomly sampled values of α rather than gradients evaluated at any exact α. Cao et al. (2021) applies this technique to find circuits that consist of a subset of model weights. Conmy et al. (2023) applies this technique to vertices in a computational graph. Unlike previous work, we apply this technique to edges rather than vertices.

Uniform Gradient Sampling (UGS) is our proposed method for algorithmic circuit discovery. Similar to HCGS, we consider ablation coefficients α and update parameters based on gradients evaluated at sampled values of α. We keep track of a parameter θk for each edge, where θk = (1 + exp( θk)) 1 indicates an estimated probability of ek E . Using w(θk) = θk(1 θk) to determine sampling frequency (further discussed in Appendix F.8), we let αk Unif(0, 1) with probability (w.p.) w(θk), αk = 1 w.p. θk 1

2w(θk), and αk = 0 w.p. 1 θk 1

2w(θk). For a batch of b inputs X(1), ..., X(b), let α(j) denote the sampled coefficients corresponding to X(j), and let Nk = Pb j=1 1(α(j) k (0, 1)). We construct a loss function L(UGS) whose gradient satisfies

θk L(UGS) = θk R( θ) + N 1 k Xb

j=1 1(α(j) k (0, 1)) α(j) k L(M α(j)(X(j)), M(X(j))) (4)

and perform SGD on the θk, where R( θ) is a continuous relaxation of R( E) from Equation (3). In Appendix F.5, we motivate UGS as an estimator for sampling over Bernoulli edge coefficients.

Optimizing circuits on opt ACDC and EAP are not compatible with optimization on opt, since the optimal a values depend on the selected circuit and it is intractable to optimize ba for every candidate circuit. For our circuit evaluations on opt, we compare to ACDCand EAP-generated circuits optimized on CF. On the other hand, HCGS and UGS allow us to perform SGD to optimize the ablation constants ba concurrently with the sampling parameters.

3.2 Experiments

We study GPT-2-small performance on the IOI (Wang et al., 2022) subtask described in Section 2.4 and the Greater-Than (Hanna et al., 2023) subtask, which involves completing prompts such as The conflict started in 1812 and ended in 18__ with digits greater than the first year in the context. We select these settings because their exposition in manual studies is particularly thorough and they are used in prior work (Conmy et al., 2023; Syed et al., 2023) to benchmark algorithmic circuit discovery.

We compare the algorithms in Section 3.1 trained to minimize on the IOI and Greater-Than subtasks when edges E \ E are ablated with mean ablation, resample ablation, OA, and CF. For IOI, the mapping π for CF is defined in Section 2.4. For Greater-Than, we continue with the practice from Hanna et al. (2023) of selecting counterfactuals π(x) by changing the first year in the prompt x to end in 01 so that all numerical completions are equally valid (see Appendix F.3).

UGS achieves Pareto dominance on the -| E| tradeoff over the other methods on both subtasks for each ablation method, identifying circuits that achieve lower at any given | E| and vice versa. Results for IOI circuits optimized on CF are shown in Figure 1 (left). On IOI, UGS finds a circuit with 385 edges that achieves CF = 0.220. This circuit has 52% fewer edges than the smallest ACDC-identified circuit with comparable CF and 48% lower CF than the best-performing ACDC-identified circuit with a comparable edge count. Similar improvements to the Pareto frontier,

Figure 1: Left: Circuit discovery Pareto frontier for the IOI subtask with counterfactual ablation. Right: Comparison of ablation methods for circuit discovery on IOI (X indicates manual circuit evaluated on each ablation method). is measured in KL-divergence.

shown in Appendix F.10, occur for mean, resample, and optimal ablation. UGS also creates Pareto improvements for Greater-Than circuits for each ablation method; see Appendix F.11.

Applying OA to circuit discovery reveals that certain sparse circuits can account for model performance on these subtasks to a much greater extent than previously known. We visualize the for each ablation method achieved by UGS-identified circuits in Figure 1 (right). Using OA to ablate excluded components, we find circuits that recover much lower at any given circuit size than any circuit for which excluded components are ablated with any other ablation method. For example, for IOI, at a circuit size of 1,000 edges, ablating excluded components with OA enables the existence of circuits with 32% lower compared to CF, 62% lower compared to mean ablation, and 88% lower compared to resample ablation, and the improvement is even larger at smaller circuit sizes. For Greater-Than (results shown in Appendix F.11), OA again admits circuits with by far the lowest among the four ablation methods. Thus, OA paints a more accurate and compelling picture of how much small subsets of the model can explain behavior on these subtasks.

Unlike other ablation methods, OA indicates that the manual circuits are approximately optimal for their size. Holding | E| fixed, the Pareto-optimal opt is 29% below the opt of the manual circuit on IOI and 42% below the opt of the manual circuit on Greater-Than. However, for the other ablation methods, optimized circuits with fewer edges than the manual circuit achieve 84-85% lower than the manual circuit on IOI, and 70-84% lower on Greater-Than. Since the manual circuits are selected using a thorough mechanistic understanding of the model for each subtask and thus arguably capture the important components, this finding furthers the notion that measured with previous methods could be artificially high due to spoofing by ablated components, and therefore opt is a superior evaluation metric for circuits.

These results show that opt is useful for evaluating and discovering circuits and provide evidence that OA better quantifies the removal of important mechanisms than previous ablation methods.

4 Application: factual recall

Transformers can store and retrieve a large corpus of factual associations. One goal in interpretability is localizing factual recall, or identifying components that store specific facts. To this end, Meng et al. (2022) proposes causal tracing, which involves removing important information about the prompt x and evaluating which components can recover the original M(x). To isolate components responsible for an association between a subject (e.g. Eiffel Tower ) and an attribute ( located in Paris ), they select a prompt x ( The Eiffel Tower is located in the city of ___ ) that elicits from M a correctly memorized response y ( Paris ). They produce a corrupted input ξGN(x) by adding a Gaussian noise (GN) term Z N(0, 9Σ), to all token embeddings that encode the subject, where Σ is a diagonal

matrix and Σii represents the variance of the ith neuron among token embeddings sampled from the training distribution. Letting [M(x)]y represent the probability assigned by M(x) to label y. Since ξGN partially ablates information about the subject, [M(ξGN(x))]y is typically much smaller than [M(x)]y. For each component A, they estimate its contribution to the recall of y with the following average indirect effect (AIE) representing the proportion of probability on the correct y recovered by replacing A(ξ(X)) with A(X), averaged over (X, Y ) D, where ξ = ξGN: 4

AIE(A) := min 0, 1 E(X,Y ) D max(0, [M(X)]Y [MA(ξ(X), A(X))]Y )

E(X,Y ) D [M(X)]Y E(X,Y ) D [M(ξ(X))]Y

where we declare AIE(A) = 0 if the denominator is non-positive and ablating subject tokens actually helps identify the correct label (however, this is never the case).

Method We perform causal tracing by removing the subject with optimal ablation (OA-tracing, or OAT) rather than with Gaussian noise (GNT). We define ξ(x) = ξOA(x, a A) by replacing subject token embeddings with a constant a A trained to minimize the numerator in Equation (5), which represents with a carefully chosen loss function (see Appendix G.2).

Figure 2: Comparison of AIE with GNT and OAT. In the top figure, layer ℓon the x-axis represents replacing a sliding window of 5 layers with ℓas the median. Error bars indicate the sample estimate plus/minus two standard errors (details given in Appendix G.4).

Experiments We compare GNT and OAT for GPT-2-XL on a dataset of subject-attribute prompts from Meng et al. (2022) for which the model completes the correct answer via sampling with temperature 0. To increase the sample size, we augment the data with similarly constructed prompts from follow-up work on factual recall (Hernandez et al., 2022). We train OAT on 60% of the dataset and evaluate both methods on the other 40%. On the test set, E[M(X)]Y = 30.6%, E[M(ξGN(X))]Y = 12.3%, and E[M(ξOA(X))]Y = 8.7%. We let A(X) represent an attention or MLP layer output at a certain token position(s): namely, all subject token positions, only the last subject token position, and only the last token position in the entire sequence. Rather than considering only one layer at a time, Meng et al. (2022) lets A represent the outputs of a sliding window of several consecutive attention layers or MLP layers. Thus, in addition to replacing the output of a single layer (window size 1), we show results for replacing windows of sizes 5 and 9.

OAT offers a more precise localization of relevant components compared to GNT. While GNT indicates a small positive AIE for most components, OAT shows a few components have large contributions while most have little to no effect. For example, Figure 2 (top left) shows that the AIE for a window of 5 attention layers at the last token is as high as 42.6% for the window consisting of layers 30-34, while the AIE peaks at only 20.2% for GNT. On the other hand, for windows centered around layers 15-23, the average AIE for OAT is only 1.7%, indicating little effect for these potentially

4Unlike Meng et al. (2022), we clip [M(X)]Y [MA(ξ(X), A(X))]Y to be non-negative, so we do not give A additional credit for increasing the probability mass of the true label past that given by the full model. We also report AIE in proportion probability recovered compared to the full model rather than percentage points.

unimportant layers, compared to 7.0% for GNT. For sliding windows of 9 attention layers at subject token positions, GNT shows marginally positive AIE measurements across layers 0-30, but OAT specifically shows highly positive AIE for layers 0-5 and 25-30 (see Figure 14). Moreover, whereas Meng et al. (2022) focuses on sliding window replacement because GNT effects from single-layer replacements are very small, OAT can sometimes identify information gain from just one layer. For instance, at the last token position, OAT records AIEs above 8% for each of attention layers 30, 32, and 34 by themselves (see Figure 2, bottom left), much greater than the AIE of the other layers. This greater level of granularity opens up the possibility of selectively investigating combinations of layers as opposed to relying on the prior that adjacent layers work together.

5 Application: latent prediction

One practice in interpretability is eliciting predictions from latent representations. Let M have layers 0, ..., N and let ℓi(X) be the residual stream activation at the last token position (LTP) after layer i. Logit attribution (Geva et al., 2022; Wang et al., 2022; Dar et al., 2023; Katz and Belinkov, 2023; Dao et al., 2023; Merullo et al., 2024; Halawi et al., 2024) is the practice of applying a transformer model s unembedding map to an activation to obtain a semantic interpretation of that activation. When applied to the LTP activation after layer i, this practice is equivalent to zero ablating layers i+1 to N. However, the semantic meanings of LTP activations after layer N can be different from those of LTP activations in earlier layers. As an alternative, tuned lens (Belrose et al., 2023a; Din et al., 2023) is a linear map fi(ℓi) = Wiℓi + bi that translates from ℓi(X) to a predicted bℓN(X). MTL(X) is defined by replacing ℓN(X) with bℓN(X) := fi(ℓi(X)) during inference, and training Wi and bi to minimize LTL := EXL(MTL(X), M(X)). Tuned lens demonstrates when information is transferred to LTP: if replacing ℓN(X) with bℓN(X) achieves low loss, then ℓi(X) contains sufficient context for computing M(X), so key information is transferred prior to layer i.

Method We propose Optimal Constant Attention (OCA) lens. We define MOCA(X) by using OA to ablate attention layers i + 1 to N: for each of these layers k, we replace its output at LTP with a constant bak. We train ba = (bai+1, ..., ba N) to minimize EXLOCA := EXL(MOCA(X), M(X)).

Similar to tuned lens, OCA lens reveals whether the LTP activation after layer i contains sufficient context to compute M(X) by eliminating information transfer from previous token positions to LTP after layer i. While tuned lens is a linear map, OCA lens is a function that leverages the model s existing architecture (specifically, its MLP layers) to translate between LTP activations at different layers. OCA lens has far fewer learnable parameters than tuned lens: O(Ndmodel) < O(d2 model).

Experiments We compare LOCA to LTL for various model sizes. As additional baselines, we also consider the ablation of later attention layers with mean or resample ablation rather than OA. Results are shown in Figure 3 (left) for GPT-2-XL and Figure 15 for other model sizes. OCA lens achieves significantly lower loss than tuned lens, indicating better extraction of predictive power from LTP activations. For example, the predictive loss of OCA lens drops to below 0.01 around layer 35 of GPT-2-XL, but does not reach this point even at the last layer for tuned lens.

Figure 3: Left: Prediction loss comparison between tuned lens and ablation-based alternatives. Middle, right: Causal faithfulness metrics for tuned and OCA lens under basis-aligned projections. Additionally, Belrose et al. (2023a) explains that one desiderata for latent prediction is causal faithfulness, i.e. fi should use ℓi(X) in the same way as M. We can investigate causal faithfulness by intervening on ℓi(X) and evaluating the extent to which MTL(X) and M(X) move in parallel. If

MTL(X) changes significantly but M(X) does not, for example, then fi could be extrapolating from spurious correlations, e.g. by inferring from directions that predict information transfer that occurs in later layers. Consider a random intervention ξ on ℓi(X) and let MTL(X; ξ) represent replacing ℓi(X) with ξ(ℓi(X)) before applying fi. Similarly, let M(X; ξ) represent replacing ℓi(X) with ξ(ℓi(X)) during inference. Belrose et al. (2023a) separates causal faithfulness into two measurable properties (both range from -1 to 1 and higher values reflect greater faithfulness):

1. Magnitude correlation: corr(E[L(MTL(X; ξ), MTL(X)) | ξ], E[L(M(X; ξ), M(X)) | ξ]). 2. Direction similarity: E[ MTL(X; ξ) MTL(X), M(X; ξ) M(X) ], where denotes subtraction in logit space and denotes the Aitchinson similarity between distributions.

We assess these properties for MTL and MOCA for a variety of interventions ξ. In Figure 3 (middle, right), we plot these properties for a modified version of the causal basis projection ξ from Belrose et al. (2023a). While they train a basis iteratively, this approach is expensive and unstable, and we instead extract an approximate basis for MTL by performing singular value decomposition (SVD) on WiΣ1/2, where Σ is the covariance matrix of ℓi(X), and applying Σ1/2 to the right singular vectors. For MOCA, we extract this basis by training a linear map to approximate fi and using the weights as the Wi. For both lenses, we compute ξ(a) = µ + p(a µ), where µ = E[ℓi(X)] and p represents projecting to the orthogonal complement of span( v) for a uniformly sampled basis vector v. We plot the magnitude correlation and direction similarity for MTL and MOCA with respect to M in Figure 3. We find that OCA lens measures significantly better on both causal faithfulness metrics across all layers, and we achieve similar results for other choices of interventions ξ (see Appendix H.3).

One downstream application of extracting latent predictions from intermediate-layer LTP activations is that they can sometimes be more accurate on text classification tasks than the model s output predictions, especially if the context contains false demonstrations, i.e. examples of incorrect task completions (Halawi et al., 2024). The proposed theory is that the model first computes the correct answer at LTP in early layers, then later layers move contextual information to LTP that lead it to make adjustments that benefit next-token prediction, such as reporting an incorrect answer for consistency with false demonstrations. We compare the elicitation accuracy boost, or the best elicitation accuracy across layers minus the accuracy of the model output, for OCA lens and tuned lens for GPT-2-XL with 2,000 classification samples from each of 15 text classification datasets from Halawi et al. (2024), using their calibrated accuracy metric. We find that OCA lens increases this accuracy boost for prompts with true demonstrations on 12 of the 15 datasets and for prompts with false demonstrations on 11 of the 15 (see Figure 21). In particular, for Wikipedia topic classification (DBPedia), OCA lens increases the elicitation accuracy boost from 2.9% to 18.0% with true demonstrations and from 19.2% to 28.8% with false demonstrations (see Figure 4, middle). Full results are reported in Appendix H.4.

Figure 4: Comparison of calibrated elicitation accuracy on selected datasets.

6 Future work

The applications of component importance presented in our work are not exhaustive. A variety of interpretability work either directly applies ablation-based importance or can be framed to use it as a potential tool. OA creates new opportunities to incorporate ablation into studies for which it may be impossible to obtain good results with previous ablation methods. For example, we can train probes derived from using OA with different loss functions (Li et al., 2023), or use an approach similar to OCA lens to decode activations other than the LTP residual stream activation. See Appendix D.3 for an extension of OA to evaluate the extent to which a component performs classification.

Acknowledgements

LJ was partially supported by DMS-2045981 and DMS-2134157.

Baan, J., ter Hoeve, M., van der Wees, M., Schuth, A., and de Rijke, M. (2019). Understanding multi-head attention in abstractive summarization.

Bach, S., Binder, A., Montavon, G., Klauschen, F., Muller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLo S One, 10(7):e0130140.

Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and Mueller, K.-R. (2009). How to explain individual classification decisions.

Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations.

Bau, D., Zhu, J.-Y., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. (2020). Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071 30078.

Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., Mc Kinney, L., Biderman, S., and Steinhardt, J. (2023a). Eliciting latent predictions from transformers with the tuned lens.

Belrose, N., Schneider-Joseph, D., Ravfogel, S., Cotterell, R., Raff, E., and Biderman, S. (2023b). Leace: Perfect linear concept erasure in closed form.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5 32.

Burns, C., Ye, H., Klein, D., and Steinhardt, J. (2022). Discovering latent knowledge in language models without supervision.

Cao, S., Sanh, V., and Rush, A. M. (2021). Low-complexity probing via finding subnetworks. ar Xiv preprint ar Xiv:2104.03514.

Chan, L., Garriga-Alonso, A., Goldowsky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. (2022). Causal scrubbing: A method for rigorously testing interpretability hypotheses. https://www.alignmentforum.org/posts/Jv Zhhzyc Hu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing.

Chang, C.-H., Creager, E., Goldenberg, A., and Duvenaud, D. (2019). Explaining image classifiers by counterfactual generation.

Chen, H., Feng, S., Ganhotra, J., Wan, H., Gunasekara, C., Joshi, S., and Ji, Y. (2021). Explaining neural network predictions on sentence pairs via learning word-group masks.

Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An informationtheoretic perspective on model interpretation.

Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability.

Covert, I., Lundberg, S., and Lee, S.-I. (2020). Understanding global feature contributions with additive importance measures.

Covert, I. C., Lundberg, S., and Lee, S.-I. (2022). Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res., 22(1).

Cunningham, H., Ewart, A., Smith, L. R., Huben, R., and Sharkey, L. (2023). Sparse autoencoders find highly interpretable model directions.

Dabkowski, P. and Gal, Y. (2017). Real time image saliency for black box classifiers. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Dao, J., Lau, Y.-T., Rager, C., and Janiak, J. (2023). An adversarial example for direct logit attribution: Memory management in gelu-4l.

Dar, G., Geva, M., Gupta, A., and Berant, J. (2023). Analyzing transformers in embedding space.

Datta, A., Sen, S., and Zick, Y. (2016). Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP), pages 598 617.

De Cao, N., Schlichtkrull, M., Aziz, W., and Titov, I. (2021). How do decisions emerge across layers in neural models? interpretation with differentiable masking.

Dhamdhere, K., Sundararajan, M., and Yan, Q. (2018). How important is a neuron?

Din, A. Y., Karidi, T., Choshen, L., and Geva, M. (2023). Jump to conclusions: Short-cutting transformers with linear transformations.

Fisher, A., Rudin, C., and Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable s importance by studying an entire class of prediction models simultaneously.

Fong, R., Patrick, M., and Vedaldi, A. (2019). Understanding deep networks via extremal perturbations and smooth masks.

Fong, R. C. and Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE.

Geva, M., Bastings, J., Filippova, K., and Globerson, A. (2023). Dissecting recall of factual associations in auto-regressive language models.

Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y. (2022). Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.

Ghorbani, A. and Zou, J. (2020). Neuron shapley: Discovering the responsible neurons.

Goldowsky-Dill, N., Mac Leod, C., Sato, L., and Arora, A. (2023). Localizing model behavior with path patching. ar Xiv preprint ar Xiv:2304.05969.

Gould, R., Ho, E., and Conmy, A. (2023). Mechanistically interpreting time in gpt-2 small.

Grömping, U. (2007). Estimators of relative importance in linear regression based on variance decomposition. The American Statistician, 61(2):139 147.

Gromping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63(4):308 319.

Guan, C., Wang, X., Zhang, Q., Chen, R., He, D., and Xie, X. (2019). Towards a deep and unified understanding of deep neural models in NLP. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2454 2463. PMLR.

Gurnee, W., Horsley, T., Guo, Z. C., Kheirkhah, T. R., Sun, Q., Hathaway, W., Nanda, N., and Bertsimas, D. (2024). Universal neurons in gpt2 language models.

Gurnee, W. and Tegmark, M. (2024). Language models represent space and time.

Halawi, D., Denain, J.-S., and Steinhardt, J. (2024). Overthinking the truth: Understanding how language models process false demonstrations.

Hanna, M., Liu, O., and Variengien, A. (2023). How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.

Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. (2023). Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models.

Hase, P., Xie, H., and Bansal, M. (2021). The out-of-distribution problem in explainability and search methods for feature importance explanations.

Heimersheim, S. and Janiak, J. (2023). A circuit for Python docstrings in a 4-layer attentiononly transformer. https://www.alignmentforum.org/posts/u6KXXm KFb Xf Wzo AXn/ a-circuit-for-python-docstrings-in-a-4-layer-attention-only.

Heimersheim, S. and Nanda, N. (2024). How to use and interpret activation patching.

Hendel, R., Geva, M., and Globerson, A. (2023). In-context learning creates task vectors.

Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., and Andreas, J. (2022). Natural language descriptions of deep visual features.

Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Wattenberg, M., Andreas, J., Belinkov, Y., and Bau, D. (2024). Linearity of relation decoding in transformer language models.

Homma, T. and Saltelli, A. (1996). Importance measures in global sensitivity analysis of nonlinear models. Reliability Engineering & System Safety, 52(1):1 17.

Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. (2019). A benchmark for interpretability methods in deep neural networks.

Huang, Z. and Wang, N. (2018). Data-driven sparse structure selection for deep neural networks.

Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1(none).

Janzing, D., Minorics, L., and Blöbaum, P. (2019). Feature relevance quantification in explainable ai: A causal problem.

Katz, S. and Belinkov, Y. (2023). Visit: Visualizing and interpreting the semantic information flow of transformers.

Kim, S., Yi, J., Kim, E., and Yoon, S. (2020). Interpretation of nlp models through input marginalization.

Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., and Baroni, M. (2019). The emergence of number and syntax units in lstm language models.

Lan, M., Torr, P., and Barez, F. (2024). Towards interpretable sequence continuation: Analyzing shared circuits in large language models.

Leino, K., Sen, S., Datta, A., Fredrikson, M., and Li, L. (2018). Influence-directed explanations for deep convolutional networks.

Li, C. and Mahadevan, S. (2017). Sensitivity Analysis of a Bayesian Network. ASCE-ASME J Risk and Uncert in Engrg Sys Part B Mech Engrg, 4(1). 011003.

Li, J., Monroe, W., and Jurafsky, D. (2017). Understanding neural networks through representation erasure.

Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., and Wattenberg, M. (2023). Emergent world representations: Exploring a sequence model trained on a synthetic task.

Li, M., Davies, X., and Nadeau, M. (2024). Circuit breaking: Removing model behaviors with targeted ablation.

Lieberum, T., Rahtz, M., Kramár, J., Nanda, N., Irving, G., Shah, R., and Mikulik, V. (2023). Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla.

Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. (2017). Learning efficient convolutional networks through network slimming.

Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through l0 regularization.

Lundberg, S. M., Erion, G., Chen, H., De Grave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. (2020). From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56 67.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Makelov, A., Lange, G., and Nanda, N. (2023). Is this the subspace you are looking for? an interpretability illusion for subspace activation patching.

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.

Marks, S. and Tegmark, M. (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.

Mase, M., Owen, A. B., and Seiler, B. B. (2024). Variable importance without impossible data. Annual Review of Statistics and Its Application, 11(Volume 11, 2024):153 178.

Mc Dougall, C., Conmy, A., Rushing, C., Mc Grath, T., and Nanda, N. (2023). Copy suppression: Comprehensively understanding an attention head.

Mc Grath, T., Rahtz, M., Kramar, J., Mikulik, V., and Legg, S. (2023). The hydra effect: Emergent self-repair in language model computations.

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022). Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359 17372.

Merullo, J., Eickhoff, C., and Pavlick, E. (2023). Circuit component reuse across tasks in transformer language models.

Merullo, J., Eickhoff, C., and Pavlick, E. (2024). Language models implement simple word2vec-style vector arithmetic.

Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. (2017). Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211 222.

Mu, J. and Andreas, J. (2021). Compositional explanations of neurons.

Nanda, N. (2023). Attribution patching: Activation patching at industrial scale.

Nathans, L., Oswald, F., and Nimon, K. (2012). Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research and Evaluation, 17(9):1 19.

Olsson, C., Elhage, N., Nanda, N., Joseph, N., Das Sarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., Mc Candlish, S., and Olah, C. (2022). In-context learning and induction heads.

Petsiuk, V., Das, A., and Saenko, K. (2018). Rise: Randomized input sampling for explanation of black-box models.

Rabitz, H. (1989). Systems analysis at the molecular scale. Science, 246(4927):221 226.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "why should i trust you?": Explaining the predictions of any classifier.

Robnik-Sikonja, M. and Kononenko, I. (2008). Explaining classifications for individual instances. Knowledge and Data Engineering, IEEE Transactions on, 20:589 600.

Rushing, C. and Nanda, N. (2024). Explorations of self-repair in language models.

Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. (2022). Toward transparent ai: A survey on interpreting the inner structures of deep neural networks.

Schlichtkrull, M. S., De Cao, N., and Titov, I. (2022). Interpreting graph neural networks for nlp with differentiable edge masking.

Schulz, K., Sixt, L., Tombari, F., and Landgraf, T. (2020). Restricting the flow: Information bottlenecks for attribution.

Schwab, P. and Karlen, W. (2019). Cxplain: Causal explanations for model interpretation under uncertainty.

Shah, H., Ilyas, A., and Madry, A. (2024). Decomposing and editing predictions by modeling model computation.

Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps.

Singh, A. K., Moskovitz, T., Hill, F., Chan, S. C. Y., and Saxe, A. M. (2024). What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation.

Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H. (2020). Fooling lime and shap: Adversarial attacks on post hoc explanation methods.

Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise.

Sobol, I. (1993). Sensitivity estimates for nonlinear mathematical models. Computational Mathematics and Mathematical Physics, 1(4):407 413.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929 1958.

Stolfo, A., Belinkov, Y., and Sachan, M. (2023). A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis.

Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9:1 11.

Strumbelj, E. and Kononenko, I. (2010). An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11(1):1 18.

Strumbelj, E., Kononenko, I., and Sikonja, M. R. (2009). Explaining instance classifications with interactions of subsets of feature values. Data and Knowledge Engineering, 68(10):886 904.

Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks.

Syed, A., Rager, C., and Conmy, A. (2023). Attribution patching outperforms automated circuit discovery.

Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. (2023). Linear representations of sentiment in large language models.

Todd, E., Li, M. L., Sharma, A. S., Mueller, A., Wallace, B. C., and Bau, D. (2024). Function vectors in large language models.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need.

Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. (2020). Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388 12401.

Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2022). Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.

Wei, C., Kakade, S., and Ma, T. (2020). The implicit and explicit regularization effects of dropout.

Williamson, B. D. and Feng, J. (2020). Efficient nonparametric statistical inference on population feature importance using shapley values.

Wu, Z., Geiger, A., Huang, J., Arora, A., Icard, T., Potts, C., and Goodman, N. D. (2024). A reply to makelov et al. (2023) s "interpretability illusion" arguments.

Ye, J., Lu, X., Lin, Z., and Wang, J. Z. (2018). Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers.

Yoon, J., Jordon, J., and Van der Schaar, M. (2018). Invase: Instance-wise variable selection using neural networks. In International conference on learning representations.

Zeiler, M. D. and Fergus, R. (2013). Visualizing and understanding convolutional networks.

Zhang, F. and Nanda, N. (2024). Towards best practices of activation patching in language models: Metrics and methods.

Zhang, L. and Janson, L. (2020). Floodgate: Inference for model-free variable importance. ar Xiv preprint ar Xiv:2007.01283.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Object detectors emerge in deep scene cnns.

Zhuang, T., Zhang, Z., Huang, Y., Zeng, X., Shuang, K., and Li, X. (2020). Neuron-level structured pruning using polarization regularizer. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 9865 9877. Curran Associates, Inc.

Zintgraf, L. M., Cohen, T. S., Adel, T., and Welling, M. (2017). Visualizing deep neural network decisions: Prediction difference analysis.

A Limitations

As is the case for previous work, we do not provide an entirely precise definition of the importance of a component. The importance of a component can be generally described as an aggregation of causal effects in a way that summarizes the component s contribution to model performance. Among the many ways to aggregate causal effects, there may not be a mathematically rigorous way to show that one measure of importance produces the correct or canonical aggregation. However, component importance is useful for a wide variety of applications in interpretability, so aside from showing that our approach to component importance better captures relevant considerations in a conceptual sense, we focus on the utility that it provides to some of these applications.

As noted in Section 2.3, optimal ablation does not entirely eliminate the contribution of information spoofing to . For example, if A typically conveys strong and exact information about the input, then there may not exist any value of a that hedges between a range of inputs.

Though OA produces circuits that achieve lower loss at a given level of edge sparsity, it may elicit mechanisms that were not previously used for some subtask, especially if there are multiple computational paths that could lead to the same conclusion. However, if the subtask of interest is sufficiently complex, it seems unlikely that a model would have many dormant mechanisms that can be repurposed to perform the subtask, because this redundancy wastes computational complexity.

For factual recall, it remains to be seen whether localization is helpful for applications that are further downstream such as producing surgical model edits (Hase et al., 2023; Shah et al., 2024).

B Additional related work

As mentioned in Section 2.2, component importance is strongly related to variable importance, which quantifies the importance of a model input Xi (also known as a feature or covariate in the variable importance literature).

Variable importance Much of variable importance work concerns oracle prediction, which roughly considers how much Xi contributes to the performance of the best possible predictor for Y given the set of covariates (X1, ..., Xn), and frames the importance of Xi as a property of the joint distribution of (X1, ..., Xn, Y ), rather than a property of any particular model used for prediction. Most work in this area analyzes some parametric class of estimators, like linear models (Grömping, 2007; Nathans et al., 2012) or Bayesian networks (Li and Mahadevan, 2017). Later work generalizes parametric variable importance to arbitrary model classes, e.g. by training an ensemble of models that only have access to subsets of the covariates (Strumbelj et al., 2009; Fisher et al., 2019). Recent work has also studied non-parametric variable importance, in which we attempt to lower-bound the best performance of any arbitrary estimator (Williamson and Feng, 2020; Zhang and Janson, 2020).

On the other hand, our motivation is to interpret the behavior of one specific model M (Fisher et al., 2019; Hooker et al., 2019), not to analyze the theoretical relationship between model inputs and outputs. Rather than estimating how well any function of Xi can predict Y , we wish to estimate how much the particular function M uses an input feature Xi to predict Y . Previous work on this algorithmic variant of variable importance has taken two main approaches.

Local function approximations One way to quantify how much Xi contributes to model performance is to aggregate local function approximations, which approximate the model around a particular input. Common tools for local approximation include the gradient of M at a given x (Rabitz, 1989; Baehrens et al., 2009; Simonyan et al., 2014; Leino et al., 2018; Nanda, 2023), or a linear function that well-approximates M(x + ε) for a chosen noise term ε (Ribeiro et al., 2016; Smilkov et al., 2017). Since these tools often yield straightforward estimates of the local importance of Xi for the input x, one approach to quantifying the global importance of Xi is aggregate the importance estimates given by these local approximations, for example by using a first-degree Taylor approximation around a reference input x0 (Bach et al., 2015; Montavon et al., 2017), or integrating over gradients along a straight-line path from x0 to any studied input x (Sundararajan et al., 2017; Dhamdhere et al., 2018). This approach to measuring variable importance works just as well for internal components as it does for inputs. However, local function approximations can fail to capture the overall loss landscape, especially in the common setting where M has unbounded gradients, and

can often be manipulated to produce arbitrary feature importance values (Slack et al., 2020; Hase et al., 2021).

Ablation-based measures The second main approach considers the ablation of feature Xi. In this approach, the feature Xi is ablated by replacing it with a different random variable Xi that captures less information about the original feature value. We then compare the model performance when Xi is replaced with Xi to the original model performance, per the definition of in Section 2.1 (where the ablated component A is feature Xi of the model input).

Many of the current methods used for ablating internal model components as described in Section 2.2 were first introduced in feature importance work. Zero ablation (Dabkowski and Gal, 2017; Li et al., 2017; Petsiuk et al., 2018; Schwab and Karlen, 2019), mean ablation (Zeiler and Fergus, 2013; Zhou et al., 2015), and Gaussian noise injection (Fong and Vedaldi, 2017; Fong et al., 2019; Guan et al., 2019; Schulz et al., 2020) are all used to remove input features, such as the pixels of an image or tokens of a text input, to assess their importance. Resample ablation is also common in feature importance work; an early variant samples Xi from a uniform distribution (Sobol, 1993; Homma and Saltelli, 1996; Strumbelj and Kononenko, 2010), while later work generally performs resample ablation on features by resampling them from their marginal distribution (Breiman, 2001; Robnik-Sikonja and Kononenko, 2008; Datta et al., 2016; Lundberg and Lee, 2017; Janzing et al., 2019; Covert et al., 2020; Kim et al., 2020).

Measuring feature importance via these ablation methods suffers from a well-documented out-ofdistribution problem (Ishwaran, 2007; Fong and Vedaldi, 2017; Hooker et al., 2019; Hase et al., 2021; Mase et al., 2024): since setting Xi to zero or its mean, resampling Xi from its marginal distribution, or adding Gaussian noise to Xi could result in an input that was never observed during training, the measured feature importance values could potentially be determined by model behavior on impossible and/or nonsensical inputs. One way to mitigate this out-of-distribution problem is replacing feature Xi by a random variable Xi sampled from its conditional distribution (Strobl et al., 2008; Lundberg et al., 2020), i.e. Xi Xi|X i, (Xi Xi)|X i, where X i denotes the other features X1, ..., Xi 1, Xi+1, Xn. Since the conditional distribution is often intractable to sample from, previous work employs a range of approximation techniques. For example, Zintgraf et al. (2017) samples an ablated pixel from its conditional distribution given a ℓ ℓpatch of its proximate pixels instead of conditioning on the entire image, and Chang et al. (2019) uses a generative model to simulate the conditional distribution.

However, in a setting where the relevant features Xi represent internal model components, rather than inputs to the model, it often does not make sense to discuss an out-of-distribution problem because the (X1, ..., Xn) = (A1(X), ..., An(X)) are usually near-deterministically related to each other. For example, for any neuron, it is typically the case that its value can be almost deterministically recovered from the values of other neurons in the same layer. Thus, nearly any intervention on an internal component Ai brings the model out-of-distribution, in the sense that the model observes the vector (A1(X), ..., An(X)) where Ai(X) is replaced with a with near-zero probability density.

Our dichotomy of deletion and spoofing in Section 2.1 is more precise than the typical discussion of the out-of-distribution problem in its description of distortions to importance values that we wish to avoid. On one hand, our analysis is more lenient than the blanket requirement that the vector of all internal component values (A1(x), ..., An(x)) is in-distribution, in the sense that not all interventions that bring (A1(x), ..., An(x)) out of distribution constitute spoofing; for example, replacing Ai(x) with a does not have a spoofing-related contribution to if Ai(x) and a are equivalent for the sake of downstream computation. On the other hand, our analysis is more stringent in the sense that effect 2a from Section 2.1 is recognized as a form of spoofing that can occur even when interventions are in-distribution (see Appendix D.1 for more details).

Using dropout to eliminate spoofing One way to eliminate spoofing when intervening on A(X) is to train M to accept neutral constant values that indicate that component A has stopped functioning and then replace A with these built-in neutral values to assess the importance of A. Variations of this technique are common in feature importance (Strumbelj et al., 2009; Chen et al., 2018; Yoon et al., 2018; Hooker et al., 2019). For internal components, we could train neural networks with dropout (Srivastava et al., 2014; Wei et al., 2020), and then use zero ablation to assess the importance of A. Since the downstream computation is trained to recognize A(X) = 0 as an indication that A carries

no information, as opposed to strong information associated with an input other than the original X, zero(M, A) becomes an accurate assessment of deletion (effect 1 from Section 2.1).

However, re-training with neutral values does not necessarily assist in analyzing a particular model M, since re-training M will in general change M itself. Furthermore, training with dropout incentivizes M to lower zero(M, A) for any component A, since part of the loss function involves minimizing loss with a random subset of ablated components. As a result, we expect to observe more redundant computation shared between many model components, since a random subset of them could be ablated during training. This redundancy inherently tends to make M less modular and harder to analyze for example, we should expect a broad variety of components to perform relevant computation for any input, even if an accurate prediction could be computed with just a few components, so it becomes difficult to localize model behaviors. Since interpretability involves decomposing model computation into smaller pieces and identifying specialization among model components, models trained with dropout may be less interpretable. To summarize, while the zero measurements are more accurate when M is trained with dropout, they may become less useful for interpretability. On the other hand, OA makes measurements a more accurate reflection of deletion effects without training M on to distort the magnitude of these effects.

Aggregation mechanisms for ablation methods On top of a selected ablation method, some work uses Shapley values to aggregate performance gap measurements for sets of features (Strumbelj et al., 2009; Strumbelj and Kononenko, 2010; Datta et al., 2016; Lundberg and Lee, 2017; Janzing et al., 2019; Lundberg et al., 2020; Covert et al., 2020). This line of work measures the importance of Xi by estimating a weighted average of the performance gap (M, S) for all subsets S {X1, ..., Xn} rather than considering only (M, Xi). This aggregation mechanism is applied after choosing an ablation method via which to measure each , and is just as compatible with OA as with any other ablation method.

Sparse pruning and masking Finally, in the literature of sparse pruning and masking, an operation that is procedurally similar to optimal ablation is sometimes performed in prior work by adding a bias term to removed features or activations after setting weights to zero.

In some structured pruning work, it is typical to introduce scaled batch normalization layers A(X) = γA(X) + β for γ [0, 1] to the output of each computational block A, and regularize the γ toward zero to select weights to prune (Liu et al., 2017; Ye et al., 2018; Zhuang et al., 2020). When γ reaches 0, the output of A is set to the constant β, which is trained to minimize the loss of the pruned model. However, the motivation of this reparameterization is not to measure component importance, and optimal ablation can be applied to more general model components (e.g. computational edges in any graph representation).

Similar to pruning, sparse masking work searches for a mask over input tokens such that for any input, most inputs are zeroed out while model performance is retained (Li et al., 2017; De Cao et al., 2021; Chen et al., 2021; Schlichtkrull et al., 2022). In particular, De Cao et al. (2021) replaces masked tokens in an input X = X1, ..., Xn with a learned bias b(X). While this operation may seem similar to optimal ablation, a fundamental difference is that the bias b(X) is different for each input sequence X and is trained to equalize embeddings at different token positions in a single X rather than assuming the same constant value for all values of X. Thus, for each X, b(X) contains specific information about the masked tokens in X, so unlike OA, this technique does not perform total ablation on the masked tokens. A follow-up work Schlichtkrull et al. (2022) trains a common b for a dataset of inputs X. However, Schlichtkrull et al. (2022) uses an auxiliary linear model ϕi(A1(X), ..., An(X)) to predict whether a component Ai(X) should be masked. Since ϕi explicitly depends on the values of the masked components Ai(X), the model output remains dependent on information contained in Ai(X), and total ablation is not achieved. Moreover, the auxiliary model ϕ provides the model with additional computation to distill information about the input, rather than strictly reducing the computational complexity of the original model as OA does. The use of an auxiliary model is a requisite feature of their method and cannot be decoupled from the masking technique: without using ϕ(X) to predict the masked components, computing masks requires a separate optimization procedure for each input, which makes it computationally intractable to optimize a single b over an entire distribution of inputs.

C Additional preliminaries

C.1 Models as computational graphs

We can write an ML model M as a connected directed acyclic graph. The graph s source vertices represent the model s (typically vector-valued) input, its sink vertices represent the model s output, and intermediate vertices represent units of computation. For the sake of simplicity, assume M has a single input and a single output. Each intermediate vertex Ai represents a computational block that takes in the values of previous vertices evaluated on x, and itself produces an output Ai(x), that is taken as input to later vertices. We indicate that there exists a directed edge from vertex Aj to vertex Ai if Aj(x) is taken as input to Ai.

Let M be represented by computational graph (G, E) where G is the overall set of vertices and E be the set of edges. Let A0:n be a tuple representing G in topologically sorted order (A0 represents the model input, while An represents the model output). For a particular vertex Ai, let Gi = (Gi 1, ..., Gi k) be the tuple of vertices (duplicates allowed) whose outputs Ai takes as immediate inputs. As we will see, we will sometimes require multiple edges between a pair of vertices. Rather than the standard edge notation e = (Aj, Ai) E for simple graphs, we adopt the notation e = (Aj, Ai, z) E to indicate that Gi z = Aj, i.e. Aj(x) is taken as the zth input to Ai.

Model inference is performed by evaluating the vertices in topologically sorted order. We perform inference on an input x by setting A0(x) = x and then iteratively evaluating Ai(x) = Ai( Gi) for i {1, ..., n}. By the time we evaluate some vertex Ai, we have already computed the values Gi z(x) for each of its inputs because they precede Ai in the topological sort. Finally, we determine that M(x) = An(x). We will alternate between the notation Ai( Gi) to explicitly write Ai as a function of its immediate inputs and the notation Ai(x) to indicate that the output of Ai is a function of x. We also sometimes use Ai(x) as a standalone quantity apart from evaluating M(x) and observe that this quantity is a function of x computed by evaluating Aj(x) in order for j {1, ..., i}.

The graph notation for any ML model is not unique. For any model, there are many equivalent graphs that faithfully represent its computation. In particular, a computational graph can represent a model at varying levels of detail. At one extreme, intermediate vertices can designate individual additions, multiplications, and nonlinearities. Such a graph would have at least as many vertices as model parameters. Fortunately, most model architectures have self-contained computational blocks, which allows them to be represented by graphs that convey a significantly higher level of abstraction. For example, in convolutional networks, intermediate vertices can represent convolutional filters and pooling layers, while in transformer models, the natural high-level computational units are attention heads and multi-layer perceptron (MLP) modules.

C.2 Activation patching

Activation patching is the practice of evaluating M(x) while performing the intervention of setting some component Ai to a counterfactual value a instead of Ai(x) during inference. We use the notation MAi(x, a) extensively in the paper to indicate this practice, and here we give a more precise definition in terms of M as a computational graph: Definition C.1 (Vertex activation patching). To compute MAi(x, a), compute A0(x), ..., Ai 1(x) in normal fashion and set Ai(x) = a. Then compute each vertex Ai+1(x), ..., An(x) in order, computing each vertex Aj as a function of its immediate inputs, i.e. Aj(x) = Aj(Gj 1(x), ..., Gj k(x)). Finally, return MAi(x, a) = An(x).

During this modified forward pass, a vertex Aj that takes Ai as its zth immediate input, i.e. for which (Ai, Aj, z) E, instead takes a as their zth input instead of the normal value of Ai(x). Later, if some other vertex takes Aj as input, it will take this modified version of Aj(x) as input, and so on, so the intervention on Ai may have an effect that carries through the graph computation and eventually makes MAi(x, a) different from M(x).

In Section 3, we discuss extending this practice to edges e: Definition C.2 (Edge activation patching). To compute Me(x, a), where e = (Aj, Ai, z), compute A0(x), ..., Ai 1(x) in normal fashion. Set Ai(x) = (Gi 1(x), ..., Gi z 1(x), a, Gi z+1(x), ..., Gi k(x)),

i.e. setting its zth input to a instead of Aj(x). Then compute each vertex Ai+1(x), ..., An(x) in order as a function of its immediate inputs. Finally, return Me(x, a) = An(x).

As mentioned in the main text, using activation patching on a particular edge e = (Aj, Ai, z) is more surgical than using activation patching on its parent vertex Aj. Performing activation patching on Aj would replace Aj(x) with a as an input to all of its child vertices, but performing activation patching on only e modifies Aj(x) only as an input to Ai. Notice that performing activation patching on Aj(x) is equivalent to performing activation patching on e = (Aj, Ai, z) for all edges e that emanate from Aj in the graph.

C.3 Transformer architecture

The transformer architecture (Vaswani et al., 2017) may be familiar to most readers. However, since our experiments involve interventions during model inference with varying levels of granularity, we include a summary of the transformer computation, which we later reference to crystallize specifically how we edit the computation.

Transformers M take in token sequences x1:s of length s, which are then prepended with a constant padding token x0. Let x = (xj)s i=0. The model simultaneously computes, for each token position i, a predicted probability distribution (b P(xj+1 | x0:j))s j=0 for the (j + 1)th token given the first j tokens. We use M(x) to refer to the predicted probability distribution over the (s + 1)th token. We sometimes abuse notation to write P(M(x) = y) to indicate b P(y | x), i.e. the probability placed on prediction y by the distribution M(x).

Let X be a random input sequence and S be a token position sampled randomly from {1, ..., s}. M is trained to minimize EX,S log P(M( X0:S 1) = XS). However, we generally refer to input-label pairs (X, Y ) = ( X0:S 1, XS), so that the loss function is instead written

EX,Y L(M(X), Y ) = EX,Y [ log P(M(X) = Y )]

To evaluate M, each token xj is mapped to a embedding Resid(0) j (x) = t(xj) + p(j) of dimension dmodel, where t(xj) is a token embedding of token xj and p(j) is a position embedding representing position j in the sequence. Over the course of inference, M keeps track of a residual stream representation Resid(i) j at each token position j that is a vector of dimension dmodel, which it updates by iterating over its nlayers layers, adding each layer s contribution to the previous representation:

MResid(i) j (x) = Resid(i 1) j (x) +

k=1 Attn(i,k) j (x) (6)

Resid(i) j (x) = MResid(i) j (x) + MLP(i) j (x). (7)

Attention heads Attn(i,k) transfer information between token positions. Let LN (layer-norm) be the function that takes a matrix Z of size m n and outputs a matrix of the same size such that each row of LN(Z) is equal to the corresponding row of Z normalized by its L2 norm: (LN(Z))j = Zj ||Zj||2 .

Let R = LN(Resid(i 1)(x)). Attention heads are computed as follows:

Attn Score(i,k)(x) = softmax RW (i,k) Q + b(i,k) Q

| {z } Q(i,k)

RW (i,k) K + b(i,k) K T

| {z } KT (i,k)

Attn(i,k)(x) = Attn Score(i,k)(x) RW (i,k) V + b(i,k) V

| {z } V(i,k)

W (i,k) O + b(i,k) O . (8)

The W (i,k)s and b(i,k)s are weights. W (i,k) Q , W (i,k) K , and W (i,k) V have size dmodel dhead, and W (i,k) O has size dhead dmodel, b(i,k) Q , b(i,k) K , and b(i,k) V have dimension dhead while b(i,k) has dimension dmodel. Biases are added to each row. is a lower triangular matrix of 1s, represents the elementwise product and the softmax is performed row-wise. Multiplying by ensures that Attn(i,k) j (x) only

depends on Resid(i) 0:j and thus information can only be propagated forward, so the prediction of token j + 1 can only depend on tokens 0 through j.

MLP layers are computed token-wise; the same map is applied to Resid(i 1) j (x) at each token position j. Let R = LN(MResid(i)(x)), and let Rj be the jth row of R. MLPs are computed as follows: MLP(i) j (x) = Re LU Rj W (i) in + b(i) in W (i) out + bout.

The W (i) and b(i) are weights. W (i) in has shape dmodel dmlp and W (i) out has shape dmlp dmodel. b(i) in has dimension dmlp and b(i) out has dimension dmodel.

Finally, the output probability distribution is determined by applying a final transformation to the residual stream representation after the last layer.

Out(x) := softmax LN Resid(nlayers)(x) Wunembed

Wunembed is a learnable weight matrix of size dmodel dvocab and the softmax is performed row-wise. Out(x) is a matrix of size (s + 1) dvocab, where dvocab is the number of tokens in the vocabulary and each row Outj(x) indicates a discrete probability distribution over dvocab values that predicts the (j + 1)th token given the first j tokens. M(x) is the prediction for the (s + 1)th-token continuation of x given the entire sequence x, i.e. M(x) = Outs(x).

C.4 KL-divergence loss function

The performance metric P( M) = EX D L( M(X), M(X)) selected in the paper frames performance in terms of proximity to the original model predictions, and thus the corresponding ablation loss gap (M, A) measures the importance of component A for the model to arrive at predictions that are close to its original predictions. A common alternative, P( M) = E(X,Y ) DL( M(X), Y ), frames performance in terms of proximity to the true labels, so the corresponding ablation loss gap (M, A) measures the importance of component A for the model to perform a subtask at a comparable level to the original model. As an example, consider a model M that computes an approximately-optimal solution M(X) and then adds noise in a way that changes its predictions but does not improve or worsen EX,Y L(M(X), Y ). Presenting M alone is a satisfactory interpretation of the behavior of M under P but not under P.

A major advantage of P is that it is much more sample-efficient to evaluate for language tasks, especially if the label distribution has high entropy. Let (X, Y ) denote a random input-label pair. Recall that a language model is trained to minimize

EX,Y L(M(X), Y ) = EX,Y [ log P(M(X) = Y )] = c + EXDKL(ρ(X) || M(X))

where c is a constant and ρ(X) represents the true probability distribution of Y |X. For each X, we are unable to observe ρ(X) in fact, we are usually only able to obtain a single sample from Y ρ(X). On the other hand, M(X) may be a sufficient estimate for ρ(X), and provides many more bits of information about ρ than the single sample Y ρ. Even if our desired performance metric were P, rather than estimating EX[EY [L( M(X), Y ) | X]] from individual samples (X, Y ), it may often be more sample-efficient to approximate EY [L( M(X), Y ) | X] analytically for a particular X by assuming that the full model well-approximates that true distribution ρ, i.e. by assuming that M(X) ρ(X) (in the sense that DKL(ρ(X) || M(X)) 0), which implies

EXDKL(ρ(X) || M(X)) EXDKL(M(X) || M(X)). (9)

and so we still evaluate M using P as the performance metric.

In practice, Equation (9) may be an unreasonable assumption and the two criteria may yield very different interpretability results. We cannot estimate EX,Y L(M(X), Y ) because it is impossible to obtain an estimate of the ground truth entropy of next-token prediction for long sequences, we typically never observe the same sequence more than once. However, we can deduce a lower bound EX,Y L(M(X), Y ) 1 for a model like GPT-2 because larger models reduce cross-entropy by more than this amount compared to GPT-2. Note that a better approximation of EX,Y L( M(X), Y ) is to obtain a probability distribution from a larger language model M , and future work may wish to explore this direction.

However, there are several other reasons to prefer using P over P with labels from a larger model. The use of KL-divergence to the original model M is consistent with previous work. In the real-world scenario of performing interpretability on the largest frontier model, we will not have access to a better M . Most importantly, one concern with circuit discovery for subtasks (X, Y ) D is that it may be possible to adversarially select M such that

E(X,Y ) DL( M(X), Y ) E(X,Y ) DL(M(X), Y ). (10)

which can occur if M sacrifices some performance on (X, Y ) D for better performance on other regions of the input distribution. Selecting only the components of M that maximize performance on D may ignore important mechanisms that must be included in its predictions on D as a result of this tradeoff. On the other hand, evaluating circuits with P llows mitigating mechanisms to be included in the selected circuit, since we must select M in a way that imitates the behavior of M itself on the subtask D. Using this metric, a subnetwork can never achieve lower loss than M, since L( M(X), M(X)) L(M(X), M(X)).

D Commentary

D.1 Understanding the difference between deletion and treating x like x

Colloquially, deletion means the model has lost the information it would use to distinguish between inputs x and x . One might expect that if the model were able to rationally handle this lack of information, it would produce an output that hedges between labels corresponding to inputs x and x . On the other hand, subclass 2a of spoofing means the model was given information in component A that is compatible with x and not x, leading the model to output something close to what it would have produced on input x .

To illustrate the difference between deletion and insertion, consider the following example. Assume a classifier M has two possible labels and two possible inputs, x and x , and the model entirely depends on component A to determine the correct label. Let M output a probability vector, and suppose L is KL-divergence. Let M(x) = (1, 0) and M(x ) = (0, 1). If we remove the information given by A, we should expect the model to output (0.5, 0.5), giving L = log 0.5, but if we instead intervene by inserting A(x ) into an inference pass on x or vice versa, then the model places probability 1 on the incorrect label, and the loss is infinite from assessing that the input is x when the true input is x.

D.2 OA as an extension of mean ablation for nonlinear functions

Let A(X) be a vector-valued model component. As noted in Lundberg and Lee (2017), one motivation for mean ablation is that E[A(X)] is, under certain assumptions, a reasonable point estimate for A(X). For instance, if the relevant loss function is the squared distance between our point estimate a and the realized value of A(X), then E[A(X)] = arg mina EX||a A(X)||2 2. Indeed, the mean is also the best point estimate of A(X) if the relevant loss is squared distance between MA(X, a) and M(X) = MA(X, A(X)) and the model M is linear in A(X):

E[A(X)] = arg min a EX||MA(X, a) MA(X, A(X))||2 2 (11)

if M(X, a) = M(X)a + b(X) for a random matrix M(X) A(X) and random bias b(X).

Thus, for model components A(X) for which the downstream computation is roughly linear, E[A(X)] could potentially be a reasonable point estimate, hence justifying mean ablation. This presumption of linearity also shows up in other interpretability work, including Hernandez et al. (2024), which uses a linear map to approximate the decoding of subject-attribute relations, and Belrose et al. (2023b), which considers erasing concepts Z from a model s latent space in a minimal sense by transforming activations A(X) with a map g Z that makes g Z(A(X)) uncorrelated with Z and minimizes expected squared distance to the original activations, E[g Z(A(X)), A(X)].

However, in most settings, M(X, a) is highly nonlinear in a, and the mean E[A(X)] could be an arbitrarily poor point estimate for A(X). Optimal ablation generalizes the idea of selecting the best point estimate for A(X) as measured by replacing A(X) with a and evaluating model loss. In particular, optimal ablation constants a generalize the property given in Equation (11) to arbitrary

models M and loss functions L:

a = arg min a L(MA(X, a), MA(X, A(X))). (12)

D.3 Generalizing OA to constrained-form estimates of A(X)

Measuring opt(M, A) on a subtask (X, Y ) D is, in a sense, a testing procedure for the hypothesis that A does not provide relevant information for model performance on subtask D. Verifying that opt 0 validates this hypothesis, since a point estimate of A(X) performs as well as the realized value of A(X) for the purpose of model inference.

Optimal ablation can be generalized to test interpretability hypotheses beyond assertions that a computed quantity A(X) is unimportant. In particular, we can test hypotheses about the specific properties of A(X) that are important.

Suppose A(X) is vector-valued, and consider the hypothesis the only relevant information in A(X) is stored in subspace W. We can test this hypothesis by replacing A(X) with PW A(X)+a , where a is an optimal constant that lies in W and PW is the projection matrix to subspace W. While this example is simple and illustrates some of the flexibility of OA, it does not add to the space of what OA can express, in the sense that we could have simply considered PW A(X) and PW A(X) as separate vertices in the graph and used OA (or any other ablation method) on only the latter.

However, a real gain of expression from OA materializes from being able to generalize the idea of null point estimates to estimates with constrained form. For example, consider the subspace hypothesis every A(X) can be adequately represented in subspace W. We can test this hypothesis by training an optimal activation a (X) for each X that lies in subspace W. Though activation training is expensive, we can train a function a (X) that maps X to values in W, and then estimate the error of this function by performing activation training on a few samples of X.

Similarly, we can generalize a to include multiple point estimates to test the claim that A(X) is the outcome of an internal classification problem, i.e. the relevant information provided by A(X) is the classification of X among a few input classes. We can train optimal point estimates (a 1, ..., a k) such that (a 1, ..., a k) = arg min (a1,...,ak) EX min j {1,...,k} L(M(X, aj), Y )

calling the outer minimized quantity k opt. If k opt 0, then every A(X) can be represented by one of a small number of prototype quantities.

E Single-component loss on IOI

E.1 Transformer graph representation

We use a graph representation in which each vertex corresponds to an attention head (Attn(i,k)(x)), an MLP block (MLP(i)(x)), the model input (Resid(0)(x)), or the model output (Out(x)). We also allow vertices representing the Resid(i)(x) and MResid(i)(x) computations. Appendix C.3 defines the computation of each vertex.

However, we slightly modify the definition of attention head vertices Attn(i,k) to save memory and so that ablation constants a for OA lie in the column space of attention head outputs. Recall from Equation (8) that attention heads produce output in a dhead-dimensional vector space, which is then mapped linearly to dmodel-dimensional space by a weight matrix W (i,k) O :

Attn(i,k)(x) = Attn Score(i,k)(x) RW (i,k) V + b(i,k) V W (i,k) O + b(i,k) O

Thus, while Attn(i,k)(x) is dmodel-dimensional, its distribution lies within a dhead subspace of the residual stream. If we used vertices Attn(i,k), then our dmodel-dimensional a for attention head (i, k) could sometimes contribute to subspaces that the attention head (i, k) can never write to. Instead, for an attention head vertex, we represent its output computation in dhead-dimensional space:

ZAttn(i,k)(x) = Attn Score(i,k)(x) RW (i,k) V + b(i,k) V

and consider replacing ZAttn(i,k)(x) rather than replacing Attn(i,k)(x) to ablate an attention head. This slight modification reduces our parameter count by a factor of dmodel/dhead when applying OA but does not affect the results for the other ablation types.

We measure the single-component ablation loss gap for the ZAttn(i,k) and MLP(i) vertices, (nheads + 1) nlayers = 156 vertices in total for GPT-2.

E.2 Ablation details

We consider zero ablation, mean ablation, optimal ablation, counterfactual mean ablation, resample ablation, and counterfactual ablation.

For a token position j, let [A(X)]j denote the representation of A(X) at token position j.

Zero ablation: To zero ablate A, we replace [A(X)]j with 0 at each sequence position j 1. We do not replace [A(X)]0 because it is a constant that does not depend on X (any result at token position 0 must only be a function of Resid(0) 0 , which represents a padding token that is the same for every sequence). Transformers may read from this beginning-of-string (BOS) token position in attention heads if no token in the sequence indicates a particularly strong signal, and since this token position does not distinguish between any inputs X and is more appropriately viewed as a structural part of the architecture, we choose not to modify it.

Mean ablation: For each vertex A, we compute E(X,Y ) D[A(X)] over 20,000 samples, conditional on token position. We let µj = E(X,Y ) D[[A(X)]j]. To mean ablate A, we replace [A(X)]j with µj at each sequence position j.

In the Greater-Than dataset, all prompts X are the same length, but in the IOI dataset, some prompts X are longer than others, reducing our sample size for later sequence positions. In particular, if X is the longest prompt in the dataset with ℓtokens, then µℓ= X ℓ, so the mean value actually carries identifying information about the prompt. Since we want the mean value µj to be uninformative about the original prompt X, we instead consider m to the minimum length of any prompt, compute a modified mean µm = E(X,Y ) D, S Unif{1,...,ℓ}[[A(X)]S | S m] that considers all values of A(X) at token positions after token position m, and replace [A(X)]j with µm if j m.

Optimal ablation: Similar to mean ablation, we optimally ablate A by replacing [A(X)]j with a constant baj for each sequence position j < m and replacing [A(X)]j with a constant bam for each j m, where m is the minimum length of any prompt. We initialize (ba0, ba1, ..., bam) = (µ0, µ1, ..., µm) as defined for mean ablation and then optimize (ba1, ..., bam) for each ablated component A to minimize (M, A). Note that similarly to zero ablation, we fix ba0 = µ0 = [A(X)]0 and do not optimize its value as an ablation constant because [A(X)]0 does not depend on X and thus naturally conveys no information about the input.

Counterfactual mean ablation: Our implementation is the same as for mean ablation except that we compute means over (X, Y ) D for the counterfactual distribution D , and m is taken as the minimum prompt length in the counterfactual distribution.

Resample ablation: To perform modified inference on an input X, we first sample an independent copy X X. Let X and X have lengths s and s respectively. If s s , for an ablated component A, we replace [A(X)]j with [A(X )]j at each token position j {1, ..., s} (in other words, we only resample from the first s tokens of X ). If s > s , then we left-pad X with an additional s s

tokens to form a modified token sequence X that is the same length as X. We then replace ablated component values A(X) with A( X ) with respect to each sequence position. Before arriving upon this implementation, we tried other choices, like resampling from the last s tokens of X in the case that s s , or right-padding X in the case that s > s .

Counterfactual ablation: We choose a function π (details discussed in the main text and further analyzed in Appendix F.3) that maps inputs X to neutral counterfactual inputs X . Typically, X and X are the same length and have many tokens in common. For ablated components, we replace [A(X)]j with [A(π(X))]j at each token position.

Figure 5: Correlation of single-component ablation loss measurements on IOI. Lower triangle shows rank correlation and upper triangle shows log-log correlation across metrics.

E.3 Full results

Figure 5 plots the pairwise correlations of single-component ablation loss evaluated on the IOI dataset with a variety of ablation methods. Table 2 is an extended version of Table 1 in the main paper that provides a summary of these results.

Table 2: Comparison of ablation loss gap on IOI, extended

Zero Mean Resample CF-Mean Optimal CF

Log-log correlation with CF 0.626 0.831 0.826 0.847 0.908 1 Rank correlation with CF 0.590 0.825 0.828 0.833 0.907 1

Mean 0.0584 0.0405 0.0559 0.0412 0.0035 0.0296 Median ratio of opt to 11.1% 33.0% 17.7% 31.7% 100% 88.9%

F Circuit discovery

F.1 Transformer graph representation

We use a residual rewrite graph representation favored by Wang et al. (2022), Goldowsky-Dill et al. (2023), and Conmy et al. (2023). Similarly to Appendix E.1, we define vertices that correspond to an attention head, an MLP block, the model input (Resid(0)(x)), or the model output (Out(x)), but we eliminate the Resid(i)(x) and MResid(i)(x) vertices. We have nlayers(nheads + 1) + 2 vertices in total (156 for GPT-2). Notice from Appendix C.3 that

Resid(ℓ)(x) = Resid(0)(x) +

MLP(i)(x) +

k=1 Attn(i,k)(x)

so rather than assuming that attention heads, MLP blocks, and the model output take Resid(i)(x) as input, we can assume that they take the output of each previous block as a separate input to the computation. In particular, we can write

MLP(i)(x) = MLP(i) Resid(0)(x), MLP(1)(x), ..., MLP(i 1)(x),

Attn(1,1)(x), ..., Attn(i,nheads)(x) (13)

in which the MLP(i) vertex has i(nheads + 1) + 1 incoming edges from previous vertices. Similarly, we can write

Out(x) = Out Resid(0)(x), MLP(1)(x), ..., MLP(nlayers)(x),

Attn(1,1)(x), ..., Attn(nlayers,nheads)(x) (14)

so the Out vertex has nlayers(nheads + 1) + 1 incoming edges, one from each previous vertex in the graph. Finally, notice that attention heads Attn(i,k) take Resid(i 1) as input in three different locations, once in each of the query, key, and value subcircuits, so we can write attention heads as taking three copies of each previous vertex s output, which can be ablated individually.

Attn(i,k)(x) = Attn(i,k) Resid(0),Q(x), MLP(1),Q(x), ..., MLP(i 1),Q(x),

Attn(1,1),Q(x), ..., Attn(i 1,nheads),Q(x),

Resid(0),K(x), MLP(1),K(x), ..., MLP(i 1),K(x),

Attn(1,1),K(x), ..., Attn(i 1,nheads),K(x),

Resid(0),V (x), MLP(1),V (x), ..., MLP(i 1),V (x),

Attn(1,1),V (x), ..., Attn(i 1,nheads),V (x) (15)

This notation indicates that attention heads admit multiple incoming edges for each previous vertex, which is somewhat non-standard. Alternatively, rather than allowing multiple edges between pairs of vertices, Conmy et al. (2023) creates a separate vertex for each of the query, key, and value subcircuits and considers each attention head output to take the outputs of these three circuits as input. However, edges between the subcircuits and attention head outputs are essentially placeholder edges that cannot be independently removed, since removing them is informationally equivalent to ablating the entire attention head. Thus, our graph representation is more natural and provides a more realistic edge count when considering removing model components.

Furthermore, we continue with the adjustment from Appendix E.1 of using ZAttn(i,k) as computational vertices rather than Attn(i,k) to conserve memory and reduce the parameter count of OA. We consider the linear map ϕ(i,k)(Z) = ZW (i,k) O + b(i,k) O (so that Attn(i,k)(x) = ϕ(i,k)(ZAttn(i,k)(x))) and express all downstream vertices as taking ZAttn(i,k)(x) as input rather than Attn(i,k)(x) and performing their computation by pre-composing with ϕ(i,k). For example, for an MLP vertex MLP(i), if m(i) represents how MLP(i)(x) is computed using Attn(i,k)(x) values as inputs, then its

computation taking ZAttn(i,k)(x) as inputs is equal to

MLP(i)(x) = m(i)(Resid(0)(x), MLP(1)(x), ..., MLP(i 1)(x),

ϕ(1,1)(ZAttn(1,1)(x)), ..., ϕ(i,nheads)(Attn(i,nheads)(x))

We replace all Attn(i,k) vertices in the graph structure with the corresponding ZAttn(i,k) vertex.

In total, we have (nheads nlayers) ZAttn(i,k) vertices, nlayers MLP(i) vertices, an input vertex (Resid(0)) and an output vertex (Out). Letting V represent the set of vertices that includes the input and MLP(i) vertices. There are 3 1

2nlayers (nlayers 1) n2 heads edges between two attention heads, 2 (nlayers+1) nlayers nheads edges between an attention head and a vertex in V , 1

2 nlayers (nlayers+1) edges between two vertices in V , and (nlayers) (nheads + 1) + 1 edges from any vertex to the output.

For GPT-2, there are 28,512 edges between two attention heads, 3,744 edges between an attention head and a vertex in V , 78 edges between two vertices in V , and 157 edges from any vertex to the output for a total of 32,491 edges.

F.2 Ablation details

For mean, resample, and counterfactual ablation, our implementation is the same as in Appendix E.2.

For optimal ablation, we adjust the implementation to remove dependence on token positions and further reduce the parameter count. For each ablated component A, rather than training a different a j to replace [A(X)]j for each token position j, we train a single optimal constant a that is the same shape as any particular [A(X)]j. We initialize a to E[[A(X)]j | j > 9], the subtask mean excluding early token positions, since early positional embeddings may have idiosyncratic effects. To ablate A during inference on X, for all token positions j > 0, we replace [A(X)]j with a . As in Appendix E.2, we do not replace [A(X)]0 because this value is a constant that does not depend on X.

We take this conservative approach to demonstrate that OA can be implemented in a position-agnostic manner, yet still outperforming position-specific implementations of other ablation methods by a large margin. Since many other ablation methods, like resample and counterfactual ablation, are inherently position-specific, compatibility with a position-agnostic implementation is a key advantage of OA over these ablation methods, and we discuss this advantage further in Appendix F.3.

As noted in Section 3, we train a single constant a j for each vertex Aj. An alternative implementation is training a separate constant for each ablated edge. In theory, this approach is consistent with the spirit of OA: though ablated edges transmit different values to downstream components that would normally receive the same value, none of the transmitted constants transmit any information about the input to downstream components. However, this approach would greatly increase the number of learnable parameters, and arguably may actually increase the computational capacity of the model. Additionally, training a single constant for each vertex has the appealing property that ablating all of the out-edges from a vertex is equivalent to ablating that vertex.

F.3 Normative comparison of ablation types

Thus far, circuit discovery on language models has focused on synthetic subtasks for which a mapping π from studied inputs x to counterfactual inputs π(x) is easily constructed. Recall that a crucial criterion for selecting π is to preserve as many tokens as possible between x and π(x). For Greater-Than, an example counterfactual pair is

Token: S YY1 YY2 YY1* x: The [conflict] began in [18][89] and ended in [18] π(x): The [conflict] began in [18][01] and ended in [18]

where the brackets [] are added to emphasize the two-token representation of the year. Similarly, for IOI, an example counterfactual pair is

Token: S1 IO S2 x: Friends [Alice] and [Bob] found a bone at the store. [Alice] gave the bone to π(x): Friends [Charlie] and [David] found a bone at the store. [Charlie] gave the bone to

Since x and π(x) share the same token at all but a few token positions (only the S1, S2, and IO token positions for IOI and only the YY2 token position for Greater-Than), we can isolate the effect of changing the specific token that conveys important information. CF thus allows us to study subtasks involving input-label pairs where subtask-relevant information is given by only one or several tokens.

However, replacing A(x) with A(x ) typically incurs much higher loss if x and x differ at many different token positions, even if most tokens are unimportant in relation to the behavior we wish to study. The model representation at a token position j is likely to contain information specific to the tokens at position j and at surrounding positions in the input X, so replacing [A(x)]j with [A(x )]j is likely to inject inconsistent information if Xj and X j (or pairs of tokens at corresponding proximate token positions) are different tokens, causing to be high as a result of spoofing (per Section 2.3).

An illustration is the discrepancy in resample ablation loss between the Greater-Than and IOI subtasks. The Greater-Than dataset only contains a single prompt template, so any sampled X and X only differ in tokens that encode the subject and year. Here is an example of a sampled (X, X ) pair:

Token: S YY1 YY2 YY1* X: The [conflict] began in [18][89] and ended in [18] X : The [deal] began in [15][47] and ended in [15]

On the other hand, the IOI dataset consists of multiple prompt templates that differ in sentence structure, so X and X may differ at nearly all token positions, not just the S1, IO, and S2 positions as shown above. For example:

X: Friends Alice and Bob found a bone at the store. Alice gave the bone to X : <> <> <> Then, Charlie and David had a long argument, and Charlie said to

where <> represents a padding token added to make the sequences the same length. As a result, resample ablation loss is relatively low for Greater-Than (see Figure 10) but relatively high for IOI (see Figure 8), indicating that token parallelism is an important requirement for CF to work well.

While the synthetic IOI and Greater-Than datasets are specifically engineered so that we can modify a prompt x at only a few token positions to obtain a neutral prompt π(x), more general language behaviors may not be suited for this type of counterfactual analysis. Here are a few examples of language subtasks for which it not be possible to pair up x and π(x) with parallel tokens:

A case study of the effect of modifiers, e.g. adjectives and adverbs, compared to a sentence with no modifier. Consider the following (degenerate) counterfactual pair, inspired by Marks and Tegmark (2023):

x: Paris is a city in the country of x : Paris is not a city in the country of

Since the presence of a modifier creates an extra token, replacing A(x) with A(x ) patching between sequences with and without the modifier would result in the embedding at most token positions in X being replaced with embeddings at the token position of a token reflecting the previous input token in X ( city with a, in with city, and so on).

A case study comparing sentence order in situations where order matters, like giving directions. Patching in activations from a counterfactual prompt in which the order of two sentences is permuted involves introducing a new token at many token positions.

x: Make a left turn, then walk forward one block. Your position is now x : Walk foward one block, then make a left turn. Your position is now

A case study relating to how language models handle mis-tokenization, like processing prompts in which a word is misspelled or the model is required to spell out a word.

x: The correct spelling of the word umpire is x : The correct [sp] [le] [ling] of [te] [h] word [up][mire] is

Additionally, as the field of interpretability moves forward, we believe that it must progress toward total interpretation of models internal mechanisms. This level of interpretation requires reasoning about subtasks that are much more general than those that have been studied and will require

performing intervention-based analysis across a broad distribution of inputs. For example, we may want to make the claim that certain components of a model M are unimportant for performing mathematical calculations; or that some components are not involved in ensuring grammatical correctness; or do not assist in making theory-of-mind assessments; etc. Additionally, we likely wish to assess component functions in the wild with filtered sampling from the model s training distribution as opposed to engineering synthetic datasets. These circumstances mean that the data will be much less suited for token parallelism between counterfactual prompts, so the adoption of a sequence-position agnostic ablation method is likely critical. This quality of OA makes it a much better candidate than CF as a suitable ablation method for scaling interpretability.

F.4 Sparsity metric

As stated in Equation (3), we wish to select a circuit E that achieves low loss EXLX(M E (opt)(X)) and which is a sparse subset of the model. Let EA represent the set of edges connected to vertex A in graph G, i.e. EA = {(Aj, Ai, z) E | Aj = A Ai = A}.

The selected circuit E should ideally satisfy two types of sparsity:

1. Edge sparsity: | E| << |E|. The circuit should contain a small number of edges compared to the total number of edges in the model. 2. Vertex sparsity: |{A | |EA E| > 0}| << |G|. The circuit should pass through a small number of vertices compared to the total number of vertices in the model.

There is a lack of guidance in prior work about whether smaller structures with more densely packed connections are more interpretable than larger structures with more thinly distributed connections. Indeed, one could argue that the larger structure is in fact easier to understand, since we do not need to dissect as many relationships to consider the function of any particular vertex within the circuit.

While circuit discovery aims to localize model behaviors on specific subtasks, we contend that a central challenge in interpretability going forward could be stacking together many circuit analyses to form a sum-of-the-parts analysis of the model s overall structure. Considering circuit discovery as a tool for decomposing model computation into interpretable subtasks, holding the total number of edges equal, we may prefer each circuit to have a smaller number of vertices to reduce the complexity of interactions between circuits rather than within circuits.

As such, we set R( E) to select for circuits with high levels of both edge and vertex sparsity:

R( E) = λ| E| + γλ X

1 2|EA| tanh

where λ, γ are constants. Similarly, the continuous relaxation R(θk) is for HCGS and UGS is derived by replacing | E| with P|E| k=1 θk and replacing |EA E| with εA := P

Note that θk R( θ) = λ + γλ P A G sech2(2 εA

The first term, λ, is generally used to control the tradeoff between edge sparsity and circuit loss; a general interpretation is that we should include an edge e E if its marginal contribution, (M, (E {ek}) \ E) (M, E \ ({ek} E)), is greater than λ, expressing the same tradeoff as the discrete threshold λ in ACDC. However, since ACDC is a less fine-grained optimization algorithm than UGS, the λ required to achieve the same circuit size | E| tends to be larger for ACDC.

The second term expresses vertex sparsity, and its effect is to increase the regularization effect for edges that are attached to vertices that have few other edges included in the circuit. Its effect is small when εA |EA|, since sech2(2) 0, so we do not apply additional regularization to edges attached to vertices that have high overall likelihood to be included in the selected circuit. However, its effect is significant when εA/|EA| 0 to prune the remaining edges from a vertex whose edge probabilities as represented by θ are low on average. We use γ to express the maximum influence of vertex regularization as compared to the effect of edge regularization (since maxx sech2(x) = 1), and generally select γ = 0.5, so the second term adds at most 50% more regularization.

F.5 Uniform Gradient Sampling: motivation

In circuit discovery, the number of possible circuits E E is exponential in |E| and the circuit losses (M, E \ E) for subsets E are not required to be related. is not even necessarily monotonic in E for any ablation method considered, i.e. E E does not imply that (M, E \ E) (M, E \ E ).

In reality, we can hope that the optimal ground-truth circuit E is clear-cut and is relatively well-behaved. If so, we could try to relax the discrete optimization problem and find a solution with gradient descent. As mentioned in Section 3, one continuous relaxation considers partial ablation for each edge ek = (Aj, Ai, z), where we replace Aj(x) as the zth input to Ai(x) with αk Aj(x) + (1 αk)baj and use L1 or L2 regularization on the αk. However, this approach is likely to get stuck in local minima in which edge coefficients converge to the optimal magnitude instead of edges being completely ablated or retained. Instead, we consider a vector θ of independent sampling probabilities for the inclusion of each edge, and optimize these probabilities so that each converges to 0 or 1. Our loss function is f( θ) := EX,( E θ)[L(M E(X), M(X)) + R( E)]. Denote

LR(X, E) = L(M E(X), M(X)) + R( E), so that f( θ) = EX,( E θ)[LR(X, E)]. The gradient with respect to the sampling probability θk for edge ek is a marginal ablation loss gap:

θk = EX,( E θ) h LR(X, E {ek}) LR(X, E \ {ek}) i =: EX,( E θ) R(M, ek, ) (17)

The problem is that |E| is large and it is not tractable to estimate this quantity individually for all k. Our goal is to find a good sample estimator for this quantity simultaneously for all k.

One way to perform this simultaneous estimation is importance sampling, where we write

f( θ) = EX,( E p)

" PE θ( E = E )

PE p( E = E ) LR(X, E)

so when p = θ, f( θ)

θk = EX,( E θ) h 1(ek E) θk LR(X, E) 1(ek E) 1 θk LR(X, E) i . However, this method leads to poor estimates. Most of the variance in gradient updates to θk comes from sampling different subsets of edges among the |E| 1 edges other than ek, not the effect of fixing ek E or ek E for a particular edge.

Instead, UGS is an approximation of the marginal ablation loss gap for each edge obtained by taking gradients with respect to sampled partial ablation coefficients α. We consider the extension of M E to convex relaxations M α, where αk represents the partial ablation coefficient for edge ek as alluded to above. Similarly, we consider LR(X, α) in place of LR(X, E). Let α( E, S) where S {1, ..., |E|} such that αk( U, E, S) = 1(k S)Uk + 1(k S)1(ek E).

For any edge ek, the marginal ablation loss gap inside the expectation in Equation (17) is equal to the expected gradient with respect to αk Unif(0, 1) when other edges are sampled according to E:

θk = EX,( E θ),( U Unif(0,1) iid)

Uk LR(X, α( U, E, {k})) . (18)

In other words, we can estimate the effect of totally ablating ek for a given (X, E) by sampling a partial ablation coefficient αk Unif(0, 1) and taking the loss gradient with respect to αk. However, we run into the same problem of needing to estimate the effect individually for each edge k.

UGS assumes that we can estimate this loss gradient for many edges simultaneously without much bias. For any particular edge ek, the interference effects caused by sampling other edges ek Unif(0, 1) for all k S instead of setting them according to E could be small, if S is small enough. UGS assumes that if S is sampled from a distribution DS

θk EX,( E θ),( U Unif(0,1) iid),(S DS)

Uk LR(X, α( U, E, S)) k S . (19)

F.6 Uniform Gradient Sampling: construction

We use θk to represent the sampling probabilities for each edge, and perform gradient descent on the parameters by constructing a loss function whose gradient is a sample estimator of Equation (19). As noted in the main text, rather than using θk (0, 1) as our parameters, we use θk ( , ) as our parameters and compute θk = σ( θk). We initialize θk = 1 for all edges ek. We avoid random initialization because it achieves worse results by causing the resulting circuits to be suboptimally constrained to be close to our random prior.

We sample S {1, ..., k} by independently sampling each 1(k S) Bern(w(θk)), where a window function w determines how often we sample αk Unif(0, 1). Additionally, we require that

P(ek E | k S) = P(ek E | k S) = 1

so that sampling αk Unif(0, 1) takes away probability mass equally from ek E and ek E. We perform this adjustment because for the purpose of estimating gradients with respect to edges other than ek, Equation (19) implicitly assumes that

E[ | k S] p E[ | αk = 0] + (1 p) E[ | αk = 1], p = P(ek E | k S) (21)

and without any priors about the functional form of , p = 1

2 is a reasonable choice.

We construct a loss function whose gradient is given by Equation (4), a sample estimator of Equation (19) with this construction of S. We construct the batch X(1), ..., X(b) by choosing b/ns unique samples of input X and repeating each input ns times in the batch. We generally use ns = 12 and b = 60. We choose w(θk) = c θk(1 θk). Note that c 2 in order for 1

2w(θk) min(θk, 1 θk) to hold, as is required by Equation (20), and we choose c = 1. We discuss this choice in F.8.

The full algorithm pseudocode is given after describing a few additional details in Appendix F.7.

F.7 Additional circuit discovery details

For HCGS and UGS, we use learning rates between 0.01 and 0.15 for the sampling parameters.

Pruning dangling edges For HCGS and UGS, after sampling α( U, E, S) for each input, we remove dangling edges ek. A dangling vertex is a vertex A for which there does not exist a path from the model input to A along edges ej for which αj > 0, or for which there does not exist such a path from A to the model output. A dangling edge is an edge that is connected to a dangling vertex. For a dangling edge ek, we replace αk = 0 (i.e. the equivalent of removing ek from E and S).

Discretization For HCGS and UGS, after training the θk, we select a final circuit by selecting all edges for which θk > τ for a threshold τ. Generally, for UGS, all but a handful of θk converge to highly negative or positive values, the choice of τ does not have much impact, and we choose τ = 0.5. However, for HCGS, many edges have θk parameters around zero even after training for 10,000 batches. We again select τ = 0.5 since we observe that including edges with θk ( 1, 0.5) does not generally affect performance.

Optimizing constants for OA For HCGS and UGS, we train ablation constants ba concurrently with training the sampling parameters θk. We use a learning rate of 0.002 for a, lower than the learning rate used for the sampling parameters.

Note that we only provide gradient updates to baj for a vertex Aj along edges ek for which αk = 0. Updating baj when αk = 0 can lead ba to update toward a value that is optimal when taking a linear combination of ba and Aj(X), rather than a value that is optimal as a constant. See Figure 6.

Even though we obtain approximate constants ba through the training process for HCGS and UGS, in order to level the playing field when comparing to ACDC and EAP, we do not use the constants found during training during circuit evaluation. Instead, for each circuit discovery algorithm, we evaluate circuits with OA by initializing constants to subtask means and then training for the same number of batches (10,000) with the same settings with a learning rate of 0.002.

Algorithm 1 shows the full algorithm of UGS, including our exact loss function, for optimization with OA. For optimization with other ablation methods, we set ablated values ba according to the

baj Ukbaj + (1 Uk)Aj(X) Aj(X)

learned a j

incorrect update

correct update baj L(M α(X), M(X))

Figure 6: Gradient updates on ba can be biased when αk = 0.

ablation method and do not perform gradient updates on ba. Note that we use the notation M α(X, ba) with OA to indicate running the circuit evaluation with ablation coefficients α and replacing ablated components with ba, in line with our notation in Appendix C.2.

F.8 Choosing a window size for UGS

We motivate the choice of our window function w(θk) = θk(1 θk).

Let f k := f( θ)

θk , and let fk be our sample estimate from approximation 19. Let K Unif({1, ..., |E|}). We may want to minimize the squared distance between our sample estimates and the true gradient values,

ε := E(f K f K)2 = EKVar(f K | K) + EK(E[f K | K] f K)2. (22)

Let our sampling distribution S D be defined by independent Bernoulli random variables 1(ek S) Bern(wk). Assume that we collect b samples of ( U, E, S) in a batch and use the samples i for which k Si to estimate the fk. Let Tk := Pn i=1 1(k Si), and let fk := 1 Tk Pn i=1 f (i) k 1(k Si).

Let f () k represent the first sample for which k Si.

Assuming that αk Unif(0, 1) w.p. wk, αk = 0 w.p. θk 1

2wk, and αk = 1 w.p. 1 θk 1

2wk as given by Equation (20), we have the loss derivative |E| ε

wk = Var(fk)

ℓ =k Var(fℓ)

wk + 2(E[fℓ] f ℓ) E[fℓ]

. If we assume that the second term is roughly equal to a constant c for all edges, then

wk = Var(f () k ) θk

Tk | Tk > 0 (23)

since Tk = 0 for an edge implies that it simply does not get a gradient update. Note that θk

θk(1 θk), and wk E h 1 Tk

i bw 2 k for a constant b > 0. Solving for ε wk = 0, ε is minimized

when wk θk(1 θk) q

Var(f () k ) which motivates the definition of the window function w(θk) =

θk(1 θk) in our main experiments. We try including the additional factor of q

Var(f () k ), but our results do not improve.

F.9 Comparison of UGS and HCGS

HCGS and UGS both involve sampling gradients from a distribution over α (0, 1)|E| and taking gradient steps on parameters θk that represent our confidence in ek E using an average of gradients with respect to the edge coefficients αk. The original explanation provided by Louizos et al. (2018) for the convergence of HCGS to a satisfactory subset of weights that minimizes a loss function similar to Equation (3) involves L0 regularization. To the contrary, we believe that a more

Algorithm 1 Uniform gradient sampling Input: set of edges E, initial parameter array θ, initial constant array ba Output: a set of edges E E that represents the circuit Require: metric L, learning rates δθ, δa, final threshold τ, batch size b, sample count per input ns, window function w

X [ ] α [ ] Unif Count [0 for k [length(θ)]] for j [b/ns] do

α[j] [ ] X[j] sample_input() for i ns do

α[j][i] [ ] for k [length(θ)] do

U Unif(0, 1) W w(θ[k]).detach_gradient() p W θ[k] + (1 W) θ[k].detach_gradient()

α[j][i][k] p U

W + 0.5 .clamp(0, 1)

Unif Count[k] Unif Count[k] + 1(α[j][i][k] (0, 1)) α[j][i] prune_dangling_edges(α[j][i]) L [ ] for j [b/ns] do

for i ns do

for k [length(θ)] do

g b/Unif Count[k] α[j][i][k] g α[j][i][k] + (1 g) α[j][i][k].detach_gradient() bb (α[j][i] == 0) ba + (α[j][i] > 0) ba.detach_gradient()

where + and are applied componentwise; this step is only used with OA. L.append(L(Mα[j][i](X[j],bb), M(X[judged]))) f (P L)/b θ θ δθ θf ba ba δa baf; this step is only used with OA. Note that in practice, we use Adam to determine step sizes. E for k [length(p)] do

if θ[k] > τ then E.add(E[k]) return E

compelling explanation for the performance of HCGS is that its behavior of sampling gradients on a region of partial ablations, α (0, 1)|E|, serves as a vague approximation of Equation (18).

Sampling αk Unif(0, 1) to obtain gradient information, rather than a scaled conditional Concrete distribution, makes the simultaneous gradient-sampling estimator unbiased in the single-dimensional case, and thus is the choice that makes Equation (19) most resemble Equation (18).

F.10 Additional IOI circuit discovery results

This section displays additional circuit discovery results on the IOI subtask. In Figure 7, we show the tradeoff between (y-axis) and | E| (x-axis) for optimal ablation to compare different circuit discovery methods. In Figure 8 (left), we show this tradeoff for mean ablation, and in Figure 8 (right), we show this tradeoff for resample ablation.

Figure 7: Circuit discovery Pareto frontier for the IOI subtask with optimal ablation.

Figure 8: Circuit discovery Pareto frontier for IOI with mean ablation (left) and resample ablation (right).

F.11 Circuit discovery results for Greater-Than

This section displays circuit discovery results on the Greater-Than subtask. In Figure 9 (left), we show the tradeoff between (y-axis) and | E| (x-axis) for optimal ablation. In Figure 9 (right), we show this tradeoff for counterfactual ablation. In Figure 10 (left), we show this tradeoff for mean ablation. In Figure 10 (right), we show this tradeoff for resample ablation. Finally, in Figure 11, we show the achieved by circuits optimized using UGS on with different ablation methods, analogous to Figure 1 (right) in the main text for the IOI subtask.

Figure 9: Circuit discovery Pareto frontier for the Greater-Than subtask with optimal ablation (left) and counterfactual ablation (right).

Figure 10: Circuit discovery Pareto frontier for Greater-Than with mean ablation (left) and resample ablation (right).

F.12 Random circuits

One question is whether it may be possible to extract circuits with OA that do not necessarily explain model behavior on the training distribution by setting vertices to out-of-distribution values which maximally elicit a certain behavior. If the ablation constants ba overparameterize the data, then performing OA could behave similarly to fine-tuning the model to perform the desired task.

Intuitively, however, OA strictly decreases the amount of computation available to the model, since we only add constants to model components and do not allow additional transformations of internal representation that are not already present in the downstream computation. To verify our stance, we compare the loss recovered by circuits discovered by UGS to random circuits to verify that OA indeed distinguishes subtask-performing mechanisms and does not provide enough degrees of freedom to elicit subtask behavior from unrelated model components.

In Table 3, we compare the (M, E) achieved by random circuits E to those achieved by circuits optimized with UGS for various ablation types. We construct E by sampling each 1(ek E) independently with some probability p, and prune dangling edges as detailed in Appendix F.7. We

Figure 11: Comparison of different ablation methods for circuit discovery for Greater-Than.

Table 3: Optimized circuits compared to random circuits for various ablation types

Mean Resample Optimal Counterfactual

IOI Random circuit loss 4.529 6.527 2.723 4.264 UGS circuit loss 0.264 1.779 0.176 0.191 Std 0.200 0.085 0.024 0.049 Z-score -21.28 -55.67 -100.57 -82.44

Greater-Than Random circuit loss 1.010 2.109 0.900 1.785 UGS circuit loss 0.033 0.056 0.029 0.021 Std 0.020 0.039 0.011 0.027 Z-score -49.29 -52.89 -80.81 -64.76

accept E if | E| is within an acceptable range, and we select p to maximize the probability that | E| falls within this range. We set our range of | E| to be [400, 500] for IOI and [200, 300] for Greater-Than.

Recall that to evaluate circuits with OA, we perform gradient descent on ba to approximate the optimal constants. Since repeating this process is expensive, we truncate training after just 200 training batches, far short of the 10,000 batches used for a full training run, for both the random circuits and optimized circuits. However, we test using a smaller sample size that the loss for the random circuits does not tend to decrease much with further training; in fact, for the optimized circuit, the loss typically drops by more than 50% after the first batch, which does not occur for the random circuits.

While random circuits achieve lower loss under OA than mean and resample ablation, the opt measurements for random circuits do not approach the low figures achieved by optimized circuits. Furthermore, the standard deviation of for random circuits is lower on average for OA than for mean or resample ablation, and surprisingly, the OA losses for optimized circuits have the most significant Z-score for both IOI and Greater-Than, though there is not necessarily a difference between Z-scores of such large magnitude. These results demonstrate that OA is likely highlighting specialized circuit components that already exist in the model rather than fabricating non-existent mechanisms.

G Causal tracing

G.1 Transformer graph representation

Consider running a causal tracing experiment on vertex A. We represent the model with four vertices: Subj(X) representing the subject tokens of the input, Non-Subj(X) representing the remaining input tokens, the component of concern A(X) = A(Subj(X), Non-Subj(X)), and the model output

Out(X) = Out(Subj(X), Non-Subj(X), A(X)). In particular, if A = MLP(i), then we compute Out(X) as a function of these three arguments by computing Equation (7), which takes A(X) and MResid(i)(X) as input, by taking the latter term MResid(i)(X) as a function of Non-Subj(X) and Subj(X), and then computing Out(X) as a function of Resid(i)(X). A similar construction is used for attention layers.

Note that the AIE compares the performance of the model with the vertex Subj ablated (the denominator in Equation (5)) to the performance of the model with only the edge (Subj, Out) ablated (the numerator in Equation (5)).

G.2 Relation of AIE to ablation loss gap

For consistency with Meng et al. (2022), we use a carefully selected loss function in the definition of to represent proximity to the model s original predictions rather than the typical KL-divergence loss. In particular, we choose

LAIE(P, Q) := min (0, max Q [P]arg max Q) , (24)

where P and Q represent probability distributions over the model vocabulary. Note that since the dataset is filtered so that Y = arg max M(X), replacing L with LAIE in , we get

= EX DLAIE(MA(ξ(X), A(X)), M(X)) = E(X,Y ) D max(0, [M(X)]Y [MA(ξ(X), A(X))]Y ) (25)

which is the numerator in Equation (5).

G.3 Additional results

Figure 12: Causal tracing probabilities for different token positions with window size 1 (patching a single component). Error bars indicate the sample estimate plus/minus two standard errors.

Figure 13: Causal tracing probabilities for different token positions with a sliding window of size 5. Error bars indicate the sample estimate plus/minus two standard errors.

We show results for additional window sizes and token positions. In particular, we show results for intervening at all subject token positions, only the last subject token position, and only the last token position, for window sizes 1 (see Figure 12), 5 (see Figure 13), and 9 (see Figure 14).

In addition to providing a more precise localization of components informational contributions, the results provide some evidence against one of the claims of Meng et al. (2022), the idea that the last subject token position is a uniquely important early site for processing information at MLP layers 10-20. For sliding windows of size 5, GNT shows that intervening on MLPs at only the last subject token position achieves over half of the AIE of performing the same intervention at all subject token positions (33.8% vs 19.3%, shown in Figure 13). However, the OAT results indicate that intervening at all subject tokens is much more effective (35.2% vs 11.1%), indicating that early subject token positions may be more important than previously thought.

G.4 Construction of standard errors

For input-label pairs (X, Y ) D, let W = min(0, [M(X)]Y [MA(ξ(X), A(X))]Y ) and Z = [M(X)]Y [M(ξ(X))]Y , and let c Wn and b Zn be their respective sample means with n samples. Recall from Equation (5) that we want to estimate from samples the quantity AIE(A) = min 0, 1 EW

EZ =: min 0, 1 µW

. By the central limit theorem,

n c Wn b Zn

d N(0, Σ) := N 0, σ2 W σW Z σW Z σ2 Z

Figure 14: Causal tracing probabilities for different token positions with a sliding window of size 9. Error bars indicate the sample estimate plus/minus two standard errors.

By the multivariate delta method, for h w z

z and v := h µW µZ

= n h c Wn b Zn

d N(0, v T Σv) (27)

so the asymptotic variance is µ2 W µ2 Z

σ2 W µ2 W + σ2 Z µ2 Z 2 σW Z

which we estimate via samples to obtain our standard errors.

H.1 Transformer graph representation

We represent the model with Resid(i), MResid(i), Attn(i), and MLP(i) vertices for each layer i and a vertex Out(x) representing the model output, where the relationships between the vertices are defined by the equations given in Appendix C.3. Applying OCA lens at layer i entails ablating vertices Attn(i+1) through Attn(N) (where N is the number of layers in the model).

H.2 Additional prediction accuracy results

Figure 15 shows results on prediction loss for GPT-2-small, GPT-2-medium, and GPT-2-large. For all models, we use a learning rate of 0.01 for tuned lens and 0.002 for OCA lens.

Figure 15: Comparison of prediction loss between tuned lens and ablation-based alternatives.

H.3 Additional causal faithfulness results

The following figures show the causal faithfulness metrics with several kinds of perturbations. Let µ = E[ℓi(X)] and Σ = Var(ℓi(X)).

Figure 16: Causal faithfulness comparison under random perturbations.

Figure 17: Causal faithfulness comparison under basis-aligned perturbations.

Random perturbation: We sample Z N(0, Σ), and let V = Z/||Z||. We let Z N(0, 1), Z Z. We define ξ(a) = a + c Z V . We define the constant c such that E[L(M(X; ξ), M(X))] 0.2. Results shown in Figure 16.

Figure 18: Causal faithfulness comparison under random projections.

Figure 19: Causal faithfulness comparison under basis-aligned resample ablation.

Basis-aligned perturbation: Same as random perturbation, except we choose a basis of dmodel vectors as described in Section 5, and let Z be a uniformly sampled basis element. Results shown in Figure 17. Random projection: We sample Z N(0, Σ), and let V = Z/||Z||. We define ξ(x) = µ + p(a µ), where p represents the projection to the orthogonal complement of V . Results shown in Figure 18 Basis-aligned projection: Described in the main text. Results shown in Figure 3. Basis-aligned resample ablation: We choose a basis as described in Section 5. We consider the subspace spanned by 100 basis elements with the largest singular vectors, and define ξ(a) by performing resample ablation on the projection of a to this subspace. Results shown in Figure 19.

We find that the improvement in causal faithfulness is consistent across all perturbation types studied.

H.4 Elicitation results on factual datasets

We show additional results for elicitations on the text classification datasets. Figure 20 shows comprehensive results for each of the individual datasets. For our experiments, we use 10 demonstrations and sample from datasets without replacement to generate the demonstration examples. Note that we exclude SST2-AB, a toy dataset constructed by Halawi et al. (2024) that replaces SST2 sentiment labels with letters A and B, since it is only created to show that elicitation accuracy does not improve when the expected answer is unrelated to the question (since the label is encoded in a non-intuitive manner, information from later layers is required to relate internal knowledge to the correct label). Figure 21 summarizes the elicitation accuracy boost between OCA lens and tuned lens across the datasets.

Figure 20: Comparison of calibrated accuracy of elicited completions on 15 datasets from Halawi et al. (2024). Dotted lines indicate the accuracy of the model s output predictions for true demonstrations (black) and false demonstrations (red).

Figure 21: Comparison of elicitation accuracy boost between OCA lens and tuned lens.

I Reproducibility

All code can be found at https://github.com/maxtli/optimalablation.

All experiments were run on a single Nvidia A100 GPU with 80GB VRAM. The cost of UGS is comparable to ACDC (about 1-2 hours to train). Training OCA lens until convergence also takes about 3-5 hours, which is similar to the amount of time to train tuned lens.

J Impact statement

We believe that OA can lead to a more granular level of understanding for models internal mechanisms. A better understanding of interpretability can help to reduce risk from dangerous AIs, but more work is required to scale interpretability techniques to larger models. Interpretability can also help us to understand how to build better inductive biases into models, paving the way for future developments in architecture. On the other hand, advanced interpretability can also be repurposed for nefarious applications, like eliciting dangerous knowledge from models latent space. However, we believe that better interpretability will also provide better clarity on how to mitigate these risks.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes]

Justification: The abstract presents our main claims in the paper, including a reference to optimal ablation as our main proposed technique and its experimental applications. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We discuss limitations in Appendices A and F.12. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: Mainly, Proposition 2.3 is clearly contextualized and proven in the paper. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide details about experiments in both the main text and in Appendices F, G, and H for our three main experimental settings. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

(a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Our code is available online at https://github.com/maxtli/optimalablation. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Guidelines: We provide details in Appendices F, G, and H, and these details can also be found in the code implementation. Justification:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes]

Justification: We treat statistical significance carefully in each experiment. We discuss statistical significance of circuits discovered with OA in Appendix F.12, and we discuss error bars for causal tracing in G.4. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: See Appendix I. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We only use open-source datasets. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: See Appendix J. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not release any new models. Our experiments run on open-source models. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: N/A Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: N/A Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: N/A Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: N/A Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.