# scaling_up_influence_functions__8a5e0d27.pdf

Scaling Up Inﬂuence Functions

Andrea Schioppa , Polina Zablotskaia, David Vilar, Artem Sokolov

Google Research {arischioppa, polinaz, vilar, artemsok}@google.com

We address efﬁcient calculation of inﬂuence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the ﬁrst successful implementation of inﬂuence functions that scales to full-size (language and vision) Transformer models with several hundreds of millions of parameters. We evaluate our approach on image classiﬁcation and sequence-to-sequence tasks with tens to a hundred of millions of training examples. Our code will be available at https://github.com/googleresearch/jax-inﬂuence.

1 Introduction

Recognizing data s highest agency in deﬁning deep neural networks (DNNs) performance, the pursuit of state-of-theart has made datasets for training modern DNNs grow to sizes that can no longer be curated by humans. This has acutely aggravated data issues like noise and mislabeled data: Noise is characteristic of tasks where training data is crawled from the Web (e.g. machine translation) and where golden-truth labels are heuristically paired to inputs (Uszkoreit et al. 2010), leaving ample room for errors and inheriting biases of the heuristic. Wrong labels can be also introduced by non-expert crowd annotators who, considering the amount of data to be labeled, are hard to incentivize for quality within available budgets (Bowman and Dahl 2021). Given the above, a natural way to interpret and ﬁx DNN models is to track their bad (or good) predictions down to the training examples that caused them (Cook and Weisberg 1980; Koh and Liang 2017; Yeh et al. 2018), and take appropriate action on the found examples or annotation policies. Addressing this, Koh and Liang (2017) proposed inﬂuence functions (IFs) as a theoretically motivated method, grounded in robust statistics (Cook and Weisberg 1982), of quantifying the effect of training examples on predictions: For a query example z, IFs estimate the most inﬂuential example x in training data D, in terms of absolute change of

Google AI Resident. Copyright c 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

loss L if x were inﬁnitesimally up-weighted in D, with:

IH(x, z) = ΘL(z), H 1 ΘL(x) , (1)

where H = 2 ΘL is the Hessian of the model at parameters Θ. The straight-forward IF implementation, using the approximate Hessian inversion procedure LISSA (Agarwal, Bullins, and Hazan 2017), has O(p) memory and O(r p) time complexities, where r is the LISSA iteration count and p = |Θ|, incurred at every x. Besides the need of careful tuning of LISSA, the O(p)-memory has been the major obstacle on the way of IF deployment for debugging applicationrelevant DNNs with hundreds of millions (or more) training examples and model parameters; so noise or mislabeling issues remain unﬁxed or even undetected, and are adversely impacting predictions. In this work, we focus on reducing the IF memory footprint by not materializing O(p)-size gradients nor Hessians, and decoupling the required number of H estimations, O(r |D|), from the training data size. This allows to parallelize computation over larger b and scale to huge datasets and models. Speciﬁcally, we use Arnoldi iteration (Arnoldi 1951) to ﬁnd the dominant (in absolute value) eigenvalues of H and their orthonormal eigenvectors on a random data subset, |D | |D|, and then cheaply invert the diagonalized H, avoiding calls to LISSA as well as its convergence and stability issues. As H is Hermitian, (1) is symmetric w.r.t. x and z, so previous work cached { ΘL(x)} to improve IFs usability (Guo et al. 2021), however, it only spares one backward pass per x and requires to re-estimate the product of the (unstorable) H 1 with ΘL(z) every time an inﬂuence on z is requested. The crux of our approach is in caching instead H in the trivially-invertable diagonalized form and for the small-dimensional subspace spanned by a few dominant eigenvectors, p p. Hessian-gradient products are then reduced to simple scalar-gradient products, which do not need to be materialized in memory as (1) can now be implemented with Jacobian-vector products. In summary, our approach renders repeated re-estimations of H 1 ΘL(x) at every x unnecessary and they are replaced with the memoryand time-efﬁcient forward-mode differentiation. Empirically, IFs with Arnoldi iteration achieve speedups of 3-4 orders of magnitude over the LISSA-powered IFs (Koh and Liang 2017) and of 10x over Trac In (Pruthi et al. 2020), a heuristic gradient-only alternative to IFs ( 5),

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

with better or similar accuracy. With this improvement, we successfully evaluated IFs on both language and vision, fullsize Transformer models (up to 300M of parameters) in image classiﬁcation and sequence-to-sequence tasks, resp., on 14M (Image Net) and 100M (Paracrawl) training examples. Note that the standard conditions for (1) to be a correct inﬂuence estimate, i.e. locally strictly-convex L C2 (Koh and Liang 2017), remain in place and their fulﬁlment depends on the concrete task, network, training algorithm and its convergence status. The time/memory complexity also remain, however, our contribution improves constants hidden in the O-notation, and thus permits IF evaluation on full data/models that are relevant in applications, on standard memory-limited hardware. This opens the way for developers to make informed decisions if IFs are appropriate for their task, rather than to resort to heuristics from the start. This is encouraging, since the existing brute-force recipe of soothing the O(p) complexity by subsetting parameters, e.g. focusing on few layers only (Chen et al. 2020), is prone to producing incorrect results (Feldman and Zhang 2020) (see also 5.2). On the other hand, running IF on subsets of D to reduce runtime (Guo et al. 2021) may introduce unwanted biases, misattribute prediction failures, and would not be enough for identifying mislabeled examples (Pruthi et al. 2020) or hard examples requiring memorization (Feldman and Zhang 2020). Yet, these approaches are compatible with our method and should result in compound speed-ups. We will open-source our implementation of Arnoldi iteration at https://github.com/google-research/jax-inﬂuence.

2 Related Work Explaining DNN predictions falls under a broader interpretability umbrella, where the lingering complexity of the data-explainability approach made research historically focus on instance-based methods, that explain predictions in terms of task-speciﬁc structural units of inputs, e.g. pixels or tokens. Rich literature offers different instantiations of the idea: gradient-based saliency maps (Simonyan, Vedaldi, and Zisserman 2013),input perturbations (Li, Monroe, and Jurafsky 2016) or LIME (Ribeiro, Singh, and Guestrin 2016), which ﬁts a linear model in the inputs neighborhood. However, being limited to speciﬁc inputs, their insights are rarely actionable for system developers. And while it is possible to repurpose them to data explainability, e.g. via clustering of saliency maps (Lapuschkin et al. 2019), this solves a more difﬁcult task than necessary, introduces new hyperparameters (incl. the saliency method itself) and relies on human experts to make sense of the clusters. In contrast to instance-explanations in the form of token level heatmaps, the IF provides a method for tracing model predictions back to training examples. Existing approaches to reducing IF runtime mostly address salient problem axes dimensionality of active parameters, cardinality of data subset or number of iterations without addressing the procedure itself; or they drop theoretical foundations to use heuristics simpler than IF: e.g. Pruthi et al. (2020) reduce inﬂuence to tracking loss changes with the cumulative (over training model checkpoints, i.e. model snapshots) dot

products of gradients, and in (Yeh et al. 2018) authors leverage kernel functions evaluated at the training samples for explaining inference decisions. Mathematically, the closest to our work is (Ghorbani, Krishnan, and Xiao 2019) who use a specialization of Arnoldi iteration to Hermitian matrices (Lanczos iteration) to study the dynamics of spectra of entire Hessians at different snapshots during training. Because of this different goal, they use full-batch Hessians (i.e. computed on the full dataset), while we spread the Hessian approximation across smaller batches that do not cover the full D. In result, we can work with larger models and datasets, e.g. Res Net50/Vi T vs. Res Net18, and at a larger speed (simultaneously bumping the number of Lanczos/Arnoldi iterations from 90 to 200 to increase precision).

3 Inﬂuence and Inﬂuence Functions The true inﬂuence of x on z, Itrue(x, z), is deﬁned as the change in the loss L at z between having learned the model parameters Θ without and with x in the training data (Cook and Weisberg 1980):

Itrue(x, z) = L(z|Θ : x D) L(z|Θ : x D).

Explicit calculation of Itrue(x, z), by removing every x and retraining, is infeasible for large D and several approximation techniques have been proposed. Feldman (2020) and Feldman and Zhang (2020) propose to train multiple models on randomly selected data subsets while tracking, for each x, to which subsets it belonged; this way, one can obtain an unbiased estimator Imem(x, x) of Itrue(x, x) which, however, requires a substantial amount of model re-trainings (up to thousands to satisfy theoretical guarantees (Feldman 2020)). Koh and Liang (2017) advocated the use of IFs in (1) that approximate the loss change after an inﬁnitesimal upweighting of x in D. For models used in practice, H cannot be materialized in memory, let alone be inverted by standard linear algebra. However, for a ﬁxed vector v, the Hessianvector product (HVP), Hv, can be computed in O(b p) time and memory (Pearlmutter 1994), where b is the batch size and deﬁnes the number of training examples on which H (of an implicitly given loss L) will be approximated. HVP is commonly implemented in modern autodiff toolkits (Baydin et al. 2018) as the reverse-mode differentiation Jacobianvector product (JVP), followed by a forward-mode JVP. Repeated HVP calls are the workhorse of the iterative procedure LISSA (Agarwal, Bullins, and Hazan 2017), used by (Koh and Liang 2017), that estimate inverse HVP as:

H 1 r v = v + (I H)H 1 r 1v,

where H is approximated on random batches and v is a gradient. Even for small r, the procedure is both timeand memory-expensive as the O(p)-memory of HVP on explicitly instantiated v forces to estimate H on a single sampled training point per iteration (b = 1), impacting accuracy. Moreover, the total O(r b p) time complexity will be incurred at every x in whose inﬂuence on z we are interested.

Evaluating inﬂuence methods. A practical problem with inﬂuence-based explainability is the absence of ground-truth

to verify that a method produces correct results. In this paper we use two proxies, following the assumption that xs with high self-inﬂuence IH(x, x) correspond to data outliers (Koh and Liang 2017; Pruthi et al. 2020): we either introduce a known synthetic data corruption and check for its correct retrieval by a method, or ﬁlter high-inﬂuence points out and measure a change in downstream task metrics (Guo et al. 2021; Kocijan and Bowman 2020). Using IH(x, x) as a retrieval score for corrupted data, we measure the retrieval quality as areas under the ROC and the precisionrecall curves, respectively denoted as the Area Under the Curve (AUC) and the Average Precision (AP).

4 Scaling Inﬂuence Functions From the discussion above, the O(p)-memory complexity is the major bottleneck for efﬁcient implementation of IFs. We start with an overview of existing techniques, showing their commonalities. Note that all are compatible with the approach we propose.

Caching. For data-interpretability purposes one might consider limiting (1) to a sufﬁciently promising subset D D, e.g. Guo et al. (2021) deﬁne D as the top-k ℓ2-neighbors of z in D. Besides more hyperparameters, this still requires computing H 1 ΘL(z) for every z and, as discussed above, reducing D would not be enough for applications that require computing IH(x, z) on all training x. As H is symmetric, one could swap x and z, and cache H 1 ΘL(x) instead, bringing down the query complexity having only to compute ΘL(z) now, but this would just shift the computational burden to building the search index over H 1 ΘL(x).

Restricting parameters. Reducing required memory is possible by naively limiting the computation to a smaller subset of parameters of cardinality p, e.g. selecting one or few layer(s); usually, the last layer is selected (Koh and Liang 2017). This has two drawbacks: the choice of layers becomes a hyperparameter and the viable values of p will depend on the model architecture, and as Feldman and Zhang (2020, 3.6) show using just one layer can result in different inﬂuence estimates compared to the full model.

Random projections. For simpliﬁcation one might assume H = I and reduce inﬂuence estimates to dot products of gradients. To account for multiple layers at once and to get a ﬁner-step control of p, we consider a simple baseline, Rand Select, which randomly selects p parameters Θ Θ and computes inﬂuence using the ﬁnal checkpoint and gradients with respect to Θ, and can be combined with layer selection. The Rand Select estimator can be equivalently expressed as

IG(x, z) = G ΘL(x), G ΘL(z) , (2)

where G R p p is a row selection matrix of the gradient s components corresponding to Θ. We also use another sketching (Woodruff 2014) baseline, Rand Proj, initially proposed by Wojnowicz et al. (2016) for generalized linear models: for a random Gaussian projection matrix G, E[GT G] = I, which leads to an unbiased

estimate in (2). Since normally p p, it allows a memoryefﬁcient implementation in the forward mode: one just estimates p JVPs with the rows of G that avoid materializing O(p)-size gradients. This has a lower memory footprint than Rand Select, which requires back-propagation as its p (e.g. one layer) is still of the same order as p (see 5.2).

Tracing updates. A development of the H = I idea is proposed in (Pruthi et al. 2020), where inﬂuence approximation works for the case when one can trace changes to the loss L across all gradient steps. In order to make this feasible, they propose the Trac In estimator, deﬁned on a subset of gradient steps:

ITrac In(x, z) = 1

i=1 Θi L(x), Θi L(z) ,

where the {Θi} is a set of C checkpoints. Note that the complexity of Trac In is C times that of using exact gradient similarity and, as discussed in (Pruthi et al. 2020), care needs to be taken in selecting checkpoints. Another practical obstacle to Trac In is that, when analysing publicly released models, usually only the ﬁnal model checkpoint is provided.

Compatible projections. Rand Proj assumes the fulldimension Hessian, H = I; if we drop this requirement, we might consider H restricted to the subspace SG which is the image of G, and work with G H GT instead of the larger H. However, SG is not in general H-invariant which can lead to approximation errors as when H is applied to a vector v SG, the result might have non-negligible components orthogonal to SG. We will see an example of this later in the experiments on eigenvalue retrieval for MNIST (Figure 1) where Rand Proj requires a considerably larger p than Arnoldi to retrieve the topp eigenvalues of H.

Our approach. We propose to use the standard technique of building an approximately H-invariant subspace by selecting an arbitrary (e.g. random) vector v Rp and constructing the n-th order Krylov subspace: Kn(H; v) = Span{v, Hv, H2v, . . . , Hnv}. The Arnoldi iteration (Arnoldi 1951) additionally builds an orthonormal basis for Kn(H; v), so that the diagonalization of the restriction H of H to Kn(H; v) yields an approximation of the largest (in absolute value) eigenvalues of H and of the corresponding eigenvectors (Trefethen and Bau 1997, Ch. 3334). Assuming n is large enough to estimate the largest p eigenvalues, in summary we obtain a projection matrix G and work with H = G H GT , which is a smaller dimensional matrix. We will call this algorithm Arnoldi, with the pseudocode in Algorithm 1. The common feature of Rand Proj and Arnoldi is that, instead of working with the full gradient ΘL(x), one takes the JVPs gi, ΘL(x) with respect to the rows gi of G. The implementation then becomes considerably more efﬁcient as it can be done in the forward-mode differentiation and on larger batches. Moreover, in the case of Arnoldi the matrix H gets replaced with now diagonal H, simplifying the matrix inversion appearing in the deﬁnition of IH, and dispensing with the expensive LISSA procedure.

Error analysis. It remains to analyse the effect of using topp eigenvalues in Arnoldi. Recall that Koh and Liang (2017) derive (1) by minimizing the quadratic form Q(θ) = 1

N ΘL(x|Θ0), θ , where Θ0 are the parameters at convergence. Ordering the eigenvalues of H at Θ0, |λ1| |λ2| and letting e1, e2, be the corresponding eigenvectors, Ghorbani, Krishnan, and Xiao (2019) empirically observe (and prove in the quadratic case) that gradient updates align with the subspace of H corresponding to the dominant λs. We provide two additional arguments in the same direction: we upperbound the error of approximating Q using such a subspace in Lemma 1, and discuss the effect of noise in H and the size of λk on applying H 1 to a vector in Lemma 2 (with proofs in A). Let Qk be the form Q restricted to the H-subspace spanned by the top-k λs. We show that, as k increases, Qk approximates Q better and the errors in directions of ek corresponding to smaller |λk| matter less1:

Lemma 1. Qk approximates Q by an error bounded by: 0 Q(θ) Qk(θ) 1

2|λk+1| θ 2 2. Further, if minimizing Q introduces an error ε in the direction of ek+1 obtaining an estimate θ for θ , then Q(θ ) Q(θ ) = ε2

Another way of looking at the same phenomenon is to consider the variance of estimated inﬂuence as the function of |λk|. Consider a computation of y = H 1u, where vector u is known exactly. Assume also that H s estimation is noisy resulting in error H + δH, that E[δH] = 0, and that the δH is isotropic (e.g. does not preferentially align with some ek nor co-vary with λk). Then the variance of the estimator ˆy = (HΘ0 + δH) 1u in the direction of ek is proportional to |λk| 2:

Lemma 2. The variance of ˆy in the direction of ek is Var( ˆy, ek ) 1 |λk|2 Var( δHek, y ).

5 Experiments 5.1 Small Model & Data Scale: Digit Recognition In this subsection, to be able to compare all baselines we pick the small MNIST dataset (Le Cun, Cortes, and Burges 1994) and consider two CNNs of different sizes: a small one that permits the exact Hessian calculation, and a larger one on which we can gauge the scalability potential. Because the inﬂuence calculation with LISSA and Trac In is slow, following (Koh and Liang 2017), we take two 10% subsamples of the original data for training and evaluation, and randomly relabel 20% of training examples to create a corrupted dataset to evaluate mislabeled example retrieval with inﬂuence estimates. Unlike (Koh and Liang 2017; Pruthi et al. 2020) we introduce the noise before training the models; by design, a perfect model on correctly labeled data would achieve only 80% accuracy on our eval set.

Small network. We re-implemented the small convolutional network with smooth non-linearities from (Koh and

1One might attempt Arnoldi on H 1 Θ0 to obtain an approximation directly in the subspace of the top-k eigenvalues of H 1 Θ0 . We found this approach however to be less performant (see A.1).

Algorithm 1: Arnoldi

1: procedure ARNOLDI(v, n) Build orthonormal basis for the Krylov subspaces Kn. 2: w0 v v 2 3: Al,m 0 for 0 l n and 0 m < n 4: for i 1, n do 5: wi H wi 1 HVP in fwd-over-rev mode 6: Set Ai,j = wi, wj for j < i 7: wi wi P j<i Ai,jwj Orthogonalize 8: Ai+1,i wi 2, wi wi wi 2 9: return: A, {wi}

10: procedure DISTILL(A, {wi}, p) Distill A, {wi} to its topp eigenvalues. 11: Discard the last row of A and the last wn 12: Obtain A s eigenvalues {λi} and eigenvectors {ei} 13: Set {λ i} to the p-largest (in absolute value) of {λi} 14: Set {e i} to the corresponding eigenvectors 15: Set G to the projection onto the spans {e i} in {wi}- basis 16: return: {λ i}, G

17: procedure INFLUENCE(x, z, n, p) Inﬂuence of x on z with n iterations and topp eigenvalues. 18: v N(0, 1) 19: A, {wi} = ARNOLDI(v, n)

Executed once and cached 20: {λ i}, G = DISTILL(A, {wi}, p) 21: gx G ΘL(x) fwd JVP for x over G-rows 22: gz G ΘL(z) fwd JVP for z over G-rows 23: gx gx/{λ i} Multiply with diagonalized H 1

24: return: gx, gz

Liang 2017), and trained it following their recipe that is designed to make the assumptions behind the inﬂuence function method (Cook and Weisberg 1982) satisﬁed: First, to ensure convergence, the network is trained for more steps (500k) than one would normally do, with a large batch size of 500 images. Second, the ℓ2-regularization of 5 10 3 is introduced to make H positive deﬁnite. With only 3k parameters, it is a useful benchmark where it possible to compute H explicitly. We achieve accuracy of 73.8% and 92.3% on, resp., the corrupted and the true evaluation set.

In Figure 1 we compare the accuracy between the topp eigenvalues estimations obtained by Arnoldi (for n = 200) and Rand Proj. This illustrates the point made above that if the image subspace SG associated with the projection G is not H-invariant, then eigenvalue estimates can be poor.

In Figure 2 we plot the retrieval quality of mislabeled examples by Arnoldi and Rand Proj as a function of p. The horizontal lines correspond to using the exact Hessian, LISSA and Trac In (C = 25 checkpoints, taken every 10 epochs). For this network we see that Arnoldi outperforms Rand Proj and steadily improves for further larger values, outperforming even the exact Hessian for a large enough p (which can be explained by presence of close-tozero eigenvalues in the exact H which affects its inverse).

Figure 1: Estimation error of the topp eigenvalues obtained by DISTILL as the optimal transport (Wasserstein) distance to the exact H eigenvalues (assuming uniform distribution). The horizontal line is the standard deviation of the latter.

Larger network. To consider a more realistic and larger network we use the CNN from the Flax library2 with 800k parameters, still small by industry standards, but for which H cannot be already computed explicitly. We train it for 10 epochs on GPU V100 without regularization, and achieve 75.4% on the corrupted and 94.8% on the true labels test set. This network is more realistic in size and in training procedure than the one in (Koh and Liang 2017). Table 1 reports results of retrieval of mislabeled examples with the total scoring time T as there is a trade-off between gains in AP or AUC vs. T. As computing exact H was not possible, our baseline for self-inﬂuence scoring is LISSA, which is about 104 times slower than Arnoldi (which took 353 sec to estimate H for p = 10, n = 200 and b = 512). We ﬁnd that both Arnoldi and Rand Proj have a good trade-off between retrieval accuracy and speed, while Rand Select, despite being the fastest, suffers in retrieval quality. As here Rand Proj performs slightly better than Arnoldi, we initially hypothesized that this might indicate an insufﬁcient accuracy of HVP estimations: To verify, we re-ran Arnoldi with the HVPs estimated on the full dataset and obtained almost identical results, indicating that, on the one hand, Arnoldi accurately estimates eigenvalues with only 512 examples per HVP and, on the other hand, for this network and task, gradient similarity methods might be more appropriate (cf. LISSA for r = 100 does not move the needle either). Finally, Trac In on 10 checkpoints (after each epoch) had the best retrieval quality, however, while its runtime on this model and dataset is acceptable, it becomes a problem for the larger models we consider later. In B.1, we show that images with I(x, z) < 0 do appear ambiguous to humans or are incorrectly labeled. Results for the same CNN when over-trained or regularized are in B.2.

5.2 Scaling with Data Size: Machine Translation To test scalability over the data size dimension, we investigate IFs for machine translation focusing on data selection

2https://github.com/google/ﬂax/tree/master/examples/mnist

Method p T, secs AUC AP

LISSA, r = 10 - 4900 98.9 95.0 LISSA, r = 100 (10% Θ) - 32300 98.8 94.8

Trac In [1] - 5 98.7 94.0 Trac In [10] - 42 99.7 98.7

Rand Proj 10 0.2 97.2 87.7 Rand Proj 100 1.9 98.6 93.9

Rand Select 10 0.1 54.9 31.2 Rand Select 100 1.8 91.8 72.6

Arnoldi 10 0.2 95.0 84.0 Arnoldi 100 1.9 98.2 92.9

Table 1: Retrieval of mislabeled MNIST examples using self-inﬂuence for larger CNN. For Trac In the C value is in brackets (last or all). All methods use full models (except the LISSA run on 10% of parameters Θ). For Rand Proj std deviation of AUC/AP estimates is 0.1/0.7 over 20 runs.

over millions of examples. We verify the hypothesis that the cleanest data is the one with the lowest self-inﬂuence (Koh and Liang 2017; Pruthi et al. 2020). We evaluate retrieval of artiﬁcially corrupted examples on the WMT17 dataset (6M sentences), and evaluate data cleaning on the large noisy Paracrawl corpus (100M sentences). Arnoldi used b = 512 for HVPs; n = 200 iterations took 139 minutes.

Retrieving mislabeled parallel data. We experiment with the Transformer Base model (Vaswani et al. 2017), implemented in Flax3, on the clean WMT17 dataset for the German-English direction. We follow the original setup from (Vaswani et al. 2017) and train it for 100k steps on a 16-core TPUv2 (details in C.2). As is standard in machine translation, the model is neither over-trained to full convergence nor we employ the ℓ2-regularization, so a priori is not guaranteed if IFs that rely on Hessian approximations would fair better than the gradient heuristics, Rand Proj and Trac In. On the test set newstest16 we obtained BLEU 36.0 after 100k training steps. To construct a synthetic noisy dataset, we uniformly sampled 4096 examples from the training data and, for 256 examples of those, randomly shufﬂe the target sides, and repeat the above to obtain 5 data folds. We then apply different methods to compute self-inﬂuence scores and evaluate their quality with AUC and AP, averaging over the 5 data folds. From Table 2 we observe that Rand Select performs the worst in terms of retrieval quality. Arnoldi outperforms Rand Proj and we observe that here AP is a measure more sensitive to differences than AUC. The memory footprint of Rand Select scales poorly: while we managed to run Arnoldi with b = 512 to compute self-inﬂuence, for Rand Select we had to reduce it to 64. Finally, we consider the question of the quality of inﬂuence estimates obtained for subsets of layers. For large models it is common to reduce p by looking at the last few layers (Pruthi et al. 2020; Guo et al. 2021; Han, Wallace, and Tsvetkov 2020). However, Feldman and Zhang (2020) ob-

3https://github.com/google/ﬂax/tree/main/examples/wmt

Figure 2: AUC and AP for retrieval of mislabeled MNIST examples as a function of p for the small CNN model.

Layers Method p AUC AP

Rand Select 10 67.0 12.2 Rand Select 100 79.3 19.7 Rand Proj 10 85.6 31.3 Rand Proj 20 85.2 28.0 Arnoldi 10 92.0 47.8 Arnoldi 20 93.8 54.4

last 3 decoder layers

Rand Select 100 80.2 20.1 Rand Select 1000 81.3 22.2 Rand Proj 10 80.6 23.5 Rand Proj 20 83.0 25.8 Arnoldi 10 82.0 28.1 Arnoldi 20 83.7 28.5

Table 2: Retrieving synthetic mislabeled examples on WMT17. The standard deviation of AUC and AP estimates were under, resp., 0.9 and 1.0. for the Rand* methods.

served for their estimator Imem a degradation of the selfinﬂuence estimates computed using only the last layer of Res Net50, conjecturing that memorization of difﬁcult examples has already happened by the time the computation reaches the last layer. We corroborate their ﬁndings here for Transformers: using only the last three decoder layers we observe a signiﬁcant degradation of the retrieval quality for all algorithms, with Arnoldi still outperforming the rest.

Filtering noisy training corpus. To simulate a realistic data cleaning setup, we take the noisy Paracrawl corpus from the WMT18 Parallel Corpus Filtering Task4. This dataset consists of 100M German-English sentence pairs with different kinds of noise that naturally occur in Web crawled datasets: sentence pairs in languages others than English or German; where both sides are in English or German; where the target is either not or only a partial source translation, or is a near copy of the source; or non-informative pairs, e.g. short sentences consisting of numbers.

4http://statmt.org/wmt18/parallel-corpus-ﬁltering.html

We trained a model on WMT17 to bootstrap self-inﬂuence calculation and used newstest2016 for evaluation of retrained models. The methods that scale to this data size are Arnoldi, Rand Proj and Rand Select, but as the latter underperformed on the retrieval of synthetically mislabeled data and has heavy memory consumption, we focused on the former two. We chose p = 10 as it is faster than p = 20, which did not substantially increase the scores in Table 2. With this, self-inﬂuence scoring speed was 2.2M examples/hour on a 16-core TPUv2 using b = 2048.

As ﬁltering baselines we consider training on the full uncleaned data and some pre-ﬁltering strategies from the AFRL, Alibaba and Microsoft submissions (Gwinnup et al. 2018; Deng et al. 2018; Junczys-Dowmunt 2018) that are detailed in C.1. In Table 3 we report results after training on 1%, 5% and 10% of the data with the lowest self-inﬂuence, i.e. the cleanest data by assumption. We followed (Vaswani et al. 2017) and reported BLEU scores both at 10k steps and at the ﬁnal 200k steps as the gap between data selection strategies might reduce over the course of training, possibly as gradient updates from cleaner data might be favored over time. Both Arnoldi and Rand Proj select cleaner data at the bottom of the 5% and 10% self-inﬂuence scores. Also Arnoldi outperforms Rand Proj gaining almost 4 BLEU points at 10k steps, and more than 8 points over the pre-ﬁltering baseline at 200k steps. At 1%, we see a degradation in the performance at 200k steps and, inspecting the training loss, we found that it decreases by more than a half going from 5% to 1% data threshold. We conjecture that the selected examples at this strict 1%-threshold are too simplistic to provide useful translation patterns, in line with ﬁndings of Feldman and Zhang (2020, 3.3) on the marginal utility of data with low memorization values on Image Net. See C.3 for examples of high/low inﬂuence sentences.

Above, we did not aim to beat state-of-the-art cleaning pipelines, which employ a cascade of ﬁltering stages, but to investigate whether a simple application of IFs can select better data to train on. Nevertheless, we evaluated the added value of Arnoldi by ﬁltering 25% and 50% of the clean data selected by Microsoft s cascade (WMT18 winner), and

Method % selected BLEU@10k BLEU@200k

None 100 9.9 17.8 Pre-ﬁltering 14 11.7 22.6

Rand Proj 1 7.7 8.7 Arnoldi 1 17.8 19.6

Rand Proj 5 18.7 27.9 Arnoldi 5 24.0 30.3

Rand Proj 10 21.3 28.6 Arnoldi 10 25.0 30.8

Table 3: Data selection on the noisy Paracrawl corpus (100M parallel sentences) with evaluation on newstest16.

this increased BLEU from their 36.32 to, resp., 37.38 and 37.20 on newstest16.

5.3 Scaling with Model Size: Computer Vision Here we empirically verify if Arnoldi scales well with the number of parameters for four state-of-the-art computer vision models of increasing sizes, and also run the mislabeled data retrieval experiment for the largest Vision Transformer.

Method performance. We considered Res Net50 and Vision Transformers (Vi T): the Flax implementation5 of Res Net50 has about 25M parameters6, and for Vi Ts we used the Ja X implementation7 and took checkpoints trained on the mixture of Image Net and Image Net21k, that have between 86M (B32) and 300M (L32) parameters. Additionally, to cover an intermediate size between the Vi T B32 and L32 we took a subset of the layers, starting from the top, to compute inﬂuence w.r.t. 50% of the parameters of the Vi T L32, amounting to 150M of weights. For all models we used one 4-core TPUv2, n = 200 and b = 512 (see D.1). Figure 3 plots the time taken by an Arnoldi run on the full set of parameters of the considered models (for a better perspective, we also included the translation Transformer Base model). As expected, we observe a linear trend in terms of model size, with the largest-size model (Vi T-L32) taking 15 hours to estimate top-200 (cachable) eigenvalues.

Retrieving mislabeled data after ﬁne-tuning. As vanilla Vi Ts have been reported to have extremely sharp loss landscapes (Chen, Hsieh, and Gong 2021), we took their Vi T L32 checkpoint trained with the landscape-smoothing SAM loss (Foret et al. 2021) to better satisfy IF assumptions, and ﬁne-tuned it on CIFAR10 for 10k steps obtaining a 98.6% test accuracy. Since ﬁne-tuning converged to a saddle point8, we restricted the INFLUENCE procedure only to λi > 0. We mislabeled 10% of the test examples and compared their retrieval by Arnoldi and Rand Proj accumulating parameters from the top 10%, 20% and the full model in D.2, Table 6. Arnoldi wins, but the gap to Rand Proj

5https://github.com/google/ﬂax/tree/main/examples/imagenet 6Even though the model parameter count is about 50M, half of the parameters are batch statistics, so we treat p as 25M. 7https://github.com/google-research/vision transformer 8Among the top-100 λs, 7% of the mass belongs to negative λs.

Figure 3: Runtime of Arnoldi for n = 200 iterations on the full set of parameters of respective networks.

is small and perhaps indicates that accounting for local curvature is superﬂuous for this particular self-inﬂuence benchmark and model. As demonstrated by Koh and Liang (2017, 2.2), this may not be the case in general, and for other IFbased applications our contribution enables a veriﬁcation of whether accounting for Hessians is required. Unlike for the machine translation case, increasing the number of parameters leads here to a slight decrease in performance, suggesting that one may restrict IFs to the top layers in the ﬁnetuning setting. Another reason for the near matching performance could be the IF s increased fragility for large models (Basu, Pope, and Feizi 2021), also called out for natural language inference and Ro BERTa model by Kocijan and Bowman (2020), where performance dropped after retraining on either highor low-inﬂuence (w.r.t. to a validation set) examples. In D.4 we also investigate the memorization vs. generalization trade-off of removing high or low selfinﬂuence images on Image Net. In D.3, for the whole Image Net and the full Res Net50 model, we picture most self-inﬂuential images and the most inﬂuential training images retrieved for a test point.

6 Conclusion

We proposed a new way of calculating inﬂuence scores of (Koh and Liang 2017) for large DNNs by approximate diagonalization of their Hessians and avoiding re-estimating them on every training example. We demonstrated ﬁnding inﬂuential or noisy examples in datasets of up to 100M training examples and models with up to 300M parameters.

Acknowledgements

We thank Behrooz Ghorbani and Mukund Sundararajan for their valuable feedback on the paper.

References Agarwal, N.; Bullins, B.; and Hazan, E. 2017. Second Order Stochastic Optimization for Machine Learning in Linear Time. JMLR, 18(116): 1 40. Arnoldi, W. E. 1951. The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quarterly of Applied Mathematics, 9(1): 17 29. Barshan, E.; Brunet, M.; and Dziugaite, G. K. 2020. Relat IF: Identifying Explanatory Training Samples via Relative Inﬂuence. In AISTATS. Basu, S.; Pope, P.; and Feizi, S. 2021. Inﬂuence Functions in Deep Learning Are Fragile. In ICLR. Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind, J. M. 2018. Automatic differentiation in machine learning: a survey. JMLR, 18(153): 1 43. Bowman, S. R.; and Dahl, G. 2021. What Will it Take to Fix Benchmarking in Natural Language Understanding? In NAACL. Chen, H.; Si, S.; Li, Y.; Chelba, C.; Kumar, S.; Boning, D.; and Hsieh, C.-J. 2020. Multi-Stage Inﬂuence Function. In Neur IPS. Chen, X.; Hsieh, C.-J.; and Gong, B. 2021. When Vision Transformers Outperform Res Nets without Pretraining or Strong Data Augmentations. Co RR, abs/2106.01548. Cook, R. D.; and Weisberg, S. 1980. Characterizations of an Empirical Inﬂuence Function for Detecting Inﬂuential Cases in Regression. Technometrics, 22(4): 495 508. Cook, R. D.; and Weisberg, S. 1982. Residuals and inﬂuence in regression. Chapman and Hall. Deng, Y.; Cheng, S.; Lu, J.; Song, K.; Wang, J.; Wu, S.; Yao, L.; Zhang, G.; et al. 2018. Alibaba s Neural Machine Translation Systems for WMT18. In WMT. Feldman, V. 2020. Does Learning Require Memorization? A Short Tale about a Long Tail. In STOC. Feldman, V.; and Zhang, C. 2020. What Neural Networks Memorize and Why: Discovering the Long Tail via Inﬂuence Estimation. In Neur IPS. Foret, P.; Kleiner, A.; Mobahi, H.; and Neyshabur, B. 2021. Sharpness-aware Minimization for Efﬁciently Improving Generalization. In ICLR. Ghorbani, B.; Krishnan, S.; and Xiao, Y. 2019. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. In ICML. Guo, H.; Fatema R., N.; Hase, P.; Bansal, M.; and Xiong, C. 2021. Fast IF: Scalable Inﬂuence Functions for Efﬁcient Model Interpretation and Debugging. In EMNLP. Gwinnup, J.; Anderson, T.; Erdmann, G.; and Young, K. 2018. The AFRL WMT18 Systems: Ensembling, Continuation and Combination. In WMT. Han, X.; Wallace, B. C.; and Tsvetkov, Y. 2020. Explaining Black Box Predictions and Unveiling Data Artifacts through Inﬂuence Functions. In ACL. Junczys-Dowmunt, M. 2018. Microsoft s Submission to the WMT2018 News Translation Task: How I Learned to Stop Worrying and Love the Data. In WMT.

Kocijan, V.; and Bowman, S. 2020. Inﬂuence Functions Do Not Seem to Predict Usefulness in NLP Transfer Learning. https://wp.nyu.edu/cilvr/2020/08/27. Koh, P. W.; and Liang, P. 2017. Understanding Black-box Predictions via Inﬂuence Functions. In ICML. Kreutzer, J.; Vilar, D.; and Sokolov, A. 2021. Bandits Don t Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits. In EMNLP. Kudo, T.; and Richardson, J. 2018. Sentence Piece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In EMNLP. Lapuschkin, S.; W aldchen, S.; Binder, A.; Montavon, G.; Samek, W.; and M uller, K. 2019. Unmasking Clever Hans Predictors and Assessing What Machines Really Learn. Co RR, abs/1902.10178. Le Cun, Y.; Cortes, C.; and Burges, C. 1994. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. Li, J.; Monroe, W.; and Jurafsky, D. 2016. Understanding Neural Networks through Representation Erasure. Co RR, abs/1612.08220. Pearlmutter, B. A. 1994. Fast Exact Multiplication by the Hessian. Neural Computation, 6: 147 160. Pruthi, G.; Liu, F.; Sundararajan, M.; and Kale, S. 2020. Estimating Training Data Inﬂuence by Tracing Gradient Descent. In Neur IPS. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust you? Explaining the predictions of any classiﬁer. In KDD. Shazeer, N.; Cheng, Y.; Parmar, N.; Tran, D.; Vaswani, A.; Koanantakool, P.; Hawkins, P.; Lee, H.; et al. 2018. Meshtensorﬂow: Deep learning for supercomputers. In NIPS. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. In ICLR. Trefethen, L. N.; and Bau, D. 1997. Numerical Linear Algebra. SIAM. Uszkoreit, J.; Ponte, J.; Popat, A.; and Dubiner, M. 2010. Large scale parallel document mining for machine translation. In COLING. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All you Need. In NIPS. Wojnowicz, M.; Cruz, B.; Zhao, X.; Wallace, B.; Wolff, M.; Luan, J.; and Crable, C. 2016. Inﬂuence sketching : Finding inﬂuential samples in large-scale regressions. In Big Data. Woodruff, D. P. 2014. Sketching as a Tool for Numerical Linear Algebra. Co RR, abs/1411.4357. Yeh, C.; Kim, J. S.; Yen, I. E.; and Ravikumar, P. 2018. Representer Point Selection for Explaining Deep Neural Networks. In NIPS. Zhang, Y.; Riesa, J.; Gillick, D.; Bakalov, A.; Baldridge, J.; and Weiss, D. 2018. A Fast, Compact, Accurate Model for Language Identiﬁcation of Codemixed Text. In EMNLP.