# scaling_up_influence_functions__8a5e0d27.pdf Scaling Up Influence Functions Andrea Schioppa , Polina Zablotskaia, David Vilar, Artem Sokolov Google Research {arischioppa, polinaz, vilar, artemsok}@google.com We address efficient calculation of influence functions for tracking predictions back to the training data. We propose and analyze a new approach to speeding up the inverse Hessian calculation based on Arnoldi iteration. With this improvement, we achieve, to the best of our knowledge, the first successful implementation of influence functions that scales to full-size (language and vision) Transformer models with several hundreds of millions of parameters. We evaluate our approach on image classification and sequence-to-sequence tasks with tens to a hundred of millions of training examples. Our code will be available at https://github.com/googleresearch/jax-influence. 1 Introduction Recognizing data s highest agency in defining deep neural networks (DNNs) performance, the pursuit of state-of-theart has made datasets for training modern DNNs grow to sizes that can no longer be curated by humans. This has acutely aggravated data issues like noise and mislabeled data: Noise is characteristic of tasks where training data is crawled from the Web (e.g. machine translation) and where golden-truth labels are heuristically paired to inputs (Uszkoreit et al. 2010), leaving ample room for errors and inheriting biases of the heuristic. Wrong labels can be also introduced by non-expert crowd annotators who, considering the amount of data to be labeled, are hard to incentivize for quality within available budgets (Bowman and Dahl 2021). Given the above, a natural way to interpret and fix DNN models is to track their bad (or good) predictions down to the training examples that caused them (Cook and Weisberg 1980; Koh and Liang 2017; Yeh et al. 2018), and take appropriate action on the found examples or annotation policies. Addressing this, Koh and Liang (2017) proposed influence functions (IFs) as a theoretically motivated method, grounded in robust statistics (Cook and Weisberg 1982), of quantifying the effect of training examples on predictions: For a query example z, IFs estimate the most influential example x in training data D, in terms of absolute change of Google AI Resident. Copyright c 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. loss L if x were infinitesimally up-weighted in D, with: IH(x, z) = ΘL(z), H 1 ΘL(x) , (1) where H = 2 ΘL is the Hessian of the model at parameters Θ. The straight-forward IF implementation, using the approximate Hessian inversion procedure LISSA (Agarwal, Bullins, and Hazan 2017), has O(p) memory and O(r p) time complexities, where r is the LISSA iteration count and p = |Θ|, incurred at every x. Besides the need of careful tuning of LISSA, the O(p)-memory has been the major obstacle on the way of IF deployment for debugging applicationrelevant DNNs with hundreds of millions (or more) training examples and model parameters; so noise or mislabeling issues remain unfixed or even undetected, and are adversely impacting predictions. In this work, we focus on reducing the IF memory footprint by not materializing O(p)-size gradients nor Hessians, and decoupling the required number of H estimations, O(r |D|), from the training data size. This allows to parallelize computation over larger b and scale to huge datasets and models. Specifically, we use Arnoldi iteration (Arnoldi 1951) to find the dominant (in absolute value) eigenvalues of H and their orthonormal eigenvectors on a random data subset, |D | |D|, and then cheaply invert the diagonalized H, avoiding calls to LISSA as well as its convergence and stability issues. As H is Hermitian, (1) is symmetric w.r.t. x and z, so previous work cached { ΘL(x)} to improve IFs usability (Guo et al. 2021), however, it only spares one backward pass per x and requires to re-estimate the product of the (unstorable) H 1 with ΘL(z) every time an influence on z is requested. The crux of our approach is in caching instead H in the trivially-invertable diagonalized form and for the small-dimensional subspace spanned by a few dominant eigenvectors, p p. Hessian-gradient products are then reduced to simple scalar-gradient products, which do not need to be materialized in memory as (1) can now be implemented with Jacobian-vector products. In summary, our approach renders repeated re-estimations of H 1 ΘL(x) at every x unnecessary and they are replaced with the memoryand time-efficient forward-mode differentiation. Empirically, IFs with Arnoldi iteration achieve speedups of 3-4 orders of magnitude over the LISSA-powered IFs (Koh and Liang 2017) and of 10x over Trac In (Pruthi et al. 2020), a heuristic gradient-only alternative to IFs ( 5), The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) with better or similar accuracy. With this improvement, we successfully evaluated IFs on both language and vision, fullsize Transformer models (up to 300M of parameters) in image classification and sequence-to-sequence tasks, resp., on 14M (Image Net) and 100M (Paracrawl) training examples. Note that the standard conditions for (1) to be a correct influence estimate, i.e. locally strictly-convex L C2 (Koh and Liang 2017), remain in place and their fulfilment depends on the concrete task, network, training algorithm and its convergence status. The time/memory complexity also remain, however, our contribution improves constants hidden in the O-notation, and thus permits IF evaluation on full data/models that are relevant in applications, on standard memory-limited hardware. This opens the way for developers to make informed decisions if IFs are appropriate for their task, rather than to resort to heuristics from the start. This is encouraging, since the existing brute-force recipe of soothing the O(p) complexity by subsetting parameters, e.g. focusing on few layers only (Chen et al. 2020), is prone to producing incorrect results (Feldman and Zhang 2020) (see also 5.2). On the other hand, running IF on subsets of D to reduce runtime (Guo et al. 2021) may introduce unwanted biases, misattribute prediction failures, and would not be enough for identifying mislabeled examples (Pruthi et al. 2020) or hard examples requiring memorization (Feldman and Zhang 2020). Yet, these approaches are compatible with our method and should result in compound speed-ups. We will open-source our implementation of Arnoldi iteration at https://github.com/google-research/jax-influence. 2 Related Work Explaining DNN predictions falls under a broader interpretability umbrella, where the lingering complexity of the data-explainability approach made research historically focus on instance-based methods, that explain predictions in terms of task-specific structural units of inputs, e.g. pixels or tokens. Rich literature offers different instantiations of the idea: gradient-based saliency maps (Simonyan, Vedaldi, and Zisserman 2013),input perturbations (Li, Monroe, and Jurafsky 2016) or LIME (Ribeiro, Singh, and Guestrin 2016), which fits a linear model in the inputs neighborhood. However, being limited to specific inputs, their insights are rarely actionable for system developers. And while it is possible to repurpose them to data explainability, e.g. via clustering of saliency maps (Lapuschkin et al. 2019), this solves a more difficult task than necessary, introduces new hyperparameters (incl. the saliency method itself) and relies on human experts to make sense of the clusters. In contrast to instance-explanations in the form of token level heatmaps, the IF provides a method for tracing model predictions back to training examples. Existing approaches to reducing IF runtime mostly address salient problem axes dimensionality of active parameters, cardinality of data subset or number of iterations without addressing the procedure itself; or they drop theoretical foundations to use heuristics simpler than IF: e.g. Pruthi et al. (2020) reduce influence to tracking loss changes with the cumulative (over training model checkpoints, i.e. model snapshots) dot products of gradients, and in (Yeh et al. 2018) authors leverage kernel functions evaluated at the training samples for explaining inference decisions. Mathematically, the closest to our work is (Ghorbani, Krishnan, and Xiao 2019) who use a specialization of Arnoldi iteration to Hermitian matrices (Lanczos iteration) to study the dynamics of spectra of entire Hessians at different snapshots during training. Because of this different goal, they use full-batch Hessians (i.e. computed on the full dataset), while we spread the Hessian approximation across smaller batches that do not cover the full D. In result, we can work with larger models and datasets, e.g. Res Net50/Vi T vs. Res Net18, and at a larger speed (simultaneously bumping the number of Lanczos/Arnoldi iterations from 90 to 200 to increase precision). 3 Influence and Influence Functions The true influence of x on z, Itrue(x, z), is defined as the change in the loss L at z between having learned the model parameters Θ without and with x in the training data (Cook and Weisberg 1980): Itrue(x, z) = L(z|Θ : x D) L(z|Θ : x D). Explicit calculation of Itrue(x, z), by removing every x and retraining, is infeasible for large D and several approximation techniques have been proposed. Feldman (2020) and Feldman and Zhang (2020) propose to train multiple models on randomly selected data subsets while tracking, for each x, to which subsets it belonged; this way, one can obtain an unbiased estimator Imem(x, x) of Itrue(x, x) which, however, requires a substantial amount of model re-trainings (up to thousands to satisfy theoretical guarantees (Feldman 2020)). Koh and Liang (2017) advocated the use of IFs in (1) that approximate the loss change after an infinitesimal upweighting of x in D. For models used in practice, H cannot be materialized in memory, let alone be inverted by standard linear algebra. However, for a fixed vector v, the Hessianvector product (HVP), Hv, can be computed in O(b p) time and memory (Pearlmutter 1994), where b is the batch size and defines the number of training examples on which H (of an implicitly given loss L) will be approximated. HVP is commonly implemented in modern autodiff toolkits (Baydin et al. 2018) as the reverse-mode differentiation Jacobianvector product (JVP), followed by a forward-mode JVP. Repeated HVP calls are the workhorse of the iterative procedure LISSA (Agarwal, Bullins, and Hazan 2017), used by (Koh and Liang 2017), that estimate inverse HVP as: H 1 r v = v + (I H)H 1 r 1v, where H is approximated on random batches and v is a gradient. Even for small r, the procedure is both timeand memory-expensive as the O(p)-memory of HVP on explicitly instantiated v forces to estimate H on a single sampled training point per iteration (b = 1), impacting accuracy. Moreover, the total O(r b p) time complexity will be incurred at every x in whose influence on z we are interested. Evaluating influence methods. A practical problem with influence-based explainability is the absence of ground-truth to verify that a method produces correct results. In this paper we use two proxies, following the assumption that xs with high self-influence IH(x, x) correspond to data outliers (Koh and Liang 2017; Pruthi et al. 2020): we either introduce a known synthetic data corruption and check for its correct retrieval by a method, or filter high-influence points out and measure a change in downstream task metrics (Guo et al. 2021; Kocijan and Bowman 2020). Using IH(x, x) as a retrieval score for corrupted data, we measure the retrieval quality as areas under the ROC and the precisionrecall curves, respectively denoted as the Area Under the Curve (AUC) and the Average Precision (AP). 4 Scaling Influence Functions From the discussion above, the O(p)-memory complexity is the major bottleneck for efficient implementation of IFs. We start with an overview of existing techniques, showing their commonalities. Note that all are compatible with the approach we propose. Caching. For data-interpretability purposes one might consider limiting (1) to a sufficiently promising subset D D, e.g. Guo et al. (2021) define D as the top-k ℓ2-neighbors of z in D. Besides more hyperparameters, this still requires computing H 1 ΘL(z) for every z and, as discussed above, reducing D would not be enough for applications that require computing IH(x, z) on all training x. As H is symmetric, one could swap x and z, and cache H 1 ΘL(x) instead, bringing down the query complexity having only to compute ΘL(z) now, but this would just shift the computational burden to building the search index over H 1 ΘL(x). Restricting parameters. Reducing required memory is possible by naively limiting the computation to a smaller subset of parameters of cardinality p, e.g. selecting one or few layer(s); usually, the last layer is selected (Koh and Liang 2017). This has two drawbacks: the choice of layers becomes a hyperparameter and the viable values of p will depend on the model architecture, and as Feldman and Zhang (2020, 3.6) show using just one layer can result in different influence estimates compared to the full model. Random projections. For simplification one might assume H = I and reduce influence estimates to dot products of gradients. To account for multiple layers at once and to get a finer-step control of p, we consider a simple baseline, Rand Select, which randomly selects p parameters Θ Θ and computes influence using the final checkpoint and gradients with respect to Θ, and can be combined with layer selection. The Rand Select estimator can be equivalently expressed as IG(x, z) = G ΘL(x), G ΘL(z) , (2) where G R p p is a row selection matrix of the gradient s components corresponding to Θ. We also use another sketching (Woodruff 2014) baseline, Rand Proj, initially proposed by Wojnowicz et al. (2016) for generalized linear models: for a random Gaussian projection matrix G, E[GT G] = I, which leads to an unbiased estimate in (2). Since normally p p, it allows a memoryefficient implementation in the forward mode: one just estimates p JVPs with the rows of G that avoid materializing O(p)-size gradients. This has a lower memory footprint than Rand Select, which requires back-propagation as its p (e.g. one layer) is still of the same order as p (see 5.2). Tracing updates. A development of the H = I idea is proposed in (Pruthi et al. 2020), where influence approximation works for the case when one can trace changes to the loss L across all gradient steps. In order to make this feasible, they propose the Trac In estimator, defined on a subset of gradient steps: ITrac In(x, z) = 1 i=1 Θi L(x), Θi L(z) , where the {Θi} is a set of C checkpoints. Note that the complexity of Trac In is C times that of using exact gradient similarity and, as discussed in (Pruthi et al. 2020), care needs to be taken in selecting checkpoints. Another practical obstacle to Trac In is that, when analysing publicly released models, usually only the final model checkpoint is provided. Compatible projections. Rand Proj assumes the fulldimension Hessian, H = I; if we drop this requirement, we might consider H restricted to the subspace SG which is the image of G, and work with G H GT instead of the larger H. However, SG is not in general H-invariant which can lead to approximation errors as when H is applied to a vector v SG, the result might have non-negligible components orthogonal to SG. We will see an example of this later in the experiments on eigenvalue retrieval for MNIST (Figure 1) where Rand Proj requires a considerably larger p than Arnoldi to retrieve the topp eigenvalues of H. Our approach. We propose to use the standard technique of building an approximately H-invariant subspace by selecting an arbitrary (e.g. random) vector v Rp and constructing the n-th order Krylov subspace: Kn(H; v) = Span{v, Hv, H2v, . . . , Hnv}. The Arnoldi iteration (Arnoldi 1951) additionally builds an orthonormal basis for Kn(H; v), so that the diagonalization of the restriction H of H to Kn(H; v) yields an approximation of the largest (in absolute value) eigenvalues of H and of the corresponding eigenvectors (Trefethen and Bau 1997, Ch. 3334). Assuming n is large enough to estimate the largest p eigenvalues, in summary we obtain a projection matrix G and work with H = G H GT , which is a smaller dimensional matrix. We will call this algorithm Arnoldi, with the pseudocode in Algorithm 1. The common feature of Rand Proj and Arnoldi is that, instead of working with the full gradient ΘL(x), one takes the JVPs gi, ΘL(x) with respect to the rows gi of G. The implementation then becomes considerably more efficient as it can be done in the forward-mode differentiation and on larger batches. Moreover, in the case of Arnoldi the matrix H gets replaced with now diagonal H, simplifying the matrix inversion appearing in the definition of IH, and dispensing with the expensive LISSA procedure. Error analysis. It remains to analyse the effect of using topp eigenvalues in Arnoldi. Recall that Koh and Liang (2017) derive (1) by minimizing the quadratic form Q(θ) = 1 N ΘL(x|Θ0), θ , where Θ0 are the parameters at convergence. Ordering the eigenvalues of H at Θ0, |λ1| |λ2| and letting e1, e2, be the corresponding eigenvectors, Ghorbani, Krishnan, and Xiao (2019) empirically observe (and prove in the quadratic case) that gradient updates align with the subspace of H corresponding to the dominant λs. We provide two additional arguments in the same direction: we upperbound the error of approximating Q using such a subspace in Lemma 1, and discuss the effect of noise in H and the size of λk on applying H 1 to a vector in Lemma 2 (with proofs in A). Let Qk be the form Q restricted to the H-subspace spanned by the top-k λs. We show that, as k increases, Qk approximates Q better and the errors in directions of ek corresponding to smaller |λk| matter less1: Lemma 1. Qk approximates Q by an error bounded by: 0 Q(θ) Qk(θ) 1 2|λk+1| θ 2 2. Further, if minimizing Q introduces an error ε in the direction of ek+1 obtaining an estimate θ for θ , then Q(θ ) Q(θ ) = ε2 Another way of looking at the same phenomenon is to consider the variance of estimated influence as the function of |λk|. Consider a computation of y = H 1u, where vector u is known exactly. Assume also that H s estimation is noisy resulting in error H + δH, that E[δH] = 0, and that the δH is isotropic (e.g. does not preferentially align with some ek nor co-vary with λk). Then the variance of the estimator ˆy = (HΘ0 + δH) 1u in the direction of ek is proportional to |λk| 2: Lemma 2. The variance of ˆy in the direction of ek is Var( ˆy, ek ) 1 |λk|2 Var( δHek, y ). 5 Experiments 5.1 Small Model & Data Scale: Digit Recognition In this subsection, to be able to compare all baselines we pick the small MNIST dataset (Le Cun, Cortes, and Burges 1994) and consider two CNNs of different sizes: a small one that permits the exact Hessian calculation, and a larger one on which we can gauge the scalability potential. Because the influence calculation with LISSA and Trac In is slow, following (Koh and Liang 2017), we take two 10% subsamples of the original data for training and evaluation, and randomly relabel 20% of training examples to create a corrupted dataset to evaluate mislabeled example retrieval with influence estimates. Unlike (Koh and Liang 2017; Pruthi et al. 2020) we introduce the noise before training the models; by design, a perfect model on correctly labeled data would achieve only 80% accuracy on our eval set. Small network. We re-implemented the small convolutional network with smooth non-linearities from (Koh and 1One might attempt Arnoldi on H 1 Θ0 to obtain an approximation directly in the subspace of the top-k eigenvalues of H 1 Θ0 . We found this approach however to be less performant (see A.1). Algorithm 1: Arnoldi 1: procedure ARNOLDI(v, n) Build orthonormal basis for the Krylov subspaces Kn. 2: w0 v v 2 3: Al,m 0 for 0 l n and 0 m < n 4: for i 1, n do 5: wi H wi 1 HVP in fwd-over-rev mode 6: Set Ai,j = wi, wj for j < i 7: wi wi P j 0. We mislabeled 10% of the test examples and compared their retrieval by Arnoldi and Rand Proj accumulating parameters from the top 10%, 20% and the full model in D.2, Table 6. Arnoldi wins, but the gap to Rand Proj 5https://github.com/google/flax/tree/main/examples/imagenet 6Even though the model parameter count is about 50M, half of the parameters are batch statistics, so we treat p as 25M. 7https://github.com/google-research/vision transformer 8Among the top-100 λs, 7% of the mass belongs to negative λs. Figure 3: Runtime of Arnoldi for n = 200 iterations on the full set of parameters of respective networks. is small and perhaps indicates that accounting for local curvature is superfluous for this particular self-influence benchmark and model. As demonstrated by Koh and Liang (2017, 2.2), this may not be the case in general, and for other IFbased applications our contribution enables a verification of whether accounting for Hessians is required. Unlike for the machine translation case, increasing the number of parameters leads here to a slight decrease in performance, suggesting that one may restrict IFs to the top layers in the finetuning setting. Another reason for the near matching performance could be the IF s increased fragility for large models (Basu, Pope, and Feizi 2021), also called out for natural language inference and Ro BERTa model by Kocijan and Bowman (2020), where performance dropped after retraining on either highor low-influence (w.r.t. to a validation set) examples. In D.4 we also investigate the memorization vs. generalization trade-off of removing high or low selfinfluence images on Image Net. In D.3, for the whole Image Net and the full Res Net50 model, we picture most self-influential images and the most influential training images retrieved for a test point. 6 Conclusion We proposed a new way of calculating influence scores of (Koh and Liang 2017) for large DNNs by approximate diagonalization of their Hessians and avoiding re-estimating them on every training example. We demonstrated finding influential or noisy examples in datasets of up to 100M training examples and models with up to 300M parameters. Acknowledgements We thank Behrooz Ghorbani and Mukund Sundararajan for their valuable feedback on the paper. References Agarwal, N.; Bullins, B.; and Hazan, E. 2017. Second Order Stochastic Optimization for Machine Learning in Linear Time. JMLR, 18(116): 1 40. Arnoldi, W. E. 1951. The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quarterly of Applied Mathematics, 9(1): 17 29. Barshan, E.; Brunet, M.; and Dziugaite, G. K. 2020. Relat IF: Identifying Explanatory Training Samples via Relative Influence. In AISTATS. Basu, S.; Pope, P.; and Feizi, S. 2021. Influence Functions in Deep Learning Are Fragile. In ICLR. Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind, J. M. 2018. Automatic differentiation in machine learning: a survey. JMLR, 18(153): 1 43. Bowman, S. R.; and Dahl, G. 2021. What Will it Take to Fix Benchmarking in Natural Language Understanding? In NAACL. Chen, H.; Si, S.; Li, Y.; Chelba, C.; Kumar, S.; Boning, D.; and Hsieh, C.-J. 2020. Multi-Stage Influence Function. In Neur IPS. Chen, X.; Hsieh, C.-J.; and Gong, B. 2021. When Vision Transformers Outperform Res Nets without Pretraining or Strong Data Augmentations. Co RR, abs/2106.01548. Cook, R. D.; and Weisberg, S. 1980. Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression. Technometrics, 22(4): 495 508. Cook, R. D.; and Weisberg, S. 1982. Residuals and influence in regression. Chapman and Hall. Deng, Y.; Cheng, S.; Lu, J.; Song, K.; Wang, J.; Wu, S.; Yao, L.; Zhang, G.; et al. 2018. Alibaba s Neural Machine Translation Systems for WMT18. In WMT. Feldman, V. 2020. Does Learning Require Memorization? A Short Tale about a Long Tail. In STOC. Feldman, V.; and Zhang, C. 2020. What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation. In Neur IPS. Foret, P.; Kleiner, A.; Mobahi, H.; and Neyshabur, B. 2021. Sharpness-aware Minimization for Efficiently Improving Generalization. In ICLR. Ghorbani, B.; Krishnan, S.; and Xiao, Y. 2019. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. In ICML. Guo, H.; Fatema R., N.; Hase, P.; Bansal, M.; and Xiong, C. 2021. Fast IF: Scalable Influence Functions for Efficient Model Interpretation and Debugging. In EMNLP. Gwinnup, J.; Anderson, T.; Erdmann, G.; and Young, K. 2018. The AFRL WMT18 Systems: Ensembling, Continuation and Combination. In WMT. Han, X.; Wallace, B. C.; and Tsvetkov, Y. 2020. Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. In ACL. Junczys-Dowmunt, M. 2018. Microsoft s Submission to the WMT2018 News Translation Task: How I Learned to Stop Worrying and Love the Data. In WMT. Kocijan, V.; and Bowman, S. 2020. Influence Functions Do Not Seem to Predict Usefulness in NLP Transfer Learning. https://wp.nyu.edu/cilvr/2020/08/27. Koh, P. W.; and Liang, P. 2017. Understanding Black-box Predictions via Influence Functions. In ICML. Kreutzer, J.; Vilar, D.; and Sokolov, A. 2021. Bandits Don t Follow Rules: Balancing Multi-Facet Machine Translation with Multi-Armed Bandits. In EMNLP. Kudo, T.; and Richardson, J. 2018. Sentence Piece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In EMNLP. Lapuschkin, S.; W aldchen, S.; Binder, A.; Montavon, G.; Samek, W.; and M uller, K. 2019. Unmasking Clever Hans Predictors and Assessing What Machines Really Learn. Co RR, abs/1902.10178. Le Cun, Y.; Cortes, C.; and Burges, C. 1994. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. Li, J.; Monroe, W.; and Jurafsky, D. 2016. Understanding Neural Networks through Representation Erasure. Co RR, abs/1612.08220. Pearlmutter, B. A. 1994. Fast Exact Multiplication by the Hessian. Neural Computation, 6: 147 160. Pruthi, G.; Liu, F.; Sundararajan, M.; and Kale, S. 2020. Estimating Training Data Influence by Tracing Gradient Descent. In Neur IPS. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should I trust you? Explaining the predictions of any classifier. In KDD. Shazeer, N.; Cheng, Y.; Parmar, N.; Tran, D.; Vaswani, A.; Koanantakool, P.; Hawkins, P.; Lee, H.; et al. 2018. Meshtensorflow: Deep learning for supercomputers. In NIPS. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR. Trefethen, L. N.; and Bau, D. 1997. Numerical Linear Algebra. SIAM. Uszkoreit, J.; Ponte, J.; Popat, A.; and Dubiner, M. 2010. Large scale parallel document mining for machine translation. In COLING. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All you Need. In NIPS. Wojnowicz, M.; Cruz, B.; Zhao, X.; Wallace, B.; Wolff, M.; Luan, J.; and Crable, C. 2016. Influence sketching : Finding influential samples in large-scale regressions. In Big Data. Woodruff, D. P. 2014. Sketching as a Tool for Numerical Linear Algebra. Co RR, abs/1411.4357. Yeh, C.; Kim, J. S.; Yen, I. E.; and Ravikumar, P. 2018. Representer Point Selection for Explaining Deep Neural Networks. In NIPS. Zhang, Y.; Riesa, J.; Gillick, D.; Bakalov, A.; Baldridge, J.; and Weiss, D. 2018. A Fast, Compact, Accurate Model for Language Identification of Codemixed Text. In EMNLP.