# fit_a_metric_for_model_sensitivity__bab6612f.pdf

Published as a conference paper at ICLR 2023

FIT: A METRIC FOR MODEL SENSITIVITY

Ben Zandonati University of Cambridge baz23@cam.ac.uk

Adrian Alan Pol Princeton University ap6964@princeton.edu

Maurizio Pierini CERN maurizio.pierini@cern.ch

Olya Sirkin CEVA Inc. sirkinolya@gmail.com

Tal Kopetz CEVA Inc. tal.kopetz@ceva-dsp.com

Model compression is vital to the deployment of deep learning on edge devices. Low precision representations, achieved via quantization of weights and activations, can reduce inference time and memory requirements. However, quantifying and predicting the response of a model to the changes associated with this procedure remains challenging. This response is non-linear and heterogeneous throughout the network. Understanding which groups of parameters and activations are more sensitive to quantization than others is a critical stage in maximizing efficiency. For this purpose, we propose FIT. Motivated by an information geometric perspective, FIT combines the Fisher information with a model of quantization. We find that FIT can estimate the final performance of a network without retraining. FIT effectively fuses contributions from both parameter and activation quantization into a single metric. Additionally, FIT is fast to compute when compared to existing methods, demonstrating favourable convergence properties. These properties are validated experimentally across hundreds of quantization configurations, with a focus on layer-wise mixed-precision quantization.

1 INTRODUCTION

The computational costs and memory footprints associated with deep neural networks (DNN) hamper their deployment to resource-constrained environments like mobile devices (Ignatov et al., 2018), self-driving cars (Liu et al., 2019), or high-energy physics experiments (Coelho et al., 2021). Latency, storage and even environmental limitations directly conflict with the current machine learning regime of performance improvement through scale. For deep learning practitioners, adhering to these strict requirements, whilst implementing state-of-the-art solutions, is a constant challenge. As a result, compression methods, such as quantization (Gray & Neuhoff, 1998) and pruning (Janowsky, 1989), have become essential stages in deployment on the edge.

In this paper, we focus on quantization. Quantization refers to the use of lower-precision representations for values within the network, like weights, activations and even gradients. This could, for example, involve reducing the values stored in 32-bit floating point (FP) single precision, to INT-8/4/2 integer precision (IP). This reduces the memory requirements whilst allowing models to meet strict latency and energy consumption criteria on high-performance hardware such as FPGAs.

Despite these benefits, there is a trade-off associated with quantization. As full precision computation is approximated with less precise representations, the model often incurs a drop in performance. In practice, this trade-off is worthwhile for resource-constrained applications. However, the DNN performance degradation associated with quantization can become unacceptable under aggressive schemes, where post-training quantization (PTQ) to 8 bits and below is applied to the whole network (Jacob et al., 2018). Quantization Aware Training (QAT) (Jacob et al., 2018) is often used to recover lost performance. However, even after QAT, aggressive quantization may still result in a large performance drop. The model performance is limited by sub-optimal quantization schemes.

It is known that different layers, within different architectures, respond differently to quantization (Wu et al., 2018). Akin to how more detailed regions of images are more challenging to compress, as

Published as a conference paper at ICLR 2023

are certain groups of parameters. As is shown clearly by Wu et al. (2018), uniform bit-width schemes fail to capture this heterogeneity. Mixed-precision quantization (MPQ), where each layer within the network is assigned a different precision, allows us to push the performance-compression trade-off to the limit. However, determining which bit widths to assign to each layer is non-trivial. Furthermore, the search space of quantization configurations is exponential in the number of layers and activations.

Existing methods employ techniques such as neural architecture search and deep reinforcement learning, which are computationally expensive and less general. Methods which aim to explicitly capture the sensitivity (or importance) of layers within the network, present improved performance and reduced complexity. In particular, previous works employ the Hessian, taking the loss landscape curvature as sensitivity and achieving state-of-the-art compression. Even so, many explicit methods are slow to compute, grounded in intuition, and fail to include activation quantization. Furthermore, previous works determine performance based on only a handful of configurations. Further elaboration is presented in Section 2.

The Fisher Information and the Hessian are closely related. In particular, many previous works in optimisation present the Fisher Information as an alternative to the Hessian. In this paper, we use the Fisher Information Trace as a means of capturing the network dynamics. We obtain our final FIT metric which includes a quantization model, through a general proof in Section 3 grounded within the field of information geometry. The layer-wise form of FIT closely resembles that of Hessian Aware Quantization (HAWQ), presented by Dong et al. (2020). Our contributions in this work are as follows:

1. We introduce the Fisher Information Trace (FIT) metric, to determine the effects of quantization. To the best of our knowledge, this is the first application of the Fisher Information to generate MPQ configurations and predict final model performance. We show that FIT demonstrates improved convergence properties, is faster to compute than alternative metrics, and can be used to predict final model performance after quantization. 2. The sensitivity of parameters and activations to quantization is combined within FIT as a single metric. We show that this consistently improves performance. 3. We introduce a rank correlation evaluation procedure for mixed-precision quantization, which yields more significant results with which to inform practitioners.

2 PREVIOUS WORK

In this section, we primarily focus on mixed-precision quantization (MPQ), and also give context to the information geometric perspective and the Hessian.

Mixed Precision Quantization As noted in Section 1, the search space of possible quantization configurations, i.e. bit setting for each layer and/or activation, is exponential in the number of layers: O(|B|2L), where B is the set of bit precisions and L the layers. Tackling this large search space has proved challenging, however recent works have made headway in improving the state-of-the-art.

CW-HAWQ (Qian et al., 2020), Auto Q (Lou et al., 2019) and HAQ (Wang et al., 2019) deploy Deep Reinforcement Learning (DRL) to automatically determine the required quantization configuration, given a set of constraints (e.g. accuracy, latency or size). Auto Q improves upon HAQ by employing a hierarchical agent with a hardware-aware objective function. CW-HAWQ seeks further improvements by reducing the search space with explicit second-order information, as outlined by Dong et al. (2020). The search space is also often explored using Neural Architecture Search (NAS). For instance, Wu et al. (2018) obtain 10-20 model compression with little to no accuracy degradation. Unfortunately, both the DRL and NAS approaches suffer from large computational resource requirements. As a result, evaluation is only possible on a small number of configurations. These methods explore the search space of possible model configurations, without explicitly capturing the dynamics of the network. Instead, this is learned implicitly, which restricts generalisation.

More recent works have successfully reduced the search space of model configurations through explicit methods, which capture the relative sensitivity of layers to quantization. The bit-width assignment is based on this sensitivity. The eigenvalues of the Hessian matrix yield an analogous heuristic to the local curvature. Higher local curvature indicates higher sensitivities to parameter perturbation, as would result from quantization to a lower bit precision. Choi et al. (2016) exploit

Published as a conference paper at ICLR 2023

this to inform bit-precision configurations. Popularised by Dong et al. (2020), HAWQ presents the following metric, where a trace-based method is combined with a measure of quantization error:

l=1 Tr(Hl) ||Q(θl) θl||2.

Here, θl and Q(θl) represent the full precision and model parameters respectively, for each block l of the model. Tr(Hl) denotes the parameter normalised Hessian trace. HAWQ-V1 (Dong et al., 2019) used the top eigenvalue as an estimate of block sensitivity. However, this proved less effective than using the trace as in HAWQ-V2 Dong et al. (2020). The ordering of quantization depth established over the set of network blocks reduces the search space of possible model configurations. The Pareto front associated with the trade-off between sensitivity and size is used to quickly determine the best MPQ configuration for a given set of constraints. HAWQ-V3 (Yao et al., 2021) involves integer linear programming to determine the quantization configuration. Although Dong et al. (2020) discusses activation quantization, it is considered separately from weight quantization. In addition, trace computation can become very expensive for large networks. This is especially the case for activation traces, which require large amounts of GPU memory. Furthermore, only a handful of configurations are analysed.

Other effective heuristics have been proposed, such as batch normalisation γ (Chen et al., 2021) and quantization (?) scaling parameters. For these intuition-grounded heuristics, it is more challenging to assess their generality. More complex methods of obtaining MPQ configurations exist. Kundu et al. (2021) employ straight-through estimators of the gradient (Hubara et al., 2016) with respect to the bit-setting parameter. Adjacent work closes the gap between final accuracy and quantization configuration. In Liu et al. (2021), a classifier after each layer is used to estimate the contribution to accuracy.

Previous works within the field of quantization have used the Fisher Information for adjacent tasks. Kadambi (2020), use it as a regularisation method to reduce quantization error, whilst it is employed by Tu et al. (2016) as a method for computing importance rankings for blocks/parameters. In addition, Li et al. (2021) use the Fisher Information as part of a block/layer-wise reconstruction loss during post-training quantization.

Connections to the loss landscape perspective Gradient preconditioning using the Fisher Information Metric (FIM) - the Natural Gradient - acts to normalise the distortion of the loss landscape. This information is extracted via the eigenvalues which are characterised by the trace. This is closely related to the Hessian matrix (and Netwon s method). The two coincide, provided that the model has converged to the global minimum θ , and specific regularity conditions are upheld (Amari, 2016) (see Appendix G.3). The Hessian matrix H is derived via the second-order expansion of the loss function at a minimum. Using it as a method for determining the effects of perturbations is common. The FIM is more general, as it is yielded (infinitesimally) from the invariant f-divergences. As a result, FIT applies to a greater subset of models, even those which have not converged to critical points.

Many previous works have analysed and provided examples for, the relation between second-order information, and the behaviour of the Fisher Information. (Kunstner et al., 2019; Becker & Lecun, 1989; Martens, 2014; Li et al., 2020). This has primarily been with reference to natural gradient descent (Amari, 1998), preconditioning matrices, and Newton s method during optimization. Recent works serve to highlight the success of the layer-wise scaling factors associated with the Adam optimiser (Kingma & Ba, 2014) whilst moving in stochastic gradient descent (SGD) directions (Agarwal et al., 2020; 2022). This is consistent with previous work (Kunstner et al., 2019), which highlights the issues associated with sub-optimal scaling for SGD, and erroneous directions for Empirical Fisher (EF)-based preconditioning. The EF and its properties are also explored by Karakida et al. (2019).

It is important to make the distinction that FIT provides a different use case. Rather than focusing on optimisation, FIT is used to quantify the effects of small parameter movements away from the full precision model, as would arise during quantization. Additionally, the Fisher-Rao metric has been previously suggested as a measure of network capacity (Liang et al., 2019). In this work, we consider taking the expectation over this quantity. From this perspective, FIT denotes expected changes in network capacity as a result of quantization.

Published as a conference paper at ICLR 2023

First, we outline preliminary notation. We then introduce an information geometric perspective to quantify the effects of model perturbation. FIT is reached via weak assumptions regarding quantization. We then discuss computational details and finally connect FIT to the loss landscape perspective.

3.1 PRELIMINARY NOTATION.

Consider training a parameterised model as an estimate of the underlying conditional distribution p(y|x, θ), where the empirical loss on the training set is given by: L(θ) = 1

N PN i=1 f(xi, yi, θ). In this case, θ Rw are the model parameters, and f(xi, yi, θ) is the loss with respect to a single member zi = (xi, yi) Rd Y of the training dataset D = {(xi, yi)}N i=1, drawn i.i.d. from the true distribution. Consider, for example, the cross-entropy criterion for training given by: f(xi, yi, θ) = log p(yi|xi, θ). Note that in parts we follow signal processing convention and refer to the parameter movement associated with a change in precision as quantization noise.

3.2 FISHER INFORMATION TRACE (FIT) METRIC

Consider a general perturbation to the model parameters: p(y|x, θ + δθ). For brevity, we denote p(y|x, ϕ) as pϕ. To measure the effect of this perturbation on the model, we use an f-divergence (e.g. KL, total variation or χ2) between them: Df(pθ||pθ+δθ). It is a well-known concept in information geometry (Amari, 2016; Nielsen, 2018) that the FIM arises from such a divergence: DKL(pθ||pθ+δθ) = 1

2δθT I(θ)δθ. The FIM, I(θ), takes the following form:

I(θ) = Epθ(x,y)[ θ log p(y|x, θ) θ log p(y|x, θ)T ]

In this case, pθ(x, y) denotes the fact that expectation is taken over the joint distribution of x and y. The Rao distance (Atkinson & Mitchell, 1981) between two distributions then follows.

The exact (per parameter) perturbations δθ associated with quantization are often unknown. As such, we assume they are drawn from an underlying quantization noise distribution, and obtain an expectation of this quadratic differential:

E δθT I(θ)δθ = E[δθ]T I(θ)E[δθ] + Tr(I(θ)Cov[δθ])

We assume that the random noise associated with quantization is symmetrically distributed around mean of zero: E[δθ] = 0, and uncorrelated: Cov[δθ] = Diag(E[δθ2]). This yields the FIT heuristic in a general form: Ω= Tr I(θ)diag(E[δθ2])

To develop this further, we assume that parameters within a single layer or block will have the same noise power, as this will be highly dependent on block-based configuration factors. As a concrete example, take the quantization noise power associated with heterogeneous quantization across different layers, which is primarily dependent on layer-wise bit-precision. As such, we can rephrase FIT in a layer-wise form:

l Tr(I(θl)) E[δθ2]l

Where l denotes a single layer/block in a set of L model layers/blocks.

3.2.1 EXTENDING THE NEURAL MANIFOLD

The previous analysis applies well to network parameters, however, quantization is also applied to the activations themselves. As such, to determine the effects of the activation noise associated with quantization, we must extend the notion of neural manifolds to also include activation statistics, ˆa, over the dataset as well. The general perturbed model is denoted as follows: p(y|x, θ + δθ, ˆa + δˆa). The primary practical change required for this extension involves taking derivatives w.r.t. activations rather than parameters - a feature which is well supported in deep learning frameworks (see Appendix

Published as a conference paper at ICLR 2023

C for activation trace examples). After which, the expectation is taken over the data. Once again, this change is simple, as the empirical form of the Fisher Information requires a similar approximation (Section 3.3, and the two can be computed at the same time. Having considered the quantization of activations and weights within the same space, these can now be meaningfully combined in the FIT heuristic. This is illustrated in Section 4.

3.3 COMPUTATIONAL DETAILS

The empirical form of FIT can be obtained via approximations of E[δθ2]l and Tr(I(θl)). The former requires either a Monte-Carlo estimate, or an approximate noise model (see Appendix F), whilst the latter requires computing the Empirical Fisher (EF): ˆI(θ). This yields the following form:

1 n(l) Tr(ˆI(θl)) ||δθ||2 l .

Kunstner et al. (2019) illustrates the limitations, as well as the notational inconsistency, of the EF, highlighting several failure cases when using the EF during optimization. Recall that the FIM involves computing expectation over the joint distribution of x and y. We do not have access to this distribution, in fact, we only model the conditional distribution. This leads naturally to the notion of the EF:

i=1 θfθ(zi) θfθ(zi)T .

The EF trace is far less computationally intensive than the Hessian matrix. Previous methods (Dong et al., 2020; Yao et al., 2021; Yu et al., 2022) suggested the use of the Hutchinson algorithm (Hutchinson, 1989). The trace of the Hessian is extracted in a matrix-free manner via zero-mean random variables with a variance of one: Tr(H) 1 m Pm i=1 r T i Hri, where m is the number of estimator iterations. It is common to use Rademacher random variables ri { 1, 1}. First, the above method requires a second backwards pass through the network for every iteration. This is costly, especially for DNNs with many layers, where the increased memory requirements associated with storing the computation graph become prohibitive. Second, the variance of each estimator can be large. This is given by: V[r T i Hri] = 2 ||H||2 F P

i H2 ii (see Appendix G.4). Even for the Hessian, which has a high diagonal norm, this variance can still be large. This is validated empirically in Section 4. In contrast, the EF admits a simpler form (see Appendix G.4):

Tr[ˆI(θ)] = 1

i=1 || f(zi, θ)||2 .

The convergence of this trace estimator improves upon that of the Hutchinson estimator in having lower variance. Additionally, the computation is faster and better supported by deep learning frameworks: its computation can be performed with a single network pass, as no second derivative is required. A similar scheme is used in the Adam optimizer (Kingma & Ba, 2014), where an exponential moving average is used. Importantly, the EF trace estimation has a more model-agnostic variance, also validated in Section 4.

4 EXPERIMENTS

We perform several experiments to determine the performance of FIT. First, we examine the properties of FIT, and compare it with the commonly used Hessian. We show that FIT preserves the relative block sensitivities of the Hessian, whilst having significantly favourable convergence properties. Second, we illustrate the predictive performance of FIT, in comparison to other sensitivity metrics. We then show the generality of FIT, by analysing the performance on a semantic segmentation task. Finally, we conclude by discussing practical implications.

4.1 COMPARISON WITH THE HESSIAN

To evaluate trace performance and convergence in comparison to Hessian-based methods, we consider several computer vision architectures, trained on the Image Net (Deng et al., 2009) dataset. We first

Published as a conference paper at ICLR 2023

illustrate the similarity between the EF trace and the Hessian trace and then illustrate the favourable convergence properties of the EF. Further analysis for BERT on SST-2 is given in Appendix A.

Trace Similarity.

Figure 1 shows that the EF preserves the relative block sensitivity of the Hessian. Substituting the EF trace for the Hessian will not affect the performance of a heuristic-based search algorithm. Additionally, even for the Inception-V3 trace in Figure 1(d), the scaling discrepancy would present no change in the final generated model configurations because heuristic methods (which search for a minimum) are scale agnostic.

(a) Res Net-18

(b) Res Net-50

(c) Mobile Net-V2

(d) Inception-V3

Figure 1: Hessian and EF Parameter traces for four classification models. The Hessian and EF traces for the parameters are very similar. For Inception-V3, this holds up to a constant scaling factor.

Convergence Rate.

Across all models, the variance associated with the EF trace is orders of magnitude lower than that of the Hessian trace, as shown in Table 1. This is made clear in Figure 2, where the EF trace stabilises in far fewer iterations than the Hessian. The analysis in Section 3, where we suggest that the EF estimation process has lower variance and converges faster, holds well in practice.

Importantly, whilst the Hessian variance shown in Table 1 is very model dependent, the EF trace estimator variance is more consistent across all models. These results also hold across differing batch sizes (see Appendix D).The results of this favourable convergence are illustrated in Table 1. For fixed tolerances, the model agnostic behaviour, faster computation, and lower variance, all contribute to a large speedup.

Estimator Variance Iteration Time (ms) Relative Speedup EF Hessian EF Hessian

Res Net-18 0.15 0.03 1.09 0.02 47.78 0.03 186.54 0.56 27.67 5.40 Res Net-50 0.31 0.04 6.91 1.52 152.02 0.38 639.13 1.02 94.24 34.06 Mobile Net-V2 0.24 0.01 4.81 0.38 58.84 0.55 2573.50 3.06 894.24 121.25 Inception-V3 0.43 0.03 13.62 0.46 235.43 0.21 905.04 4.69 122.06 14.90

Table 1: Representative examples of the typical speedup associated with using the EF over the Hessian. Iteration times and variances are computed as sample statistics over multiple runs of many iterations, with batch size of 32. The resulting speedup is denoted for a fixed tolerance, which can be practically computed via a moving variation of the mean trace. Early stopping is possible when we first reach the desired tolerance. The measurements were performed on an NVidia 2080Ti GPU.

4.2 FROM FIT TO ACCURACY

In this section, we use correlation as a novel evaluation criterion for sensitivity metrics, used to inform quantization. Strong correlation implies that the metric is indicative of final performance, whilst low correlation implies that the metric is less informative for MPQ configuration generation. In this MPQ setting, FIT is calculated as follows (Appendix F):

l Tr(ˆI(θl)) θmax θmin

Each bit configuration - {bl}L 1 - yields a unique FIT value from which we can estimate final performance. The space of possible models is large, as outlined in Section 2, thus we randomly sample

Published as a conference paper at ICLR 2023

(a) Res Net-18

(b) Res Net-50

(c) Mobile Net-V2

(d) Inception-V3

Figure 2: Comparison between Hessian and EF trace convergence for four classification models.

configuration space. 100 CNN models, with and without batch-normalisation, are trained on the Cifar-10 and Mnist datasets, giving a total of 4 studies.

Comparison Metrics.

As noted in Section 2 previous works (Kundu et al., 2021; Liu et al., 2021; ?) provide layer-wise heuristics with which to generate quantization configurations. This differs from FIT, which yields a single value. These previous methods are not directly comparable. However, we take inspiration from Chen et al. (2021); ?, using the quantization ranges (QR: |θmax θmin|) as well as the batch normalisation scaling parameter (BN: γ) (Ioffe & Szegedy, 2015), to evaluate the efficacy of FIT in characterising sensitivity.

In addition to these, we also perform ablation studies by removing components of FIT, resulting in FITW, FITA, and the isolated quantization noise model, i.e. E[δθ2]. Note that we do not include HAWQ here as it generates results equivalent to FITW. Equations for these comparison heuristics are shown in Appendix E. This decomposition helps determine how much the components which comprise FIT contribute to its overall performance.

Experiment Dataset BN FIT QR Noise FITW QRW FITA QRA BN A Cifar-10 0.89 0.76 0.85 0.87 0.86 0.38 0.36 0.33 B Cifar-10 0.77 0.67 0.60 0.65 0.61 0.61 0.60 - C Mnist 0.86 0.89 0.83 0.72 0.80 0.44 0.39 0.39 D Mnist 0.90 0.58 0.70 0.72 0.72 0.55 0.44 -

Table 2: Rank correlation coefficient for each combination of sensitivity and quantization experiment. W/A subscript indicates using only either weights or activations. BN indicates the presence of batch-normalisation within the architecture.

Figure 3 and Table 2 show the plots and rank-correlation results across all datasets and metrics. From these results, the key benefits of FIT are demonstrated.

FIT correlates well with final model performance. From Table 2, we can see that FIT has a consistently high rank correlation coefficient, demonstrating its application in informing final model performance. Whilst other methods vary in correlation, FIT remains consistent across experiments.

FIT combines contributions from both parameters and activations effectively. We note from Table 2, that combining FITW and FITA consistently improves performance. The same is not the case for QR. Concretely, the average increase in correlation with the inclusion of FITA is 0.12, whilst for QRA, the correlation decreases on average by 0.02. FIT scales the combination of activation and parameter contributions correctly, leading to a consistent increase in performance. Furthermore, from Table 2 we observe that the inclusion of batch-normalisation alters the relative contribution from parameters and activations. As a result, whilst FIT combines each contribution effectively and remains consistently high, QR does not.

Published as a conference paper at ICLR 2023

FIT QR Noise FITW/HAWQ

Figure 3: Plots of the chosen predictive heuristic against final model performance.

4.3 SEMANTIC SEGMENTATION

In this section, we illustrate the ability of FIT to generalise to larger datasets and architectures, as well as more diverse tasks. We choose to quantify the effects of MPQ on the U-Net architecture (Ronneberger et al., 2015), for the Cityscapes semantic segmentation dataset (Cordts et al., 2016). Additional analysis of BERT on SST-2 is shown in Appendix A. In this experiment, we train 50 models with randomly generated bit configurations for both weights and activations. For evaluation, we use Jaccard similarity, or more commonly, the mean Intersection over Union (m Io U). The final correlation coefficient is between FIT and m Io U. EF trace computation is stopped at a tolerance of 0.01, occurring at 82 individual iterations. The weight and activation traces for the trained, full precision, U-Net architecture is shown in Figure 4. Figure 4(c) shows the correlation between FIT and final model performance for the U-Net architecture on the Cityscapes dataset. In particular, we obtain a high final rank correlation coefficient of 0.86.

(a) Weight Trace

(b) Activation Trace

0.00 0.01 0.02 0.03 FIT

(c) FIT against m Io U

Figure 4: EF weight (a) and activation (b) traces for U-Net architecture on the Cityscapes dataset. (c) FIT against m Io U for 50 random MPQ configurations of U-Net on Cityscapes semantic segmentation. Example configurations are highlighted, showing the average bit precision for weights and activations.

Published as a conference paper at ICLR 2023

4.4 FURTHER PRACTICAL DETAILS

Small perturbations In Section 3, we assume quantization noise, δθ, is small with respect to the parameters themselves. This allows us to reliably consider a second-order approximation of the divergence. In Figure 5, we plot every parameter within the model for every single quantization configuration from experiment A. Almost all parameters adhere to this approximation. We observe that our setting covers most practical situations, and we leave characterising more aggressive quantization (1/2 bit) to future work.

Distributional shift Modern DNNs often over-fit to their training dataset. More precisely, models are able to capture the small distributional shifts between training and testing subsets. FIT is computed from the trained model, using samples from the training dataset. As such, the extent to which FIT captures the final quantized model performance on the test dataset is, therefore, dependent on model over-fitting. Consider for example dataset D. We observe a high correlation of 0.98 between FIT and final training accuracy, which decreases to 0.90 during testing. This is further demonstrated in Figure 5. For practitioners, FIT is more effective where over-fitting is less prevalent.

(a) Noise vs Parameter magnitude

(b) FIT against training accuracy on experiment D

Figure 5: (a) Noise vs parameter magnitude. The line indicates equal magnitude, and is shown for reference. (b) FIT against final training accuracy for experiment D. The correlation coefficient, in this case, is 0.98.

5 CONCLUSION

In this paper, we introduced FIT, a metric for characterising how quantization affects final model performance. Such a method is vital in determining high-performing mixed-precision quantization configurations which maximise performance given constraints on compression.

We presented FIT from an information geometric perspective, justifying its general application and connection to the Hessian. By applying a quantization-specific noise model, as well as using the empirical fisher, we obtained a well-grounded and practical form of FIT.

Empirically, we show that FIT correlates highly with final model performance, remaining consistent across varying datasets, architectures and tasks. Our ablation studies established the importance of including activations. Moreover, FIT fuses the sensitivity contribution from parameters and activations yielding a single, simple to use, and informative metric. In addition, we show that FIT has very favourable convergence properties, making it orders of magnitude faster to compute. We also explored assumptions which help to guide practitioners.

Finally, by training hundreds of MPQ configurations, we obtained correlation metrics from which to demonstrate the benefits of using FIT. Previous works train a small number of configurations. We encourage future work in this field to replicate our approach, for meaningful evaluation.

Future Work.

Our contribution, FIT, worked quickly and effectively in estimating the final performance of a quantized model. However, as noted in Section 4, FIT is susceptible to the distributional shifts associated with model over-fitting. In addition, FIT must be computed from the trained, full-precision model. We believe dataset agnostic methods provide a promising direction for future research, where the MPQ configurations can be determined from initialisation.

Published as a conference paper at ICLR 2023

REPRODUCIBILITY STATEMENT

We have included sample code for generating the parameter and activation traces, as well as generating and analysing quantized models. Further experimental details are presented in Appendix E. Relevant complete proofs are shown in Appendices F and G.

Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, and Cyril Zhang. Disentangling adaptive gradient methods from learning rates, 2020. URL https://arxiv.org/abs/2002.11803.

Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, and Cyril Zhang. Learning rate grafting: Transferability of optimizer tuning, 2022. URL https://openreview.net/forum?id= Fp Kg G31Z_i9.

Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. doi: 10.1162/089976698300017746.

Shun ichi Amari. Information geometry and its applications. Springer, 2016.

Colin Atkinson and Ann F. S. Mitchell. Rao s distance measure. Sankhy a: The Indian Journal of Statistics, Series A (1961-2002), 43(3):345 365, 1981. ISSN 0581572X. URL http://www. jstor.org/stable/25050283.

Suzanna Becker and Yann Lecun. Improving the convergence of back-propagation learning with second-order methods. 01 1989.

Sung-En Chang, Yanyu Li, Mengshu Sun, Weiwen Jiang, Sijia Liu, Yanzhi Wang, and Xue Lin. Rmsmp: A novel deep neural network quantization framework with row-wise mixed schemes and multiple precisions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5251 5260, 2021a.

Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K-H So, Xuehai Qian, Yanzhi Wang, and Xue Lin. Mix and match: A novel fpga-centric deep neural network quantization framework. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 208 220. IEEE, 2021b.

Boyu Chen, Peixia Li, Baopu Li, Chen Lin, Chuming Li, Ming Sun, Junjie Yan, and Wanli Ouyang. Bn-nas: Neural architecture search with batch normalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 307 316, 2021.

Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Towards the limit of network quantization, 2016. URL https://arxiv.org/abs/1612.01543.

Claudionor N Coelho, Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba, Thea Klaeboe Aarrestad, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, and Sioni Summers. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. Nature Machine Intelligence, 3(8):675 686, 2021.

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Published as a conference paper at ICLR 2023

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 293 302, 2019.

Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518 18529, 2020.

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference, 2021. URL https: //arxiv.org/abs/2103.13630.

R.M. Gray and D.L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6): 2325 2383, 1998. doi: 10.1109/18.720541.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. Advances in neural information processing systems, 29, 2016.

M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communication in Statistics Simulation and Computation, 18:1059 1076, 01 1989. doi: 10.1080/03610919008812866.

Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0 0, 2018.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456. PMLR, 2015.

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704 2713, 2018.

Steven A. Janowsky. Pruning versus clipping in neural networks. Phys. Rev. A, 39:6600 6603, Jun 1989. doi: 10.1103/Phys Rev A.39.6600. URL https://link.aps.org/doi/10.1103/ Phys Rev A.39.6600.

Prad Kadambi. Comparing fisher information regularization with distillation for dnn quantization. 2020. URL https://openreview.net/pdf?id=Js Rdc90lpws.

Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Universal statistics of fisher information in deep neural networks: Mean field approach. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1032 1041. PMLR, 2019.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ar Xiv e-prints, art. ar Xiv:1412.6980, December 2014.

Souvik Kundu, Shikai Wang, Qirui Sun, Peter A. Beerel, and Massoud Pedram. Bmpq: Bitgradient sensitivity driven mixed-precision quantization of dnns from scratch, 2021. URL https: //arxiv.org/abs/2112.13843.

Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. ar Xiv preprint ar Xiv:1910.09700, 2019.

Xinyan Li, Qilong Gu, Yingxue Zhou, Tiancong Chen, and Arindam Banerjee. Hessian based analysis of sgd for deep nets: Dynamics and generalization. In Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 190 198. SIAM, 2020.

Published as a conference paper at ICLR 2023

Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. ar Xiv preprint ar Xiv:2102.05426, 2021.

Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd international conference on artificial intelligence and statistics, pp. 888 896. PMLR, 2019.

Hongyang Liu, Sara Elkerdawy, Nilanjan Ray, and Mostafa Elhoushi. Layer importance estimation with imprinting for neural network quantization. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2408 2417, 2021. doi: 10.1109/ CVPRW53098.2021.00273.

Shaoshan Liu, Liangkai Liu, Jie Tang, Bo Yu, Yifan Wang, and Weisong Shi. Edge computing for autonomous driving: Opportunities and challenges. Proceedings of the IEEE, 107(8):1697 1716, 2019.

Qian Lou, Feng Guo, Lantao Liu, Minje Kim, and Lei Jiang. Autoq: Automated kernel-wise neural network quantization, 2019. URL https://arxiv.org/abs/1902.05690.

James Martens. New perspectives on the natural gradient method. Co RR, abs/1412.1193, 2014. URL

http://arxiv.org/abs/1412.1193.

Frank Nielsen. An elementary introduction to information geometry. ar Xiv e-prints, art. ar Xiv:1808.08271, August 2018.

Xu Qian, Victor Li, and Crews Darren. Channel-wise hessian aware trace-weighted quantization of neural networks, 2020. URL https://arxiv.org/abs/2008.08284.

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234 241. Springer, 2015.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631 1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.

Ming Tu, Visar Berisha, Yu Cao, and Jae-sun Seo. Reducing the model order of deep neural networks using information theory. In 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 93 98. IEEE, 2016.

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8612 8620, 2019.

Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of convnets via differentiable neural architecture search, 2018. URL https://arxiv.org/abs/1812.00090.

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, et al. Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pp. 11875 11886. PMLR, 2021.

Shixing Yu, Zhewei Yao, Amir Gholami, Zhen Dong, Sehoon Kim, Michael W Mahoney, and Kurt Keutzer. Hessian-aware pruning and optimal neural implant. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3880 3891, 2022.

Published as a conference paper at ICLR 2023

A ADDITIONAL EXPERIMENTS: BERT ON SST-2

0 1 2 3 4 5 6 7 8 Bit precision

0.attention.self.query

0.attention.self.key

0.attention.self.value

0.attention.output.dense

0.intermediate.dense

0.output.dense

1.attention.self.query

1.attention.self.key

1.attention.self.value

1.attention.output.dense

1.intermediate.dense

1.output.dense

2.attention.self.query

2.attention.self.key

2.attention.self.value

2.attention.output.dense

2.intermediate.dense

2.output.dense

3.attention.self.query

3.attention.self.key

3.attention.self.value

3.attention.output.dense

3.intermediate.dense

3.output.dense

4.attention.self.query

4.attention.self.key

4.attention.self.value

4.attention.output.dense

4.intermediate.dense

4.output.dense

5.attention.self.query

5.attention.self.key

5.attention.self.value

5.attention.output.dense

5.intermediate.dense

5.output.dense

6.attention.self.query

6.attention.self.key

6.attention.self.value

6.attention.output.dense

6.intermediate.dense

6.output.dense

7.attention.self.query

7.attention.self.key

7.attention.self.value

7.attention.output.dense

7.intermediate.dense

7.output.dense

8.attention.self.query

8.attention.self.key

8.attention.self.value

8.attention.output.dense

8.intermediate.dense

8.output.dense

9.attention.self.query

9.attention.self.key

9.attention.self.value

9.attention.output.dense

9.intermediate.dense

9.output.dense

10.attention.self.query

10.attention.self.key

10.attention.self.value

10.attention.output.dense

10.intermediate.dense

10.output.dense

11.attention.self.query

11.attention.self.key

11.attention.self.value

11.attention.output.dense

11.intermediate.dense

11.output.dense

FIT [3,4,6,8], Accuracy: 89.11 FIT [4,8], Accuracy: 90.94 Uniform w4 a8, Accuracy: 85.66

(a) Weight Configuration

0 1 2 3 4 5 6 7 8 Bit precision

0.attention.self.query

0.attention.self.key

0.attention.self.value

0.intermediate.dense

1.attention.self.query

1.attention.self.key

1.attention.self.value

1.intermediate.dense

2.attention.self.query

2.attention.self.key

2.attention.self.value

2.intermediate.dense

3.attention.self.query

3.attention.self.key

3.attention.self.value

3.intermediate.dense

4.attention.self.query

4.attention.self.key

4.attention.self.value

4.intermediate.dense

5.attention.self.query

5.attention.self.key

5.attention.self.value

5.intermediate.dense

6.attention.self.query

6.attention.self.key

6.attention.self.value

6.intermediate.dense

7.attention.self.query

7.attention.self.key

7.attention.self.value

7.intermediate.dense

8.attention.self.query

8.attention.self.key

8.attention.self.value

8.intermediate.dense

9.attention.self.query

9.attention.self.key

9.attention.self.value

9.intermediate.dense

10.attention.self.query

10.attention.self.key

10.attention.self.value

10.intermediate.dense

11.attention.self.query

11.attention.self.key

11.attention.self.value

11.intermediate.dense

(b) Activation Configuration

Figure 6: Comparison between bit precision configurations for the uniform scheme: (W4, A8), and comparable (BOPs) configurations generated via FIT.

Published as a conference paper at ICLR 2023

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Bit precision

0.attention.self.query

0.attention.self.key

0.attention.self.value

0.attention.output.dense

0.intermediate.dense

0.output.dense

1.attention.self.query

1.attention.self.key

1.attention.self.value

1.attention.output.dense

1.intermediate.dense

1.output.dense

2.attention.self.query

2.attention.self.key

2.attention.self.value

2.attention.output.dense

2.intermediate.dense

2.output.dense

3.attention.self.query

3.attention.self.key

3.attention.self.value

3.attention.output.dense

3.intermediate.dense

3.output.dense

4.attention.self.query

4.attention.self.key

4.attention.self.value

4.attention.output.dense

4.intermediate.dense

4.output.dense

5.attention.self.query

5.attention.self.key

5.attention.self.value

5.attention.output.dense

5.intermediate.dense

5.output.dense

6.attention.self.query

6.attention.self.key

6.attention.self.value

6.attention.output.dense

6.intermediate.dense

6.output.dense

7.attention.self.query

7.attention.self.key

7.attention.self.value

7.attention.output.dense

7.intermediate.dense

7.output.dense

8.attention.self.query

8.attention.self.key

8.attention.self.value

8.attention.output.dense

8.intermediate.dense

8.output.dense

9.attention.self.query

9.attention.self.key

9.attention.self.value

9.attention.output.dense

9.intermediate.dense

9.output.dense

10.attention.self.query

10.attention.self.key

10.attention.self.value

10.attention.output.dense

10.intermediate.dense

10.output.dense

11.attention.self.query

11.attention.self.key

11.attention.self.value

11.attention.output.dense

11.intermediate.dense

11.output.dense

FIT [2,3,4], Accuracy: 81.08 Uniform w2 a8, Accuracy: 80.62

(a) Weight Configuration

0 1 2 3 4 5 6 7 8 Bit precision

0.attention.self.query

0.attention.self.key

0.attention.self.value

0.intermediate.dense

1.attention.self.query

1.attention.self.key

1.attention.self.value

1.intermediate.dense

2.attention.self.query

2.attention.self.key

2.attention.self.value

2.intermediate.dense

3.attention.self.query

3.attention.self.key

3.attention.self.value

3.intermediate.dense

4.attention.self.query

4.attention.self.key

4.attention.self.value

4.intermediate.dense

5.attention.self.query

5.attention.self.key

5.attention.self.value

5.intermediate.dense

6.attention.self.query

6.attention.self.key

6.attention.self.value

6.intermediate.dense

7.attention.self.query

7.attention.self.key

7.attention.self.value

7.intermediate.dense

8.attention.self.query

8.attention.self.key

8.attention.self.value

8.intermediate.dense

9.attention.self.query

9.attention.self.key

9.attention.self.value

9.intermediate.dense

10.attention.self.query

10.attention.self.key

10.attention.self.value

10.intermediate.dense

11.attention.self.query

11.attention.self.key

11.attention.self.value

11.intermediate.dense

(b) Activation Configuration

Figure 7: Comparison between bit precision configurations for the uniform scheme: (W2, A8), and comparable (BOPs) configurations generated via FIT.

This section extends our experimental evaluation of FIT to other challenging datasets and benchmarks. Notably, we focus on quantizing BERT-base Devlin et al. (2018) for the SST-2 dataset Socher et al. (2013). BERT-base is quantized via layer-wise mixed precision simulated asymmetric quantization of

Published as a conference paper at ICLR 2023

weights and activations. We chose 50 distinct MPQ configurations for BERT on SST-2 across a range of bit precisions [2,3,4,6,8]. We obtain a final correlation score of 0.752 between post-fine-tuning accuracy and FIT score for these MPQ configurations. Similar to other experiments, this demonstrates that FIT has an excellent predictive power of the impact of quantization, even for large and deep models such as BERT. To illustrate a use case for FIT, Figures 6 and 7 show highly non-trivial MPQ configurations compared to the uniform baseline methods constrained by operations (BOPs). In this case, we obtain higher MPQ accuracy at comparable BOPs, showing the superior performance of FIT in selecting high-performing MPQ configurations. Notably, FIT can trade off with any arbitrary hardware constraint (e.g. latency, power), making it very flexible.

From a computational perspective, given a batch size of 256, FIT converged with a variance of 0.73 in very few iterations. Conversely, the Hessian trace method could not properly converge even after 200 iterations, which is reflected in a variance of 5256. This granularity of layer-wise quantization renders Hessian-based methods challenging to use in practice. The EF traces for the weights and activations of BERT are shown in Figure 8.

(a) Weight Trace magnitude

0 10 20 30 40

(b) Activation Trace

Figure 8: EF weight (a) and activation (b) traces for BERT-base on the SST-2 dataset.

B QUANTIZATION AWARE TRAINING (QAT)

Quantization can lead to significant performance degradation Gholami et al. (2021). It is, therefore, necessary to perform additional quantization aware training (QAT) in order to recover this lost performance. QAT involves simulating the effects of quantization during training. The quantization function Q(θ) is applied to the floating point parameter values during the forward pass of the network. Q(θ) is piece-wise flat. As a result, to propagate gradients, we use a straight-through estimator (STE) (Hubara et al., 2016). In effect, this bypasses the quantization function during gradient computation in the backwards pass. Figure 9 illustrates this process. QAT also involves learning the quantization ranges. For weights, a max-min approach is taken, whilst for activations, an exponential moving average is used. As a result, the scaling and zero points for quantization are mapped correctly. In addition, batch-normalisation parameters can be folded into weights for efficiency.

Figure 9: Overview QAT - Q(θ) is applied during the forward pass, and an STE is used in the backwards pass.

Published as a conference paper at ICLR 2023

B.1 FURTHER QUANTIZATION DETAILS

During our experiments, we employ layer-wise symmetric/asymmetric simulated quantization. In addition, layer-wise quantization ranges are accumulated for a short period (30 iterations) at the beginning of QAT to accurately tune the quantization function across the expected range. Note that this is particularly important for activation quantization. As a result of the simulated nature of the quantization, we do not perform batch-norm folding in our experiments. However, this may improve results further, given the additional scaling factor per channel (rather than just per-layer).

Notably, investigating FIT combined with other more complicated quantization schemes such as those proposed by Chang et al. (2021b) and Chang et al. (2021a) provide appealing directions for future work. In particular, we note that due to the formulation of FIT, differing quantization schemes are realised by changing the quantization noise model. We look forward to investigating this in future.

C ACTIVATION TRACES

(a) Res Net-18

(b) Res Net-50

(c) Mobilenet-V2

(d) Inception-V3

Figure 10: EF Activation traces for four classification models.

D ESTIMATOR COMPARISON

For the four classification models considered, Tables 3 and 4 indicate the estimator variances and iteration times associated with the EF and Hessian for a variety of batch sizes. Whilst the EF exhibits expected variance reduction behaviour, the Hessian does not, and requires a minimum, model-dependent, batch size to achieve stable behaviour. In all cases, the EF estimator variance is orders of magnitude lower than that of the Hessian. Means and variances are estimated over 3 runs of 200 samples. Deviations are normalised w.r.t. the trace magnitude, taking the average across blocks/layers. This ensures all the statistics are comparable, regardless of convergence bias. The relative speedup associated with a fixed tolerance is computed as follows:

s = σ2 H t H σ2 EF t EF

Where σ2 indicates the estimator variance and t is iteration time. This follows the Monte-Carlo estimate variance properties.

E FURTHER EXPERIMENTAL DETAILS

For each of experiments A, B, C and D, we trained 100 convolutional classifiers on the Cifar10 and Mnist datasets. In both cases, we performed experiments with and without the inclusion of batch normalisation layers before each activation. The architecture used consisted of three convolutional layers, followed by a fully connected classification head, with Re LU activations in between. The first two blocks are also followed by Max Pool layers. This is shown in Figure 11. Between the Mnist and Cifar10 datasets, the number of filters was scaled by two. To obtain the data, we first trained a full precision version of the network for 50 epochs using the Adam optimizer. A learning rate of

Published as a conference paper at ICLR 2023

Table 3: Estimator Variances associated with the EF and Hessian for a variety of batch sizes, over 3 runs of 200 iterations for each model and batch size.

Batch Size EF Hessian 1 2 3 Mean Stdev 1 2 3 Mean Stdev Res Net-18 4 0.82 0.99 1.27 1.03 0.23 10.64 9.77 6.89 9.10 1.96 8 0.61 0.42 0.61 0.55 0.11 5.01 4.28 4.38 4.56 0.40 16 0.33 0.30 0.26 0.30 0.03 1.77 2.32 2.03 2.04 0.27 32 0.13 0.18 0.15 0.15 0.03 1.09 1.07 1.11 1.09 0.02 Res Net-50 4 1.81 2.44 2.03 2.10 0.32 - - - - - 8 1.15 1.06 1.19 1.13 0.07 22.57 45.45 76.14 48.05 26.88 16 0.55 0.71 0.51 0.59 0.11 17.68 22.41 14.56 18.21 3.95 32 0.26 0.34 0.32 0.31 0.04 8.60 5.65 6.49 6.91 1.52 Mobile Net-V2 4 1.18 1.29 1.03 1.17 0.13 - - - - - 8 0.67 0.61 0.71 0.66 0.05 17.65 72.02 34.19 41.28 27.87 16 0.36 0.35 0.37 0.36 0.01 9.36 10.54 5,092.34 9.95 0.84 32 0.25 0.24 0.22 0.24 0.01 4.47 4.75 5.23 4.81 0.38 Inception-V3 4 3.97 3.00 2.77 3.24 0.64 - - - - - 8 1.92 1.24 1.95 1.70 0.40 - - - - - 16 0.89 0.66 0.69 0.75 0.13 31.55 65.36 109.84 68.92 39.26 32 0.44 0.45 0.39 0.43 0.03 13.81 13.96 13.10 13.62 0.46

Table 4: Iteration times associated with the EF and Hessian for a variety of batch sizes, averaged over 3 runs of 200 iterations for each model and batch size.

Batch Size EF (ms) Hessian (ms) 1 2 3 Mean Stdev 1 2 3 Mean Stdev Res Net-18 4 11.06 11.29 11.29 11.21 0.13 41.91 40.99 40.58 41.16 0.68 8 16.80 16.86 16.98 16.88 0.09 62.80 61.54 61.16 61.83 0.86 16 24.07 24.46 24.12 24.22 0.21 100.01 95.40 98.38 97.93 2.34 32 47.81 47.77 47.75 47.78 0.03 186.92 186.80 185.90 186.54 0.56 Res Net-50 4 30.89 30.77 30.75 30.80 0.07 150.69 148.17 149.08 149.31 1.27 8 47.90 48.50 48.49 48.30 0.34 196.37 201.74 200.99 199.70 2.91 16 83.67 83.54 82.94 83.38 0.39 332.11 341.01 338.11 337.08 4.54 32 152.44 151.92 151.69 152.02 0.38 639.62 639.82 637.96 639.13 1.02 Mobile Net-V2 4 19.32 19.38 19.57 19.42 0.13 709.06 705.77 704.53 706.45 2.34 8 21.49 21.87 21.96 21.77 0.25 913.47 917.50 914.63 915.20 2.07 16 33.85 33.83 33.73 33.80 0.06 1,460.03 1,464.95 1,460.89 1,461.96 2.62 32 58.65 58.80 59.07 58.84 0.21 2,570.02 2,575.81 2,574.66 2,573.50 3.06 Inception-V3 4 56.30 59.83 57.00 57.71 1.87 265.04 276.99 279.17 273.73 7.61 8 83.77 83.21 82.41 83.13 0.68 325.06 335.16 332.37 330.86 5.21 16 135.46 132.97 132.58 133.67 1.56 530.60 537.80 535.57 534.66 3.69 32 236.07 235.14 235.09 235.43 0.55 901.72 908.35 955.94 905.04 4.69

0.01 was chosen, and increased to 0.1 with the inclusion of batch normalization. A cosine-annealing learning rate schedule was used. We then used this trained full precision model as a checkpoint to initialise our randomly chosen mixed precision configurations, and training was continued for another 30 epochs with a learning rate reduction of 0.1, using the same schedule. Quantization configurations were chosen uniformly at random from the possible set of bit precisions: [8,6,4,3]. In these cases, initialisation and training were identical across all MPQ models, so as to compare the final performance.

Published as a conference paper at ICLR 2023

Figure 11: Small convolutional classifier architecture used in the experiments detailed in Section 4

E.1 FURTHER DETAILS FOR COMPARISON METRICS

QR: The QR baseline represents the use of the quantization ranges to replace the EF as the sensitivity metric:

1 |θmax θmin| θmax θmin

BN: the BN baseline represents the use of the batch-norm scaling factor γ to replace the EF as the sensitivity metric.

1 γ θmax θmin

FITW/A: To obtain FITW, we remove the component which takes into account activation quantization sensitivity. Similarly, to obtain FITA, we remove the component which takes into account weight quantization sensitivity. Recall that in Section 3, we extend parameter space to include the activation statistics. In these ablations, we remove (W) or retain (A) these contributions.

F QUANTIZATION AND NOISE MODEL

The following analysis serves to motivate the direct connection between model perturbation and bit configuration. During quantization, it is common historically to assume the quantization error is uncorrelated with the original signal, yielding an approximately uniform distribution. This leads to an important assumption: The quantization error of each parameter is independent and uniformly distributed with a mean zero. Figure 12 serves to motivate the validity of this assumption.

(a) Weight Distribution

(b) Quantized Weight Distribution

(c) Quantization Error Distribution

Figure 12: Res Net-18, block 12, quantization distribution analysis showing the validity of the uniform noise assumption.

Under this assumption, it is possible to evaluate this noise power:

Published as a conference paper at ICLR 2023

where denotes the step width of quantization, and can be modelled directly by the quantization scheme being used. In this case, uniform (min-max) quantization. Letting Q(θ) : θ θq denote the quantization function, uniform quantization can be expressed as follows:

Q(θ) = θ θmin

= θmax θmin

2b 1 where b is the quantization bit precision. This yields the following quantization noise power:

Consider now our revised model perturbation (w.l.o.g. remove the constant factor),

l Tr(ˆI(θl)) θmax θmin

Each bit configuration - {bl}L 1 - yields a unique FIT value from which we can estimate final performance.

G FIT DETAILS

G.1 STANDARD RELATIONS IN INFORMATION GEOMETRY

Proposition 1. The EF defines an estimate for the natural metric tensor of the statistical manifold induced by the parameterised model, obtained infinitesimally (in this case) from an expansion of the KL divergence.

Proof. For brevity, denote p(z, θ) as pθ. Here we describe the discrete case, however, the continuous follows similarly.

DKL(pθ||pθ+δθ) = X

D pθ log pθ pθ+δθ

D pθlog pθ pθ

log pθ + θpθ

2δθT 2 θpθ pθ θpθ θpθT

D pθ θ log pθ θ log pθ T #

D pθ θ log pθ θ log pθ T #

D pθ θ log pθ θ log pθ T #

D θ log pθ θ log pθ T #

2δθT ˆI(θ)δθ

Published as a conference paper at ICLR 2023

G.2 OBTAINING FIT

Proposition 2. FIT is obtained (under mild noise assumptions) by taking expectation over the Fisher-Rao metric.

Ω= E δθT I(θ)δθ

= E[δθ]T I(θ)E[δθ] + Tr(I(θ)Cov[δθ])

We assume that the random noise is symmetrically distributed around mean of zero: E[δθ] = 0, and uncorrelated: Cov[δθ] = Diag(E[δθ2]). This yields the FIT heuristic:

Ω= Tr I(θ)Diag(E[δθ2])

And in its block-wise empirical form, with an approximate noise model:

l Tr(ˆI(θl)) Diag(ˆEl[δθ2]).

In cases where we compute a numerical approximation of our noise model: 1 n(l) ||δθ||2 l , rather than averaging over many trained models, we leverage the high dimensionality of the parameters in each layer to obtain a good approximation.

G.3 CONNECTIONS TO THE HESSIAN

Proposition 3. The Fisher Information Matrix is equal to the expectation of the Hessian under the correctly specified model.

Epθ (z) 2 θf(z, θ) = Epθ (x,y)

2 θp(y|x, θ) p(y|x, θ)

+ Epθ (x,y)

θp(y|x, θ) θp(y|x, θ)T

Taking the first term of the right-hand side (RHS). We assume in this case the function is sufficiently smooth such that the order of integration and (second) differentiation can be exchanged. In this case, the first term on the RHS evaluates to 0:

2 θp(y|x, θ) p(y|x, θ)

2 θp(y|x, θ) p(y|x, θ) p(y|x, θ)dz (2)

Z 2 θp(y|x, θ)dz (3)

Z p(y|x, θ)dz (4)

= 2 θ[1] = 0 (5) (6)

The second term on the RHS is then simplified as follows, yielding our result:

Epθ (z) 2 θf(z, θ) = Epθ (x,y)[ θ log p(y|x, θ) θ log p(y|x, θ)T ] = I(θ) (7)

Published as a conference paper at ICLR 2023

Proposition 4. Provided ˆθn is a consistent estimator for the true model parameters θ , the Empirical Fisher Information will converge in probability to the Fisher Information as n .

ˆI(ˆθn) I(θ ) = [ˆI(ˆθn) I(ˆθn)] + [I(ˆθn) I(θ )] (8)

Considering the first term on the RHS, we are able to apply the uniform law of large numbers,

[ˆI(ˆθn) I(ˆθn)] = 1

i=1 f(zi, ˆθn) f(zi, ˆθn)T Epθ (x,y)[ θf(z, ˆθn) θf(z, ˆθn)T ] (9)

i=1 f(zi, θ) f(zi, θ)T Epθ (x,y)[ θf(z, θ) θf(z, θ)T ]

provided the following regularity conditions are upheld:

1. Θ must be compact

2. f(z, θ) continuous at each θ Θ for almost all z Z

3. The aforementioned dominating function must be finite

Now considering the second term on the RHS. Our estimator in this case is defined as consistent, and therefore ˆθn pθ θ as n . Via the continuous mapping theorem, [I(ˆθn) I(θ )] pθ 0. We therefore arrive at the desired result.

G.4 COMPUTATION

Proposition 5. The trace of the EF can be computed via the sum of the square of the gradient.

Proof. The trace is a linear operator, which allows each individual estimator to be extracted from the summation. Additionally, the trace of the second moment matrix can then be computed by the norm of the vector of parameters from which it is composed:

Tr[ˆI(θ)] =Tr

i=1 f(zi, θ) f(zi, θ)T #

i=1 Tr f(zi, θ) f(zi, θ)T

i=1 f(zi, θ)T f(zi, θ)

i=1 || f(zi, θ)||2

Proposition 6. The variance of the Hutchinson estimator is given by: V[r T i Hri] = 2 ||H||2 F P

Published as a conference paper at ICLR 2023

Proof. Assuming r follows a Rademacher distribution, the first four moments are as follows: m = (0, 1, 0, 1). As such, we can compute the variance of the quadratic form:

V[r T i Hri] = m4 3m2 2 X

i H2 ii + m2 2 Tr(H)2 + 2 Tr(H2) E[r T i Hri]2

H ENVIRONMENTAL ANALYSIS

During the course of this research, we estimate our total emissions to be 10.8 kg CO2eq. O(100) hours of computation was performed on an RTX 2080 Ti (TDP of 250W), using a private infrastructure which has a carbon efficiency of 0.432 kg CO2eq/k Wh. Estimations were conducted using the Machine Learning Impact calculator presented by Lacoste et al. (2019).