# streamlining_prediction_in_bayesian_deep_learning__e6a03829.pdf

Published as a conference paper at ICLR 2025

STREAMLINING PREDICTION IN BAYESIAN DEEP LEARNING

Rui Li Marcus Klasson Arno Solin Martin Trapp Department of Computer Science, Aalto University, Finland {firstname.lastname}@aalto.fi

The rising interest in Bayesian deep learning (BDL) has led to a plethora of methods for estimating the posterior distribution. However, efficient computation of inferences, such as predictions, has been largely overlooked with Monte Carlo integration remaining the standard. In this work we examine streamlining prediction in BDL through a single forward pass without sampling. For this, we use local linearisation of activation functions and local Gaussian approximations at linear layers. Thus allowing us to analytically compute an approximation of the posterior predictive distribution. We showcase our approach for both MLP and transformers, such as Vi T and GPT-2, and assess its performance on regression and classification tasks.

Open-source library: https://github.com/Aalto ML/SUQ.

1 INTRODUCTION

Recent progress and adoption of deep learning models have led to a sharp increase in interest of improving their reliability and robustness. In applications such as aided medical diagnosis (Begoli et al., 2019), autonomous driving (Michelmore et al., 2020), or supporting scientific discovery (Psaros et al., 2023), providing reliable and robust predictions as well as identifying failure modes is vital. A principled approach to address these challenges is the use of Bayesian deep learning (BDL, Wilson & Izmailov, 2020; Papamarkou et al., 2024) which promises a plug & play framework for uncertainty quantification. However, while plugging the Bayesian approach into deep learning is relatively straightforward (Blundell et al., 2015; Gal & Ghahramani, 2016; Wu et al., 2019), the play part is typically severely hampered by computational and practical challenges (Wenzel et al., 2020; Foong et al., 2020; Gelberg et al., 2024; Coker et al., 2022; Kristiadi et al., 2023).

The key challenges associated with BDL can roughly be divided into three parts: (i) defining a meaningful prior, (ii) estimating the posterior distribution, and (iii) performing inferences of interest, e.g., making predictions for unseen data, detecting out-of-distribution settings, or analysing model sensitivities. While constructing a meaningful prior is an important research direction (Nalisnick, 2018; Meronen et al., 2021; Fortuin et al., 2021; Tran et al., 2022), it has been argued that the

Out of Domain

Practical Outlier Detection

Input Sensitivity Analysis

Figure 1: Our streamlined approach allows for practical outlier detection and sensitivity analysis. Locally linearising the network function with local Gaussian approximations enables many relevant inference tasks to be solved analytically, helping render BDL a practical tool for downstream tasks.

Published as a conference paper at ICLR 2025

differentiating aspect of Bayesian deep learning is marginalisation (Wilson & Izmailov, 2020; Wilson, 2020) rather than the prior itself. Hence, estimating the posterior distribution has seen significant progress in recent years (Blundell et al., 2015; Maddox et al., 2019; Daxberger et al., 2021a) with a particular focus on post-hoc approximations (Kristiadi et al., 2020; Daxberger et al., 2021b). However, while these approaches have shown promise in making BDL useful for real-world applications, they tackle only part of the computational and practical challenges associated with BDL.

In this work, we focus on streamlining prediction in BDL for downstream tasks by providing a straightforward and effective method to compute inferences of interest, cf., Fig. 1. For this, we make the neural network locally linear with respect to the inputs. Thus, inferences, such as computing predictions, admit a closed-form solution and can be estimated efficiently. In particular, we propose using local linearisation of non-linear activation functions at every layer of the network and local Gaussian approximations at linear layers. Empirically, we find that local linearisation combined with Gaussian approximation of Bayesian neural networks provides accurate predictions, with useful predictive uncertainties, while being conceptually simple. Moreover, complex inference tasks w.r.t. the inputs, such as analysing model sensitivities to input perturbations, can be computed efficiently. Thus allowing us to truly account for all sources of uncertainties.

Contributions: (i) We propose layer-wise local linearisation and local Gaussian approximations of neural networks to streamline BDL for downstream tasks (Sec. 3). (ii) We discuss how to handle different covariance structures and architecture choices (Sec. 3.2 & Sec. 3.3). (iii) Finally, we present an empirical assessment of our approach on regression and classification tasks, and showcase its utility for uncertainty quantification, out-of-domain detection, and sensitivity analysis (Sec. 4).

2 RELATED WORK

To estimate the posterior in BDL, variational inference (VI, Blei et al., 2017; Zhang et al., 2018) utilises a variational approximation to the true posterior distribution and minimises a divergence measure between both distributions. A typical choice for the variational family is a factorised Gaussian distribution, chosen for computational reasons. Early works on mean-field VI (MFVI) and related approaches require modifications of the model structure (Blundell et al., 2015) to perform a reparametrisation of the variational distribution. Recent work by Shen et al. (2024) developed an optimiser to ease the use of MFVI, and has shown good performance on large-scale models such as Res Nets (He et al., 2016) and GPT-2 (Radford et al., 2019). However, VI-based methods typically require Monte Carlo estimation to perform inferences, which can be problematic in practice due to additional computational overhead.

A recent trend in BDL are post-hoc methods, such as the Laplace approximation (LA, Mac Kay, 1992a), which can be applied directly on the trained model without modification (Kristiadi et al.,

2020; Daxberger et al., 2021a). Daxberger et al. (2021b) extended the applicability of LAs by showing that treating a subset of parameters Bayesian can still give good predictive uncertainties. Moreover, Immer et al. (2021b) proposed the linearised LA by performing a global linearisation, which is principled under the Generalised Gauss Newton approximation to the Hessian, and has shown promise in providing useful predictive uncertainties. Recent works applied post-hoc methods in various applications, such as large language models (Yang et al., 2024; Kampen et al., 2024), vision-language models (Baumann et al., 2024), dynamic neural networks (Meronen et al., 2024), and sequential learning (Scannell et al., 2024).

In addition, various tailored ensemble-based methods for BDL have been proposed, such as Monte Carlo dropout (Gal & Ghahramani, 2016), deep ensembles (Lakshminarayanan et al., 2017), and stochastic weight averaging-Gaussian (Maddox et al., 2019). While some works on deep ensembles enable estimating the predictive distribution in a single forward pass (Eschenhagen et al., 2021; Havasi et al., 2021), most methods typically require multiple forward passes to estimate the predictive distribution and do not explicate an approximation to the posterior distribution.

More recently, there has been a trend in exploring deterministic computations in BDL to avoid the need for sampling (Goulet et al., 2021; Giordano et al., 2024; Burroni et al., 2024). In particular, Wu et al. (2019) derived an analytical training objective for VI by using moment-matching at each layer of the network. However, the solutions to the moment-matching have to be derived manually for each type of activation function, making it impractical in practice. More recently, Goulet et al.

Published as a conference paper at ICLR 2025

Local Gaussian Approximation

Local Linearisation

(a) Multi-layer Perceptron (MLP)

E h H(l 1)W (l) Q i

E h H(l 1)W (l) K i

H(l 1)W (l) V

softmax( Q(l)K(l)

D )V (l) Q(l)

(b) Attention Layer

Figure 2: Illustration of our approach for different network architectures. In MLPs, we can directly apply local Gaussian approximations and local linearisation of each layer. The distribution over activations is then propagated to the next layer. In attention layers, we treat the query Q and key K deterministically and only treat the value V as a random quantity, resulting in a straightforward propagation path. The resulting distribution is then propagated to the subsequent MLP layer.

(2021) proposed local linearisation of the network to perform message passing on the network under a mean-field assumption. Moreover, Petersen et al. (2024) used a local linearisation of the network to propagate aleatoric uncertainties over the input through a deterministic network. In addition, Dhawan et al. (2023) investigated local linearisations of activation functions to estimate the function space distance of two neural networks, for example, relevant in continual learning settings. In contrast, our work disentangles the approximation of the posterior distribution and the computation of inferences w.r.t. the posterior distribution. Hence, it provides a streamlined framework to propagate all forms of uncertainties through Bayesian neural networks.

In Bayesian deep learning (BDL), predicting the output y (e.g., class label, regression value) for an input x X is performed by marginalising out the model parameters θ of the neural network fθ( ) instead of trusting a single point estimate, i.e.,

p(y | x, D) = Z

θ p(y | fθ(x)) p(θ | D) dθ, (1)

where D = {(xn, yn)}N n=1 denotes the training data and the posterior distribution p(θ | D) = p(θ,D)

p(D) is given by Bayes rule. However, for most neural networks, integrating over the high-dimensional parameter space is intractable, necessitating the use of approximations to compute the posterior distribution p(θ | D) and the posterior predictive distribution p(y | x, D).

Recently, much progress has been made in efficiently approximating the posterior distribution for BDL, including scaling mean-field variational inference (Shen et al., 2024) to large-scale models and performing post-hoc estimation using the Laplace approximations (Daxberger et al., 2021a). A common thread is using a tractable distribution q to approximate the posterior distribution q(θ) p(θ | D), commonly chosen as a Gaussian distribution. Consequently, the posterior predictive distribution is typically approximated using Monte Carlo integration, i.e., by sampling from q, to estimate the integral in Eq. (1), with the exception of the linearised Laplace approximation (Immer et al., 2021b). However, while using a Gaussian approximation facilitates efficient computation of the approximate posterior distribution, sampling from the high-dimensional Gaussian approximation can be challenging (Vono et al., 2022) and result in high computational overhead.

We will now shift our focus on estimating integrals of the form of Eq. (1) and assume that an approximation to the posterior distribution q(θ) is given. Further, we will assume that q is in the family of stable distributions. Note that a linear combination of two independent random variables following a stable distribution has the same distribution as the distribution of the individual random variables. The Gaussian distribution is a typical example of a stable distribution. Marginalisation tasks such as in Eq. (1) appear in many scenarios, e.g., active learning (Mac Kay, 1992a; Gal et al., 2017; Smith et al., 2023), model selection (Immer et al., 2021a; Mac Kay, 1996), or outlier detection (Wilson & Izmailov, 2020), and pose a reappearing challenge in downstream applications of BDL.

Published as a conference paper at ICLR 2025

3.1 STREAMLINING COMPUTATIONS WITH LOCAL APPROXIMATIONS

Let the weights and biases of the mth linear layer of the network f be denoted as W (m) RDout Din and b(m) RDout, respectively. Then the pre-activation h(m) is given as h(m) = W (m)a(m 1) + b(m), where a(m 1) RDin is the activation of the previous layer. In case m = 1, then a(0) corresponds to the input x. We further denote the kth element of h(m) as h(m) k = PDin i=1 W (m) ki a(m 1) i + b(m) k and drop the superscript if it is clear from the context.

Given an approximate posterior distribution q(θ) with θ = {W (m), b(m)}M m=1, we aim to compute the probability distribution of the activation a(m) of each layer m. For this, we need to estimate the distribution of the pre-activation h(m) and then compute an approximation to the activation a(m) after the application of a non-linear activation function g( ).

Approximating the pre-activation distribution In case the activation a(m 1) is deterministically given, i.e., for the input layer, we can compute the distribution over pre-activations analytically as a consequence of the stability of stable distributions under linear transformations (Petersen et al., 2024). However, for hidden layers, the distribution over pre-activations is generally not of the same family as the posterior distribution (Wolinski & Arbel, 2022). Nevertheless, we will apply a local Gaussian approximation to the pre-activation at every hidden layer. Specifically, we make the assumption:

Assumption 3.1. Assume that the activations of the previous layer a(m 1) i and parameters of the mth layer are independent.

Then followed by a Gaussian approximation of a(m 1) i W (m) ki for each i and each k, the mean of the pre-activation h(m) is given as:

E h h(m)i = E h W (m)i E h a(m 1)i + E h b(m)i , (2)

and the covariance between the kth and the jth hidden unit is computed as:

Cov h h(m) k , h(m) l i = X

1 i,j Din Cov h a(m 1) i W (m) ki , a(m 1) j W (m) lj i + Cov h b(m) k , b(m) l i

1 i Din E h a(m 1) i i Cov h W (m) ki , b(m) l i + Cov h W (m) li , b(m) k i , (3)

Cov h a(m 1) i W (m) ki , a(m 1) j W (m) lj i = E h a(m 1) i i E h a(m 1) j i Cov h W (m) ki , W (m) lj i

+ E h W (m) ki i E h W (m) lj i Cov h a(m 1) i , a(m 1) j i

+ Cov h a(m 1) i , a(m 1) j i Cov h W (m) ki , W (m) lj i . (4)

A detailed derivation alongside an empirical evaluation of the approximation quality can be found in Apps. A.1 and A.2. Depending on the structure of the covariance matrix, we can further simplify the computation of the covariance matrix, which we will discuss in Sec. 3.3.

Approximating the activation distribution Let g( ) denote a non-linear activation function computing a = g(h) for a pre-activation h. Inspired by the application of local linearisation in Bayesian filtering (e.g., Särkkä & Svensson, 2023), we use a first-order Taylor expansion of g( ) at the mean of the pre-activation E [h]. Specifically, we approximate g(h) using

g(h) g(E [h]) + Jg|h=E[h](h E [h]), (5)

where Jg|h=E[h] is the Jacobian of g( ) at h = E [h]. Then, as stable distributions are closed under linear transformations, the distribution of a can be computed analytically and is given as follows in case of a Gaussian distribution, i.e.,

a N(g(E [h]), Jg| h=E[h]Σh Jg|h=E[h]). (6)

Note that the quality of the local linearisation will depend on the scale of the distribution over the input h. For Re LU activation functions, Petersen et al. (2024) have shown that local linearisation

Published as a conference paper at ICLR 2025

provides the optimal Gaussian approximation of a univariate Gaussian distribution in total variation. For classification tasks, we employ a probit approximation Mac Kay (1992b); Kristiadi et al. (2020).

Intuition One way to understand the resulting approximation is as a piecewise linear function (or multilinear function). Globally, the function will still be non-linear, but locally it will behave linearly. In contrast to the original model, which composes piecewise linear functions in the case of a Re LU network, our approximation composes linear functions locally. We obtain a piecewise linear function due to the local composition, which allows us to capture the non-linear nature of the model.

3.2 ARCHITECTURE CHOICES

By combining local Gaussian approximations for linear layers and local linearisation for non-linear activation functions, we can analytically compute the distribution over activations at each layer in a single forward pass. In the case of a multi-layer perceptron (MLP) and common architecture choices, the described approach can be directly applied to each layer of the network. However, further considerations are needed to streamline the computation path for more complex architectures such as attention. Fig. 2 illustrates the computation path for MLPs and attention layers.

Attention layers Each block in a transformer (Vaswani et al., 2017) constitutes: multi-head attention, an MLP, layer normalisation, and a residual connection. For the MLP part, the propagation is the same as previously described. Further, layer normalisation is a linear transformations, and the resulting distribution can be obtained analytically. For residual connections by assuming independence, we could also obtain the resulting distribution analytically. Treating the multi-head attention block is more involved as the softmax activation function squashes the distribution of the pre-activations. We describe our method below, and further details are in App. A.6.

Given an input H RT D, where T is the number of tokens in the input sequence and D is the dimension of each token, denote the query, key and value matrices as WQ RD D, WK RD D, WV RD D, respectively. Further, we denote the key, query and value in an attention block as Q = HWQ, K = HWK, and V = HWV . Then the output of attention layer is given as follows Attention(H) = Softmax QK /

V . For computational reasons, we will assume the input distribution to the multi-head attention block has a diagonal covariance structure. Pushing random vectors over a softmax activation may require further approximations and will not result in an output with a distribution close to a Gaussian distribution. Hence, we treat the query and key matrices as deterministically given. A possible remedy is to leverage an approximation to the softmax function such as Lu et al. (2021). Consequently, the attention scores are given as:

Attention(H) = Softmax E [H] E [WQ] (E [H] E [WK])

where V follows a stable distributions. Due to linearity, the resulting distribution can again be obtained analytically.

3.3 COVARIANCE STRUCTURE

Computing the full covariance of the posterior is usually infeasible due to high computational and memory cost. We describe our methods for the two most common covariance approximations and will briefly discuss the computational cost in the case of a full and diagonal covariance structure.

Full covariance When the posterior has full covariance, for the mth linear layer the computational complexity for computing Cov[hk, hl] is O([D(m) in ]2). Consequently, computing the covariance of the activations for the mth layer adds to O([D(m) out ]2[D(m) in ]2). Computing the local linearisation for element-wise activation functions results in a complexity of O([D(l) out]2). Hence, we obtain a total cost of O(PM m=1[D(m) out ]2[D(m) in ]2 + [D(m) out ]2) for a network with M layers. As the computational cost is directly linked to the number of parameters and their correlation structure, a natural way to reduce the computational cost is to either exploit the structure in the covariance matrix or consider only a subset of parameters, in the spirit of subnetwork Laplace (Daxberger et al., 2021b). We will focus on exploiting the structure of the covariance as using a subset of parameters trivially extends from our discussion.

Published as a conference paper at ICLR 2025

W11, W21 W11, W23 W11, W33 W11, W31 W11, W32 W11, W33 W12, W11 W12, W12 W12, W13 W12, W31 W12, W32 W12, W33 W13, W11 W13, W13 W13, W13 W13, W31 W13, W32 W13, W33 W21, W11 W21, W12 W21, W13 W21, W31 W21, W32 W21, W33 W22, W11 W22, W12 W22, W13 W22, W31 W22, W32 W22, W33 W23, W11 W23, W12 W23, W13 W23, W31 W23, W32 W23, W33

VII VIII IX

Figure 3: To retrieve the highlighted submatrix Cov[W [1, :], W [2, :]] of the covariance for W R2 3, we identify the Kronecker blocks that contain the covariance of interest (II, III, V, and VI), explicate those blocks in memory, and then retrieve the relevant submatrix.

Diagonal approximation In case the correlations between model parameters are ignored, as in mean-field variational inference, the computation of the pre-activation covariance reduces to:

Cov h h(m) k , h(m) l i = X

1 i,j Din E h W (m) ki i E h W (m) lj i Cov h a(m 1) i , a(m 1) j i , (8)

and variance of the kth pre-activation is given as: Var[h(m) k ] = X

1 i Din E h a(m) i i2 Var h W (m) ki i + Var h b(m) k i + Var h a(m 1) i i E h W (m) ki i2 + Var h W (m) ki i . (9)

Hence, assuming a diagonal covariance structure can help in reducing the computational burden. If only the variance of the layer output is of interest, the computational cost can be further reduced and adds to a total of O(PM m=1 D(m) out D(m) in + D(m) out ). Further details are given in App. A.3. An empirical run time analysis indicating little to no overhead is given in App. B.7.

Kronecker-factorisation (KFAC) Another common choice for approximating the posterior covariance is the use of a Kronecker-factorisation (KFAC) (Martens & Grosse, 2015), popularised in the context of Laplace approximations (Ritter et al., 2018). In this case, the posterior covariance Σ is given by a Kronecker product of two factors, i.e., Σ = (A B + λ2I) 1 where denotes the Kronecker product and λ2I is a prior precision. Note that in case of a non-zero prior precision, the covariance cannot be expressed in the form of a Kronecker matrix multiplication. Denote the eigenvectors and eigenvalues of A as UA and ΛA, respectively, we approximate the Σ as follows:

Σ = (A B + λ2I) 1 = (UAΛAU A ) (UBΛBU B ) + λ2I 1 (Eigen Decomposition)

((UA UB)(ΛA + λIA)) 1 (ΛB + λIB)(UA UB) 1 . (10)

To compute the covariance of the pre-activations, we need to retrieve the covariance between the weights of the kth unit and the weights of the lth unit, which corresponds to the kth and lth row in W , i.e., Cov[W [k, :], W [l, :]]. In case of KFAC Laplace approximations, accessing Cov[W [k, :], W [l, :]] cannot be done directly. Therefore, we developed a block retrieval method to retrieve Cov[W [k, :], W [l, :]] without explicating the full covariance matrix in memory.

The key idea is to first identify the Kronecker blocks that contain the covariance of interest and then retrieve the submatrix by reconstructing only the relevant blocks. Fig. 3 illustrates the idea in a toy example for a weight matrix W R2 3 and a submatrix of interest Cov[W [1, :], W [2, :]]. Our method only reconstructs blocks containing the sub-covariance of interest (II, III, V, and VI) and then retrieves the relevant submatrix. Further details are given in App. A.4.

4 EXPERIMENTS

We demonstrate practical applicability of our approach on classification/regression tasks (Sec. 4.1), large-scale classification results with Vi T/GPT models (Sec. 4.2), and sensitivity estimation (Sec. 4.3). Additional experiments and additional experimental results can be found in App. B.

Data sets We use a selection of data sets from the UCI repository (Kelly et al., 2023) for the regression experiments. For classification, we experiment on MNIST (Le Cun et al., 1998), FMNIST (Xiao et al.,

Published as a conference paper at ICLR 2025

2017), as well as the 11-class data sets Organ CMNIST and Organ SMNIST from Med MNIST (Yang et al., 2023). To assess our method on higher-dimensional settings, we experiment with CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), DTD (Cimpoi et al., 2014), RESISC (Cheng et al., 2017) and a subsampled version of Image Net-R (Hendrycks et al., 2021) with 100 classes to reduce the memory overhead for the LA. For the GPT model, we used the BOOLQ, WIC, and MRPC tasks from GLUE (Wang et al., 2019b) and Super GLUE (Wang et al., 2019a) benchmarks.

Posterior approximations We adopt the Laplace approximation (LA, Mac Kay (1996)) and meanfield variational inference (MFVI, Blei et al., 2017) for approximating the posterior distribution of the network parameters. For the LA, we estimate the full covariance for the regression experiments, while we use diagonal or KFAC approximations for the covariance where applicable in the classification experiments. We compare our method using local Gaussian approximation and local linearisation against Monte Carlo (MC) sampling and a global linearised model (GLM, Immer et al., 2021b). For MFVI, we adopt the IVON optimiser (Shen et al., 2024) to obtain the posterior approximation with a diagonal covariance structure by default, which has been shown to be effective and scalable to large-scale classification tasks. Here, we compare our method against MC sampling from the posterior to make predictions as done in Shen et al. (2024). For the MFVI and LA sampling baselines, we used 1, 000 MC samples in the regression and classification experiments in Sec. 4.1, and 50 MC samples for the Vi T and GPT-2 in Sec. 4.2. For our method, we fit an additional scaling factor on the predictive variance by minimising the NLPD on a validation set, similar to the pseudo-count used in Ritter et al. (2018).

Network architectures We experiment with one or two-layer multi-layer perceptron (MLP) on the UCI regression data sets with details given in App. B.1. For MNIST, FMNIST, Organ CMNIST and Organ SMNIST, we use an MLP with layers containing 784 128 64 C neurons, where C is the number of classes. For CIFAR-10/100, DTD, RESISC and Image Net-R, we fine-tune a Vision Transformer (Vi T) (Dosovitskiy et al., 2021) base model pre-trained on Image Net-1k (Deng et al., 2009). For the GPT model, we use the pre-trained GPT-2 base model from Hugging Face Transformers (Wolf et al., 2019) and fine-tune it on the respective tasks.

Evaluation metrics For the regression experiments, we measure the negative log predictive density (NLPD) and root-mean-square error (RMSE) for each method. In the classification experiments, we use accuracy (ACC), NLPD, and expected calibration error (ECE) to compare the methods. We use a paired t-test with p = 0.05 to bold results with significant statistical differences when reporting the results. For assessing out-of-distribution (OOD) robustness, we use a Gaussian kernel density estimator with a variance of 0.25 on the histogram of the predictive entropy evaluated on the test set.

4.1 DOES OUR METHOD PROVIDE USEFUL UNCERTAINTY ESTIMATES?

Regression We experiment on a selection of data sets from the UCI repository and run a 5-fold cross validation to report results for each data set. We use either MFVI or LA to obtain the posterior approximation and separately compare our method against their corresponding prediction approaches. Table 1 shows our method achieves better NLPD in general than the predictions with sampling for both MFVI and LA. Moreover, our method performs on par with the GLM, even though our method results in a locally linearised network w.r.t. the inputs. Similar conclusions are made inspecting Table 8.

Table 1: Negative log predictive density on UCI regression data sets. Ours results in better or matching performance compared with sampling and GLM, indicating the effectiveness of our method.

MFVI (Diagonal Covariance) Laplace Approximation (Full Covariance) (n, d) Sampling Ours Sampling GLM Ours SERVO (167, 4) 1.287 0.069 1.136 0.182 3.795 0.110 1.047 0.172 1.443 0.077 LD (345, 5) 1.346 0.280 1.369 0.440 2.221 0.110 1.495 0.580 1.474 0.648 AM (398, 7) 1.004 0.052 0.807 0.087 1.812 0.065 0.492 0.279 0.478 0.309 REV (414, 6) 1.076 0.059 0.925 0.091 1.932 0.045 0.859 0.129 0.833 0.156 FF (517, 12) 2.160 3.003 2.333 3.671 2.086 0.292 1.584 0.950 1.596 1.217 ITT (1020, 33) 0.937 0.047 0.841 0.065 1.681 0.069 0.825 0.095 0.756 0.164 CCS (1030, 8) 0.939 0.068 0.828 0.108 1.612 0.048 0.319 0.109 0.234 0.161 ASN (1503, 5) 0.962 0.054 0.899 0.065 1.788 0.045 0.422 0.109 0.396 0.133 CAC (1994, 127) 0.973 0.092 0.920 0.118 1.848 0.055 1.281 0.069 2.662 1.096 PT (5875, 19) 0.976 0.069 0.940 0.074 0.984 0.101 0.576 0.181 0.651 0.306 CCPP (9568, 4) 0.365 0.040 0.352 0.042 1.345 0.085 0.062 0.182 0.062 0.200

Bold Count 3/11 11/11 0/11 7/11 8/11

Published as a conference paper at ICLR 2025

Table 2: Performance metrics on the MNIST-like data sets for each method with the standard error for ACC and NLPD. Our method achieves better NLPD and ECE than the baselines.

Metrics Methods MNIST FMNIST ORGANCMNIST ORGANSMNIST LA Sampling 0.972 0.002 0.868 0.004 0.734 0.008 0.591 0.010 LA GLM 0.975 0.002 0.882 0.003 0.824 0.007 0.634 0.009 ACC LA Ours 0.975 0.002 0.881 0.003 0.826 0.008 0.630 0.009 MFVI Sampling 0.974 0.002 0.843 0.004 0.620 0.010 0.467 0.010 MFVI Ours 0.974 0.002 0.842 0.004 0.630 0.010 0.454 0.010 LA Sampling 0.210 0.003 0.556 0.008 1.135 0.017 1.614 0.021 LA GLM 0.089 0.004 0.548 0.018 0.875 0.44 1.967 0.070 NLPD LA Ours 0.089 0.005 0.397 0.010 0.710 0.021 1.365 0.025 MFVI Sampling 0.179 0.014 2.010 0.051 4.775 0.140 6.095 0.141 MFVI Ours 0.086 0.005 0.529 0.011 1.492 0.026 1.895 0.022

LA Sampling 0.122 0.151 0.292 0.261 LA GLM 0.009 0.078 0.038 0.170 ECE LA Ours 0.004 0.012 0.122 0.151 MFVI Sampling 0.018 0.144 0.251 0.385 MFVI Ours 0.003 0.013 0.145 0.067

Table 3: Performance comparison between our method and Moment-Matching (MM) on the MNISTlike classification tasks. We report the ACC and NLPD with standard errors and the ECE. Our method outperforms MM despite being simpler and more applicable to various distributions.

Metrics Methods MNIST FMNIST ORGANCMNIST ORGANSMNIST MM 0.971 0.002 0.882 0.003 0.771 0.004 0.600 0.010 ACC Ours 0.975 0.002 0.881 0.003 0.816 0.004 0.614 0.010 MM 0.103 0.005 0.463 0.014 0.838 0.015 1.401 0.036 NLPD Ours 0.090 0.005 0.435 0.010 0.733 0.010 1.282 0.024 MM 0.014 0.051 0.023 0.110 ECE Ours 0.006 0.022 0.127 0.081

Classification Here, we assess our method on MNIST-like classification tasks. For the LA, we use KFAC approximation of the covariance to reduce the memory overhead. In Table 2, we report the ACC and NLPD with their standard errors and the ECE for each method. Our method achieves similar ACC with the baselines, while outperforming them on the NLPD and ECE metrics. In App. B.2, we assess our method on robustness to OOD data. We evaluate an MLP trained on MNIST on rotated versions of the test set. Our method consistently reduces overconfidence on OOD data, cf., Fig. 8.

Ours vs. moment-matching To verify the viability of local linearisation, we compare our method against moment-matching (MM) used in Wu et al. (2019). We apply MM instead of local linearisation to our setting, assuming a diagonal posterior approximation from LA. In Table 3, we show the results for our method against MM on MNIST-like classification tasks. Our approach outperforms MM across the data sets and the metrics, except for the ECE on Organ CMNIST. Compared to MM, our method is applicable to any differentiable activation function and any type of stable distribution (Petersen et al., 2024), while MM requires tailored derivations for each case and, hence, is less plug-and-play.

4.2 IS OUR METHOD SCALABLE?

We demonstrate that our method is applicable to large-scale networks by experimenting with pretrained Vi T and GPT-2 models. In particular, we experiment with applying our method on either the attention layer or the MLP after the attention layer in the last two (Vi T) / four (GPT) transformer blocks. For each target data set, we fine-tune the layers we obtain the posterior approximation for. We assume a diagonal posterior to reduce the memory overhead. Table 4 shows the results when fine-tuning Vi T models and obtaining the posterior approximation from the attention layers. We observe that our method achieves better or on par NLPD and ECE compared to the baselines for both LA and MFVI across all data sets while maintaining similar ACC as the baselines.

In Table 5 we show the results for a GPT-2 model with LA on the MLP layers. We observe that our method systematically outperforms sampling, while achieving similar performance to GLM in some cases. Thus indicating that our method is applicable to different application domains. We present additional results for Vi T models in App. B.

We also assess the robustness to out-of-distribution (OOD) data for our method and the baselines. In particular, we take the Vi T network fine-tuned on CIFAR-10 and evaluate its predictive entropy on

Published as a conference paper at ICLR 2025

Table 4: Performance metrics using Vi T with posterior approximation on the attention layers with the standard error for ACC and NLPD. Our method achieves better NLPD and ECE in general and achieves similar ACC compared to the baselines.

Metrics Methods CIFAR-10 CIFAR-100 DTD RESISC IMAGENET-R

LA Sampling 0.971 0.002 0.882 0.003 0.715 0.010 0.892 0.004 0.731 0.012 LA GLM 0.976 0.002 0.879 0.003 0.718 0.010 0.891 0.004 0.739 0.012 ACC LA Ours 0.976 0.002 0.880 0.003 0.719 0.010 0.892 0.004 0.739 0.012 MFVI Sampling 0.975 0.002 0.880 0.003 0.732 0.010 0.867 0.004 0.730 0.012 MFVI Ours 0.975 0.002 0.880 0.003 0.734 0.010 0.867 0.004 0.727 0.012 LA Sampling 0.170 0.004 0.444 0.012 1.238 0.028 0.461 0.009 1.208 0.048 LA GLM 0.092 0.007 0.459 0.012 1.197 0.029 0.385 0.010 1.180 0.047 NLPD LA Ours 0.086 0.006 0.456 0.012 1.068 0.035 0.353 0.012 1.264 0.043 MFVI Sampling 0.133 0.011 0.641 0.022 1.091 0.048 1.010 0.041 1.577 0.083 MFVI Ours 0.087 0.006 0.468 0.012 1.006 0.035 0.619 0.019 1.234 0.052 LA Sampling 0.006 0.022 0.197 0.129 0.070 LA GLM 0.011 0.024 0.155 0.053 0.057 ECE LA Ours 0.009 0.027 0.042 0.017 0.130 MFVI Sampling 0.015 0.070 0.075 0.079 0.118 MFVI Ours 0.007 0.024 0.040 0.018 0.043

Table 5: Performance on language understanding tasks.

Metrics Methods BOOLQ WIC MRPC LA Sampling 0.383 0.009 0.500 0.020 0.335 0.011 ACC LA GLM 0.698 0.008 0.611 0.019 0.710 0.011 LA Ours 0.641 0.008 0.500 0.020 0.665 0.011 LA Sampling 0.814 0.006 0.858 0.024 1.219 0.018 NLPD LA GLM 0.650 0.014 0.730 0.027 0.660 0.023 LA Ours 0.684 0.001 0.719 0.010 0.629 0.009 LA Sampling 0.259 0.261 0.484 ECE LA GLM 0.086 0.126 0.145 LA Ours 0.129 0.118 0.054

the SVHN data set (Netzer et al., 2011). Fig. 4 shows the kernel density of the predictive entropy computed on the test sets of CIFAR-10 and SVHN, where the model should have high entropy for data different from the training data. Although our method is slightly underconfident on the in-distribution data, the entropy for in-distribution and OOD data is clearly separated, especially for MFVI.

4.3 CAN OUR METHOD ESTIMATE INPUT SENSITIVIES?

We demonstrate that our method can estimate sensitivities w.r.t. the inputs to the network. For this, we use a 3-class MLP trained on the digits 0/6/8. Our goal is to estimate sensitivity maps by assuming that the input images x N(x, Σ) are distributed according to a Gaussian centred at the pixel values with diagonal covariance Σ. We optimise the input covariance of each image by minimising the loss

ℓ= PN n=1 cross-entropy(f(xn), yn) H(N(xn, Σn)). (11)

In words, we jointly minimise the cross-entropy loss, after analytically propagating the input distribution through the network, while maximising the entropy H(N(xn, Σn)) of the input distribution. The optimisation is stopped once the difference in NLPD between the current iteration and initial condition is more than 0.1. Fig. 5 shows examples of the resulting sensitivity maps for a deterministic MLP (MAP) and the same MLP with last-layer LA (Bayes). We observe that the largest sensitivity for the digits 0 and 8 are generally in the middle, while for 6 in the upper right corner. The Bayes model shows less spurious sensitivities across the pixels compared to the MAP model. Thus, indicating that incorporating all sources of uncertainties can lead to a more interpretable sensitivity analysis.

5 DISCUSSION & CONCLUSION

In this work, we proposed to streamline prediction in Bayesian deep learning through local linearisation and local Gaussian approximations of the network. For this, we discussed the propagation in different neural network architectures and covariance structures. In particular, we discussed how to handle Kronecker-factorised posterior covariances and transformer architectures. We showed through a series of experiments that our method obtains high predictive performance, provides useful predictive uncertainties, and can be used for sensitivity analysis. Our method helps to make BDL more useful in practice and expands the use cases and sources of uncertainties that can be considered.

Published as a conference paper at ICLR 2025

MFVI Sampling

Figure 4: Kernel density plots over the predictive entropy from a Vi T network finetuned on CIFAR-10 (blue, in-distribution) and data from SVHN (red, out-of-distribution). Our method results in a clear separation between the inand out-of-distribution data.

y = 0 y = 0 y = 6 y = 6 y = 8 y = 8

Figure 5: Pixel sensitivity maps of an MLP trained on a subset of MNIST digits (classes 0/6/8). The rows show sensitivities to pixel perturbations for the MLP (MAP) and MLP with last-layer Laplace approximation (Bayes) respectively. The sensitivities are visualised in the range (0.5

1.0). The Bayes MLP shows less spurious sensitivities across the pixels compared to MAP.

In future work, we aim to apply our approach to tasks with a larger number of output classes, explore additional use-case scenarios in which our streamlined approach can be beneficial, and scale to even larger networks. Moreover, we aim to investigate further the computational benefits obtained by exploiting the posterior covariance structure and sparsity in the network.

Limitations The local linearisation of activation functions induces an error that depends on both the activation function and the location and scale of the distribution over the input to the activation function. Moreover, we assume independence between the activations and model parameters for the local Gaussian approximation in linear layers and residual connections, which may incur a loss of information in the propagation. Especially, the independence assumption in the residual block is potentially harmful, and relaxing it would be a valuable future direction. Further, it would be interesting to estimate the induced approximation error to identify potential failure modes. Finally, we assume access to a validation set for fitting the scaling factor of the predictive posterior distribution, which is currently done using a grid search. An interesting future step is the use of the marginal likelihood to optimise the scaling factor.

ACKNOWLEDGMENTS

AS and RL acknowledge funding from the Research Council of Finland (grant number 339730, 362408). MT acknowledges funding from the Research Council of Finland (grant number 347279). MK acknowledges funding from the Finnish Center for Artificial Intelligence (FCAI). We acknowledge CSC IT Center for Science, Finland, for awarding this project access to the LUMI supercomputer, owned by the Euro HPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through CSC. We acknowledge the computational resources provided by the Aalto Science-IT project. Lastly, we thank Jonas Vestergaard for finding a bug in our code and his feedback on the manuscript.

Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, and Martin Trapp. Post-hoc probabilistic vision-language models. ar Xiv preprint ar Xiv:2412.06014, 2024. 2

Published as a conference paper at ICLR 2025

Edmon Begoli, Tanmoy Bhattacharya, and Dimitri Kusnezov. The need for uncertainty quantification in machine-assisted medical decision making. Nature Machine Intelligence, 1(1):20 23, 2019. 1

David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. 2, 7

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32th International Conference on Machine Learning (ICML), volume 37 of Proceedings of Machine Learning Research, pp. 1613 1622. PMLR, 2015. 1, 2

Javier Burroni, Justin Domke, and Daniel Sheldon. Sample average approximation for blackbox variational inference. In Proceedings of the 40th Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2024. 2

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865 1883, 2017. 7

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3606 3613. IEEE Computer Society, 2014. 7

Beau Coker, Wessel P Bruinsma, David R Burt, Weiwei Pan, and Finale Doshi-Velez. Wide meanfield bayesian neural networks ignore the data. In Proceedings of the twenty fifth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 131 of Proceedings of Machine Learning Research, pp. 5276 5333. PMLR, 2022. 1

Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux effortless Bayesian deep learning. In Advances in Neural Information Processing Systems 34 (Neur IPS), volume 34, pp. 20089 20103. Curran Associates, Inc., 2021a. 2, 3, 30

Erik Daxberger, Eric Nalisnick, James U Allingham, Javier Antorán, and José Miguel Hernández Lobato. Bayesian deep learning via subnetwork inference. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 2510 2521. PMLR, 2021b. 2, 5

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248 255. IEEE Computer Society, 2009. 7

Nikita Dhawan, Sicong Huang, Juhan Bae, and Roger Baker Grosse. Efficient parametric approximations of neural network function space distance. In Proceedings of the 40th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, pp. 7795 7812. PMLR, 2023. 3

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021. 7

Runa Eschenhagen, Erik Daxberger, Philipp Hennig, and Agustinus Kristiadi. Mixtures of Lapkace approximations for improved post-hoc uncertainty in deep learning. In Neur IPS workshop on Bayesian deep learning, 2021. 2

Andrew Foong, David Burt, Yingzhen Li, and Richard Turner. On the expressiveness of approximate inference in Bayesian neural networks. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 15897 15908. Curran Associates, Inc., 2020. 1

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W Ober, Florian Wenzel, Gunnar Rätsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. In International Conference on Learning Representations (ICLR), 2021. 1

Published as a conference paper at ICLR 2025

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33th International Conference on Machine Learning (ICML), volume 48 of Proceedings of Machine Learning Research, pp. 1050 1059. PMLR, 2016. 1, 2

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, pp. 1183 1192. PMLR, 2017. 3

Yoav Gelberg, Tycho F. A. van der Ouderaa, Mark van der Wilk, and Yarin Gal. Variational inference failures under model symmetries: Permutation invariant posteriors for Bayesian neural networks. In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2024. 1

Ryan Giordano, Martin Ingram, and Tamara Broderick. Black box variational inference with a deterministic objective: Faster, more accurate, and even more black box. Journal of Machine Learning Research, 25(18):1 39, 2024. 2

James-A Goulet, Luong Ha Nguyen, and Saeid Amiri. Tractable approximate Gaussian inference for Bayesian neural networks. Journal of Machine Learning Research, 22(251):1 23, 2021. 2

Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew Mingbo Dai, and Dustin Tran. Training independent subnetworks for robust prediction. In International Conference on Learning Representations (ICLR), 2021. 2

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778. IEEE Computer Society, 2016. 2

Dan Hendrycks, Kevin Zhao, Sebastian Basart, Jacob Steinhardt, and Dawn Song. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8340 8349. IEEE, 2021. 7

Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Khan Mohammad Emtiyaz. Scalable marginal likelihood estimation for model selection in deep learning. In Proceedings of the 38th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, pp. 4563 4573. PMLR, 2021a. 3

Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of Bayesian neural nets via local linearization. In Proceedings of the twenty forth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research, pp. 703 711. PMLR, 2021b. 2, 3, 7

Peter JT Kampen, Gustav RS Als, and Michael Riis Andersen. Towards scalable Bayesian transformers: Investigating stochastic subset selection for nlp. In Proceedings of the 40th Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2024. 2

Raymond Kan. From moments of sum to moments of product. Journal of Multivariate Analysis, 99 (3):542 554, 2008. 18

Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository, 2023. URL: https://archive.ics.uci.edu. 6

Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in relu networks. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 5436 5446. PMLR, 2020. 2, 5

Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, and Vincent Fortuin. Promises and pitfalls of the linearized Lapkace in Bayesian optimization. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023. 1

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Toronto, ON, Canada, 2009. 7

Published as a conference paper at ICLR 2025

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30 (Neur IPS), volume 30, pp. 6402 6413. Curran Associates, Inc., 2017. 2

Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. 6

Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, and Li Zhang. Soft: Softmax-free transformer with linear complexity. In Advances in Neural Information Processing Systems 34 (Neur IPS), pp. 21297 21309. Curran Associates, Inc., 2021. 5

David JC Mac Kay. Information-based objective functions for active data selection. Neural Computation, 4(4):590 604, 1992a. 2, 3

David JC Mac Kay. Bayesian interpolation. Neural computation, 4(3):415 447, 1992b. 5

David JC Mac Kay. Bayesian methods for backpropagation networks. In Models of Neural Networks III: Association, Generalization, and Representation, pp. 211 254. Springer, 1996. 3, 7

Wesley J. Maddox, Pavel Izmailov, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems 32 (Neur IPS), pp. 13132 13143. Curran Associates, Inc., 2019. 2

James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, pp. 2408 2417. PMLR, 2015. 6

Lassi Meronen, Martin Trapp, and Arno Solin. Periodic activation functions induce stationarity. In Advances in Neural Information Processing Systems 34 (Neur IPS), pp. 1673 1685. Curran Associates, Inc., 2021. 1

Lassi Meronen, Martin Trapp, Andrea Pilzer, Le Yang, and Arno Solin. Fixing overconfidence in dynamic neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2680 2690, 2024. 2

Rhiannon Michelmore, Matthew Wicker, Luca Laurenti, Luca Cardelli, Yarin Gal, and Marta Kwiatkowska. Uncertainty quantification with statistical guarantees in end-to-end autonomous driving control. In IEEE International Conference on Robotics and Automation (ICRA), pp. 7344 7350. IEEE, 2020. 1

Saralees Nadarajah and Tibor K Pogány. On the distribution of the product of correlated normal random variables. Comptes Rendus. Mathématique, 354(2):201 204, 2016. 18

Eric Thomas Nalisnick. On priors for Bayesian neural networks. University of California, Irvine, 2018. 1

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, 2011. 9

Theodore Papamarkou, Maria Skoularidou, Konstantina Palla, Laurence Aitchison, Julyan Arbel, David B. Dunson, Maurizio Filippone, Vincent Fortuin, Philipp Hennig, José Miguel Hernández Lobato, Aliaksandr Hubin, Alexander Immer, Theofanis Karaletsos, Mohammad Emtiyaz Khan, Agustinus Kristiadi, Yingzhen Li, Stephan Mandt, Christopher Nemeth, Michael A. Osborne, Tim G. J. Rudner, David Rügamer, Yee Whye Teh, Max Welling, Andrew Gordon Wilson, and Ruqi Zhang. Position: Bayesian deep learning is needed in the age of large-scale ai. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235 of Proceedings of Machine Learning Research. PMLR, 2024. 1

Felix Petersen, Aashwin Ananda Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, and Mikhail Yurochkin. Uncertainty quantification via stable distribution propagation. In International Conference on Learning Representations (ICLR), 2024. 3, 4, 8

Published as a conference paper at ICLR 2025

Apostolos F Psaros, Xuhui Meng, Zongren Zou, Ling Guo, and George Em Karniadakis. Uncertainty quantification in scientific machine learning: Methods, metrics, and comparisons. Journal of Computational Physics, 477:111902, 2023. 1

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Open AI blog, 2019. 2

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations (ICLR), 2018. 6, 7

Simo Särkkä and Lennart Svensson. Bayesian Filtering and Smoothing. Cambridge University Press, 2023. 4

Aidan Scannell, Riccardo Mereu, Paul Edmund Chang, Ella Tamir, Joni Pajarinen, and Arno Solin. Function-space parameterization of neural networks for sequential learning. In International Conference on Learning Representations (ICLR), 2024. 2

Yuesong Shen, Nico Daheim, Bai Cong, Peter Nickl, Gian Maria Marconi, Bazan Clement Emile Marcel Raoul, Rio Yokota, Iryna Gurevych, Daniel Cremers, Mohammad Emtiyaz Khan, and Thomas Möllenhoff. Variational learning is effective for large deep networks. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235 of Proceedings of Machine Learning Research. PMLR, 2024. 2, 3, 7, 30

Freddie Bickford Smith, Andreas Kirsch, Sebastian Farquhar, Yarin Gal, Adam Foster, and Tom Rainforth. Prediction-oriented bayesian active learning. In Proceedings of the twenty sixth International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, pp. 7331 7348. PMLR, 2023. 3

Ba-Hien Tran, Simone Rossi, Dimitrios Milios, and Maurizio Filippone. All you need is a good functional prior for Bayesian deep learning. Journal of Machine Learning Research, 23(74):1 56, 2022. 1

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (Neur IPS). Curran Associates, Inc., 2017. 5, 16, 21

Maxime Vono, Nicolas Dobigeon, and Pierre Chainais. High-dimensional Gaussian sampling: a review and a unifying approach based on a stochastic proximal point algorithm. SIAM Review, 64 (1):3 56, 2022. 3

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32 (Neur IPS), pp. 3266 3280. Curran Associates, Inc., 2019a. 7, 25

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR), 2019b. 7, 25

Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 10248 10259. PMLR, 2020. 1

Andrew Gordon Wilson. The case for Bayesian deep learning. ar Xiv preprint ar Xiv:2001.10995, 2020. 2

Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 4697 4708. Curran Associates, Inc., 2020. 1, 2, 3

Published as a conference paper at ICLR 2025

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019. 7

Pierre Wolinski and Julyan Arbel. Gaussian pre-activations in neural networks: Myth or reality? ar Xiv preprint ar Xiv:2205.12379, 2022. 4

Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E. Turner, José Miguel Hernández-Lobato, and Alexander L. Gaunt. Deterministic variational inference for robust Bayesian neural networks. In International Conference on Learning Representations (ICLR), 2019. 1, 2, 8

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. 6

Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations (ICLR), 2024. 2

Jiancheng Yang, Rui Shi, Donglai Wei, Zeju Liu, Lin Zhao, Bilian Ke, Ziyang Shi, Yunzhu Li, Xiaoyang Hu, Yang Gao, Ye Xu, Daniel L. Rubin, and Holger R. Roth. Medmnist v2: A large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):1 14, 2023. doi: 10.1038/s41597-023-02552-1. 7

Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):2008 2026, 2018. 2

Published as a conference paper at ICLR 2025

The appendices are structured as follows: App. A presents the derivations of our method in detail. App. B describes the experimental setup and additional experimental results.

A DERIVATIONS

In this section, we derive how to propagate the distribution deterministically. See Table 6 for the list of notations that will be used throughout this section.

We first derive the general result in App. A.1 where the posterior covariance has a full structure in the linear layer and evaluate the quality of local Gaussian approximation in App. A.2. Next, in App. A.3 and App. A.4 we give the derivation for diagonal and KFAC covariance, respectively. Then, App. A.5 shows the derivation for activation functions. Finally, App. A.6 describes how we apply our method in a transformer network (Vaswani et al., 2017).

Table 6: Notation.

x lowercase bolder letter, vector W uppercase bold letter, matrix D set xi ith element of x Wki kth row, ith column of W W [k, :] kth row of a matrix k, l dimension of the output i, j dimension of the input d data feature dimension n, N number of data points C total number of classes m layer index

A.1 DERIVATION FOR FULL COVARIANCE STRUCTURE

Denote the weight and bias of the mth linear layer as W (m) RDout Din and b(m) RDout respectively, and its input as a(m 1) RDin. The pre-activation is then given as h(m) = W (m)a(m 1) + b(m) with its kth element being h(m) k = PDin i=1 W (m) ki a(m 1) i + b(m) k .

We make the following assumptions to obtain a tractable distribution on the pre-activation:

Assumption 1: We assume each a(m 1) i W (m) ki is a Gaussian distribution. Assumption 2: We assume that the activations of the previous layer a(m 1) i and parameters of the mth layer are independent.

From assumption 1, because now a(m 1) i W (m) ki and b(m) k are all Gaussian distributions, h(m) k will follow Gaussian distribution as well. We call this local Gaussian approximation as we approximate each local component a(m 1) i W (m) ki with a Gaussian. As now each h(m) k is a Gaussian, h(m) will be jointly Gaussian. We derive its mean and covariance and drop the layer index if it is clear from the context.

Published as a conference paper at ICLR 2025

Derivation of mean As ai is assumed to be uncorrected with Wki, we have

i=1 Wkiai + bk

i=1 E [Wkiai + bk] (13)

i=1 E [Wkiai] + E [bk] (14)

i=1 E [Wki] E [ai] + E [bk] . (Assumption 2)

Derivation of covariance The covariance between the kth and lth pre-activation can be written as

Cov [hk, hl] = Cov

i=1 ai Wki + bk,

i=1 ai Wli + bl

i=1 ai Wki,

i=1 ai Wki, bl

i=1 ai Wli, bk

+ Cov [bk, bl] (16)

1 i,j Din Cov [ai Wki, aj Wlj] + X

1 i Din (Cov [ai Wki, bl] + Cov [ai Wli, bk])

+ Cov [bk, bl] (17) We first derive the form of Cov[ai Wki, ai Wli]: Cov [ai Wki, aj Wlj]

= E [(ai Wki E [ai Wki])(aj Wlj E [aj Wlj])] (18)

= E [ai Wkiaj Wlj ai Wki E [aj Wlj] E [ai Wki] aj Wlj + E [ai Wki] E [aj Wlj]] (19)

= E [aiaj Wki Wlj] E [ai Wki] E [aj Wlj] E [ai Wki] E [aj Wlj] + E [ai Wki] E [aj Wlj] (20)

E [aiaj] E [Wki Wlj] E [ai] E [Wki] E [aj] E [Wlj] (Assumption 2)

= (E [ai] E [aj] + Cov [ai, aj])(E [Wki] E [Wlj] + Cov [Wki, Wlj])

E [ai] E [Wki] E [aj] E [Wlj] (21)

= E [ai] E [aj] Cov [Wki, Wlj] + E [Wki] E [Wlj] Cov [ai, aj] + Cov [ai, aj] Cov [Wki, Wlj] . (22)

Then we derive the form of Cov[ai Wki, bl]: Cov [ai Wki, bl] = E [(ai Wki E [ai Wki])(bl E [bl])] (23)

E [(ai Wki E [ai] E [Wki])(bl E [bl])] (Assumption 2)

= E [ai Wkibl ai Wki E [bl] E [ai] E [Wki] bl + E [ai] E [Wki] E [bl]] (24)

= E [ai Wkibl] E [ai] E [Wki] E [bl] (25)

E [ai] E [Wkibl] E [ai] E [Wki] E [bl] (Assumption 2)

= E [ai] (E [Wki] E [bl] + Cov [Wki, bl]) E [ai] E [Wki] E [bl] (26)

= E [ai] Cov [Wki, bl] . (27)

Published as a conference paper at ICLR 2025

Putting it together, we have Cov[hk, hl] =

1 i,j Din Cov [ai Wki, aj Wlj] +

i=1 (E [ai] Cov [Wki, bl] + E [ai] Cov [Wli, bk]) + Cov [bk, bl] , (28)

where Cov[ai Wki, aj Wlj] =

E [ai] E [aj] Cov [Wki, Wlj] + E [Wki] E [Wlj] Cov [ai, aj] + Cov [ai, aj] Cov [Wki, Wlj] . (29)

Note that P

1 i,j Din Cov[ai Wki, aj Wlj] in Eq. (28) could be rewritten into the form of matrix multiplication for efficient implementation:

1 i,j Din Cov [ai Wki, aj Wlj] (30)

1 i,j Din E [ai] E [aj] Cov [Wki, Wlj] + E [Wki] E [Wlj] Cov [ai, aj] + Cov [ai, aj] Cov [Wki, Wlj] (31)

E [a1] E [a1] Cov[Wk1, Wl1] . . . E [a1] E a Din Cov[Wk1, Wl Din] ... ... ... E a Din E [a1] Cov[Wk Din, Wl1] . . . E [a1] E a Din Cov[Wk Din, Wl Din]

Cov[Wk1, Wl1] . . . Cov[Wk1, Wl Din] ... ... ... Cov[Wk Din, Wl1] . . . Cov[Wk Din, Wl Din]

E [Wk1] E [Wl1] . . . E [Wk1] E Wl Din

... ... ... E Wk Din E [Wl1] . . . E Wk Din E Wl Din

Cov[a1, a1] . . . Cov[a1, a Din] ... ... ... Cov[a Din, a1] . . . Cov[a Din, a Din]

Cov[a1, a1] . . . Cov[a1, a Din] ... ... ... Cov[a Din, a1] . . . Cov[a Din, a Din]

Cov[Wk1, Wl1] . . . Cov[Wk1, Wl Din] ... ... ... Cov[Wk Din, Wl1] . . . Cov[Wk Din, Wl Din]

A.2 ERROR INDUCED THROUGH LOCAL GAUSSIAN APPROXIMATION

In this section, we analyse the error induced by the local Gaussian approximation. Recall that we made these two assumptions for the derivation:

Assumption 1: We assume a(m 1) i W (m) ki is a Gaussian distribution.

Assumption 2: We assume that the activations of the previous layer a(m 1) i and parameters of the mth layer are independent.

We first examine the error induced by A2 on the moments for ai Wki. Given two correlated univariate Gaussian x1 and x2, with the joint being x1 x2

N E [x1] E [x2]

, σ2 x1 Cov[x1, x2] Cov[x1, x2] σ2 x2

from Nadarajah & Pogány (2016); Kan (2008), although the distribution form of x1x2 is no longer Gaussian and intractable, its mean and variance can be computed analytically as

E [x1x2] = E [x1] E [x2] + Cov [x1, x2] , (37)

Var [x1x2] = σ2 1σ2 2 + σ2 1E [x2]2 + σ2 2E [x1]2 + (σ2 1σ2 2 + 2E [x1] E [x2]) Cov [x1, x2] . (38)

Applying the above result in our case, we have

E [ai Wki] = E [ai] E [Wki] + Cov [ai, Wki] , (39)

Var [ai Wki] = σ2 aiσ2 Wki + σ2 ai E [Wki]2 + σ2 Wki E [ai]2

+ (2E [ai] E [Wki] + σ2 aiσ2 Wki) Cov [ai, Wki] . (40)

Published as a conference paper at ICLR 2025

As Cov[ai, Wki] is intractable, in A2 we ignore the correlation between ai and Wki, which results in

E [ai Wki] E [ai] E [Wki]((((((( + Cov [ai, Wki], (41)

Var [ai Wki] σ2 aiσ2 Wki + σ2 ai E [Wki]2 + σ2 Wki E [ai]2

(((((((((((((((((( (

+(2E [ai] E [Wki] + σ2 aiσ2 Wki) Cov [ai, Wki]. (42)

Note that in the case of diagonal posterior covariance, as each parameter is independent of each other, A2 holds automatically. In this case, we recover the correct mean and variance for ai Wki.

Now, we examine the error induced by A1 and A2 through Monte Carlo estimation. Fig. 6 provides a simulation result illustrating the error induced by the local Gaussian approximation on ai Wki. We plot the results for weights with the largest absolute magnitude of an MLP trained on MNIST. This approximation works well in practice but fails to capture the potential skewness of the distributions.

40 20 0 pre-activation value

50 30 10 pre-activation value

40 20 0 pre-activation value

20 0 20 pre-activation value

50 30 10 pre-activation value

Figure 6: Comparison between Monte-Carlo estimates of the distribution over ai Wki and our analytic Gaussian approximation .

A.3 DERIVATION FOR DIAGONAL COVARIANCE STRUCTURE

When the posterior has diagonal covariance, the mean E [hk] will still be the same.

For covariance, when k = l we have Cov[hk, hl] =

1 i,j Din Cov [ai Wki, aj Wlj] +

i=1 (E [ai] Cov [Wki, bl] + E [ai] Cov [Wli, bk]) + Cov [bk, bl] (43)

1 i,j Din Cov [ai Wki, aj Wlj] (44)

1 i,j Din E [ai] E [aj] Cov [Wki, Wlj] + E [Wki] E [Wlj] Cov [ai, aj] + Cov [ai, aj] Cov [Wki, Wlj]

1 i,j Din E [Wki] E [Wlj] Cov [ai, aj] . (46)

For k = l, we have Var[hk] =

1 i,j Din Cov [ai Wki, aj Wkj] +

i=1 (E [ai] Cov [Wki, bk] + E [ai] Cov [Wki, bk]) + Var [bk] (47)

1 i Din Cov [ai Wki, ai Wki] + Var [bk] (48)

1 i Din E [ai]2 Var [Wki] + E [Wki]2 Var [ai] + Var [ai] Var [Wki] + Var [bk] . (49)

Published as a conference paper at ICLR 2025

Note that as

Var h h(m) k i = X

1 i Din E h a(m 1) i i2 Var h W (m) ki i + E h W (m) ki i2 Var h a(m 1) i i (50)

1 i Din Var h a(m 1) i i Var h W (m) ki i + Var h b(m) k i , (51)

the variance of h(m) k will only rely on the variance of the activations of previous layers, i.e., Var[a(m 1) i ]. In the case of element-wise activation functions, Var[a(m 1) i ] will only rely on Var[h(m 1) i ] as now the Jacobian of the activation is diagonal. As a result, in the case where we only need the variance of the input, we could drop the computation of Cov[hk, hl] and only compute the variance for each layer, which will largely reduce the computation cost.

A.4 DERIVATION FOR KRONECKER COVARIANCE STRUCTURE

In KFAC, the Hessian is represented in Kronecker product form Hess = A B. Denote the prior precision as λ2, then the posterior covariance is

Σ = (Hess + λ2I) 1 = (A B + λ2I) 1 (52)

As there is no closed form for the inverse, to express the covariance in the form of the Kronecker product as well, we approximate the covariance as

Σ = (A B + λ2I) 1 (53)

= (UAΛAU A ) (UBΛBU B ) + λ2I 1 (Eigen Decomposition)

(UA(ΛA + λIA)U A ) (UB(ΛB + λIB)U B ) 1 (54)

= UA(ΛA + λIA)U A 1

UB(ΛB + λIB)U B 1

. ((A B) 1 = A 1 B 1)

Recall for an efficient implementation for computing P

1 i,j Din Cov[ai Wki, aj Wlj] (Eq. (35)), we need to retrieve the covariance between the kth row of weight and lth row of weight, which is a Din Din matrix:

Cov [W [k, :], W [l, :]] =

Cov[Wk1, Wl1] . . . Cov[Wk1, Wl Din] ... ... ... Cov[Wk Din, Wl1] . . . Cov[Wk Din, Wl Din]

As Σ C D where C Din Din and D Dout Dout, the posterior covariance is represented by a total number of Din Din matrix with size Dout Dout. Retrieving a Din Din matrix from it is not trivial. In the toy example as shown in Fig. 7, for a Din = 3 and Dout = 2 matrix W , its covariance is represented by a total number of 9 (Din Din) matrix I, II, . . ., IX with shape 2 2 (Dout Dout). To retrieve Cov[W [1, :], W [2, :]], we need to first decide which Kronecker blocks contain it (in this case block II, III, V and VI) and reconstruct these Kronecker blocks. Then, we retrieve Cov[W [1, :], W [2, :]] from the reconstructed blocks.

In general, the retrieval process consists of two steps: (1) identifying the block indices within the Kronecker product matrix that correspond to the required covariance block and (2) extracting the covariance of interests from the constructed block.

Identifying Block Indices We first identify the Kronecker blocks that contain the covariance of interest. This is achieved by calculating the block indexed for C, which is later used to construct Kronecker blocks. Specifically, the start and end positions of the covariance block corresponding to rows k and l can be computed as:

row_start = k Din

Published as a conference paper at ICLR 2025

W11, W21 W11, W23 W11, W33 W11, W31 W11, W32 W11, W33 W12, W11 W12, W12 W12, W13 W12, W31 W12, W32 W12, W33 W13, W11 W13, W13 W13, W13 W13, W31 W13, W32 W13, W33 W21, W11 W21, W12 W21, W13 W21, W31 W21, W32 W21, W33 W22, W11 W22, W12 W22, W13 W22, W31 W22, W32 W22, W33 W23, W11 W23, W12 W23, W13 W23, W31 W23, W32 W23, W33

VII VIII IX

Figure 7: To retrieve the highlighted submatrix Cov[W [1, :], W [2, :]] of the covariance for W R2 3, we identify the Kronecker blocks that contain the covariance of interest (II, III, V, and VI), explicate those blocks in memory, and then retrieve the relevant submatrix.

row_end = (k + 1) Din

col_start = l Din

col_end = (l + 1) Din

Then, we can construct the Kronecker blocks that contain the covariance of interest by C[row_start : row_end, col_start : col_end]D.

Extract the Covariance Once we have C[row_start : row_end, col_start : col_end], and as we know the covariance we need to retrieve has shape Din Din, we only need to compute the start row and column index, which can be computed as

select_row_start = (k Din) mod Dout, (60)

select_col_start = (l Din) mod Dout. (61)

A.5 DERIVATION FOR ACTIVATION LAYERS

For a = g(h) where h N(h; E [h] , Σh) and g( ) is the activation function, we use local linearisation to approximate the distribution of a. Specifically, we do a first-order Taylor expansion on g( ) at E [h]:

a = g(h) (62) g(E [h]) + Jg|h=E[h](h E [h]). (63)

Given that Gaussian distribution is closed under linear transformation, we have

h N(E [h] , Σh) (64) h E [h] N(0, Σh) (65)

Jg|h=E[h](h E [h]) N(0, Jg|h=E[h] Σh Jg|h=E[h]) (66)

g(E [h]) + Jg|h=E[h](h E [h]) N(g(E [h]), Jg|h=E[h] Σh Jg|h=E[h]) (67)

a approx N(a; g(E [h]), Jg|h=E[h] Σh Jg|h=E[h]). (68)

A.6 TRANSFORMER BLOCK

There are four components in each transformer block (Vaswani et al., 2017): (1) multi-head attention; (2) MLP; (3) layer normalisation; and (4) residual connection. For MLP blocks, the propagation is the same as described above. For layer normalisation and residual connection, as Gaussian distributions are closed under linear transformations, pushing distributions through them is straightforward.

Published as a conference paper at ICLR 2025

We describe how to push distributions through attention layers below. Note that for computational reasons, we always assume the input has diagonal covariance.

Given an input H RT D where T is the number of tokens in the input sequence and D is the dimension of each token, denote the query, key and value matrices as WQ RD D, WK RD D, WV RD D respectively, the key, query and value in an attention blocks are

Q = HWQ, K = HWK, V = HWV , (69)

and the output of attention block is

Attention(H) = Softmax(QK

When the input H is a distribution, Q, K and V will all be distributions as well. As pushing a distribution over a softmax activation requires further approximation, we ignore the distribution over Q and K for computational reasons and compute their value by using the mean of the input:

Q = E [H] E [WQ] , K = E [H] E [WK] . (71)

For V , for simplicity we describe our approximation for a single token h whose value is v = WV h with kth element being vk = PD i=1 WVkihi. Assuming h is a Gaussian, the covariance between the kth and the lth value is

Cov [vk, vl] = Cov

i=1 WVkihi,

j=1 Cov WVkihi, WVljhj . (73)

When treating WV deterministically, we have

Cov [vk, vl] =

j=1 Cov WVkihi, WVljhj (definition)

j=1 WVki WVlj Cov [hi, hj] (WV deterministic)

i=1 WVki WVli Var [hi] . (ignore correlation between h for computational reason)

When WV is an isotropic Gaussian, we have

Cov [vk, vl] = X

1 i,j D Cov WVkihi, WVljhj (74)

1 i,j D (E [hi] E [hj] + Cov [hi, hj]) Cov [Wki, Wlj] + E [Wki] E [Wlj] Cov [hi, hj]

(assumption A2)

1 i,j D E [Wki] E [Wlj] Cov [hi, hj] (WV is isotropic Gaussian)

1 i D E [Wki] E [Wli] Var [hi] .

(ignore correlation between h for computational reason)

Published as a conference paper at ICLR 2025

Var [vk] = X

1 i,j D Cov WVkihi, WVkjhj (definition)

1 i,j D (E [hi] E [hj] + Cov [hi, hj]) Cov [Wki, Wkj] + E [Wki] E [Wkj] Cov [hi, hj]

(assumption A2)

1 i D (E [hi]2 + Var [hi]) Var [Wki] + E [Wki]2 Var [hi] .

(WV is isotropic Gaussian)

Once we have the distribution over V , the distribution over Attention(H) becomes a distribution of linear combination of Gaussian, which is tractable.

Then for multi-head attention, we assume each attention head s output is independent, which allows us to compute the distribution over the final output in tractable form. As we assume the input is isotropic, we only need to compute the variance for each dimension.

A.7 CONVOLUTIONAL NEURAL NETWORK

The derivation for convolutional layers is similar to fully connected layers as convolution layers can be considered as a shared weight fully connected layer. We first give the derivation for convolutional layers, then discuss pooling layers in convolutional neural networks.

Denote the pixel value at (i, j) of cinth channel as acin[i, j], the cinth channel of convolutional kernel corresponding to coutth output channel as Wcout,cin[i, j] and the pixel value at (k, l) of the coutth output channel as hcout[k, l]. Then, suppose there are Cin channels in total and the kernel size is Kh Kw, we can write the convolutional layer as

hcout[k, l] =

j=1 acin[k + i 1, l + j 1]Wcout,cin[i, j]. (76)

Derivation of mean Following our assumption that acin[k + i 1, l + j 1] is uncorrelated with Wcout,cin[i, j], we have

E [hcout[k, l]] =

j=1 E [acin[k + i 1, l + j 1]] E [Wcout,cin[i, j]] (77)

Derivation of covariance The covariance between pixels of the coutth output channel are given as:

Cov [hcout[k1, l1], hcout[k2, l2]] (78)

j1=1 acin,1[k1 + i1 1, l1 + j1 1]Wcout,cin[i1, j1] , (79)

j2=1 acin,2[k2 + i2 1, l2 + j2 1]Wcout,cin[i2, j2]

j2=1 Cov acin,1[k1 + i1 1, l1 + j1 1]Wcout,cin[i1, j1] (81)

acin,2[k2 + i2 1, l2 + j2 1]Wcout,cin[i2, j2] . (82)

Published as a conference paper at ICLR 2025

Using earlier results from Eq. (22) and the shorthand W = Wcout, cin, we have:

Cov [ain,1[k1+i1 1, l1+j1 1]W[i1, j1], ain,2[k2+i2 1, l2+j2 1]W[i2, j2]] (83) E [ain,1[k1+i1 1, l1+j1 1]ain,2[k2+i2 1, l2+j2 1]] E [W[i1, j1]W[i2, j2]] (84) E [ain,1[k1+i1 1, l1+j1 1]] E [ain,2[k2+i2 1, l2+j2 1]] E [W[i1, j1]] E [W[i2, j2]] (85) = E [ain,1[k1+i1 1, l1+j1 1]] E [ain,2[k2+i2 1, l2+j2 1]] Cov [W[i1, j1], W[i2, j2]] (86) + E [W[i1, j1]] E [W[i2, j2]] Cov [ain,1[k1+i1 1, l1+j1 1], ain,2[k2+i2 1, l2+j2 1]] (87) + Cov [ain,1[k1+i1 1, l1+j1 1], ain,2[k2+i2 1, l2+j2 1]] Cov [W[i1, j1], W[i2, j2]] . (88)

B ADDITIONAL EXPERIMENTS

In the Appendix, we provide additional details on (i) the regression experiments App. B.1, (ii) the classification experiments App. B.2, and (iii) the image sensitivity experiment App. B.3. In addition, we also present additional experiments on (i) measuring the performance by varying the number of MC samples for the sampling baseline App. B.4, (ii) estimating the degree of local linearity in our method App. B.5, and (iii) comparing the runtime of our method against the baselines App. B.7 to further demonstrate the benefits of our streamlined prediction with local linearisation and local Gaussian approximation.

B.1 REGRESSION

Table 7 gives the UCI regression data set information and the neural network structure we used. For all neural networks, we use the Re LU activation function. In Table 8, we report the Root Mean Square Error (RMSE). Our method results in matching or better performance compared with sampling and GLM, indicating the effectiveness of our method. Note that as the mean of the posterior prediction of our method is the same as the prediction made by setting the weights of the neural network to be the mean of the posterior, we result in the same prediction as GLM of LA, and hence the same performance.

Table 7: UCI regression experiment setup.

Data Set Name Shorthand (n, d) Network Structure

SERVO SERVO (167, 4) d-50-1 LIVER DISORDERS LD (345, 5) d-50-1 AUTO MPG AM (398, 7) d-50-1 REAL ESTATE VALUATION REV (414,6) d-50-1 FOREST FIRES FF (517, 12) d-50-1 INFRARED THERMOGRAPHY TEMPERATURE ITT (1020, 33) d-100-1 CONCRETE COMPRESSIVE STRENGTH CCS (1030, 8) d-100-1 AIRFOIL SELF-NOISE ASN (1503, 5) d-100-1 COMMUNITIES AND CRIME CAC (1994, 127) d-100-1 PARKINSONS TELEMONITORING PT (5875, 19) d-50-50-1 COMBINED CYCLE POWER PLANT CCPP (9568, 4) d-50-50-1

B.2 CLASSIFICATION

Table 9 gives the classification data sets information and the neural network structure we used for the MLP experiment. We use Re LU activation for MLP.

OOD Experiments with MLP To test our method on out-of-distribution (OOD) data, we first evaluate the MNIST-trained MLP on rotated versions of the test set as shown in Fig. 8. The rotation degree interval is 10 from 0 180 . We observe that with increasing rotation degree, our method achieves a lower NLPD compared to LA MAP and MFVI Sampling while being close compared with LA Sampling and GLM. Also, our method achieves similar NLPD for both LA and MFVI posterior approximations across the rotation degrees. All methods perform on par regarding their ACC. In Fig. 9, we show kernel density plots over the predictive entropy of an FMNIST-trained MLP evaluated on MNIST. Our method can distinguish between in-distribution and OOD data better than the LA MAP and MFVI Sampling. Although our method under fits the in-distribution data, the separation between them is clear for the OOD data.

Published as a conference paper at ICLR 2025

Table 8: Root Mean Square Error on UCI regression data sets. Our method results in better or matching performance compared with sampling and GLM, indicating its effectiveness.

MFVI (Diag. Cov.) Laplace Approximation (Full Cov.) (n, d) Sampling Ours Sampling GLM Ours

SERVO (167, 4) 0.749 0.147 0.740 0.143 1.632 0.233 0.658 0.141 0.658 0.141 LD (345, 5) 0.884 0.273 0.881 0.272 0.989 0.441 0.977 0.418 0.977 0.418 AM (398, 7) 0.415 0.115 0.417 0.113 0.505 0.105 0.371 0.103 0.371 0.103 REV (414, 6) 0.563 0.096 0.562 0.095 0.789 0.130 0.532 0.104 0.532 0.104 FF (517, 12) 0.874 1.123 0.874 1.124 0.910 0.824 0.852 0.792 0.852 0.792 ITT (1020, 33) 0.481 0.057 0.497 0.066 0.560 0.075 0.507 0.072 0.507 0.072 CCS (1030, 8) 0.472 0.102 0.476 0.106 0.494 0.102 0.301 0.057 0.301 0.057 ASN (1503, 5) 0.568 0.062 0.560 0.062 0.550 0.069 0.352 0.055 0.352 0.055 CAC (1994, 127) 0.571 0.105 0.585 0.092 1.481 0.167 0.703 0.101 0.703 0.101 PT (5875, 19) 0.601 0.067 0.590 0.068 0.479 0.081 0.410 0.076 0.410 0.076 CCPP (9568, 4) 0.241 0.038 0.241 0.038 0.358 0.041 0.224 0.037 0.224 0.037

Bold Count 8/11 10/11 2/11 11/11 11/11

Table 9: Classification experiment setup.

Data Set Name (n, d) Network Structure

MNIST (50000, 784) d-128-64-10 FMNIST (50000, 784) d-128-64-10 ORGANCMNIST (12975, 784) d-128-64-11 ORGANSMNIST (13932, 784) d-128-64-11

0 60 120 180

Rotation Degree

LA GLM LA Sampling LA Ours MFVI Sampling MFVI Ours MAP

0 60 120 180

0.2 0.4 0.6 0.8 1

Rotation Degree

LA GLM LA Sampling LA Ours MFVI Sampling MFVI Ours MAP

Figure 8: NLPD and ACC for MNIST-trained MLP on rotated versions of the MNIST test set. The rotation degree interval is 10 from 0 180 . Our method achieves similar NLPD for both LA and MFVI posterior approximations.

MFVI Sampling

Figure 9: Kernel density plots over the predictive entropy from an MLP trained on FMNIST (blue, indistribution) and data from MNIST (red, out-of-distribution). Our method results in a clear separation between the inand out-of-distribution data.

Our method applied to MLP in Vi T In Table 10 we report the results for fine-tuning the MLPs after the attention layers in the last two transformer blocks in Vi T and later treating them Bayesian. We observe that our method achieves better or on par NLPD and ECE compared to the baselines for both LA and MFVI across all data sets while maintaining similar ACC as the baselines.

We present results for GPT-2 on tasks from GLUE (Wang et al., 2019b) and Super GLUE (Wang et al., 2019a) benchmarks. These natural language understanding tasks could be turned into classification tasks with the prompt shown in Table 11. We add a classification layer on top of the encoder and use the embedding of the last token in each input to do classification.

Published as a conference paper at ICLR 2025

Table 10: Performance metrics using Vi T with posterior approximation on MLPs after the attention layers with the standard error for ACC and NLPD. Our method achieves better NLPD and ECE in general and achieves similar ACC compared to the baselines.

Metrics Methods CIFAR-10 CIFAR-100 DTD RESISC IMAGENET-R LA Sampling 0.971 0.002 0.855 0.004 0.656 0.011 0.812 0.005 0.589 0.013 LA GLM 0.974 0.002 0.873 0.003 0.714 0.010 0.886 0.004 0.687 0.012 ACC LA Ours 0.976 0.002 0.884 0.003 0.716 0.010 0.908 0.004 0.713 0.012 MFVI Sampling 0.978 0.001 0.896 0.003 0.727 0.010 0.870 0.004 0.733 0.012 MFVI Ours 0.978 0.001 0.895 0.003 0.720 0.010 0.867 0.004 0.732 0.012 LA Sampling 0.169 0.004 1.043 0.010 2.035 0.022 1.304 0.011 2.330 0.041 LA GLM 0.089 0.005 0.602 0.011 1.260 0.029 0.568 0.011 1.584 0.045 NLPD LA Ours 0.088 0.006 0.457 0.013 1.078 0.036 0.318 0.012 1.339 0.047 MFVI Sampling 0.124 0.011 0.480 0.018 1.277 0.060 1.098 0.043 1.489 0.081 MFVI Ours 0.081 0.006 0.436 0.013 1.146 0.040 0.650 0.019 1.206 0.053 LA Sampling 0.078 0.349 0.442 0.431 0.331 LA GLM 0.005 0.097 0.174 0.157 0.155 ECE LA Ours 0.007 0.032 0.055 0.017 0.087 MFVI Sampling 0.014 0.040 0.083 0.075 0.115 MFVI Ours 0.006 0.030 0.053 0.028 0.040

Table 11: Prompt templates for fine-tuning GPT-2 on natural language understanding tasks.

Task Prompt MRPC Answer whether sentence 2 is equivalent to sentence 1. Sentence 1: {sentence1}. Sentence 2: {sentence2}. Answer: Wi C Select whether word {word} has the same meaning in these two sentences.

Sentence 1: {sentence1}. Sentence 2: {sentence2}. Answer: Bool Q Answer the question with only True or False. Passage: {passage}. Question: {question}. Answer:

Lasy Layer Laplace Approximation on Vi T In Table 12 we report the results for fine-tuning only the last classification layer in Vi T base and later treating it Bayesian. We observe that our method (LLLA Ours) achieves better or on par NLPD and ECE compared to last layer Laplace approximation (LL-LA GLM/Sampling) across all data sets while maintaining similar ACC. Compared to the case where more layers are treated Bayesian (LA Ours) (results are taken from Table 4), last layer approximations in general have lower accuracies and higher NLPD and ECE, indicating the benefits gained by treating more layers Bayesian. In Table 13 we report the wall-clock run times for last layer Laplace approximation on CIFAR-10 in milliseconds (see App. B.7 for the run time setting) Our method has matching speed with MAP and slight speed improvements over GLM.

Table 12: Performance metrics using Vi T with posterior approximation on last layer with the standard error for ACC and NLPD. In the last layer Laplace approximation (LL-LA), our method achieves better NLPD and ECE in general and achieves similar ACC compared to the baselines. Compared with the case where more intermediate layers are treated Bayesian (LA Ours), last layer Laplace approximation in general has lower accuracies and higher NLPD and ECE.

Metrics Methods CIFAR-10 CIFAR-100 DTD RESISC IMAGENET-R

LL-LA GLM 0.965 0.002 0.825 0.004 0.681 0.006 0.506 0.012 0.592 0.013 LL-LA Sampling 0.966 0.002 0.827 0.004 0.025 0.002 0.509 0.012 0.604 0.013 LL-LA Ours 0.965 0.002 0.825 0.004 0.693 0.006 0.508 0.012 0.592 0.013 LA Ours 0.976 0.002 0.880 0.003 0.719 0.010 0.892 0.004 0.739 0.012

LL-LA GLM 0.115 0.005 0.889 0.018 1.500 0.021 2.574 0.026 2.034 0.051 LL-LA Sampling 0.118 0.005 0.924 0.021 7.341 0.065 2.456 0.030 2.000 0.058 LL-LA Ours 0.110 0.005 0.874 0.019 1.411 0.026 2.319 0.032 2.020 0.052 LA Ours 0.086 0.006 0.456 0.012 1.068 0.035 0.353 0.012 1.264 0.043

LL-LA GLM 0.011 0.062 0.137 0.333 0.145 LL-LA Sampling 0.015 0.045 0.215 0.304 0.127 LL-LA Ours 0.007 0.029 0.026 0.233 0.135 LA Ours 0.009 0.027 0.042 0.017 0.130

Published as a conference paper at ICLR 2025

Table 13: Wallclock times for Last layer Vi T base on CIFAR-10 in milliseconds.

Model Methods AVG. RUNTIME ( STD)

MAP 3.732 0.091 LA Sampling 188.732 0.051 Last Layer Vi T LA GLM 5.517 0.033 Ours 3.782 0.088

y = 0 y = 0 y = 0 y = 6 y = 6 y = 6 y = 8 y = 8 y = 8

Figure 10: Pixel sensitivity of MLP classifiers trained on binary classification tasks (0/6/8) for MNIST digits. The rows show the sensitivity of the MAP predictor to pixel perturbations, and the pixel sensitivity for a last-layer Laplace approximation. The predictive distribution is approximated analytically in both cases. We observe that the Bayesian model using a Laplace approximation has less spurious sensitivities to pixel perturbations indicating that it is more robust to input perturbations. The sensitivities are visualised in the range (0.5

B.3 IMAGE PIXEL SENSITIVITY

We trained a 4 layer MLP classifier on MNIST digits zero and eight using a batch size of 64, learning rate of 1e 3, weight decay set to 1e 5, and for 50 epochs. We used a subset of 0.1% of the training data as held-out validation set and assumed a full covariance Gaussian distribution for each input centred at the pixel values of the datum and with a fixed covariance of 1e 5. Furthermore, we then computed the pixel sensitivities for the trained model by learning the pixel-wise input covariance matrices by minimizing the negative log-likelihood of the held-out validation set and jointly maximizing the entropy of the input distributions. The optimisation was performed for each image independently and using Adam with a learning rate of 5e 3 until the validation loss dropped below a divergence to the initial loss of 1e 2. Doing so typically took around 700 iterations. Fig. 10 shows some additional examples with the input-dependent sensitivities.

B.4 EFFECT OF THE NUMBER OF MC SAMPLES ON PERFORMANCE

We investigate the influence of number of samples on performance.

On regression tasks with small scale neural network (two layer MLP), we run experiments with the range of [100, 500, 1000, 5000, 10000, 50000]. On classification tasks with medium scale neural network (four layer MLP), we run experiments with the range of [100, 500, 1000, 5000, 10000, 25000]. On classification tasks with large scale neural network (Vi T-Base), we run experiments with the range of [10, 20, . . . , 100]. The results are reported in Figs. 11 to 13. The number of samples we used to report results in the main paper is shown in the dashed line.

In Fig. 11 and Fig. 13, we observe that for regression on small scale networks and classification on large scale networks, the performance saturates when set the number of samples to 50 samples (classification with Vi T-Base) or 1000 samples (regression with two layer MLP). In Fig. 12, the performance saturates with 1000 samples with LA. For MFVI the performance saturates with 5000 samples, as the improvement gained on NLPD is marginal (from 2.13 to 2.11) from 1000 samples to 5000 samples, in experiment we set the number of samples as 1000 for MFVI as well.

Published as a conference paper at ICLR 2025

20 40 60 80 100

MFVI on CIFAR-10 dataset

20 40 60 80 100

LA on CIFAR-10 dataset

20 40 60 80 100

20 40 60 80 100

20 40 60 80 100

Number of MC samples

20 40 60 80 100

Number of MC samples

Figure 11: Effects of the number of MC samples on performance for LA and MFVI on CIFAR-10 with Vi T-Base model. In the results reported in main paper, we set the number of MC samples to 50 (dashed line).

B.5 ESTIMATING DEGREE OF LOCAL LINEARITY

We performed an additional experiment to assess the degree of local linearity of a trained MLP with Re LU activation functions. In particular, for trained MLP f( ), we are estimating the expected absolute error

δLin = Ez p(z) [|f(z(1 ϵ)) f(z)(1 ϵ)|] , (89)

where ϵ 0 and δLin is zero for any ϵ if f( ) is linear around each z.

In our experiments we vary ϵ in the range of ϵ [1e 6, 1e 5, . . . , 1] for a fully connected Re LU MLP with layers with sizes [784, 128, 64, 10] trained on MNIST digits. After training, we removed the softmax operation on the last layer and measured the local linearisation error on the logits. We estimated the error on a random subset of 124 validation data points and estimated the range of the inputs and the function outputs on the same subset. The range of input values is 3.246 and the range of the function outputs varies between 153.072 and 291.168. Fig. 14 shows the results for each of the ten output dimensions scaled relative to their respective range. We observe that the trained Re LU MLP obtains low expected absolute error and behaves locally linear to a certain degree.

B.6 COMPARISON OF VALUE COVARIANCE IN TRANSFORMERS

One way to improve efficiency in transformer is dropping the correlation between values, i.e., drop the correlation Cov[vk, vl] given in Eq. (74). We compare the performance of both approximations and the results are given in Tables 14 to 16. Both approximation results in almost the same results for NLPD, ACC and ECE.

Published as a conference paper at ICLR 2025

102 103 104 2.5 104

MFVI on FMNIST dataset

102 103 104 2.5 104

LA on FMNIST dataset

102 103 104 2.5 104

102 103 104 2.5 104

102 103 104 2.5 104

Number of MC samples

102 103 104 2.5 104

Number of MC samples

Figure 12: Effects of the number of MC samples on performance for LA and MFVI on FMNIST. In the results reported in main paper, we set the number of MC samples to 1000 (dashed line).

Table 14: Negative Log Predictive Density (NLPD) for Vi T with posterior approximation in the attention layers. We compare only considering variance for value V and considering full covariance. For MFVI and LA, both approximations results in almost the same result.

Mean Field Variational Inference Laplace Approximation Dataset Full Covariance Only Variance Full Covariance Only Variance

CIFAR-10 0.087 0.006 0.088 0.006 0.086 0.006 0.086 0.006 CIFAR-100 0.468 0.012 0.467 0.012 0.456 0.012 0.456 0.012 DTD 1.006 0.035 1.007 0.035 1.068 0.035 1.068 0.035 RESISC 0.619 0.019 0.616 0.019 0.353 0.012 0.352 0.012 IMAGENET-R 1.234 0.052 1.233 0.052 1.264 0.043 1.267 0.043

Table 15: Accuracy (ACC) for Vi T with posterior approximation n the attention layers. We compare only considering variance for value V and considering full covariance. For MFVI and LA, both approximations results in almost the same result.

Mean Field Variational Inference Laplace Approximation Dataset Full Covariance Only Variance Full Covariance Only Variance

CIFAR-10 0.975 0.002 0.975 0.004 0.976 0.002 0.976 0.004 CIFAR-100 0.880 0.003 0.880 0.009 0.880 0.003 0.880 0.009 DTD 0.734 0.010 0.734 0.012 0.719 0.010 0.719 0.012 RESISC 0.867 0.004 0.867 0.009 0.892 0.004 0.892 0.008 IMAGENET-R 0.727 0.012 0.728 0.012 0.739 0.012 0.739 0.012

Published as a conference paper at ICLR 2025

102 103 104 5 104

MFVI on CCPP regression dataset

102 103 104 5 104

LA on CCPP regression dataset

102 103 104 5 104

102 103 104 5 104

102 103 104 5 104

Number of MC samples

102 103 104 5 104

Number of MC samples

Figure 13: Effects of the number of MC samples on performance for LA and MFVI on regression tasks. In the results reported in main paper, we set the number of MC samples to 1000 (dashed line).

10 6 10 5 10 4 10 3 10 2 10 1 100

Epxected Absolute Error: δLin

Figure 14: Estimated divergence from a locally linear function as a function of ϵ. Note that a value of zero means that the function behaves locally like a linear function.

B.7 RUNTIME EXPERIMENT

We compared the runtime of our method against sampling (using the torch-laplace library Daxberger et al. (2021a) for Laplace and the IVON Shen et al. (2024)) and the GLM implementation of the torch-laplace library for diagonal posterior covariances. For our streamlined approach on Vi T, we assessed two cases: (i) propagating Cov[vk, vl] covariance terms (cf., Eq. (74)) through the transformer (+Cov), and (ii) ignoring Cov[vk, vl] covariance terms. We used a pre-trained Vi T base model on CIFAR-10 and a pre-trained MLP on MNIST. For comparison we also list the runtime for a single forward pass. For this, we ran experiments on an NVIDIA H100 80GB GPU for 400 data points, batchsize of one, and for each data point we repeated the measurement ten times. To account

Published as a conference paper at ICLR 2025

Table 16: Expected Calibration Error (ECE) for Vi T with posterior approximation n the attention layers. We compare only considering variance for value V and considering full covariance. For MFVI and LA, both approximations results in almost the same result.

Mean Field Variational Inference Laplace Approximation Dataset Full Covariance Only Variance Full Covariance Only Variance

CIFAR-10 0.007 0.008 0.009 0.008 CIFAR-100 0.024 0.026 0.027 0.023 DTD 0.040 0.040 0.042 0.043 RESISC 0.018 0.016 0.017 0.020 IMAGENET-R 0.043 0.132 0.130 0.039

for code compilation overheads, we dropped the first run on each data point. We report the mean and standard deviation of the runtime (in milliseconds) over the remaining nine runs and all 400 data points. The results are shown in Table 17. For Vi T, we can see that our method without Cov[vk, vl] covariance terms has a comparable runtime to a single forward pass in the deterministic model. When additionally accounting for covariance terms, we obtain slight speed improvements over GLM but overall comparable performance. Note that our implementation is not optimised for speed and larger speedups may be obtained by optimising the code. For MLP, we obtain slight speed improvements over LA GLM but overall comparable performance.

Table 17: Wallclock times for Vi T base on CIFAR-10 and MLP on MNIST in milliseconds.

Model Methods AVG. RUNTIME ( STD) MAP 3.737 0.093 LA Sampling 190.806 0.137 LA GLM 17.191 0.734 Vi T MFVI Sampling 207.854 0.307 Ours (+ Cov) 14.728 0.144 Ours 4.350 0.079 MAP 0.069 0.001 LA Sampling 98.584 3.737 LA GLM 1.656 0.049 MLP MFVI Sampling 190.302 0.466 Ours 0.542 0.073