# trainingfree_neural_active_learning_with_initializationrobustness_guarantees__3fd2a121.pdf

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Apivich Hemachandra 1 Zhongxiang Dai 1 Jasraj Singh 2 See-Kiong Ng 1 Bryan Kian Hsiang Low 1

Existing neural active learning algorithms have aimed to optimize the predictive performance of neural networks (NNs) by selecting data for labelling. However, other than a good predictive performance, being robust against random parameter initializations is also a crucial requirement in safety-critical applications. To this end, we introduce our expected variance with Gaussian processes (EV-GP) criterion for neural active learning, which is theoretically guaranteed to select data points which lead to trained NNs with both (a) good predictive performances and (b) initialization robustness. Importantly, our EV-GP criterion is training-free, i.e., it does not require any training of the NN during data selection, which makes it computationally efficient. We empirically demonstrate that our EV-GP criterion is highly correlated with both initialization robustness and generalization performance, and show that it consistently outperforms baseline methods in terms of both desiderata, especially in situations with limited initial data or large batch sizes.

1. Introduction

Deep neural networks (NNs) have recently achieved impressive performances in various applications thanks to the availability of massive amount of labelled data. However, in some applications, data labelling is so costly that obtaining large datasets is infeasible. In this regard, a number of works have used the method of neural active learning to select a small number of data points to be labelled and subsequently be used for the training of the NNs (Ren et al., 2021; Sener & Savarese, 2018; Kirsch et al., 2019; Ash et al., 2020).

When selecting the data for labelling, existing neural active learning methods have only aimed to optimize the predic-

1Department of Computer Science, National University of Singapore, Republic of Singapore 2School of Computer Science and Engineering, Nanyang Technological University, Republic of Singapore. Correspondence to: Zhongxiang Dai <dzx@nus.edu.sg>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

2 1 0 1 2 Input

Initialization

2 1 0 1 2 Input

"Good" dataset

2 1 0 1 2 Input

"Bad" dataset

Figure 1. Neural networks initialized randomly with different seeds (left), after training, can give consistent outputs (center) or highly varying outputs (right) depending on the training data.

tive performance (e.g., generalization performance) of the trained NNs. However, the output from a neural network is heavily dependent on its training process, especially on its initialization (Cyr et al., 2019; Lee et al., 2020). Therefore in some safety-critical applications, having small variability in the trained NNs with respect to model initialization is also a crucial requirement. For example, given a training set of patient data in a healthcare application (e.g., disease diagnosis), if the NNs trained using different random initializations have drastically disparate predictions, the resulting NNs would be too unreliable to be used for clinical diagnosis due to the high-stakes nature of the application (Esteva et al., 2019). As a simple illustration, Fig. 1 shows that different training sets can indeed result in distinct sensitivities of the trained NNs to model initialization, and hence a badly constructed training set (e.g., the rightmost figure in Fig. 1) can cause the trained NN to be highly variable (and hence untrustworthy) w.r.t. the model initialization. Therefore, in these important applications, it is paramount for a neural active learning algorithm to select data points which lead to trained NNs with not only (a) good predictive performances but also (b) initialization robustness.

To this end, we introduce our Expected Variance with Gaussian Processes (EV-GP) criterion, which is theoretically shown to select data points which satisfy both criteria. Specifically, we firstly leverage the theory of neural tangent kernel (NTK) (Jacot et al., 2018) and overparameterized NNs (Lee et al., 2020) to characterize the predictive distribution of trained NNs (Sec. 3.1) with respect to random model initialization. This allows us to measure the robustness of the trained NNs against model initialization via the output variance, i.e., a smaller output variance indicates a more initialization-robust NN. Next, we introduce an efficiently computable approximation of the output variance, and de-

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

rive a theoretical guarantee on the approximation quality (Sec. 3.1). Subsequently, we show that this approximate output variance, which is an indicator of (the inverse of) initialization robustness, is also, interestingly, an upper bound on the generalization error of the trained NNs (Sec. 3.2).

As a result, our EV-GP active learning criterion (Sec. 3.3), which is based on the minimization of the above-mentioned approximate output variance, is able to select data points which lead to both (a) good predictive performances (i.e., small generalization errors) and (b) initialization robustness. Moreover, our EV-GP criterion also has the additional advantages of computational efficiency and generality. Firstly, during data selection, the computation of our EV-GP criterion does not require training of the NN, which is usually computationally costly. Therefore, our EV-GP criterion is computationally efficient and can be used to select all data points in a small number of batches or even a single batch. Secondly, our EV-GP criterion does not impose any requirement on the training process of the NN, and is hence applicable to generic NNs rather than requiring specific types of models such as Bayesian NNs. Furthermore, we introduce a theoretically grounded method for model selection which is able to automatically optimize the architecture of the NN based on the selected data points (Sec. 4).

We empirically show that our EV-GP criterion can accurately characterize the output variance (w.r.t. the random model initialization) and hence the initialization robustness of trained NNs (Secs. 5.1 and 5.2). Furthermore, we use extensive real-world regression (Sec. 5.2) and classification (Sec. 5.3) experiments to show that our EV-GP criterion outperforms existing baselines in terms of both initialization robustness and predictive performances. Of note, the performance advantage of our EV-GP criterion is particularly pronounced when there is limited labelled data or the batch size is large (Sec. 5.3). Lastly, we also empirically verify that our model selection method (Sec. 4) is able to further improve the performance of our EV-GP criterion (Sec. 5.4).

2. Problem Setup and Background

In our problem setup, consistent with previous works on active learning, we have access to an unlabelled pool of input data XU whose corresponding labels are expensive to obtain. Also, suppose we have a set of testing input data XT (whose corresponding labels are unknown), which we want our NN to eventually make predictions on. In practice, when a set of testing input is not available, we can simply let XT = XU, which we have done in our experiments. Additionally, we are given a set L0 = (X0, y0) of alreadylabelled data (which may be empty), and the algorithm proceeds to construct a set L = (XL, y L) where XL \ X0 XU and |L \ L0| k. The parameter k can be considered the budget for active learning, which is a limit on the number

of queries to the oracle. To construct L, in each round of active learning, the algorithm selects a set of b unlabelled input data points Xb XU and submits them to the oracle to obtain their corresponding labels yb. It is desirable to have minimal training of the NN between different rounds of querying in order to avoid incurring excessive computational costs and long delays between each rounds.

Once the labelled training set L is obtained after the active learning algorithm, we use it as the training set D = L to train an NN, f( ; θ), where θ denotes its parameters. The model parameters are firstly initialized, θ0 init(θ), and following the common practice (Glorot & Bengio, 2010), we assume that each parameter of θ0 is independently drawn from a Gaussian distribution with zero mean and a known variance. After initialization, the model is trained with gradient-based optimization to minimize the regularized mean-squared error (MSE) loss:

ℓ(D; θ) = 1 |D|

1 2 f(x; θ) y 2 + λ

2 θ 2 . (1)

The loss function comprises of a mean-squared error term (first term) and a regularisation term (second term), in which λ controls their trade-off. Note that the regularized MSE loss (1) is only required in our theoretical analysis and in practice, other loss functions such as the cross-entropy loss can also be used (Sec. 5.3). The NN model is assumed to be trained till convergence, resulting in the parameters θ = train(θ0) where train( ) denotes the function of the training process. As discussed in Sec. 1, we would like the final model predictions f(x; train(θ0)) to achieve both small generalization errors and low output variances with respect to the randomness of θ0.

2.1. Neural Tangent Kernels

When training a NN using a training set D = (X, y) with gradient descent (GD): θt+1 θt η θℓ(D; θt) where η is the learning rate, it has been shown (Lee et al., 2020) that as long as η is small enough, the training can be approximated by continuous GD. As a result, the change in the predictive output f(X; θt) over time can be expressed as:

t = η θf(X; θt) θf(X; θt) | {z }

The term ˆΘt(X, X ) θf(X; θt) θf (X ; θt) is referred to as the (empirical) neural tangent kernel (NTK) (Jacot et al., 2018; Arora et al., 2019b). While kernels are defined over a tuple of elements from the input space, we will overload the notation for all kernels and use Θ(X, X ) = (Θ(x, x ))x X,x X to represent the matrix constructed using values of the kernel. We will sometimes use the shorthand notation ΘX to represent Θ(X, X) and ΘX X to represent Θ(X , X) when the context is clear, which is consistent with previous works on NTK (He et al., 2020).

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

The work of Lee et al. (2020) has shown that when the width of the NN approaches infinity, the output of the NN can be approximated by a linear model:

f(x; θ) f(x; θ0) + θf(x; θ0), θ θ0 , (3)

and the empirical NTK ˆΘt( , ) stays constant throughout training and approaches a deterministic kernel Θ( , ) regardless of the initialization (Lee et al., 2020). Using the linear approximation (3), Lee et al. (2020) have shown that if a randomly initialized NN is trained on data (X, y) with the mean-squared error loss ((1) with λ = 0) till convergence, then the predictions of the converged model on a testing set XT follow a normal distribution: f(XT ) N(µ(XT |X, y), ΣNN(XT |X)), where the output mean is

µNN(XT |X, y) = ΘXT X Θ 1 X y , (4)

and the output covariance is

ΣNN(XT |X) = KXT + ΘXT X Θ 1 X KX Θ 1 X ΘXXT ΘXT X Θ 1 X KXXT + KXT X Θ 1 X ΘXXT . (5)

When the testing set XT consists of only a single point x, we use σ2 NN(x|X) = ΣNN(XT |X) to denote the predictive variance at x. The kernel K in (5), which is defined as K(x, x ) = Eθ0 init(θ)[f(x; θ0) f(x ; θ0)], is the covariance of the NN output with respect to the random initialization. Similar to Θ above, we have used the shorthand notations KX and KX X to represent K(X, X) and K(X , X), respectively. We will also use µNN,f and σ2 NN,f to represent the predictive mean and variance calculated using a particular model architecture f.

2.2. Gaussian Processes

To derive our active learning criterion (Sec. 3), we will make use of tools from the literature of Gaussian processes (GPs) (Rasmussen & Williams, 2006). A GP is fully specified by a prior mean function µ(x) which is usually assumed to be µ(x) = 0 x, and a covariance function K(x, x ) (also called kernel function). Given some data (X, y), the GP posterior predictive distribution of the outputs y T at a testing set XT (in the noiseless setting) is given by N(KXT X K 1 X y, KXT KXT X K 1 X KXXT ). The principled uncertainty measures provided by GPs have been used for data selection in active learning (Krause & Guestrin, 2007; Krause et al., 2008; Hoang et al., 2014a;b; Ling et al., 2016; Zhang et al., 2016; Nguyen et al., 2021c; Xu et al., 2023). For example, Krause & Guestrin (2007) have used the mutual information between the selected data and the unlabelled data, calculated using GPs, as the active learning criterion.

3. Initialization-Robust Active Learning

In this section, we firstly introduce a computationally efficient approximation of the output variance of an NN

w.r.t. random initializations, and derive a theoretical guarantee on the approximation quality (Sec. 3.1). Next, we show that the approximate output variance from Sec. 3.1, is also, interestingly, an upper bound on the generalization error of the NN (Sec. 3.2). Finally, we use the approximate output variance to design our active learning criterion (Sec. 3.3).

3.1. Approximation of Output Variance of Trained Neural Network

In the infinite-width regime, when an NN with initial parameters θ0 is trained till convergence to yield parameters train(θ0), the output of the NN on a test set XT follows a normal distribution with mean µNN(XT |X, y) ((4)) and covariance ΣNN(XT |X) (5) (Lee et al., 2020). The randomness in the output distribution results from the random initializations θ0. Given this, we see that ΣNN can serve as a natural and principled measure of initialization robustness.

However, ΣNN presents a significant computational challenge because it requires computing two different kernels (i.e., Θ and K) and a number of matrix inversion and multiplication operations. Therefore, by drawing inspiration from GPs (Rasmussen & Williams, 2006), we introduce an approximation of ΣNN which is both computationally efficient (Sec. 3.1.1) and equipped with a theoretical guarantee on the approximation quality (Sec. 3.1.2).

3.1.1. COMPUTATIONAL EFFICIENCY

Given some data (X, y), performing GP regression with NTK as the covariance function (which we refer to as NTKGP following He et al. (2020)) leads to the output distribution (on a testing set XT ) of f(XT ) N(µNTKGP(XT |X, y), ΣNTKGP(XT |X)), where µNTKGP = µNN (4) and

ΣNTKGP(XT |X) = ΘXT ΘXT X Θ 1 X ΘXXT . (6)

When XT contains a single point x, we use σ2 NTKGP(x|X) = ΣNTKGP(XT |X) to denote the output variance at x.

Compared with ΣNN, the covariance function ΣNTKGP is more efficient to compute because (a) it only requires computing one (instead of two) kernel Θ which can also be easily approximated using the inner product of the parameter gradients at initialization (Sec. 2.1), and (b) the posterior distributions of GPs are well-studied, allowing us to adopt existing tools to further reduce the computational cost of ΣNTKGP. Even though ΣNTKGP incurs a cost of O(n3) with n queried data points due to inversion of an n-by-n matrix, we are able to use results from linear algebra to perform low-rank updates to the GP covariance to reduce the running time to O(n2). Furthermore, we can adopt techniques from the abundant literature of sparse GPs to significantly reduce the dependency on n (Qui nonero-Candela & Rasmussen, 2005; Hoang et al., 2015). We discuss the various

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

approximation methods for ΣNTKGP in App. C.

3.1.2. GUARANTEED APPROXIMATION QUALITY

To provide a theoretical justification for our approximation ΣNTKGP (6), we need to theoretically bound the difference between ΣNTKGP and ΣNN. The work of He et al. (2020) has shown that ΣNN(X |X) ΣNTKGP(X |X). In other words, for a single test point x, we have that σ2 NN(x|X) σ2 NTKGP(x|X), which suggests that the NTKGP approximation (6) is an overestimation of the true output variance (5). However, this result is not enough for guaranteeing a small approximation error. Therefore, we prove here a stronger connection between σ2 NN(x|X) and σ2 NTKGP(x|X): Theorem 3.1 (Informal). Let f( ; θ) be an infinite-width NN with Re LU activation and L 2 hidden layers whose NTK satisfies |Θ(x, x )| B for all x, x X. Then, there exist some constants α > 0 and β = O poly(|X|, B, L, λmin(ΘX ) 1) such that σ2 NN(x|X) α σ2 NTKGP(x|X) β .

Theorem 3.1 shows that σ2 NN(x|X) and α σ2 NTKGP(x|X) have a bounded difference and are hence expected to behave similarly. Of note, when using the active learning criterion based on σ2 NTKGP (Sec. 3.3) to select data points to query, what affects the selected points is only the relative ranking of the values of the criterion (at different inputs in the unlabelled pool), therefore, the presence of the constant α > 0 does not affect our active learning algorithm. The approximation error in Theorem 3.1 depends on a number of factors including the model architecture (which affects the difference between Θ and K), the number of points in X (due to the accumulation of the approximation errors with more data points), and the eigenvalues of ΘX (which affects how well-behaved the matrix ΘX is).

As a brief sketch of the proof, firstly, it can be verified that if K(x, x ) = α Θ(x, x ), then σ2 NN(x|X) = α σ2 NTKGP(x|X). However, in general, K(x, x ) = α Θ(x, x ). Instead, we have managed to show that the ratio between K(x, x ) and Θ(x, x ) is bounded, i.e. there exists some constants a > 0 and a+ > 0 (which depend on L) such that

a K(x, x )/Θ(x, x ) a+ . (7)

As a result, (7) allows us to bound σ2 NN(x|X) α σ2 NTKGP(x|X) . The complete proof is presented in App. A. We will also provide empirical justifications for Theorem 3.1 in Sec. 5.1 by showing that σ2 NTKGP is indeed highly correlated with the empirical output variance of the NN resulting from different model initializations, and is hence a reliable indicator of initialization robustness.

3.2. Connection with Generalization Error

In this section, we show that the approximate output variance σ2 NTKGP (6) is also an upper bound on the generalization

error of the trained NN and hence a good indicator of its predictive performance. To analyze the performance of the trained neural network, we make the following assumption about the groundtruth function f in a manner similar to Vakili et al. (2021b).

Assumption 3.2. Assume that the groundtruth function f HΘ. Specifically, f lies in the reproducing kernel Hilbert space (RKHS) of the NTK Θ, or equivalently, its RKHS norm is such that f HΘ B for some B R 0. Further assume that the function observation at any input xi is given by yi = f (xi) + ξi, in which every ξi is i.i.d. observation noise drawn from an R sub-Gaussian distribution: E[exp(ηξi)] exp(η2R2/2) for all η R.

Both assumptions in Assumption 3.2 are commonly made in the analysis of kernelized and neural bandits (Chowdhury & Gopalan, 2017; Kassraie & Krause, 2022). They allow us to show the following theoretical guarantee on the generalization error:

Theorem 3.3 (Informal). Suppose we train an infinitely wide NN f( ; θ) with training dataset (X, y) on meansquared error loss function using gradient descent until convergence. Then, there exists a constant ζ = O poly(B, R)

such that for any x X, with high probability over the random observation noise ε and network initialization θ0, f (x) f(x; train(θ0)) ζ σNTKGP(x|X). (8)

Theorem 3.3 shows that the generalization error of a trained NN through gradient descent is proportional to σNTKGP (6), where the constant of proportionality ζ is independent of x and X. As a result, minimizing σNTKGP will not only (a) decrease the output variance (Sec. 3.1) and hence improve initialization robustness, but also (b) reduce the generalization error and hence enhance the predictive performance. The degree of correlation between σNTKGP and the generalization performance, represented by the constant ζ, depends on the parameters B and R, such that the easier the function f is to learn (i.e., a smaller B) or the less noisy the observations are (i.e., a smaller R), the better the degree of correlation. Theorem 3.3 is also consistent with the empirically observed characteristics of over-parameterized NNs, because NN models with lower variance are also observed to have higher predictive accuracy (Neal et al., 2018). Theorem 3.3 will be stated formally and proved in App. B.

3.3. Active Learning Criterion

Since we have shown that minimizing σNTKGP(x|X) can improve both initialization robustness (Sec. 3.1) and generaliztion performance (Sec. 3.2), we design our active learning criterion based on the minimization of σNTKGP(x|X) across all test input points x XT . Specifically, our EVGP criterion encourages the selection of input data points

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

which result in small expected output variance σ2 NTKGP(x|X) across all test inputs, and we estimate the expected variance by averaging σ2 NTKGP(x|X) over the available test set XT :

αEV(X) = 1 |XT |

σ2 NTKGP(x| ) σ2 NTKGP(x|X) . (9) We have added σ2 NTKGP(x| ) to the criterion so that αEV(X) 0 and that our criterion is to be maximized during active learning.1 We will sometimes also use αEV(X; f) to indicate that the criterion uses the model architecture f.

Our αEV criterion (9) has multiple computational benefits. Firstly, it is training-free, i.e., its calculation does not require any training of the NN and is hence able to sidestep significant computational costs resulting from model training. Secondly, it only requires calculating the variance at individual test points rather than the full covariance over the testing set. Thirdly, it can make use of the approximation techniques based on sparse GPs discussed in Sec. 3.1.1, for which we simply need to replace σ2 NTKGP by its sparse GP counterparts in (9). Fourthly, it is monotone submodular, and therefore a greedy approach (i.e., select the point which gives the largest increase in the criterion in each selection round) is guaranteed to give a (1 1

e)-optimal solution (Nemhauser et al., 1978). We adopt the greedy approach in our experiments for simplicity (with some techniques for speedups which we discuss in App. E), and leave the use of other more sophisticated submodular optimization techniques to future works. Furthermore, another advantage of our αEV criterion is that it is label-independent, because the calculation of σ2 NTKGP (6) does not require the observations. Therefore, our criterion does not need the heuristic of pseudo-labels which is required by previous active learning algorithms (Ash et al., 2020; Mohamadi et al., 2022).

In addition to our αEV criterion (9), we can also use ΣNTKGP to construct alternative criteria with different characteristics. We introduce two additional criteria in App. D, which are based on, respectively, mutual information (which requires computing the full covariance matrix) and the percentile variance (which is not submodular in general).

4. Initialization-Robust Active Learning with Model Selection

A common issue with existing neural active learning algorithms is that a model architecture has to be fixed in advance and then used for the data point selection. However, in practice, especially when having no access to (labelled) data beforehand, it is infeasible to select the best model architecture prior to running the neural active learning algorithms. To this end, by leveraging the output distributions of overparameterized NNs in a similar way to Sec. 3.1, we extend

1This is to follow convention of other active learning methods.

our initialization-robust active learning algorithm (Sec. 3) to simultaneously select the data points to query (Sec. 3.3) and optimize the model architecture in a training-free manner.

Given a model architecture f and a training set D = (X, y), the expected squared error of the trained model on a testing set DT w.r.t. random parameter initializations is given by (proof in App. F)

ˆαM,DT (f; D) E θ0 init(θ) [ℓ(DT , train(θ0))]

(y µNN,f(x|X))2 | {z } 1

+ σ2 NN,f(x|X) | {z } 2

The first term in (10), 1 , characterizes how well the trained model is able to fit the underlying function, which is related to its generalization performance. The second term, 2 , represents the predictive variance of the trained model, which is an indicator of the complexity of the model. A good model architecture should be expressive enough to fit the underlying function well (i.e., have a small 1 ), while also not being too sensitive to the parameter initialization (i.e., have a small 2 ).

(10) allows us to design our model selection method based on cross validation using the queried data D during active learning. Specifically, we adopt the cross validation method of bootstrapping (Kohavi, 1995): we select a random subset DT D of size κ as the testing set, and compute (10) using D \ DT as the training set. As a result, this leads to the following criterion for model selection:2

αM(f; D) = E DT D; |DT |=κ

ˆαM,DT (f; D \ DT ) . (11)

In practice, the user chooses an appropriate κ and computes the empirical mean to approximate the expectation in (11). Since αM is computed far fewer times than αEV during the active learning process, it is reasonable to use σ2 NN directly rather than approximating it with σ2 NTKGP. As a result, in our algorithm for model selection, given a set of candidate model architectures M = {f1, . . . , fm}, we evaluate (11) for every architecture f M and subsequently choose the architecture which maximizes this criterion.

The full algorithm with both data and model selection, which we name EV-GP+MS, is shown in (1). The algorithm alternates between two phases. The first phase uses a fixed model architecture to greedily maximize our αEV criterion (9) for data selection, and the second phase utilizes the queried data so far to select the best model architecture using our criterion in (11).

5. Experiments

When reporting the model output variance or average performance, we train an NN 50 times (regression) or 25 times

2The negative sign is added to convert to maximization.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Algorithm 1 EV-GP+MS

Input: Initial labelled data (X0, y0), unlabelled pool XU, candidate model architectures M, batch size b (XL, y L) (X0, y0) Pick an initial model f M repeat

// Phase 1: Data selection for b iterations do

x arg maxx XU\XL αEV(XL {x}; f ) XL XL {x } end for Query the unlabelled points in XL for the labels y L // Phase 2: Model selection f arg maxf M αM f; (XL, y L)

until budget exhausted return (XL, y L), f

(classification) with the same architecture but with different model initializations, and then calculate the empirical mean and variance. All experiments are repeated 5 times unless stated otherwise, and their mean and standard deviations are reported. Although the theoretical properties of ΣNN and ΣNTKGP are applicable to infinite-width NNs, we follow the practice of previous works on NTKs (He et al., 2020; Mohamadi et al., 2022) and use finite-width NNs, because they are able to achieve good performances. In our experiments, we test our algorithm using both the theoretical NTKs (computed using the JAX-based (Bradbury et al., 2018) Neural-Tangents package (Novak et al., 2019)), and the empirical NTK (computed using Py Torch). We will use EV-GP-EMP to denote instances when we use the empirical NTK for our algorithm. We discuss the computation of NTKs further in App. G.2. We adopt the MSE loss (1) for regression experiments and the cross-entropy loss for classification experiments, which is consistent with previous works on NTK (Liu & Zenke, 2020; Shu et al., 2022a). We find that even though the NTK theory is developed based on MSE loss, prior works utilizing NTKs have shown that it is also effective in predicting behaviors of models trained under cross-entropy loss as well. We compare our algorithm with previous baselines which also require minimal model training between different batches and do not incur significant extra computations. These benchmarks are described further in App. G.4. In particular, we compare with random selection, K-MEANS++ (Arthur & Vassilvitskii, 2007), BADGE (Ash et al., 2020), and MLMOC (Mohamadi et al., 2022). The former two algorithms, like our algorithm, can select all the points in a single batch, whereas the latter two are not designed for such a setting but are applicable after modifications. We omit results comparing our algorithm against BATCHBALD (Kirsch et al., 2019) in the main paper due to the method requiring a Bayesian neural network, although their results have been included in H.4.2.

Figure 2. Correlation between our approximate output variance σ2 NTKGP(x|X) and the empirical NN output variance. The full description of the graph is given in App. H.1.

We have chosen to focus on the cases with low query budgets since we find this is when the neural networks tend to show a larger difference in predictive performances, and so the active learning algorithm needs to be more careful in selecting which data points to query as they will have a larger impact on the final selected models. This setting studied in our experiments is realistic since it simulates the situations where there is little or no initial labeled data and querying any data incurs a large cost. The code for the experiments can be found at https://github.com/ apivich-h/init-robust-al. Other experimental details are deferred to App. G due to space limitation.

5.1. Correlations Between σ2 NTKGP and Neural Network Output Variance

Here we study whether our approximate output variance σ2 NTKGP (Sec. 3.1) can accurately reflect the output variance of NNs (w.r.t. the random initializations) and hence the initialization robustness. Fig. 2 plots the individual variance predicted by our NTKGP (i.e., σ2 NTKGP(x|X) for some x and X) against the empirically observed output variance resulting from different random initializations. The figure verifies that σ2 NTKGP is highly correlated with the observed output variance of the NN and the variances are generally confined within some region, which provides an empirical corroboration for our Theorem 3.1. This justifies our choice of using σ2 NTKGP to measure the output variance w.r.t. the model initializations and hence the initialization robustness. In addition, in App. H.2, we present further experimental results to show that the output variance predicted using sparse GP approximations, which is more computationally efficient (Sec. 3.1.1), is also highly correlated with the observed output variance.

5.2. Experiments on Regression Tasks

Here we evaluate our EV-GP criterion3 (Sec. 3.3) using regression tasks. In the experiments here, each algorithm is

3When reporting the value αEV, we will ignore the σ2 NN(x| ) terms and instead report the average of σ2 NN(x|X).

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

0.6 0.4 0.2 EV criterion

90th %ile output variance

SP=-0.948 PE=-0.95

0 5 10 15 20 Selection points

0 5 10 15 20 Selection points

Figure 3. Results of sequential data selection in regression tasks (discussed in Sec. 5.2 and detailed descriptions in App. H.1).

given an unlabelled pool of data and no initial labelled data, and all methods use a 2-layer MLP with Re LU activation.

In Fig. 3, our EV-GP criterion is used to sequentially select the data points (i.e. the batch size is 1). In the first column of Fig. 3, we plot the 90th percentile output variance (i.e. 90% of the test points have lower output variance than this value) as the vertical axis, against the values of our EV-GP criterion as the horizontal axis. The gray dots show that our EV-GP criterion is highly correlated with output variance, and the blue dots, which display the points selected by our EV-GP criterion during active learning, demonstrate that our EV-GP criterion is able to select points which lead to low output variance (since the selected points are mostly clustered in the bottom right corner). This is also corroborated by the middle column of Fig. 3, which shows that sequentially maximizing our EV-GP criterion indeed leads to the selection points which progressively reduce the output variance, and our EV-GP criterion consistently outperform random search. The third column of Fig. 3 shows that the points selected by maximizing our EV-GP criterion also sequentially reduce the test MSE and hence improve the predictive performance of the NN. Therefore, the second and third columns of Fig. 3 combine to provide an empirical justification for our Theorem 3.3, which has theoretically shown that minimizing the approximate output variance σNTKGP (which is achieved by maximizing our EV-GP criterion) also improves the generalization performance of overparameterized NNs.

We have also tested our EV-GP criterion in the more practical active learning setting where a batch of points are selected in every round (with a batch size of 20). Fig. 4 shows that in the batch setting, our EV-GP criterion is still able select batches of points which lead to both low output variance (first column) and small test error (second column), and outperforms the other baselines. We include more experimental results for the regression tasks in App. H.3, which are consistent with those shown here (Fig. 3 and Fig. 4). A particularly interesting additional result is Fig. 12, which shows that an easier regression task leads to a larger degree of correlation between the output variance and the test error. This, interestingly, is consistent with Theorem 3.3, because it has theoretically shown that an easier task (i.e., a simpler groundtruth function which is indicated by a smaller B)

Robot Kinematics

25 50 75 100 Selection points

90th %ile output variance

25 50 75 100 Selection points

50 100 Selection points

90th %ile output variance

50 100 Selection points

Random EV-GP (ours) K-Means++

Figure 4. Results on regression tasks. Left: output variance of test predictions after training using the labelled active set. Right: test MSE. The x-axis represents the size of the selected active set. More details about the metrics are in App. G.3.

reduces the value of ζ on the right hand side of Theorem 3.3, which consequently increases the degree of correlation between the output variance (i.e., right hand side) and the generalization error (i.e., left hand side).

5.3. Experiments on Classification Tasks

Here we apply our EV-GP criterion to classification tasks. We use a wider variety of NN architectures, including MLPs (Re LU activation), convolutional NNs (CNNs), and Wide Res Nets (Zagoruyko & Komodakis, 2016) which has also been used in experiments of Mohamadi et al. (2022).

Performance Comparison. Fig. 5(a-b) presents the comparison of our EV-GP criterion with other baselines, in which all methods use MLPs. The figures show that our EV-GP criterion is indeed able to select points which lead to both initialization robustness (i.e., low output entropy plotted in the first column) and good generalization performances (i.e., high test accuracy shown in the second column). The results are consistent with those for the regression tasks (Sec. 5.2). Moreover, our EV-GP criterion outperforms the other baselines in Fig. 5(a-b), especially in the earlier rounds when there is a small number of selected points. Fig. 5(c-d) plots the results using more sophisticated NN architectures (i.e., CNNs and Wide Res Nets), in which our EV-GP criterion also consistently outperforms the other baselines in terms of the test accuracy. Further experimental results on other dataset and model architectures (such as Res Net18) are also provided in App. H.4.1.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Random EV-GP (ours)

K-Means++ BADGE

MLMOC Random EV-GP-Emp (ours)

K-Means++ BADGE

100 200 300 400 Selection points

90th %ile output entropy

200 400 Selection points

Mean test accuracy

200 400 600 Selection points

90th %ile output entropy

200 400 600 Selection points

Mean test accuracy

200 400 600 Selection points

Mean test accuracy

250 500 750 Selection points

Mean test accuracy

(a) MNIST, 2-layer MLP (b) EMNIST, 3-layer MLP (c) EMNIST, 2-layer CNN (d) SVHN, Wide Res Net

Figure 5. Results of active learning on classification tasks using NNs. (a-b) the output entropy and mean test accuracy of test predictions for experiments involving MLPs. (c-d) the mean test accuracy for experiments involving more complex models with convolutions. More details about the metrics are provided in App. G.3.

MNIST (2-layer MLP) SVHN (Wide Res Net)

200 400 Batch size

Mean test accuracy

250 500 750 Batch size

Mean test accuracy

EV-GP-Emp (ours) BADGE MLMOC

Figure 6. Results on classification with varying batch sizes.

Effects of the Batch Size. Here we examine the impact of the batch size on the performances of different active learning algorithms, by fixing the total query budget and varying the batch size. The results in Fig. 6 show that EV-GP-EMP (which uses the empirical NTK) is minimally affected by the increasing batch size4. This is reasonable because these algorithms do not require the labels, therefore, a smaller batch size (i.e., more frequent availability of the labels) has no impact on the performance. In contrast, when the batch size is increased, MLMOC experiences a large drop in performance for both datasets and the performance of BADGE is significantly decreased for MNIST. This may be mainly attributed to their reliance on the labels, because a larger batch size reduces the frequency of the availability of, and hence their abilities to use, labels. Moreover, another factor which causes the detrimental effect of a larger batch size on MLMOC is that a larger batch size is likely to reduce the diversity of the selected points (Mohamadi et al., 2022). We provide further discussions and additional results involving varying query batch sizes in App. H.4.3.

Computation Costs of EV-GP. We find that the cost for only running EV-GP is not significantly higher than other active learning baseline algorithms, while still being able to outperform these baselines. Nonetheless, in practical scenarios, the cost for querying labels will often dominate the cost

4We omit RANDOM, K-MEANS++ and EV-GP from the graph since the selection algorithm is independent of batch size.

for active data selection. Since EV-GP is training-free and minimally affected by query batch size, in practical scenarios EV-GP will be less affected by these high labelling costs and be more advantageous than other active learning algorithms. We discuss the results regarding the computation time of EV-GP further in App. H.4.4.

Effects of the Network Width. We find that using a wider NN at training time improves both the model accuracy and initialization robustness, while increasing the width of the NN for active learning (i.e., data selection) only affects the initialization robustness yet has negligible effects on the resulting model accuracy. We explore these results further in App. H.4.5.

5.4. Active Learning with Model Selection

Here we evaluate the effect of our model selection algorithm (Sec. 4), by comparing the performance of our EV-GP+MS algorithm (with model selection) with that of our EV-GP algorithm using a fixed model architecture (2-layer MLP with Re LU activation) and with model selection using NASWOT (Mellor et al., 2021). The left figure in Fig. 7 shows that for some datasets for which the fixed architecture already leads to a good performance, our EV-GP+MS with model selection is able to perform on par with that of EVGP with the fixed architecture. For some other datasets where the fixed architecture is not adequate (e.g. when a deeper model or a different activation function can better model the data), our EV-GP+MS is able to discover a better model architecture and hence achieve a lower loss (the right figure in Fig. 7). The performance of EV-GP+MS can be attributed to the fact that it is coherent with our active learning criterion. Specifically, although it is possible to run other model selection algorithms together with EV-GP criterion, our model selection criterion (11) is developed based on the same theoretical foundation as our active learning criterion (9). Therefore, our model selection algorithm leads to a coherent framework for joint data and model selection, hence resulting in better empirical performances as shown. We provide further discussions and additional results for

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Robot Kinematics Naval

50 100 150 200 Selected points

EV-GP (Fixed M) EV-GP + MS EV-GP + NASWOT

50 100 150 200 Selected points

EV-GP (Fixed M) EV-GP + MS EV-GP + NASWOT

Figure 7. Results of EV-GP+MS compared to EV-GP on a fixed model architecture (2-layer MLP with Re LU activation).

EV-GP+MS in App. H.5.

6. Related Works

The existing works on active learning are usually based on either diversity or uncertainty. Diversity-based active learning algorithms aim to select a diverse subset of data, in which the diversity of the data is measured based on a discriminator (Gissin & Shalev-Shwartz, 2019; Sinha et al., 2019; Kim et al., 2021), using some latent representation (Sener & Savarese, 2018; Ash et al., 2020), or based on the degree to which they match with the other unlabelled data (Chattopadhyay et al., 2012; Shui et al., 2020). Uncertaintybased active learning algorithms select the unlabelled data based on how uncertain the trained model is about their predictions, in which the uncertainty can be captured through the softmax output (Ranganathan et al., 2017; He et al., 2019) or using a Bayesian NN (Gal et al., 2017; Kirsch et al., 2019; 2021). In addition, some previous works on active learning have also combined both diversity and uncertainty for data selection (Ash et al., 2020; Prabhu et al., 2021). Some works have improved the efficiency of neural active learning by estimating the performance of the NNs using computationally efficient proxies, such as an NN with a smaller size (Coleman et al., 2020). Another line of work is to approximate the neural network performance using the theory of NTKs (Wang et al., 2021; 2022; Mohamadi et al., 2022), which is able provide theoretical guarantees on how a model would behave without the need to perform expensive model training.

In addition to active learning, NTKs also find use in other tasks such as data valuation (Wu et al., 2022), neural architecture search (Shu et al., 2022a;c), multi-armed bandits (Dai et al., 2023), Bayesian optimization (Dai et al., 2022b), and uncertainty quantification (He et al., 2020). NTKs have also been useful in understanding model generalization loss (Cao & Gu, 2019) and also and in linking neural network training with gradient descent and kernel ridge regression (Arora et al., 2019a; Vakili et al., 2021b). Moreover, the problem of active learning is also closely related to Bayesian optimization (BO) which also utilizes

the principled uncertainty measures provided by GPs to optimize a black-box function (Daxberger & Low, 2017; Kharkovskii et al., 2020b; Balakrishnan et al., 2020; Nguyen et al., 2021d;e; Dai et al., 2022a). Specifically, active learning can be viewed as a variant of BO for pure exploration.

7. Conclusion

We have introduced a computationally efficient and theoretically grounded criterion for neural active learning, which can lead to the selection of points that result in both initialization robustness and good generalization performances. Extensive empirical results have shown that our criterion is highly correlated to both the initialization robustness and generalization error, and that it consistently outperforms existing baselines. An interesting future direction is to incorporate our algorithm to select the initial points for other neural active learning algorithms to further enhance their performances, because our algorithm has shown impressive performances in scenarios with limited initial labelled data. Moreover, given the close connections between active learning and Bayesian optimization (BO) (Sec. 6), as well as the widespread real-world applications of BO, we will also explore extending our active learning algorithm to other real-world problem settings that have been considered by BO, such as problems with high-dimensional input spaces (Hoang et al., 2018), multi-fidelity observations (Zhang et al., 2017; 2019; Dai et al., 2019), delayed feedback (Verma et al., 2022), a necessity for rigorous privacy guarantees (Kharkovskii et al., 2020a) and a requirement for risk aversion (Nguyen et al., 2021a;b; Tay et al., 2022), as well as problems that fall under the federated/collaborative setting (Dai et al., 2020b; 2021; Sim et al., 2021) and game-theoretic setting (Dai et al., 2020a; Tay et al., 2023). Furthermore, it is also a promising future topic to extend our model selection method (Sec. 4) by incorporating more advanced methods for neural architecture search (Shu et al., 2022a;b).

Acknowledgements

This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2RP-2020-018) and by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmatic Funds (Award A20H6b0151). We would like to thank Shu Yao for his valuable inputs to our paper.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. In Proc. ICML, pp. 322 332. PMLR, May 2019a.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an infinitely wide neural net. In Proc. Neur IPS, number 731, pp. 8141 8150. Curran Associates Inc., Red Hook, NY, USA, December 2019b.

Arthur, D. and Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proc. SODA, SODA 07, pp. 1027 1035, USA, January 2007. Society for Industrial and Applied Mathematics. ISBN 978-0-89871-624-5.

Ash, J. T., Zhang, C., Krishnamurthy, A., Langford, J., and Agarwal, A. Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds. In Proc. ICLR, April 2020.

Balakrishnan, S., Nguyen, Q. P., Low, B. K. H., and Soh, H. Efficient exploration of reward functions in inverse reinforcement learning via Bayesian optimization. In Proc. Neur IPS, pp. 4187 4198, 2020.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: Composable transformations of Python+Num Py programs, 2018.

Cao, Y. and Gu, Q. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks. In Proc. Neur IPS, volume 32. Curran Associates, Inc., 2019.

Chattopadhyay, R., Wang, Z., Fan, W., Davidson, I., Panchanathan, S., and Ye, J. Batch Mode Active Sampling based on Marginal Probability Distribution Matching. In Proc. KDD, volume 2012, pp. 741 749, 2012. doi: 10.1145/2339530.2339647.

Chowdhury, S. R. and Gopalan, A. On Kernelized Multiarmed Bandits. In Proc. ICML, pp. 844 853. PMLR, July 2017.

Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. EMNIST: Extending MNIST to handwritten letters. In Proc. IJCNN, pp. 2921 2926, May 2017. doi: 10.1109/IJCNN. 2017.7966217.

Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., Leskovec, J., and Zaharia, M. Selection via Proxy: Efficient Data Selection for Deep Learning, October 2020.

Cyr, E. C., Gulian, M. A., Patel, R. G., Perego, M., and Trask, N. A. Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint, December 2019.

Dai, Z., Yu, H., Low, B. K. H., and Jaillet, P. Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pp. 1496 1506, 2019.

Dai, Z., Chen, Y., Low, B. K. H., Jaillet, P., and Ho, T.-H. R2B2: Recursive reasoning-based Bayesian optimization for no-regret learning in games. In Proc. ICML, pp. 2291 2301, 2020a.

Dai, Z., Low, B. K. H., and Jaillet, P. Federated Bayesian optimization via Thompson sampling. In Proc. Neur IPS, pp. 9687 9699, 2020b.

Dai, Z., Low, B. K. H., and Jaillet, P. Differentially private federated Bayesian optimization with distributed exploration. In Proc. Neur IPS, pp. 9125 9139, 2021.

Dai, Z., Chen, Y., Yu, H., Low, B. K. H., and Jaillet, P. On provably robust meta-Bayesian optimization. In Proc. UAI, 2022a.

Dai, Z., Shu, Y., Low, B. K. H., and Jaillet, P. Samplethen-optimize batch neural Thompson sampling. In Proc. Neur IPS, 2022b.

Dai, Z., Shu, Y., Verma, A., Fan, F. X., Low, B. K. H., and Jaillet, P. Federated neural bandits. In Proc. ICLR, 2023.

Daxberger, E. A. and Low, B. K. H. Distributed batch Gaussian process optimization. In Proc. ICML, pp. 951 960, 2017.

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web], 2012.

Dua, D. and Graff, C. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml.

Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., De Pristo, M., Chou, K., Cui, C., Corrado, G., Thrun, S., and Dean, J. A guide to deep learning in healthcare. Nature Medicine, 25(1):24 29, January 2019. ISSN 1546170X. doi: 10.1038/s41591-018-0316-z.

Gal, Y., Islam, R., and Ghahramani, Z. Deep Bayesian Active Learning with Image Data. In Proc. ICML, pp. 1183 1192. PMLR, July 2017.

Gissin, D. and Shalev-Shwartz, S. Discriminative Active Learning. Co RR, abs/1907.06347, 2019.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. AISTATS, pp. 249 256. JMLR Workshop and Conference Proceedings, March 2010.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

He, B., Lakshminarayanan, B., and Teh, Y. W. Bayesian Deep Ensembles via the Neural Tangent Kernel. In Proc. Neur IPS, volume 33, pp. 1010 1022. Curran Associates, Inc., 2020.

He, T., Jin, X., Ding, G., Yi, L., and Yan, C. Towards Better Uncertainty Sampling: Active Learning with Multiple Views for Deep Convolutional Neural Network. In Proc. ICME, pp. 1360 1365, Shanghai, China, July 2019. IEEE. ISBN 978-1-5386-9552-4. doi: 10.1109/ICME.2019. 00236.

Hoang, T. N., Low, B. K. H., Jaillet, P., and Kankanhalli, M. Active learning is planning: Nonmyopic ε-Bayes-optimal active learning of Gaussian processes. In Proc. ECML PKDD, pp. 494 498, 2014a.

Hoang, T. N., Low, B. K. H., Jaillet, P., and Kankanhalli, M. Nonmyopic ε-Bayes-optimal active learning of Gaussian processes. In Proc. ICML, pp. 739 747, 2014b.

Hoang, T. N., Hoang, Q. M., and Low, B. K. H. A Unifying Framework of Anytime Sparse Gaussian Process Regression Models with Stochastic Variational Inference for Big Data. In Proc. ICML, pp. 569 578. PMLR, June 2015.

Hoang, T. N., Hoang, Q. M., and Low, B. K. H. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, pp. 3231 3238, 2018.

Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Proc. Neur IPS, volume 31. Curran Associates, Inc., 2018.

Kassraie, P. and Krause, A. Neural Contextual Bandits without Regret. In Proc. AISTATS, pp. 240 278. PMLR, May 2022.

Kharkovskii, D., Dai, Z., and Low, B. K. H. Private outsourced Bayesian optimization. In Proc. ICML, pp. 5231 5242, 2020a.

Kharkovskii, D., Ling, C. K., and Low, B. K. H. Nonmyopic Gaussian process optimization with macro-actions. In Proc. AISTATS, pp. 4593 4604, 2020b.

Kim, K., Park, D., Kim, K. I., and Chun, S. Y. Task-Aware Variational Adversarial Active Learning. In Proc. CVPR, pp. 8162 8171, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-66544-509-2. doi: 10.1109/CVPR46437. 2021.00807.

Kirsch, A., van Amersfoort, J., and Gal, Y. Batch BALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning. In Proc. Neur IPS, volume 32. Curran Associates, Inc., 2019.

Kirsch, A., Rainforth, T., and Gal, Y. Test Distribution Aware Active Learning: A Principled Approach Against Distribution Shift and Outliers, November 2021.

Koh, P. W. and Liang, P. Understanding Black-box Predictions via Influence Functions. In Proc. ICML, pp. 1885 1894. PMLR, July 2017.

Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. IJCAI, IJCAI 95, pp. 1137 1143, San Francisco, CA, USA, August 1995. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-363-9.

Krause, A. and Guestrin, C. Nonmyopic active learning of Gaussian processes: An exploration-exploitation approach. In Proc. ICML, ICML 07, pp. 449 456, New York, NY, USA, June 2007. Association for Computing Machinery. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273553.

Krause, A., Singh, A., and Guestrin, C. Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. The Journal of Machine Learning Research, 9:235 284, June 2008. ISSN 1532-4435.

Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009.

Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124002, December 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/abc62b.

Ling, C. K., Low, K. H., and Jaillet, P. Gaussian process planning with lipschitz continuous reward functions: Towards unifying bayesian optimization, active learning, and beyond. In Proc. AAAI, 2016.

Liu, T. and Zenke, F. Finding trainable sparse networks through Neural Tangent Transfer. In Proc. ICML, pp. 6336 6347. PMLR, November 2020.

Mellor, J., Turner, J., Storkey, A., and Crowley, E. J. Neural Architecture Search without Training. In Proc. ICML, pp. 7588 7598. PMLR, July 2021.

Minoux, M. Accelerated greedy algorithms for maximizing submodular set functions. In Stoer, J. (ed.), Optimization Techniques, volume 7, pp. 234 243. Springer-Verlag, Berlin/Heidelberg, 1978. ISBN 978-3-540-08708-3. doi: 10.1007/BFb0006528.

Mirzasoleiman, B., Badanidiyuru, A., Karbasi, A., Vondrak, J., and Krause, A. Lazier Than Lazy Greedy. Proceedings

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

of the AAAI Conference on Artificial Intelligence, 29(1), February 2015. ISSN 2374-3468, 2159-5399. doi: 10. 1609/aaai.v29i1.9486.

Mohamadi, M. A., Bae, W., and Sutherland, D. J. Making Look-Ahead Active Learning Strategies Feasible with Neural Tangent Kernels. In Proc. Neur IPS, 2022.

Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S., and Mitliagkas, I. A Modern Take on the Bias-Variance Tradeoff in Neural Networks. Co RR, abs/1810.08591, 2018.

Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. An analysis of approximations for maximizing submodular set functions I. Mathematical Programming, 14(1):265 294, December 1978. ISSN 0025-5610, 1436-4646. doi: 10.1007/BF01588971.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.

Nguyen, Q. P., Dai, Z., Low, B. K. H., and Jaillet, P. Optimizing conditional value-at-risk of black-box functions. In Proc. Neur IPS, pp. 4170 4180, 2021a.

Nguyen, Q. P., Dai, Z., Low, B. K. H., and Jaillet, P. Valueat-risk optimization with Gaussian processes. In Proc. ICML, pp. 8063 8072, 2021b.

Nguyen, Q. P., Low, B. K. H., and Jaillet, P. An informationtheoretic framework for unifying active learning problems. In Proc. AAAI, pp. 9126 9134, 2021c.

Nguyen, Q. P., Tay, S., Low, B. K. H., and Jaillet, P. Top-k ranking Bayesian optimization. In Proc. AAAI, pp. 9135 9143, 2021d.

Nguyen, Q. P., Wu, Z., Low, B. K. H., and Jaillet, P. Trustedmaximizers entropy search for efficient Bayesian optimization. In Proc. UAI, pp. 1486 1495, 2021e.

Novak, R., Xiao, L., Hron, J., Lee, J., Alemi, A. A., Sohl Dickstein, J., and Schoenholz, S. S. Neural Tangents: Fast and Easy Infinite Neural Networks in Python, December 2019.

Prabhu, V., Chandrasekaran, A., Saenko, K., and Hoffman, J. Active Domain Adaptation via Clustering Uncertaintyweighted Embeddings. In Proc. ICCV, pp. 8485 8494, Montreal, QC, Canada, October 2021. IEEE. ISBN 9781-66542-812-5. doi: 10.1109/ICCV48922.2021.00839.

Qui nonero-Candela, J. and Rasmussen, C. E. A Unifying View of Sparse Approximate Gaussian Process Regression. Journal of Machine Learning Research, 6(65):1939 1959, 2005. ISSN 1533-7928.

Ranganathan, H., Venkateswara, H., Chakraborty, S., and Panchanathan, S. Deep active learning for image classification. In Proc. ICIP, pp. 3934 3938, Beijing, September 2017. IEEE. ISBN 978-1-5090-2175-8. doi: 10.1109/ICIP.2017.8297020.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-18253-9.

Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B. B., Chen, X., and Wang, X. A Survey of Deep Active Learning. ACM Computing Surveys, 54(9):180:1 180:40, October 2021. ISSN 0360-0300. doi: 10.1145/3472291.

Sener, O. and Savarese, S. Active learning for convolutional neural networks: A core-set approach. In Proc. ICLR. Open Review.net, 2018.

Shu, Y., Cai, S., Dai, Z., Ooi, B. C., and Low, B. K. H. NASI: Labeland Data-agnostic Neural Architecture Search at Initialization. In Proc. ICLR. Open Review.net, 2022a.

Shu, Y., Chen, Y., Dai, Z., and Low, B. K. H. Neural ensemble search via Bayesian sampling. In Cussens, J. and Zhang, K. (eds.), Proc. UAI, volume 180 of Proceedings of Machine Learning Research, pp. 1803 1812. PMLR, 2022b.

Shu, Y., Dai, Z., Wu, Z., and Low, B. K. H. Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search. In Proc. Neur IPS, 2022c.

Shui, C., Zhou, F., Gagn e, C., and Wang, B. Deep Active Learning: Unified and Principled Method for Query and Training. In Proc. AISTATS, pp. 1308 1318. PMLR, June 2020.

Sim, R. H. L., Zhang, Y., Low, B. K. H., and Jaillet, P. Collaborative Bayesian optimization with fair regret. In Proc. ICML, pp. 9691 9701, 2021.

Sinha, S., Ebrahimi, S., and Darrell, T. Variational Adversarial Active Learning, October 2019.

Tay, S. S., Foo, C. S., Urano, D., Leong, R. C. X., and Low, B. K. H. Efficient distributionally robust Bayesian optimization with worst-case sensitivity. In Proc. ICML, 2022.

Tay, S. S., Nguyen, Q. P., Foo, C. S., and Low, B. K. H. Noregret sample-efficient Bayesian optimization for finding nash equilibria with unknown utilities. In Proc. AISTATS, pp. 3591 3619. PMLR, 2023.

Vakili, S., Bouziani, N., Jalali, S., Bernacchia, A., and Shiu, D.-s. Optimal Order Simple Regret for Gaussian Process Bandits. ar Xiv:2108.09262 [cs, stat], August 2021a.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Vakili, S., Bromberg, M., Shiu, D.-S., and Bernacchia, A. Uniform Generalization Bounds for Overparameterized Neural Networks. Co RR, abs/2109.06099, 2021b.

van der Maaten, L. and Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(86): 2579 2605, 2008. ISSN 1533-7928.

Verma, A., Dai, Z., and Low, B. K. H. Bayesian optimization under stochastic delayed feedback. In Proc. ICML, pp. 22145 22167, 2022.

Wang, H., Huang, W., Wu, Z., Tong, H., Margenot, A. J., and He, J. Deep Active Learning by Leveraging Training Dynamics. Advances in Neural Information Processing Systems, 35:25171 25184, December 2022.

Wang, Z., Awasthi, P., Dann, C., Sekhari, A., and Gentile, C. Neural Active Learning with Performance Guarantees. In Proc. Neur IPS, volume 34, pp. 7510 7521. Curran Associates, Inc., 2021.

Wu, Z., Shu, Y., and Low, B. K. H. DAVINZ: Data Valuation using Deep Neural Networks at Initialization. In Proc. ICML, pp. 24150 24176. PMLR, June 2022.

Xu, X., Wu, Z., Verma, A., Foo, C. S., and Low, B. K. H. FAIR: Fair collaborative active learning with individual rationality for scientific discovery. In Proc. AISTATS, 2023.

Zagoruyko, S. and Komodakis, N. Wide Residual Networks. In Wilson, R. C., Hancock, E. R., and Smith, W. A. P. (eds.), Proc. BMVC. BMVA Press, 2016.

Zhang, Y., Hoang, T. N., Low, K. H., and Kankanhalli, M. Near-optimal active learning of multi-output gaussian processes. In Proc. AAAI, 2016.

Zhang, Y., Hoang, T. N., Low, B. K. H., and Kankanhalli, M. Information-based multi-fidelity Bayesian optimization. In Proc. NIPS Workshop on Bayesian Optimization, 2017.

Zhang, Y., Dai, Z., and Low, B. K. H. Bayesian optimization with binary auxiliary information. In Proc. UAI, pp. 1222 1232, 2019.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

A. Proof of Theorem 3.1

In this section, we will show that the output variance of neural networks with Re LU activation can be represented using σNTKGP. For notation, we will let p represent the Lp-norm of a vector or matrix. When p is unspecified, we assume that we are referring to the L2-norm.

A.1. Dual Activations

In this subsection, we first define the concept of dual activation functions (Lee et al., 2020).

Definition A.1. Let ϕ : R R be an activation function. Then, the dual activation function of ϕ is given by ˇϕ : R2 2 R where ˇϕ(Λ) = E(u,v) N(0,Λ)[ϕ(u)ϕ(v)].

Similarly, if we let ϕ (x) = dϕ(x)

dx be the derivative of the activation function, then its dual activation ˇϕ is defined as

ˇϕ (Λ) = E(u,v) N(0,Λ)[ϕ (u)ϕ (v)].

A.2. Neural Network Assumption

We make an assumption about the parametrisation of our neural network. We use a common parametrisation from other NTK-related works, which can be found in e.g., Lee et al. (2020). Assumption A.2. Assume f( ; θ) is a multilayer perceptron with L hidden layers each with width of k1, k2, . . . , k L. For simplicity we assume that all hidden layers have the same width. Let the input dimension of the neural network is k0 and the output dimension is k L+1. Let the neural network be parametrised as

h(0)(x) = x

h(ℓ)(x) = 1 kℓ ϕ W (ℓ)h(ℓ 1)(x) + b(ℓ) for ℓ [1, . . . , L]

f(x) = W (L+1)h(L)(x) + b(L+1)

where W (ℓ) ij N(0, σ2 W ) and b(ℓ) ij N(0, σ2 b) are model weights and biases initialized randomly from a Gaussian

distribution with variances σ2 W and σ2 b respectively, and ϕ is an activation function which is scaled so that ˇϕ 1 1 1 1

There are two main changes from the standard neural network assumption.

1. The difference in the parametrisation of the hidden layer, where a multiplicative factor of 1 kℓ has been added. This

simplifies the computation of expectation values.

2. The scaling of the activation function. In our case, we scale the activation function ϕ such that its dual activation output is fixed under a specific input. This is an additional assumption not applied in Lee et al. (2020), and is done here to simplify the computation when we have to repeatedly apply the activation function.

A.3. Relationship between K and Θ

We first present the relationship between K and Θ. The following lemma is also presented and proven in Jacot et al. (2018).

Lemma A.3. For a neural network in the infinite-width regime, K and Θ can be defined recursively as

K(0)(x, x ) = x x + σ2 b (12)

Θ(0)(x, x ) = 0 (13)

K(ℓ)(x, x ) = ˇϕ K(ℓ 1)(x, x) K(ℓ 1)(x, x ) K(ℓ 1)(x , x) K(ℓ 1)(x , x )

+ σ2 b ℓ [1, , L + 1] (14)

Θ(ℓ)(x, x ) = K(ℓ)(x, x ) + Θ(ℓ 1)(x, x ) K(ℓ 1)(x, x ) ℓ [1, , L + 1] (15)

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

where we define

K(ℓ)(x, x ) = ˇϕ K(ℓ)(x, x) K(ℓ)(x, x ) K(ℓ)(x , x) K(ℓ)(x , x )

Notice that (15) gives a recursive formula for computing Θ. If we unroll this formula, we will obtain

ℓ=1 K(ℓ)(x, x )

ℓ =ℓ+1 K(ℓ )(x, x ). (17)

Given (17), it is now simple to bound Θ(x, x ), since all that is required is to provide bounds for each K(ℓ) and K(ℓ)

individually. This can be done by inspecting the dual activation functions.

A.4. Properties of Re LU Dual Activation Functions

For the remainder of this section, we will consider the case where the activation function is the scaled Re LU function, defined

2 max(x, 0). Based on Lee et al. (2020), it can be shown that for Re LU activations, if Λ = x x x y y x y y

then ˇϕ(Λ) = 1

π x y sin θ + (π θ) cos θ

and ˇϕ (Λ) = π θ

where θ = arccos x y x y . Notice that the Re LU activation in this case has a

2 multiplicative factor, which scales the activation as required from above.

For convenience, define the function ρ : [ 1, 1] R where

ρ(r) = ˇϕ 1 r r 1

1 r2 + (π arccos r) r . (18)

This can be thought of as the re-parametrisation of the dual activation function ϕ in the case where x = y = 1, and we are only specifying the cosine distance r = cos θ between x and y. We can also define ρ to be a similar function but on the dual activation ˇϕ instead,

ρ (r) = ˇϕ 1 r r 1

= π arccos r

It is simple to verify that for Re LU activations,

ˇϕ x 2 x y r x y r y 2

= x y ρ(r) (20)

and ˇϕ x 2 x y r x y r y 2

= ρ (r). (21)

For any function f, we define a notation f m(x) = (f f f) | {z } m times

(x). This is thought as repeatedly applying function f to

the input value m times. For our dual activation function, we can show that repeating the function input will still result in a non-decreasing function.

Lemma A.4. For any n N, ρn(r) is non-decreasing with respect to r.

Proof. We can prove this by induction. For the case that n = 1, it is simple to see that

dρ dr = π arccos r

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

We can see that arccos r π, and so dρ

dr 0. This means that ρ is non-decreasing.

Furthermore, we are able to see that ρ( 1) = 0 and ρ(1) = 1. Since the function is non-decreasing, this means for any r [ 1, 1], it is the case that ρ(r) [0, 1] [ 1, 1].

Now, assume that ρn is non-decreasing. Let r1, r2 [ 1, 1]. If r1 r2, then it is the case that ρn(r1) ρn(r2). Since we know ρn(r1), ρn(r2) [ 1, 1], we can therefore conclude that ρn+1(r1) ρn+1(r2).

Similarly, we can show a similar claim for the dual activation with respect to ϕ . Note that for ϕ , we do not need to show that repeated application of the function keeps the output non-decreasing.

Lemma A.5. ρ is non-decreasing with respect to r.

Proof. It is simple to see that dρ

1 r2 . (23)

Since square roots are always positive, it follows that dρ /dr 0. This means that ρ is an increasing function.

A.5. Bounding K in terms of ρ

We will first show the relationship between K and ρ.

Lemma A.6. For ℓ [1, . . . , L + 1],

K(ℓ)(x, x ) = p

u(ℓ 1) ρ r(ℓ) u + σ2 b (24)

where u(ℓ) = x 2 + ℓσ2 b x 2 + ℓσ2 b and r(ℓ) u = K(ℓ 1)(x, x )

Proof. We can prove this by induction. In the case that ℓ= 1, we see that

K(1)(x, x ) = ˇϕ K(0)(x, x) K(0)(x, x ) K(0)(x , x) K(0)(x , x )

+ σ2 b (25)

= ˇϕ x 2 x x

+ σ2 b (26)

+ σ2 b (27)

For the inductive step, assume that K(ℓ)(x, x ) =

u(ℓ 1) ρ r(ℓ) u + σ2 b holds. Then, we see that

K(ℓ+1)(x, x ) = ˇϕ K(ℓ)(x, x) K(ℓ)(x, x ) K(ℓ)(x , x) K(ℓ)(x , x )

+ σ2 b (28)

= ˇϕ x 2 + (ℓ 1)σ2 b ρ(1) + σ2 b K(ℓ)(x, x ) K(ℓ)(x , x) x 2 + (ℓ 1)σ2 b ρ(1) + σ2 b

+ σ2 b (29)

= ˇϕ x 2 + ℓσ2 b K(ℓ)(x, x ) K(ℓ)(x , x) x 2 + ℓσ2 b

+ σ2 b (30)

K(ℓ)(x, x )

+ σ2 b (31)

where we use the fact that K(ℓ)(x, x ) p

K(ℓ)(x, x) K(ℓ)(x , x ). This proves the inductive step.

The next proofs will attempt to bound the values of K(ℓ)(x, x ) based on ρ.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

ρ(ℓ) K(ℓ)(x, x ) ρ(ℓ) + (32)

x x if ℓ= 0,

u(ℓ 1) ρ ρ(ℓ 1)

+ σ2 b if ℓ 1. (33)

Proof. We can prove so by inspecting result from Lemma A.6, and using proof by induction. In the case that ℓ= 1,

K(1)(x, x ) = x x ρ

+ σ2 b (34)

then it is simple to show our claim in this case from the fact that x x x x x x and that ρ is monotone.

For the inductive step, assume that the claim is true for ℓ. We see that

K(ℓ+1)(x, x ) = p

K(ℓ)(x, x )

+ σ2 b (35)

+ σ2 b (36)

which uses the fact that ρ is non-decreasing. A similar logic can be used to show the lower bound. This proves the inductive step and hence proves our lemma.

Corollary A.8. For all ℓ [1, . . . , L + 1], p

u(ℓ 1) ρ ˆr(ℓ) + σ2 b K(ℓ)(x, x ) p

u(ℓ 1) ρ ˆr(ℓ) + + σ2 b (37)

where ˆr(ℓ) + = 1 (38)

ρ ˆr(ℓ 1) ℓ 2

ℓ 1 if ℓ> 1 (39)

Proof. Notice that by Lemma A.7 and due to the non-decreasing nature of ρ, proving the corollary above is equivalent to

showing that ρ(ℓ 1) +

u(ℓ 1) r(ℓ) + and ρ(ℓ 1)

u(ℓ 1) r(ℓ) .

We will start by showing that ρ(ℓ 1) +

u(ℓ 1) r(ℓ) + = 1. It is simple to show that the statement is true for ℓ= 1. In the case that

ℓ> 1, since ρ(r) 1 for all r, we can show that

u(ℓ 2) + σ2 b 2

u(ℓ 1) (40)

x + (ℓ 2)σ2 b x + (ℓ 2)σ2 b + σ2 b q x + (ℓ 2)σ2 b x + (ℓ 2)σ2 b + σ4 b x + (ℓ 2)σ2 b + σ2 b x + (ℓ 2)σ2 b + x + (ℓ 2)σ2 b + σ4 b (41)

from the inequality 2

ab a + b. This therefore means ˆρ(ℓ 1) +

u(ℓ 1) 1. This proves the first part of the corollary.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

For the second part, we will prove so by induction. It is easy to show that the statement is true for ℓ= 1. For the inductive

step, consider some value of ℓ> 2, and assume that the statement holds for ℓ 1, meaning that ρ(ℓ 2)

u(ℓ 2) r(ℓ 1) . Then,

we can see that

u(ℓ 2) ρ ρ(ℓ 2)

u(ℓ 1) (43)

u(ℓ 2) ρ r(ℓ 1) + σ2 b

u(ℓ 1) (44)

u(ℓ 2) + σ2 b

u(ℓ 1) (45)

ρ r(ℓ 1) ℓ 2

= r(ℓ) (47)

where in (46) we use the fact that

u(ℓ 2) + σ2 b

x + (ℓ 2)σ2 b x + (ℓ 1)σ2 b

x + (ℓ 2)σ2 b x + (ℓ 1)σ2 b ℓ 2

This proves the second part of our corollary.

A.6. Bounding K in terms of ρ

We will first show that we can express K in terms of ρ .

Lemma A.9. For ℓ [1, . . . , L + 1], K(ℓ)(x, x ) = ρ r(ℓ) u (49)

where r(ℓ) u is defined in Lemma A.6.

Proof. We can show that

K(ℓ)(x, x ) = ˇϕ K(ℓ 1)(x, x) K(ℓ 1)(x, x ) K(ℓ 1)(x , x) K(ℓ 1)(x , x )

= ˇϕ x 2 + (ℓ 2)σ2 b ρ(1) + σ2 b K(ℓ 1)(x, x ) K(ℓ 1)(x , x) x 2 + (ℓ 2)σ2 b ρ(1) + σ2 b

= ˇϕ x 2 + (ℓ 1)σ2 b K(ℓ 1)(x, x ) K(ℓ 1)(x , x) x 2 + (ℓ 1)σ2 b

= ρ K(ℓ 1)(x, x )

Using the result above, we can now bound the values of K(ℓ)(x, x ).

Corollary A.10.

ρ ˆr(ℓ) K(ℓ)(x, x ) ρ ˆr(ℓ) + (54)

where ˆr(ℓ) and ˆr(ℓ) + are as defined in Corollary A.8.

Proof. From Corollary A.8, it is the case that ˆr(ℓ) r(ℓ) u ˆr(ℓ) + . The corollary then follows from this fact.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

A.7. Ratio between K and Θ

We will now prove the bound of the ratio between K and Θ.

Theorem A.11. For a neural network with L 2, if max x , x B, then

ℓ=1 ρ ˆr(ℓ) ℓ 1

ℓ =ℓ+1 ρ ˆr(ℓ ) Θ(x, x )

K(x, x ) 1 + L

ρ ˆr(L+1) (55)

Proof. We will first prove the right hand inequality. We see that

Θ(x, x ) K(x, x ) = 1 K(L+1)(x, x )

ℓ=1 K(ℓ)(x, x )

ℓ =ℓ+1 K(ℓ )(x, x ) (56)

u(ℓ 1) ρ ˆr(ℓ) + + σ2 b

u(L) ρ ˆr(L+1) + σ2 b

ℓ ℓ+1 ρ ˆr(ℓ ) + (57)

u(ℓ 1) + σ2 b

ρ ˆr(L+1) (59)

where we use the fact that

u(ℓ 1) + σ2 b

u(L) 1 based on a similar argument used in (41).

Similarly for the left hand inequality,

Θ(x, x ) K(x, x ) 1 +

u(ℓ 1) ρ ˆr(ℓ) + σ2 b

u(L) ρ ˆr(L+1) + + σ2 b

ℓ =ℓ+1 ρ ˆr(ℓ ) (60)

ℓ=1 ρ ˆr(ℓ)

u(L) + σ2 b

ℓ =ℓ+1 ρ ˆr(ℓ ) (61)

ℓ=1 ρ ˆr(ℓ) ℓ 1

ℓ =ℓ+1 ρ ˆr(ℓ ) (62)

where we use the fact that

u(L) + σ2 b

L + 1 based on a similar reasoning as (48). This proves our

From Theorem A.12, we are therefore able to give a bound for the ratio between Θ and K. While the constant is defined recursively based on the function ρ, we claim that this is still useful since the constants are expressed in a form which allows it to be computed directly, and more importantly, we are able to see that such a constant will exist.

A.8. Ratio between K and Θ when σb = 0

In the case that there is no bias, we can improve the bound on the ratio between K and Θ. The key fact is that in the case without bias, u(ℓ) is equal for all values of ℓ, and we can simplify ˆρ(ℓ) = x x ρℓ( 1). During the computation, the constants x x will then cancel each other out.

Theorem A.12. For a neural network with L 2 and σb = 0,

ρℓ( 1) ρL+1(1)

ℓ =ℓ+1 (ρ ρℓ )( 1) Θ(x, x )

ρℓ(1) ρL+1( 1)

ℓ =ℓ+1 (ρ ρℓ )(1) (63)

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Proof. It is easy to see from the recursive form in (14) that if σb = 0, then K(ℓ)(x, x ) = x x ρℓ x x

we know ρℓis non-decreasing from Lemma A.4, we are able to bound

x x ρℓ( 1) K(ℓ)(x, x ) x x ρℓ(1). (64)

Similarly, we can also see from (16) that K(ℓ)(x, x ) = (ρ ρℓ) x x

. Since ρ is non-decreasing based on

Lemma A.5, we are able to bound (ρ ρℓ)( 1) K(ℓ)(x, x ) (ρ ρℓ)(1). (65)

Given that we can write Θ(x, x ) with the recursive form given by (17), it is then simple to use (64) and (65) to bound Θ(x, x ) K(x, x ).

A.9. Bounding the Difference Between σNN and σNTKGP

From above, we are able to see that it is possible to bound the ratio

Θ(x, x ) a+ (66)

for some appropriately set a and a+ according to either Theorem A.11 or Theorem A.12 depending on the value of σb (note that in the two theorems above we show the bounds for the reciprocal of what is stated in (66)). Given this bound, we are able to show the following main result. Theorem A.13 (Formal Version of Theorem 3.1). For a neural network with Re LU activation and L 2 hidden layers, if ΘX 0, then σ2 NN(x|X) α σ2 NTKGP(x|X) β (67)

where n is the upper limit on the size of the training set, B |Θ(x, x )|, α [a , a+], γ = B max α a , a+ α ,

and β = γ + nγB2

λmin(ΘX )2 + 2nγB λmin(ΘX ).

Proof. First, we know that we have K(x, x )

Θ(x, x ) [a , a+] by assumption. In the case that Θ(x, x ) 0, we can convert the

multiplicative bound into an additive bound as

K(x, x ) a Θ(x, x ) (68)

α Θ(x, x ) K(x, x ) (α a ) Θ(x, x ) (69)

(α a ) B (70)

K(x, x ) a+Θ(x, x ) (71)

K(x, x ) α Θ(x, x ) (α a+) Θ(x, x ) (72)

(α a+) B (73)

which can be combined to give K(x, x ) α Θ(x, x ) γ for γ as defined earlier. The same is the case when Θ(x, x ) < 0.

Given this additive bound, we are then able to bound each term which appears in σNN individually. We can see that Kx X Θ 1 X ΘXx α Θx X Θ 1 X ΘXx = Kx X α Θx X Θ 1 X ΘXx (75)

λmax(Θ 1 X ) KXx α ΘXx ΘXx (76)

1 λmin(ΘX ) γ n B n (77)

= nγB λmin(ΘX ). (78)

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Θx X Θ 1 X KX Θ 1 X ΘXx α Θx X Θ 1 X ΘXx = Θx X Θ 1 X (KX α ΘX )Θ 1 X ΘXx (79)

λmax(KX α ΘX ) Θ 1 X ΘXx 2 (80)

= λmax(KX α ΘX ) λmax(Θ 1 X )2 ΘXx 2 (81)

λmin(ΘX )2 . (82)

Combining these results together, we obtain

σ2 NN(x|X) α σ2 NTKGP(x|X) = Kx + Θx X Θ 1 X KX Θ 1 X ΘXx Kx X Θ 1 X ΘXx Θx X Θ 1 X KXx +

α Θx Θx X Θ 1 X ΘXx (83)

Kx αΘx + Θx X Θ 1 X KX Θ 1 X ΘXx αΘx X Θ 1 X ΘXx

+ Kx X Θ 1 X ΘXx αΘx X Θ 1 X ΘXx + Θx X Θ 1 X KXx αΘx X Θ 1 X ΘXx (84)

λmin(ΘX )2 + 2nγB λmin(ΘX ) (85)

which proves our claim.

A.10. Bounding the Ratio Between αEV using σNN and σNTKGP

It turns out that we are able to get a bound on the ratio of the EV criterion directly when we use σNN versus when we use σNTKGP. Considering the case of a single test point, we can see that using σNN, the EV criterion gives

αEV,NN(x|X) = σ2 NN(x| ) σ2 NN(x|X) = 2 Θx X Θ 1 X KXx Θx X Θ 1 X KX Θ 1 X ΘXx 0, (86)

and using σNTKGP,

αEV,NTKGP(x|X) = σ2 NTKGP(x| ) σ2 NTKGP(x|X) = Θx X Θ 1 X ΘXx 0. (87)

Based on this, we are able to show the following bound.

Lemma A.14. Assume X is such that ΘX , KX 0. Let ΘX , ΘXx B and a K(x, x )

Θ(x, x ) a+. Then,

2a λmin(ΘX )

B a+B λmin(ΘX ) αEV,NN(x|X) αEV,NTKGP(x|X) 2a+B λmin(ΘX ). (88)

Proof. For the right hand inequality, we can see that

αEV,NN(x|X) αEV,NTKGP(x|X) 2 Θx X Θ 1 X KXx Θx X Θ 1 X ΘXx (89)

2 λmax(Θ 1 X ) ΘXx KXx λmin(Θ 1 X ) ΘXx 2 (90)

= 2 λmax(ΘX ) KXx

λmin(ΘX ) ΘXx (91)

2a+B λmin(ΘX ). (92)

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Meanwhile, for the left hand inequality,

αEV,NN(x|X) αEV,NTKGP(x|X) = 2 Θx X Θ 1 X KXx Θx X Θ 1 X ΘXx Θx X Θ 1 X KX Θ 1 X ΘXx Θx X Θ 1 X ΘXx (93)

2 λmin(Θ 1 X ) ΘXx KXx λmax(Θ 1 X ) ΘXx 2 λmax(KX ) Θ 1 X ΘXx 2

λmin(Θ 1 X ) Θ 1 X ΘXx 2 (94)

2 λmin(Θ 1 X ) KXx λmax(Θ 1 X ) ΘXx λmax(KX )

λmin(ΘX ) (95)

2 λmin(ΘX ) KXx

λmax(ΘX ) ΘXx λmax(KX )

λmin(ΘX ) (96)

2 λmin(ΘX ) a

B a+B λmin(ΘX ) (97)

which provides a lower bound.

This provides another method of showing the agreement when using σ2 NTKGP variance function for our criterion compared to using σ2 NN variance function.

B. Proof of Theorem 3.3

In this section we will state Theorem B.1 more formally and proceed to prove it.

Theorem B.1 (Formal version of Theorem 3.3). Let the function f HΘ and training set (X, y) follow Assumption 3.2. Suppose we train an infinitely wide neural network f( ; θ) with training dataset (X, y) and mean-squared error loss function as given by (1), using gradient descent until convergence. Then, for any x X, with probability at least 1 2δ over the random observation noise and network initialization,

f (x) f(x; train(θ0)) B + R

2 log δ 1 σNTKGP(x|X) (98)

where λ is the regularization of the loss function.

Proof. By the triangle inequality, we can show that f (x) f(x; train(θ0)) f (x) µ(x|X) | {z } 1

+ µ(x|X) f(x; θ ) | {z } 2

2 log δ 1 σNTKGP(x|X) + µ(x|X) f(x; θ ) (100)

2 log δ 1 σNTKGP(x|X) + p

2 log δ 1 σNN(x|X) (101)

2 log δ 1 σNTKGP(x|X) (102)

(99) is true due to the triangle inequality,

(100) is true with probability at least 1 δ based on Theorem 1 of (Vakili et al., 2021a),

(101) is true with probability at least 1 δ due to Hoeffding inequality for sub-Gaussian random variables, and

(102) is true since we know that σNTKGP(x|X) σNN(x|X) from He et al. (2020).

By union bound, the statement above is true with probability at least 1 2δ.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

The loss as decomposed in (99) can also be thought of as the sum of the model bias (i.e. how well our neural network architecture is able to fit the given underlying function), given by 1 , and the variance of prediction due to the random network initialization, given by 2 . By Theorem B.1, we can show that we are minimising the predictive variance 2 , however our ability to do so will also depend on the bias term 1 .

C. Efficient Computations of the NTKGP

In the practical deployment of active learning, the cost of data label querying is usually significantly costly (in terms of time and computation costs), meaning the time cost of the active set selection can be easily be dominated by the querying costs. Note that our algorithm incurs less cost of data label querying because it can run in large batches while maintaining competitive performance. Furthermore, another factor contributing significantly to the cost of active learning is the training of the neural networks, which our algorithm mitigates due to its training-free nature. Nonetheless, we preent some possible speedups that our algorithm can utilize, particularly in computing the value of σ2 NTKGP.

C.1. Incremental Computation of σ2 NTKGP

Computing σ2 NTKGP( |X) for the active learning criteria is computationally expensive. Fortunately, we know that in our active learning algorithm the labelled set X is always incremented by one each time, and so when we would like to compute the criterion α(X {x }), rather than recomputing the σ2 NTKGP( |X {x }) again each time, we can just get the updated value of σ2 NTKGP( |X {x }) based on the already-known value of σ2 NTKGP( |X).

Suppose we let X = X {x }. We can utilize the formula for inversion of block matrices (e.g. Eqns. A.11 and A.12 of of Rasmussen & Williams (2006))

" A BD 1C 1 A 1B D CA 1B 1

D 1C A BD 1C 1 D CA 1B 1

and the matrix inversion lemma (e.g. Eqn. A.9 of Rasmussen & Williams (2006))

(A + BCD) 1 = A 1 A 1B(C 1 + DA 1B) 1DA 1 (104)

to see that

Θ 1 X = ΘX ΘXx Θx X Θx

" ΘX ΘXx Θ 1 x Θx X 1 Θ 1 X ΘXx Θx Θx X Θ 1 X ΘXx 1

Θ 1 x Θx X ΘX ΘXx Θ 1 x Θx X 1 Θx Θx X Θ 1 X ΘXx 1

σ2 NTKGP(x|X ) = Θx Θx X Θ 1 X ΘX x (106)

= σ2 NTKGP(x|X) Θx X Θ 1 X ΘXx Θx Θx X Θ 1 X ΘXx 1 Θx X Θ 1 X ΘXx

+ 2 Θx X Θ 1 X ΘXx Θx Θx X Θ 1 X ΘXx 1 Θx x

Θxx Θx Θx X Θ 1 X ΘXx 1 Θx x. (107)

Given that Θ 1 X can also be reused from the previous round and updated iteratively using Eq. (105), this means the variance can be computed more efficiently.

C.2. Approximating µNTKGP and ΣNTKGP Using Sparse GPs

A benefit for using ΣNTKGP for our criterion is that unlike ΣNN, the covariance matrices in the form of ΣNTKGP is well-studied in Gaussian process literature, and methods of sparsifying the kernel matrix has been proposed. Below, we discuss how such approximation techniques can be used.

Given we have a set of labelled training data X, we would like to compute the posterior variance ΣNTKGP(XT |X). Since it is expensive to compute the posterior directly, we will use techniques from sparse GPs to compute the approximate posterior. In particular, we will use the approximation framework as proposed by Hoang et al. (2015).

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Let U = (XU, y U) be a set of inducing points which is a representative for XT . Note that in general, U does not have to be a subset of X, and therefore is not necessarily a valid candidate for the active set. For simplicity, we will attempt to apply the Fully-Independent Conditional (FIC) approximation, where we assume that y T and y are conditionally independent given y U, i.e.

p(y T |y) = Z p(y T |y U, y)p(y U|y)dy U Z p(y T |y U)p(y U|y)dy U. (108)

We can assume that p(y U|y) q(y U) and the goal is to compute the distribution of q(y U) such that DKL[q(y U) p(y U|y)] is minimised.

Assume that q(y U) is a normal distribution, i.e. assume that q(y U) N(µU, ΣU) where the mean and the covariance function will be dependent on our choice of set X. According to Hoang et al. (2015), it can be shown that the µU and ΣU which minimises the KL divergence will also maximise the ELBO as given by (Equation 6 of (Hoang et al., 2015))

L(q) = Z q(f U) Z q(f|f U) log p(f, y|y U)

q(f|f U) df df U DKL[q(f U) p(f U)] (109)

where f is the true model function (i.e. the noiseless version of y). In order to compute the distribution q(y U) which maximises L, we can compute its gradient. In this case the gradient has a nice closed form, and is given by (Equation 17 of Hoang et al. (2015)) L µU = Θ 1 U µU X

G(U, x) F(U, x)µU (110)

and (Equation 18 of Hoang et al. (2015))

L ΣU = Σ 1 U 1

x X F(U, x) (111)

where G(U, x) = Θ 1 U ΘUxΓ 1 x y, F(U, x) = Θ 1 U ΘUxΓ 1 x Θx UΘ 1 U µU and Γx = Θx Θx UΘ 1 U ΘUx. We will also let ΘU = Θ(XU, XU) and Θx U = Θ(x, XU). Using the closed form expression of the gradient, the optimal µU and ΣU can be computed through gradient ascent or by directly solving equations L/ µU = 0 and L/ ΣU = 0. In our experiments, we will use the latter method.

Given the optimal µU and ΣU, the mean of the sparse NTKGP approximation can be computed as

µs NTKGP(y T |y) = ΘT UΘ 1 U µ U, (112)

and the covariance by Σs NTKGP(y T |y) = ΘT ΘT UΘ 1 U ΘUT + ΘT UΘ 1 U Σ UΘ 1 U ΘUT . (113)

The approximate posterior can now be computed with time cubic with number of inducing points, rather than cubic in number of training points. The equation can also be iteratively computed in quadratic time if Eq. (104) is used for matrix inversion.

From the decomposition in Eqs. (110) and (111), we see that the FIC approximation is nice to use since it is possible to decompose the summation above based on terms F(U, x) and G(U, x) where each terms only depends on one element, and independent of the other elements in D. This is convenient when trying to add one element and see what effect it has on the inducing points distribution. Using this fact, it is also possible to apply a further linear approximation on the criterion (as we will present in the next subsection) which allows for even more efficient computation.

C.3. Approximating incremental change of µNTKGP and ΣNTKGP

Given the inducing point prior ΣU(X) as computed by the method above, we could theoretically compute the approximate posterior Σs NTKGP(XT |X {x }) when one point is added directly using the same method. However, this can still be expensive to perform. Therefore, we propose a simple technique to approximate the updated value of ΣU(X {x }), and therefore Σs NTKGP(XT |X {x }), when we add a small number of points to the training set. The approximation technique is akin to the influence function as proposed by Koh & Liang (2017).

When we add a single point to the training set, the ELBO (Eq. (109)) will only change by a small amount. Let L = L+ L be the new loss function after adding x into the set X, and L be the contribution specifically from the new element x . Since L

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

only changes by an amount L, we also do not expect ΣU to change by much. Let ΣU(X {x }) = ΣU(X) + ΣU(x ; X) where ΣU(x ; X) is the change of the inducing prior after adding the training point x . We see that ΣU(X {x }) and ΣU(X) would be slightly different from each other.

We find that the change in the inducing points posterior can be instead given by

ΣU(x ; X) 2L

Conveniently, the FIC approximation gives rise to a loss function whose gradient is easily decomposable. We can write the derivative of the loss function contribution from x as

2F(U, x). (115)

We also see that the Hessian of the original loss is equal to the Jacobian of the matrix inverse, i.e.

which can easily be computed in closed form and also using any auto-differentiation package. Expressions from 115 and 116 can be plugged back into 114 to obtain the change of ΣU when one new sample is added.

Therefore, any criterion α we have which would depend on the approximation of the s NTKGP can be approximated as

ΣU , ΣU(x ; X) . (117)

This expression can be used to compute the change in the active learning criterion when an additional data point is added. Because this term is based on the inner product of some matrix quantities, it is parallelisable and in practice speeds up the algorithm by a large amount.

Experimentally, the Hessian can be expensive to compute. In order to speed up the computation, we can instead use an alternate parametrisation of the sparse GP called the natural parameters which defines parameters to match more naturally to the terms that appears in the Gaussian distribution. We refer the readers to the original paper (Hoang et al., 2015) for further details on this.

D. Discussions On Active Learning Criteria

In this section, we discuss the active learning criteria based on the NTKGP which we use. We first provide further discussion on the expected value criterion, then we discuss two other possible criteria omitted from the main text based on variance percentile and on mutual information. The experiments comparing the performances of the other active learning criteria can be found in App. H.3.3.

D.1. Expected Variance Criterion

The expected variance criterion has been introduced in the main paper in (9). In this section, we would like to further explore how the active set is selected based on this criterion. In Fig. 8, we embed our data onto a space using t-SNE (van der Maaten & Hinton, 2008), then plot out the predictive variance at model initialization of different regions. An interesting observation from the plot is that the output variance with respect to model initialization is actually heteroscedastic. This means there will be some inputs which will have more varied model prediction than others at initialization. From the plot, we see that our algorithm does not necessarily pick the samples which evenly covers the whole input space, unlike what coverset-based algorithms may choose to do. Rather, our algorithm tends to also prioritise querying points from regions where the predictive variance is high, and balance it with querying from regions with a larger number of points.

D.1.1. RUNNING TIME

The computation of the criterion αEV(X) has a cost of O(|XT | |X|2), where XT is the reference test set. This is due to the fact that we have to compute the GP posterior variance for each of the element in XT . Computing the GP posterior is cubic

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Figure 8. Visualisation of which points are selected by αEV criterion. The black points represent the unlabelled set of data from the Handwritten Digits (from the UCI repository), mapped onto a 2D plane using t-SNE based on the predictive correlation (i.e. if the model output is more correlated, then they will be mapped closer in the above representation). The contours represent the predictive standard deviation of each input point, computed using the empirically with the model output at initialization. The crosses represent the points which are selected by our active learning algorithm using the αEV criterion.

with respect to X|, however can be made quadratic due to the trick of incrementally computing the NTKGP which we will discuss in App. C.1. Note that since the NTK is assumed to be constant from initialization, it can be precomputed at the beginning and reused throughout the active learning process, incurring little cost during the actual active learning process.

D.1.2. PRACTICAL SPEEDUPS

In additional to the speedups in computations as discussed in App. C.2, we also make further approximations in order to compute αEV. Namely, rather than computing the empirical mean of output variance on the whole test set XT , we first select a subset of the test set using simple diversification methods (e.g. K-MEANS++), then only compute the criterion on those data. We find that not only does this speed up the criterion computation, this technique also remove some bias in the test data distribution and give a more representative of the overall variance across the whole space. We also optimize our computation of αEV further by only computing the variance σ2 NTKGP and not the whole covariance ΣNTKGP.

D.2. Variance Percentile

Another possible idea is to ensure that some proportion of test points has a low variance. This can be done by considering the rth percentile of variance, i.e.

αr V(X) = percentile {σ2 NTKGP(x|X)|x XT }, r . (118)

In this case, letting r = 50 is equivalent to considering the median of test output variance, and letting r = 100 is equivalent to considering the maximum test output variance.

Unfortunately, the maximum function is not submodular in general, and therefore has no theoretical guarantees when points are selected using the greedy algorithm. Nonetheless, we will conduct experiments to select points based on this criterion for the cases where r = 90 and r = 100.

D.3. Mutual Information Criterion

The previous criteria we have discussed often does not take into account the distribution of XT . Through the lens of information theory, we can view the active learning problem as attempting to select points XL with outputs y L such that the most information can be obtained about y T , and that the dependence between each datapoint is also taken into account.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Similar to Krause et al. (2008), we can attempt to maximise the mutual information

αMI(XL) = I[y L; y T \L] (119)

= H[y T \L] H[y T \L|y L] (120)

2 log det ΣNTKGP(XT \L) 1

2 log det ΣNTKGP(XT \L|XL) + constant, (121)

where we use the shorthand XT \L = XT \ XL. We are able to arrive at the final line since we know the predictive covariance follows a multivariate Gaussian distribution.

In the case that the test set and the unlabelled pool are disjoint, we have y T \L = y T , which means H[y T \L] is a constant regardless of which active set we choose. In this case, maximising αMI is equivalent to minimising H[y T |y L]. Similar to the case of αEV, in experiments where the test set is separate from the unlabelled pool, we will simply report the values of H[y T |y L].

D.3.1. NUMERICAL STABILITY

A particular issue with using the mutual information is the numerical stability when computing the entropy values. In particular, αMI involves computing the entropy H[y T \L|y L], which involves computing the determinant det ΣNN(XT \L|XL). However, when the test set XT \L have points which are highly correlated with each other, the matrix ΣNN(XT \L|XL) may be singular and cause the determinant to be undefined.

Notice that the decision of using XT \L instead of XT will already alleviate some of the issues regarding singular matrices. However, in order to prevent further issues, in our experiments, we decide to pre-filter the points that are used for the test set. In particular, instead of using all points from the test set, we select only a subset of training points such that H[y T ] is maximised. This means that the subset selected will be as independent from each other as possible. Furthermore, using a subset of points instead of the full test set also reduces the matrix size, which speeds up our computation as well. The subset of points that are selected are done so using the K-MEANS++ initialization method.

Furthermore, to avoid computing the determinant of a singular matrix, we add in a diagonal noise term in order to compute the determinant of ΣNN(XT \L|XL) + σ2 n I instead. This corresponds to the case where the observation has some added noise with variance σ2 n.

E. Further Details about the Greedy Algorithm

As denoted in Algo. 1, for the data selection stage, we utilise the simple greedy algorithm to find the set of suitable points. However, the greedy algorithm requires O(nk) criterion computations, which can be slow.

There is a large literature on efficient submodular maximisation algorithms with cardinality constraints. This, however is not the focus of this paper. Furthermore, since the data selection stage is independent of the true labels, our algorithm can Nonetheless, for practical purposes, we choose to apply two simple algorithm optimization techniques.

The first optimization performed based on the ACCELERATED-GREEDY algorithm (Minoux, 1978). For each element in the unlabelled data pool, we store the marginal gain value x(X) αEV(X {x}) αEV(X) for some X that was previously computed. Then, as the greedy algorithm progresses, we will continue to grow the selected active set into some X X. From submodularity, we know that x(X) x(X ). This means that in one particular round with set X , if we already have checked another point whose marginal gain in that round is such that x (X ) x(X), then there is no point in checking x since we know the criterion value of X {x } will definitely be higher than that obtained from X {x }. In our algorithm, when the empirical NTK is used, we reinitialize the random neural network after each batch of data (which changes the empirical NTK), and as a result also reset the stored values of x.

Another optimization used is that in each greedy round, we do not compute the criterion on all elements, but only on a subset of k elements per round. In each greedy round, the maximum value amongst the k elements tested is added to the active set. This is the technique used in the STOCHASTIC-GREEDY algorithm (Mirzasoleiman et al., 2015), and is able to achieve accuracy arbitrarily close to 1 1/e in O(n) time (independent of k). In our experiments, we either set k = 1000 for experiments with smaller budgets and k = 250 for experiments with larger budgets.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

F. Proof of Eq. (10)

We verify (10) for MSE loss. For a single test point (x, y), notice that we can write

Eθ0 ℓ (x, y), train(θ0) = Ey N(µNN(x|X),σ2 NN(x|X)) (y y)2

= Ey [(y )2] 2y Ey [y ] + y2

= µNN(x|X)2 + σ2 NN(x|X) 2yµNN(x|X) y2

= (y µNN(x|X))2 + σ2 NN(x|X)

where the first equality is based on the predictive distribution of the converged model.

G. Further Details on Experimental Setup

In this section, we provide further details of the experiments conducted. All the code for the experiments are also provided in the supplementary material.

G.1. Training Setup for Each Experiments

G.1.1. REGRESSION EXPERIMENTS

We will use a combination of generated dataset and real-life data. For this paper, randomly generated data (Random Model) is constructed from random points from a ball, with their output values generated from a random model initialization (which ensures that the neural network will be adequate to describe the data used).

Meanwhile the real-life training data are taken from the UCI Machine Learning Repository (Dua & Graff). For all dataset, we split the whole data in half, and let one half be the pool of data and the other be the test data which the algorithm has no access to. All the dataset used are regression datasets, with the exception of the Handwritten Digits dataset which is a classification dataset where we perform regression on the label value (i.e. for the inputs corresponding to the digit 1, we assign y = 1). All datasets are also normalized such that they have mean of zero and variance of 1.

In all of the regression experiments, the model used is a two-layer multi-layer perceptron with width of 512 and with bias. We set σW = 1. and σb = 0.1. The NNs are optimized using gradient descent with step size 0.01.

G.1.2. EXPERIMENTS ON THE CLASSIFICATION DATASET

In the classification datasets, in order to reduce the training time, we restrict the unlabelled data pool to a random subset of the whole training set. For the MNIST dataset, we randomly select 10k points and use it as our unlabelled pool, while for the remaining classification experiments we randomly select 20k points for the unlabelled pool. All the inputs are also rescaled such that the distribution of input values are between [ 1, 1]. For all the models, we train the models using stochastic gradient descent with learning rate of 0.1 and weight decay of 0.005. The models are trained with training batch size of 32 and are trained for 100 epochs. Below, we provide a brief description of the datasets used and the model architectures used for training on the correpsonding datasets.

MNIST (Deng, 2012). MNIST is a database whose inputs are single-channel images of size 28 28 that corresponds to handwriting of different digits. For MNIST, we use a multi-layer perceptron with two hidden layers each of width 512 as we have done for the regression experiments. We find that using such a simple model without convolution is sufficient for the MNIST dataset. In the experiments where we vary the network width at active learning stage and training stage (Fig. 21), we used a 3-layer MLP and experiment with network widths 128, 256, 512 and 1024. The total query budget was fixed at 600 and the query size at 200.

EMNIST (Cohen et al., 2017). The EMNIST dataset is an extension to the MNIST dataset, where the input contains handwritten characters as well as handwritten digits. There are a total of 47 output classes (since some of the characters that have a similar shape are grouped together in the same group). For EMNIST, we either use a multi-layer perceptron with three hidden layers each of width 512, or a simple convolutional neural network with 2 blocks of [CONV - MAXPOOL - RELU], followed by a fully-connected layer of width 512.

SVHN (Netzer et al., 2011). SVHN consists of RGB images of size 32 32 which corresponds to photos different digits. Each image in the SVHN dataset corresponds to one of the 10 classes, each representing a digit. For SVHN,

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

we use a 1-layer Wide Res Net (Zagoruyko & Komodakis, 2016). We modify the network by removing the batch normalization layer in order to make the module compatible with FUNCTOOLS on Pytorch (see App. G.2) such that the NTK is computable by gradient inner product, and also to ensure positive-semidefiniteness of the resulting matrix. While this may result in a lower model accuracy, this is not a main concern of this work regardless since we are more interested in the relative accuracy between each active learning methods.

CIFAR100 (Krizhevsky, 2009). CIFAR100 is a dataset consisting of RGB images of size 32 32 which corresponds to one of 100 different image categories. For CIFAR100, we use a 2-layer Wide Res Net with similar modifications as we have done for SVHN.

G.1.3. MODEL SELECTION ALGORITHM

For the model selection experiments, the data used are from the UCI dataset (as we have done for the regression experiments). We also construct an artificial dataset Sinusoidal whose output is generated from the sine function y = sin(ω x + b) for some random ω and b.

The pool of models M consists of MLPs with depths 1, 2, 3 and 4, and with Re LU, Ge LU, leaky Re LU (a = 0.1), and Erf. This makes up a total of 16 candidate models. For the Sinusoidal dataset, we also added an additional model architecture with sinusoidal activation, making a total of 20 candidate models in this case. All the models are implemented on JAX (in order to match the theoretical NTK which is also computed in JAX). For the model selection algorithm, in order to approximate αM, we choose κ = 0.3|L| for the labelled set L at that round, and compute the empirical mean of αM by sampling the random subset 100 times. The batch size for active learning is 10 or 20 depending on the size of the dataset.

The models are trained using the Adam optimizer with learning rate 0.01. The reason the Adam optimizer is used is that some of the activation functions suffer from diminishing gradients, and therefore does not behave well when using gradient descent. The training is all done in a single batch, and is done for 500 epochs.

G.2. Computation of the Neural Tangent Kernel

G.2.1. THEORETICAL NTK USING NEURAL-TANGENTS

The theoretical NTK are computed using the NEURAL-TANGENTS package (Novak et al., 2019). Even though the final trained model will not have exactly the same network parameters as that assumed to compute the theoretical NTK (due to different packages used during active learning and true model training), we still find that the kernel itself shows enough agreements and are still useful for predicting the model behaviour.

The main disadvantage of using the theoretical NTK is that the theoretical NTK is not available for all model architectures. In particular, NEURAL-TANGENTS does not work for Max Pooling layers (only Avg Pooling is supported), and also does not work for Batch Norm layers or Dropout layers placed arbitrarily. For this reason, experiments involving the theoretical NTK are only restricted to ones which uses a MLP. We also only use the theoretical NTK for experiments involving our algorithms, and do not use it for MLMOC which uses the empirical kernel in order to predict how a specific model initialization will behave during training.

G.2.2. EMPIRICAL NTK USING PYTORCH

An alternative method of computing the NTK is to do so empirically. In order to compute the NTK empirically, we require taking the inner product of the model output gradient θf(x, X) with respect to its parameters. This requires us to compute the Jacobian θf(x, X), which is extremely expensive in terms of required memory and cost for performing the matrix multiplication. In particular, for a model with p parameters and o outputs, computing the NTK on an input of size n requires O(npo) space and O(n2po) running time for the matrix multiplication alone. In order to perform this operation efficiently, we follow the method as used in Py Torch s FUNCTOOLS tutorial5. Note that FUNCTOOLS is not compatible with all neural network components (e.g. Batch Norm), and therefore in our experiments we do not use those components in the models. Even though Neural-Tangents is able to compute the empirical NTK as well, we chose to compute the empirical NTK on Py Torch since the models and (most of) the model training are done on Py Torch anyway.

For classification instances where the model has multiple outputs, we also perform a further approximation of only using the gradient which contributes to a single model output (which is chosen at random). This is possible since the gradient inner

5See https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

product with respect to each outputs are independent of each other and has the same value in the limit. This is a similar trick which is also used by Mohamadi et al. (2022).

G.3. Reported Metrics

To quantify the model performances, we use either the test mean-squared error (MSE) for regression problems, or the test accuracy for classification problems.

To quantify the initialization robustness of the models, we use two different metrics as described below.

Output variance. This is used as an initialization-robustness measure for regression tasks, and is the empirical variance of output of trained neural networks with respect to the different model initializations. In our experiments, we report the 90th percentile output variance, which is defined as the 90th percentile output variance test data. The 90th percentile is used instead of the 100th percentile (i.e. the maximum output variance) since using the 100th percentile tends to focus on some outlier rather than reflecting the output variance of the majority of the test data. We also chose to report the 90th percentile value instead of the average value since it is able to indicate the output variance of the worst-case inputs and is therefore a better measure of initialization robustness on the overall space.

Output entropy. This is used as an initialization-robustness measure for classification tasks, and is defined as the empirical entropy of the predicted label. For a neural network, the predicted label for some input x is given by ˆy = arg maxi f(x; θ)i. Given multiple trained models with different parameters θ1, . . . , θk, we will obtain different predictions ˆy1, . . . , ˆyk. The output entropy is then defined as Pc i=1 vi log vi where vi = 1

c Pk j=1 1[ˆyj = i]. A lower output entropy corresponds to models whose predictions are more consistent with each other.

G.4. Description of Other Active Learning Algorithms

Below, we provide a brief description of the other active learning algorithms which we compare our method against.

RANDOM. The batch of points are selected uniformly at random without replacement.

K-MEANS++. The batch of points are selected based on the initialization method from Arthur & Vassilvitskii (2007). When using this algorithm, all the points are selected right from the start (and order in selection by the algorithm is kept). When the user queries for a batch of size b, the algorithm returns the next b elements that it has chosen from the K-MEANS++ initialization algorithm.

BADGE (Ash et al., 2020). BADGE utilizes the hallucinated gradient space, or the gradient of the loss function with respect to the final output layer,

h(x) = θ( 1) ℓ((x, y), θ)

where θ( 1) are the model parameters of the final output layer and y is the pseudo-prediction from the model based on the model output (for example, in classification problems, y would be the one-hot vector representing the class prediction given by the model output). Pseudo-labels are used since in active learning, the algorithm would not have access to the true labels. Given the embedding h(x), BADGE then performs diversification on this embedded input. Theoretically, BADGE is able to provide a balance between selecting points which are diverse and selecting points which the model is uncertain about.

Unlike in the original paper, for experiments involving BADGE, we do not provide the algorithm with an initial training set. Despite this, we find that BADGE is still able to select a good batch of points from the start, and is even able to beat our algorithm in some experiments.

MLMOC (Mohamadi et al., 2022). MLMOC uses the NTK to provide a prediction of how a neural network would behave when an additional point is added to the training set. Given this, the algorithm selects an active set which when trained, will cause the largest change in the model output.

In our experiments, MLMOC will randomly select 100 initial points for classification experiments. The random subset is counted as a part of the budget used up by MLMOC. For SVHN and CIFAR100 experiments, this is fewer points than provided in the original paper. However, this setup is chosen in order to test how the algorithm performs when there is little initial data available. We found that MLMOC performs poorer when there are fewer initial data points available to it.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

H. Further Details on Experimental Results

H.1. Description of Figures

In this section, we give a more detailed description of the figures plotted in the main paper.

Fig. 2. This figure shows the correlation between variance predicted by the NTKGP, σ2 NTKGP(x|X), and the (empirical) model output variance. The horizontal represents the variance based on σ2 NTKGP(x|X) for a certain element x from the validation set and a certain subset X of the unlabelled data pool, while the vertical represents the empirical output variance with respect to different model initialization. In this experiment the Robot Kinematics dataset was used.

Fig. 3. This figure shows the relationship between the different criteria value, the 90th percentile output variance, and the test MSE. In each row, from left to right, the graph shows: the relationship between the 90th percentile variance and αEV; the 90th percentile output variance of models when trained on active sets of increasing sizes; the test MSE when trained on active sets of increasing sizes. The blue line represents the active set which has been sequentially constructed based on the αEV criterion. Each row represents trials conducted on one dataset. SP and PR represents the Spearman Rank correlation and Pearson correlation coefficients respectively. The metrics reported on the y-axis are described further in App. G.3.

H.2. Relationship Between σNTKGP, σs NTKGP and Output Variance

In Fig. 9, we provide further results comparing the variance obtained from the σNTKGP, σs NTKGP approximations and the true output variance. We find that σNTKGP provides good agreement with the output variance, while σs NTKGP provides a decent approximation of the true output variance. Furthermore, note that the graph shows the individual variance and not the sum of variance (as αEV requires) so when computing the criterion, the discrepancy between using the NTKGP or s NTKGP approximation and using true output variance will be lower (see Fig. 10).

Robot Kinematics Protein

Naval Handwritten Digits (Regression)

Figure 9. The correlation between variance predicted by the NTKGP and s NTKGP, and the (empirical) model output variance. The x-axis represents the variance based on the NTKGP or the s NTKGP for a certain element x from the validation set and a certain subset X of the unlabelled data pool, while the y-axis represents the empirical output variance with respect to different model initialization.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

H.3. Additional Results From Regression Experiments

H.3.1. EXPERIMENTS ON OTHER REGRESSION DATASETS

In Fig. 10 we show further results for other regression datasets in the sequential selection process, and in Fig. 11, we show the results in the larger-scale, batch setting.

We see that EV-GP criterion is able to outperform both the Random selection and K-MEANS++ selection method. The latter result is true likely due to the fact that K-MEANS++ aims at selecting points which are as diverse as possible from each other in the input space. This, however, makes K-MEANS++ more prone to selecting outliers which have large differences with other points in the input space yet have little correlation with the rest of the dataset. Therefore, these outliers would provide little information for making predictions on the more common data points in the test set. Meanwhile, EV-GP criterion directly incorporates the distribution of the dataset in the point selection process, and therefore is able to select more correlated points based on the existing dataset.

Boston Naval

0.8 0.6 0.4 EV criterion

90th %ile output variance

SP=-0.869 PE=-0.88

0 5 10 15 20 Selection points

0 5 10 15 20 Selection points

0.6 0.4 0.2 0.0 EV criterion

90th %ile output variance

SP=-0.930 PE=-0.93

0 5 10 15 20 Selection points

0 5 10 15 20 Selection points

Handwritten Digits (Regression) Robot Kinematics

0.6 0.5 EV criterion

90th %ile output variance

SP=-0.825 PE=-0.86

0 5 10 15 20 Selection points

0 5 10 15 20 Selection points

0.7 0.6 0.5 0.4 EV criterion

90th %ile output variance

SP=-0.953 PE=-0.95

0 5 10 15 20 Selection points

0 5 10 15 20 Selection points

Figure 10. Relationship between the different criteria value, the 90th percentile output variance, and the test MSE. The information shown is presented in the same manner as Fig. 3 (see App. H.1).

Boston Naval

20 40 Selection points

90th %ile output variance

20 40 Selection points

25 50 75 100 Selection points

90th %ile output variance

25 50 75 100 Selection points

Random EV-GP (ours) K-Means++

Figure 11. Active learning on regression datasets. The information shown is presented in the same manner as Fig. 4 (see App. H.1).

H.3.2. SUITABILITY OF αEV AND MSE LOSS

In Fig. 12, we study how the ability of αEV to represent the test loss is able to reflect how suitable the model is for the given dataset. We find that when αEV shows a higher correlation to the test loss, the achievable test loss (when the model is trained on the entire set) is also lower. This shows that if we have a model which matches closely to the true data, then optimizing αEV is more likely to result in a low test loss. This fact can be related back to Theorem B.1, where we have the constant B which represents how well the underlying function fits with the neural network model. The lower value of B means that

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

optimizing αEV can provide a tighter bound on the difference between the trained model and the true function. We also note that despite the results showing low correlations between αEV and MSE loss in some of the datasets, we are still able to minimize the loss simply by maximising αEV.

Figure 12. Plot between the correlation of αEV(X) and the MSE loss when trained on X, and the empirical bias (defined as the average MSE loss minus the output variance), on different regression datasets. Note that αEV is higher when the output variance and the MSE loss is lower, meaning the correlation values on the x-axis are all negative. The Random Model dataset is a dataset whose outputs are generated from a randomly initialized 2-layer MLP (which the neural network is expected to fit well on). The results for active learning on the datasets labelled in the graph are shown in Fig. 13.

H.3.3. RESULTS FOR OTHER ACTIVE LEARNING CRITERIA

We present some results for other active learning criteria presented in App. D. In Fig. 14 we show the results in the sequential setting, and in Fig. 15 we show the results in the batch setting.

We find that α100V tends to perform poorly compared to the other criteria, and sometimes even worse than RANDOM in some instances (in Fig. 15 we did not show some of the results from α100V since it gave a training set which was unstable during training). We believe that this is due to the fact that α100V will sometimes tend to be too focused on reducing the variance of outliers that it does not provide a diverse subset of points. Reducing this threshold to the 90th percentile (and using α90V) is able to give a better subset however still is not as effective as αEV.

Meanwhile, αMI often results in good performances, and sometimes even outperforming αEV. However, we find it is still less preferable since its advantage over αEV is not consistent and also due to the aforementioned issue of computational efficiency.

H.3.4. RESULTS FOR APPROXIMATION METHODS OF NTKGP

In Figs. 16 and 17, we present the results for active learning results for various approximation methods of the NTKGP, namely approximating the NTK empirically, and approximating the covariance using sparse GP techniques (as introduced in App. C.2).

We can see that the empirical approximation of the NTK sees some drop in performance compared to the theoretical NTK, however still provides useful enough approximation for the true NTK for our purposes.

Meanwhile, in the experiments where the covariance from the s NTKGP is used, we can see that the method also still provides a useful approximation for the criterion, and may be used instead of the true NTKGP, at least for smaller problems. Unfortunately, we find that applying a further linear approximation to the s NTKGP computation worsens the selection process. This suggests that the covariance function from the s NTKGP may not be smooth enough and so assuming a linear approximation method is inaccurate.

In practice, we also find that for smaller datasets (where the number of selected points is not smaller than the number of inducing points), the s NTKGP approximation is not as beneficial since the computation required for the full-rank NTKGP is already low and s NTKGP requires more matrix computation that would otherwise. Nonetheless, the s NTKGP may be

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Random Model Boston

1.0 0.8 0.6 EV criterion

SP=-0.939 PE=-0.95

0 5 10 15 20 Selection points

0.6 0.4 0.2 EV criterion

SP=-0.787 PE=-0.82

0 5 10 15 20 Selection points

Robot Kinematics Protein

0.6 0.4 EV criterion

SP=-0.657 PE=-0.63

0 5 10 15 20 Selection points

0.6 0.4 0.2 EV criterion

SP=-0.357 PE=-0.30

0 5 10 15 20 Selection points

Figure 13. Plot comparing the values of αEV(X) and the MSE loss when trained on X on different regression datasets. For each dataset, the left plot shows αEV(X) and test MSE when trained on different subsets of X, and the right plot shows the test MSE when trained on subsets of different sizes. The blue line shows the subset selected by our greedy algorithm to maximize αEV. The datasets in the top row shows higher correlation between the criterion and the corresponding MSE loss, whereas the bottom row shows the opposite case. This correlation also reflects how well the chosen model architecture fits with the dataset.

a useful approximation method if can be scaled, and can be an interesting future direction of making our method more scalable.

H.4. Additional Results From Classification Experiments

H.4.1. ADDITIONAL RESULTS FROM EXPERIMENTS INVOLVING NEURAL NETWORKS WITH CONVOLUTIONS

In Fig. 18(a-b), we show the corresponding output entropy of the experiments from Fig. 5(c-d). We find that even for convolution experiments our method can generally select points which result in good initialization-robustness. In Fig. 18(c) we provide results for active learning on SVHN dataset on Res Net18 (following the setup from (Ash et al., 2020)), while Fig. 18(d-e) present the results for active learning on CIFAR100 dataset.

H.4.2. COMPARISON BETWEEN EV-GP AND BATCHBALD

We also conducted experiments to compare EV-GP with BATCHBALD (Kirsch et al., 2019), which is popular uncertaintybased active learning algorithm. BATCHBALD was not included in the original benchmark due to its

In Fig. 19, we present the results of the experiments. In these experiments, we find that EV-GP outperforms BATCHBALD, which shows that BATCHBALD is less suitable for scenarios with no initial training data or when large batch sizes are used (which is a use case not targeted at and hence not tested by Kirsch et al. (2019)). BATCHBALD also does not take into consideration the overall distribution of the training dataset which may also harm its performances.

H.4.3. ADDITIONAL RESULTS FROM VARYING ACTIVE LEARNING BATCH SIZES

In Fig. 20, we present further results for neural network active learning when the batch sizes vary. We find our algorithm shows little change in both accuracy and output entropy as the batch size gets larger. Note that when the empirical NTK is used, the algorithm seems to perform slightly worse in larger batch sizes case. This may be due to the chance that in some randomized network the empirical NTK provides a slightly less accurate representation of the true uncertainty. In the smaller iterations the random neural network used for NTK computation is reinitialized and so such an error can be reduced. However we find that in practice this effect is less significant and we still achieve a good active set regardless.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Table 1. Running time for experiments on MNIST dataset on 2-layer MLPs (corresponding to experiments in Fig. 5(a)).

Algorithm Time to query n points (in seconds) n = 100 n = 200 n = 300 n = 400 BADGE 12 1 23 1 35 2 47 2 BATCHBALD 2629 1570 4499 1571 6371 1572 8227 1574 MLMOC 88 11 441 62 974 98 1637 120 EV-GP 1120 9 2089 12 3159 33 4283 41

Table 2. Running time for the experiments using Res Net18 on the SVHN dataset (corresponding to experiments in Fig. 18(c)).

Algorithm Time to query n points (in seconds) n = 200 n = 400 n = 600 n = 800 n = 1000 BADGE 34 4 70 8 105 9 148 13 204 15 MLMOC 2486 3 5485 38 9816 43 16564 62 26607 457 EV-GP 3768 10 7846 10 13020 52 18755 91 25018 115

H.4.4. DISCUSSION ON RUNNING TIME OF ACTIVE LEARNING ALGORITHMS

Although the active set selection process can incur long running time, in the practical deployment of active learning, the time required for active set selection is easily dominated by the costs of querying data labels and training the neural networks. To this end, EV-GP is able to reduce the practical running time in two aspects. Firstly, since active learning is usually required when querying data labels incurs high costs (e.g., costs in terms of time), an active learning algorithm which requires a smaller number of query batches (while ensuring good performances), is less time-consuming in practice. Secondly, significant source of the time costs in active learning results from the training of the neural networks, which is further aggravated in active learning problems involving a large amount of data. Since EV-GP is both batch size independent (as presented in Fig. 6) and training-free, EV-GP can reduce the practical costs in these aspects.

Regardless, we may still analyze the amount of time our active learning algorithm requires in performing active set selection. We present the running time of our experiments, which does not account for the time costs of querying data labels. Specifically, in Table 1, we present the running time for active set selection on MNIST dataset on 2-layer MLP, which corresponds to experiments in Fig. 5(a). Meanwhile, in Table 2, we present the running time for active set selection on SVHN dataset on Res Net18, which corresponds to experiments in Fig. 18(c)). These results show that the running time of EV-GP is not significantly larger than that of other algorithms which utilizes the NTK (e.g., MLMOC) or other sequential greedy-based active learning algorithms (e.g., BATCHBALD). Therefore, our EV-GP outperforms the other algorithms in terms of both the predictive performances and initialization robustness (as shown in our previous results) without incurring significantly more running time. As a consequence, in practical settings where the cost of labelling is higher, we expect EV-GP to have comparable or better time efficiency than other active learning algorithms due to its ability to run in large batches with little sacrifice to its performances.

H.4.5. RESULTS FROM VARYING NEURAL NETWORK WIDTH

Here we investigate how the width of the NN affects the performance of our EV-GP-EMP algorithm, by changing the widths of both the NNs used for data selection (i.e., for calculating our EV-GP criterion (Sec. 3.3)) and for training. The left figure in Fig. 21 shows that increasing the width of either the NN for data selection (horizontal axis) or for training (different lines) improves the initialization robustness This is because a wider NN for both data selection and training can reduce the error between the empirical NTK and exact NTK (Sec. 2.1), and therefore improve the accuracy of approximation for σ2 NTKGP (6) and our EV-GP criterion (9). As a result, this leads to more accurate data selection and hence better initialization robustness. The right figure in Fig. 21 shows that increasing the width of the NN for training improves test accuracy, which can be attributed to the better expressivity of wider NNs (for the fixed width). Meanwhile, in contrast to initialization robustness, increasing the width of the NN for data selection does not have a significant impact on the test accuracy. This suggests that the NN does not need to be extremely wide in order to select data points that lead to a good predictive performance of the trained NN.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

H.4.6. EFFECT OF INITIAL POINTS ON OTHER ACTIVE LEARNING BENCHMARKS

For BADGE and MLMOC algorithms, we find that the model performance is sometimes affected by how many initial labelled points were given to the algorithms. We plot the algorithms when some random data is available versus the case where no random initial data is given in Fig. 22.

We find that BADGE shows little difference whether the first batch of data is randomized or not. This demonstrates the BADGE is able to select diverse points by itself without needing extra information to kickstart the algorithm. MLMOC, on the other hand, seems to be more reliant on the initial dataset, with a noticeable drop in performance when the algorithm is given no initial data. This suggests that MLMOC is a poor choice in the low-data regime, when the active learning algorithm has to provide good performance right away even when little prior knowledge about the data is given.

H.5. Additional Results for the Model Selection Algorithm

H.5.1. RELATIONSHIP BETWEEN αM AND TRUE LOSS

In this section we present how our model selection criterion relates to the true MSE loss. In Fig. 23, we present how the model selection criterion αM relates to the true achievable MSE loss. Note that we do not expect αM to be a perfect reflection of the test MSE since during the active learning we do not know what the true test set is, and that αM will be dependent on what L is at the current point. Regardless, we find that αM is able to give an accurate reflection of the MSE loss achievable.

H.5.2. ADDITIONAL RESULTS ON OTHER REGRESSION DATASETS

In Fig. 24, we present further results for EV-GP+MS on other regression datasets. We find that in most datasets, our algorithm is able to select a suitable model architecture for the dataset, which leads to a low MSE loss at test time.

An issue we have found is that our algorithm is often sensitive to the obtained value of αM. In some cases, multiple models will have a similar value of αM, but in practice will have different MSE loss at training. This often leads to selecting a model which is suboptimal (although still provides acceptable results nonetheless). A future research direction would be on how the algorithm can be improved so that such noise can be mitigated.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

1.0 0.5 90V criterion

90th %ile output variance

SP=-0.957 PE=-0.96

20 15 10 100V criterion

SP=-0.766 PE=-0.85

4140 4160 MI criterion

SP=-0.822 PE=-0.73

0 5 10 15 20 Selection points

EV MI 90V 100V Random

0 5 10 15 20 Selection points

EV MI 90V 100V Random

Robot Kinematics

1.0 0.8 0.6 90V criterion

90th %ile output variance

SP=-0.950 PE=-0.95

1.4 1.2 1.0 0.8 100V criterion

SP=-0.886 PE=-0.88

2100 2110 2120 MI criterion

SP=-0.941 PE=-0.90

0 5 10 15 20 Selection points

EV MI 90V 100V Random

0 5 10 15 20 Selection points

EV MI 90V 100V Random

1.5 1.0 0.5 90V criterion

90th %ile output variance

SP=-0.955 PE=-0.94

6 4 2 100V criterion

SP=-0.537 PE=-0.50

580 590 600 610 MI criterion

SP=-0.734 PE=-0.75

0 5 10 15 20 Selection points

EV MI 90V 100V Random

0 5 10 15 20 Selection points

EV MI 90V 100V Random

2.0 1.5 1.0 0.5 0.0 90V criterion

90th %ile output variance

SP=-0.902 PE=-0.96

2 1 0 100V criterion

SP=-0.846 PE=-0.96

8080 8100 8120 8140 MI criterion

SP=-0.823 PE=-0.63

0 5 10 15 20 Selection points

EV MI 90V 100V Random

0 5 10 15 20 Selection points

EV MI 90V 100V Random

Handwritten Digits

0.8 0.7 0.6 90V criterion

90th %ile output variance

SP=-0.851 PE=-0.87

12 11 10 9 8 100V criterion

SP=-0.792 PE=-0.68

175 180 185 MI criterion

SP=-0.781 PE=-0.78

0 5 10 15 20 Selection points

EV MI 90V 100V Random

0 5 10 15 20 Selection points

EV MI 90V 100V Random

2.0 1.5 90V criterion

90th %ile output variance

SP=-0.975 PE=-0.96

2.25 2.00 1.75 1.50 100V criterion

SP=-0.952 PE=-0.91

1790 1800 1810 1820 MI criterion

SP=-0.964 PE=-0.91

0 5 10 15 20 Selection points

EV MI 90V 100V Random

0 5 10 15 20 Selection points

EV MI 90V 100V Random

Figure 14. Relationship between the different criteria value, the 90th percentile predictive variance, and the test MSE. Each grey point represents one instance of a randomly-selected active set. The coloured line represents the subset greedily selected to optimize the corresponding criteria. Each row represents trials conducted on one dataset. SP and PR represents the Spearman Rank correlation and Pearson correlation coefficients respectively. The metrics reported on the y-axis are described further in App. G.3.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Boston Naval

20 40 Selection points

90th %ile output variance

20 40 Selection points

25 50 75 100 Selection points

90th %ile output variance

25 50 75 100 Selection points

Robot Kinematics Protein

25 50 75 100 Selection points

90th %ile output variance

25 50 75 100 Selection points

50 100 Selection points

90th %ile output variance

50 100 Selection points

Random EV-GP

MI-GP 90V-GP

Figure 15. Active learning on regression datasets with different criteria based on σ2 NTKGP. The information shown is presented in the same manner as Fig. 4 (see App. H.1).

Boston Naval Robot Kinematics Protein

0 5 10 15 20 Selection points

90th %ile output variance

EV-GP EV-s GP EV-s GP+LA Random

0 5 10 15 20 Selection points

90th %ile output variance

EV-GP EV-s GP EV-s GP+LA Random

0 5 10 15 20 Selection points

90th %ile output variance

EV-GP EV-s GP EV-s GP+LA Random

0 5 10 15 20 Selection points

90th %ile output variance

EV-GP EV-s GP EV-s GP+LA Random

Figure 16. Active learning on regression problems with different approximations of σNTKGP. The graph shows the output variance of the test dataset when trained on the selected active set of various sizes, according to how it was approximated. The information presented is similar to that shown in the graphs on the right column of Fig. 3 (see App. H.1).

Naval Robot Kinematics

20 40 Selection points

90th %ile output variance

20 40 Selection points

25 50 75 100 Selection points

90th %ile output variance

25 50 75 100 Selection points

Random EV-GP

EV-s GP EV-s GP+LA

Figure 17. Active learning on regression datasets with different approximations of σNTKGP and with the empirical NTK. The information shown is presented in the same manner as Fig. 4 (see App. H.1).

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

200 400 600 Selection points

Mean output entropy

200 400 600 800 Selection points

Mean output entropy

250 500 750 1000 Selection points

Mean output entropy

250 500 750 1000 Selection points

Mean test accuracy

(a) EMNIST, 2-layer CNN (b) SVHN, Wide Res Net (c) SVHN, Res Net18, Batch size 200, Budget 800

200 400 600 800 Selection points

Mean output entropy

250 500 750 Selection points

Mean test accuracy

500 1000 1500 2000 Selection points

Mean output entropy

1000 2000 Selection points

Mean test accuracy

(d) CIFAR100, Wide Res Net, Batch size 200, Budget 800 (e) CIFAR100, Wide Res Net, Batch size 500, Budget 2000

Random EV-GP-Emp (ours)

K-Means++ BADGE

Figure 18. (a-b) Achieved output entropy for active learning on corresponding classification experiments conducted for Fig. 5(c-d). (c) Additional results on classification using Res Net18 on SVHN dataset. (d-e) Additional results on classification using Wide Res Net on CIFAR100 with different batch size and query budgets.

100 200 300 400 Selection points

90th %ile output entropy

200 400 Selection points

Mean test accuracy

200 400 600 Selection points

90th %ile output entropy

200 400 600 Selection points

Mean test accuracy

EV-GP (ours) Batch BALD EV-GP-Emp (ours) Batch BALD

MNIST (2-layer MLP) EMNIST (CNN)

Figure 19. Comparison between EV-GP and BATCHBALD.

100 200 300 400 Batch size

90th %ile output entropy

200 400 600 800 Batch size

90th %ile output entropy

250 500 750 Batch size

Mean test accuracy

250 500 750 Batch size

90th %ile output entropy

200 400 600 800 Batch size

90th %ile output entropy

200 400 600 800 Batch size

Mean test accuracy

MNIST (2-layer MLP) EMNIST (3-layer MLP) SVHN (Wide Res Net) EMNIST (CNN)

Random EV-GP (ours)

K-Means++ BADGE

MLMOC Random EV-GP-Emp (ours)

K-Means++ BADGE

Figure 20. Results on classification with varying batch sizes.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

MNIST (3-layer MLP)

Figure 21. Performances of our EV-GP-EMP with varying NN widths for data selection (x-axis) and NN training (different lines).

MNIST (2-layer MLP) EMNIST (3-layer MLP) SVHN (Wide Res Net)

100 200 300 400 Selection points

Mean test accuracy

200 400 600 Selection points

Mean test accuracy

250 500 750 Selection points

Mean test accuracy

Random MLMOC

MLMOC (Rand. Init.) BADGE

BADGE (Rand. Init.)

Figure 22. Results on classification with BADGE and MLMOC when random initial data is given. In the examples with (Rand. Init.) quantifier, the algorithm randomizes the first batch of data instead of using their algorithm to first select the points.

Training-Free Neural Active Learning with Initialization-Robustness Guarantees

Robot Kinematics Scores at |L| = 20 Scores at |L| = 200 Scores progression

Naval Scores at |L| = 20 Scores at |L| = 200 Scores progression

Figure 23. Further results related to EV-GP+MS from ablation experiments conducted for Fig. 7. The graphs presents how EV-GP+MS selects a model during one trial of active learning. Left and center: the plots between αM and the obtained test MSE for each model architecture at different iterations of the active learning process. Points with the same shapes are models with the same activation functions but varying depth. Right: visualization of which models are selected by EV-GP+MS in each round. The gray crosses represent the resulting training when using a particular architecture from M. The red line represents the choice from EV-GP+MS, and the blue line represents the training if we stuck with the 2-layer MLP with Re LU activation throughout.

Protein Scores at |L| = 20 Scores at |L| = 200 Scores progression Comparison

50 100 150 200 Selected points

EV-GP (Fixed M) EV-GP + MS EV-GP + NASWOT

Sinusoidal Scores at |L| = 20 Scores at |L| = 200 Scores progression Comparison

50 100 150 200 Selected points

EV-GP (Fixed M) EV-GP + MS EV-GP + NASWOT

Figure 24. Additional experiments for EV-GP+MS conducted on other datasets. The data presented is presented in the same manner as Figs. 7 and 23.